Please help! How to write/read MemArray in RANDOM order?


#1

EDIT: I’ll make this even simpler: all I need to know is how to write to a MemArray in RANDOM order, and how to read items out of that MemArray in RANDOM order. I’ve been scouring this website for how to do this, but despite the number of times I’ve seen “You should use MemArray” recommended I can’t find where anybody has actually said HOW to do that.

Originally I was extracting the information out of the input array and storing it in a vector to perform the calculations on, but while that worked well for small arrays, that approach causes SciDB to crash once the vectors get too big. So, I want to replace my use of vectors with MemArrays so that the data will be stored to disk rather than all in memory, but my calculations require me to pass over and update the data multiple times and also check a given item’s neighbors, none of which I can find examples of how to do.

[code]This is a two-part question, but hopefully both parts are easy:

Given an Array (in this case a MemArray) and a Coordinate, how do I quickly:

  1. Tell if the Array contains an item at the Coordinate, and
  2. Extract the item at the given Coordinate if it exists?

For more detail:

  • In my operator, I am processing a input Array and storing information about each item in the input Array into a new MemArray at the same Coordinate.
  • When I examine a new item in the input Array, I take a note of its Coordinate and I calculate that Coordinate’s neighbors (for example, in a 1D array, the neighbors of Coordinate 4 would be 3 and 5).
  • For each of the neighbor Coordinates, I check to see if the MemArray has a value at that Coordinate. If so, I want to extract that value for my calculations for the current item.
  • After performing calculations, I store a value into the MemArray at the item’s Coordinate and move to the next one.

I know I could brute force this by creating new ArrayIterators and ChunkIterators and scanning through the MemArray all the way through every time I want to check if a neighbor Coordinate is in it/extract that item, but I’m hoping there is a more elegant way of doing this.[/code]


#2

Is this question harder than I thought…? What I need to know is how to use a MemArray like a vector because my vectors are getting too large during my calculations and I want to use an Array to store the data to disk. If I can be told how to do that I can finally finish this project that has been taking me months.


#3

Okay, not only am I having extreme difficulty getting this to work, but now whenever I create a new chunk in a MemArray it populates it with all ones instead of being empty initially and I can’t figure out why.


#4

One of my biggest problems right now is that I need to read from the MemArray I am writing, and I don’t know which indexes I need to read at any given time (they could all be in previously written full chunks, or some could be in the chunk I am currently writing, etc.). The only way I can get this to work if I’m reading from the currently open chunk is by calling flush() before I try to read from the array (otherwise it always gives me a value of 1 for some reason), and flush() takes FOREVER, especially when it gets called multiple times per item in my input array. Is there any way to do this more elegantly? This seems like a simple question that I would have hoped could have been answered over two weeks ago when I first asked…


#5

Hi,

So the pattern you are looking for is implemented (slightly sloppily) in the file Aggregator.h. Abstracted out is a set of stateArrayIterator and stateChunkIterator. You call the method setOutputPosition which checks to see which chunk you are currently on. If you happen to have the same chunk open already, the work done is minimal. Otherwise - flush the current chunk, open or re-open the chunk at the new position and set position for write. This works well when you don’t switch chunks very often. If you do have to switch chunks often - this will still work but will take longer, as you observed. So in that case it might be worthwhile to write unordered result into a linear form first (could be a buffer, could be a file, could be a 1D MemArray with coordinates as an attribute), then call sort() on it, then write it out.

Having everything set to 1 probably comes from creating an array with no empty tag attribute. Note:

AttributeDesc et ((AttributeID) outSchema.getAttributes().size(), DEFAULT_EMPTY_TAG_ATTRIBUTE_NAME,  TID_INDICATOR, AttributeDesc::IS_EMPTY_INDICATOR, 0);
outSchema.addAttribute(et);

This help at all?


#6

I end up switching chunks very frequently and flush() was making the operator laughably slow, so I had to figure out a way to avoid that as much as possible. Saving the data to an array and sorting it doesn’t help at all since the I need to be checking neighbors as I scan through the array, and if I just copied the data to an array, sorted it, and then checked the neighbors of the resulting array, it would do me no better than just reading from the input array in the first place.

I got this working well for 1D arrays well using a sequential write by storing the currently open chunk’s data in a vector, such that if I need to check a neighbor in the open chunk I can just access the vector rather than wasting time calling flush(). My MemArray is one dimensional, such that a given coordinate is mapped to a 1D index so that I can do the sequential write on higher order arrays. However, I’ve found out that this doesn’t work like I would have thought with multidimensional arrays, since I cannot predict what order the array’s chunks will be visited by an ArrayIterator/ChunkIterator.

The best way I can think to describe this is to use pictoral examples, but I can try to explain this in another way if this doesn’t make sense. Sometimes my iterators are doing this:

[code][1] (0, 0, 0)
[2] (0, 0, 1)
[3] (0, 0, 2)
[4] (1, 0, 0)
[5] (1, 0, 1)
[6] (1, 0, 2)
[7] (2, 0, 0)
[8] (2, 0, 1)
[9] (2, 0, 2)
. . .
[X] (3, 1, 2)

// The iterators iterate through all items where the second dimension = 0 before iterating through all dimensions where the second dimension = 1[/code]

And other times they do this:

[code][1] (0, 0, 0)
[2] (0, 1, 0)
[3] (1, 0, 0)
[4] (1, 1, 0)
[5] (2, 0, 0)
[6] (2, 1, 0)
[7] (3, 0, 0)
[8] (3, 1, 0)
[9] (0, 0, 1)
. . .
[X] (3, 1, 2)

// The iterators iterate through all items where the third dimension = 0, then where the third = 1, and finally where the third = 2[/code]

Is there some way to ensure an Array/Chunk Iterator will always loop through a multidimensional array in the following way:

[1] (0, 0, 0) [2] (0, 0, 1) [3] (0, 0, 2) [4] (0, 1, 0) [5] (0, 1, 1) [6] (0, 1, 2) [7] (1, 0, 0) [8] (1, 0, 1) [9] (1, 0, 2) ... [X] (3, 1, 2)


#7

So… wait a sec.

At the moment, there is an “unspoken invariant” in the system that iteration over chunks, and cells in a chunk always happens in row-major order. More specifically, if you open an ArrayIterator and iterate using operator++ then each chunk Ci you encounter will be at coordinates greater than all previous chunks Ci-1…C0 where “greater than” is defined by CoordinatesLess in Metadata.h.

In human terms, suppose you have:
val:double [x=1:10000,1000,0, y=1:10000,1000,0]
on a single instance. You open an ArrayIterator and start iterating and call getPosition on each chunk. You will see this order:

(1,1)
(1,1001)
(1,2001)
...
(1,9001)
(1001,1)
(1001,1001)
...
(2001,1)
...

Now, obviously if you have multiple instances, each instance will only iterate over its chunks only. So the pattern above will have “gaps” but the order will be the same.

And this invariant also applies to ChunkIterators. Suppose you open up the first chunk at (1,1) and start iterating over that. You will see:

(1,1)
(1,2)
...
(1,1000)
(2,1)
...

And all UDOs you write (if they use a virtual / Delegate array pattern) should conform to this. (Yes, aren’t you glad we told you?) Ops like join, merge, window rely on this property heavily.

Now, suppose you want to iterate in a different order. That happens sometimes and can be done. You can first iterate over the array, find all the chunk positions, then call setPosition on them in any order you like. For example, read shared_ptr Array::findChunkPositions() (Array.cpp). Note that some rare array types (InputArray) do not allow random access reading and there’s a special method “ensureRandomAccess” to convert them if needed.

This help?


#8

That is not the behavior I am seeing as I iterate over my array.

In my operator, I first call redistribute() to send the full input array to the coordinator, and then return an empty MemArray if it’s not the coordinator:

[code]// Save our array and query in our data members
_array = inInputArrays[0];
_query = inQuery;

// Send the array to the coordinator
if (_query->getInstancesCount() > 1)
{
uint64_t coordinatorID
= ((int64_t)_query->getCoordinatorID() == -1) ?
_query->getInstanceID() :
_query->getCoordinatorID();

_array = redistribute(_array, _query, psLocalInstance, "", coordinatorID);

if (_query->getInstanceID() != coordinatorID)
	return boost::shared_ptr<Array>(new MemArray(_schema,_query));

}[/code]

Then, as the coordinator, I iterate over the array (and its chunks). For demonstration purposes, I’m just going to store the position of each item as I visit it and do nothing else:

[code]class PhysicalMyOperator: public PhysicalOperator
{
private:

	vector<Coordinates> _visited;

.
.
.

// Grab our array iterator
shared_ptr arrayIter =
_array->getConstIterator(_settings->_inputAttributeId);

// Loop through our array iterator
while(!arrayIter->end())
{
// Grab the chunk out of our array iterator
ConstChunk const& chunk = arrayIter->getChunk();

// Grab our chunk iterator
shared_ptr<ConstChunkIterator> chunkIter = chunk.getConstIterator();

// Loop through our chunk iterator
while(!chunkIter->end())
{
	_visited.push_back(chunkIter->getPosition());

	// Iterate over chunk
	++(*chunkIter);
}

// Iterate over array
++(*arrayIter);

}[/code]

I then write out the _visited array to a string (which will contain the coordinate positions in the order they were visited) and throw an error displaying that information (again, for demonstration purposes).

Here are some of my results which very clearly show the items are NOT getting visited in the way you describe:

AFL% show(dimTestDemo2);
{i} schema
{0} 'dimTestDemo2<val:double,boolVal:bool> [x=0:10,10,0,y=0:5,5,0,z=0:5,1,0]'

// VISITED ORDER:
[0, 0, 0]
[0, 1, 0]
[0, 2, 0]
[0, 3, 0]
[0, 4, 0]
[1, 0, 0]
[1, 1, 0]
[1, 2, 0]
[1, 3, 0]
[1, 4, 0]
[2, 0, 0]
[2, 1, 0]
[2, 2, 0]
[2, 3, 0]
[2, 4, 0]
[3, 0, 0]
[3, 1, 0]
[3, 2, 0]
[3, 3, 0]
[3, 4, 0]
[4, 0, 0]
[4, 1, 0]
[4, 2, 0]
[4, 3, 0]
[4, 4, 0]
[5, 0, 0]
[5, 1, 0]
[5, 2, 0]
[5, 3, 0]
[5, 4, 0]
[6, 0, 0]
[6, 1, 0]
[6, 2, 0]
[6, 3, 0]
[6, 4, 0]
[7, 0, 0]
[7, 1, 0]
[7, 2, 0]
[7, 3, 0]
[7, 4, 0]
[8, 0, 0]
[8, 1, 0]
[8, 2, 0]
[8, 3, 0]
[8, 4, 0]
[9, 0, 0]
[9, 1, 0]
[9, 2, 0]
[9, 3, 0]
[9, 4, 0]
[0, 0, 1]
[0, 1, 1]
[0, 2, 1]
[0, 3, 1]
[0, 4, 1]
[1, 0, 1]
[1, 1, 1]
[1, 2, 1]
[1, 3, 1]
[1, 4, 1]
[2, 0, 1]
[2, 1, 1]
[2, 2, 1]
[2, 3, 1]
[2, 4, 1]
[3, 0, 1]
[3, 1, 1]
[3, 2, 1]
[3, 3, 1]
[3, 4, 1]
[4, 0, 1]
[4, 1, 1]
[4, 2, 1]
[4, 3, 1]
[4, 4, 1]
[5, 0, 1]
[5, 1, 1]
[5, 2, 1]
[5, 3, 1]
[5, 4, 1]
[6, 0, 1]
[6, 1, 1]
[6, 2, 1]
[6, 3, 1]
[6, 4, 1]
[7, 0, 1]
[7, 1, 1]
[7, 2, 1]
[7, 3, 1]
[7, 4, 1]
[8, 0, 1]
[8, 1, 1]
[8, 2, 1]
[8, 3, 1]
[8, 4, 1]
[9, 0, 1]
[9, 1, 1]
[9, 2, 1]
[9, 3, 1]
[9, 4, 1]
[0, 0, 2]
[0, 1, 2]
[0, 2, 2]
[0, 3, 2]
[0, 4, 2]
[1, 0, 2]
[1, 1, 2]
[1, 2, 2]
[1, 3, 2]
[1, 4, 2]
[2, 0, 2]
[2, 1, 2]
[2, 2, 2]
[2, 3, 2]
[2, 4, 2]
[3, 0, 2]
[3, 1, 2]
[3, 2, 2]
[3, 3, 2]
[3, 4, 2]
[4, 0, 2]
[4, 1, 2]
[4, 2, 2]
[4, 3, 2]
[4, 4, 2]
[5, 0, 2]
[5, 1, 2]
[5, 2, 2]
[5, 3, 2]
[5, 4, 2]
[6, 0, 2]
[6, 1, 2]
[6, 2, 2]
[6, 3, 2]
[6, 4, 2]
[7, 0, 2]
[7, 1, 2]
[7, 2, 2]
[7, 3, 2]
[7, 4, 2]
[8, 0, 2]
[8, 1, 2]
[8, 2, 2]
[8, 3, 2]
[8, 4, 2]
[9, 0, 2]
[9, 1, 2]
[9, 2, 2]
[9, 3, 2]
[9, 4, 2]
[0, 0, 3]
[0, 1, 3]
[0, 2, 3]
[0, 3, 3]
[0, 4, 3]
[1, 0, 3]
[1, 1, 3]
[1, 2, 3]
[1, 3, 3]
[1, 4, 3]
[2, 0, 3]
[2, 1, 3]
[2, 2, 3]
[2, 3, 3]
[2, 4, 3]
[3, 0, 3]
[3, 1, 3]
[3, 2, 3]
[3, 3, 3]
[3, 4, 3]
[4, 0, 3]
[4, 1, 3]
[4, 2, 3]
[4, 3, 3]
[4, 4, 3]
[5, 0, 3]
[5, 1, 3]
[5, 2, 3]
[5, 3, 3]
[5, 4, 3]
[6, 0, 3]
[6, 1, 3]
[6, 2, 3]
[6, 3, 3]
[6, 4, 3]
[7, 0, 3]
[7, 1, 3]
[7, 2, 3]
[7, 3, 3]
[7, 4, 3]
[8, 0, 3]
[8, 1, 3]
[8, 2, 3]
[8, 3, 3]
[8, 4, 3]
[9, 0, 3]
[9, 1, 3]
[9, 2, 3]
[9, 3, 3]
[9, 4, 3]
[0, 0, 4]
[0, 1, 4]
[0, 2, 4]
[0, 3, 4]
[0, 4, 4]
[1, 0, 4]
[1, 1, 4]
[1, 2, 4]
[1, 3, 4]
[1, 4, 4]
[2, 0, 4]
[2, 1, 4]
[2, 2, 4]
[2, 3, 4]
[2, 4, 4]
[3, 0, 4]
[3, 1, 4]
[3, 2, 4]
[3, 3, 4]
[3, 4, 4]
[4, 0, 4]
[4, 1, 4]
[4, 2, 4]
[4, 3, 4]
[4, 4, 4]
[5, 0, 4]
[5, 1, 4]
[5, 2, 4]
[5, 3, 4]
[5, 4, 4]
[6, 0, 4]
[6, 1, 4]
[6, 2, 4]
[6, 3, 4]
[6, 4, 4]
[7, 0, 4]
[7, 1, 4]
[7, 2, 4]
[7, 3, 4]
[7, 4, 4]
[8, 0, 4]
[8, 1, 4]
[8, 2, 4]
[8, 3, 4]
[8, 4, 4]
[9, 0, 4]
[9, 1, 4]
[9, 2, 4]
[9, 3, 4]
[9, 4, 4]
[0, 0, 5]
[0, 1, 5]
[0, 2, 5]
[0, 3, 5]
[0, 4, 5]
[1, 0, 5]
[1, 1, 5]
[1, 2, 5]
[1, 3, 5]
[1, 4, 5]
[2, 0, 5]
[2, 1, 5]
[2, 2, 5]
[2, 3, 5]
[2, 4, 5]
[3, 0, 5]
[3, 1, 5]
[3, 2, 5]
[3, 3, 5]
[3, 4, 5]
[4, 0, 5]
[4, 1, 5]
[4, 2, 5]
[4, 3, 5]
[4, 4, 5]
[5, 0, 5]
[5, 1, 5]
[5, 2, 5]
[5, 3, 5]
[5, 4, 5]
[6, 0, 5]
[6, 1, 5]
[6, 2, 5]
[6, 3, 5]
[6, 4, 5]
[7, 0, 5]
[7, 1, 5]
[7, 2, 5]
[7, 3, 5]
[7, 4, 5]
[8, 0, 5]
[8, 1, 5]
[8, 2, 5]
[8, 3, 5]
[8, 4, 5]
[9, 0, 5]
[9, 1, 5]
[9, 2, 5]
[9, 3, 5]
[9, 4, 5]
[0, 5, 0]
[1, 5, 0]
[2, 5, 0]
[3, 5, 0]
[4, 5, 0]
[5, 5, 0]
[6, 5, 0]
[7, 5, 0]
[8, 5, 0]
[9, 5, 0]
[0, 5, 1]
[1, 5, 1]
[2, 5, 1]
[3, 5, 1]
[4, 5, 1]
[5, 5, 1]
[6, 5, 1]
[7, 5, 1]
[8, 5, 1]
[9, 5, 1]
[0, 5, 2]
[1, 5, 2]
[2, 5, 2]
[3, 5, 2]
[4, 5, 2]
[5, 5, 2]
[6, 5, 2]
[7, 5, 2]
[8, 5, 2]
[9, 5, 2]
[0, 5, 3]
[1, 5, 3]
[2, 5, 3]
[3, 5, 3]
[4, 5, 3]
[5, 5, 3]
[6, 5, 3]
[7, 5, 3]
[8, 5, 3]
[9, 5, 3]
[0, 5, 4]
[1, 5, 4]
[2, 5, 4]
[3, 5, 4]
[4, 5, 4]
[5, 5, 4]
[6, 5, 4]
[7, 5, 4]
[8, 5, 4]
[9, 5, 4]
[0, 5, 5]
[1, 5, 5]
[2, 5, 5]
[3, 5, 5]
[4, 5, 5]
[5, 5, 5]
[6, 5, 5]
[7, 5, 5]
[8, 5, 5]
[9, 5, 5]
[10, 0, 0]
[10, 1, 0]
[10, 2, 0]
[10, 3, 0]
[10, 4, 0]
[10, 0, 1]
[10, 1, 1]
[10, 2, 1]
[10, 3, 1]
[10, 4, 1]
[10, 0, 2]
[10, 1, 2]
[10, 2, 2]
[10, 3, 2]
[10, 4, 2]
[10, 0, 3]
[10, 1, 3]
[10, 2, 3]
[10, 3, 3]
[10, 4, 3]
[10, 0, 4]
[10, 1, 4]
[10, 2, 4]
[10, 3, 4]
[10, 4, 4]
[10, 0, 5]
[10, 1, 5]
[10, 2, 5]
[10, 3, 5]
[10, 4, 5]
[10, 5, 0]
[10, 5, 1]
[10, 5, 2]
[10, 5, 3]
[10, 5, 4]
[10, 5, 5]

[code]AFL% show(dimTestDemo4);
{i} schema
{0} ‘dimTestDemo4val:double,boolVal:bool [x=0:5,1,0,y=0:5,5,0,z=0:5,1,0]’

// VISITED ORDER:
[0, 0, 0]
[0, 1, 0]
[0, 2, 0]
[0, 3, 0]
[0, 4, 0]
[0, 0, 1]
[0, 1, 1]
[0, 2, 1]
[0, 3, 1]
[0, 4, 1]
[0, 0, 2]
[0, 1, 2]
[0, 2, 2]
[0, 3, 2]
[0, 4, 2]
[0, 0, 3]
[0, 1, 3]
[0, 2, 3]
[0, 3, 3]
[0, 4, 3]
[0, 0, 4]
[0, 1, 4]
[0, 2, 4]
[0, 3, 4]
[0, 4, 4]
[0, 0, 5]
[0, 1, 5]
[0, 2, 5]
[0, 3, 5]
[0, 4, 5]
[0, 5, 0]
[0, 5, 1]
[0, 5, 2]
[0, 5, 3]
[0, 5, 4]
[0, 5, 5]
[1, 0, 0]
[1, 1, 0]
[1, 2, 0]
[1, 3, 0]
[1, 4, 0]
[1, 0, 1]
[1, 1, 1]
[1, 2, 1]
[1, 3, 1]
[1, 4, 1]
[1, 0, 2]
[1, 1, 2]
[1, 2, 2]
[1, 3, 2]
[1, 4, 2]
[1, 0, 3]
[1, 1, 3]
[1, 2, 3]
[1, 3, 3]
[1, 4, 3]
[1, 0, 4]
[1, 1, 4]
[1, 2, 4]
[1, 3, 4]
[1, 4, 4]
[1, 0, 5]
[1, 1, 5]
[1, 2, 5]
[1, 3, 5]
[1, 4, 5]
[1, 5, 0]
[1, 5, 1]
[1, 5, 2]
[1, 5, 3]
[1, 5, 4]
[1, 5, 5]
[2, 0, 0]
[2, 1, 0]
[2, 2, 0]
[2, 3, 0]
[2, 4, 0]
[2, 0, 1]
[2, 1, 1]
[2, 2, 1]
[2, 3, 1]
[2, 4, 1]
[2, 0, 2]
[2, 1, 2]
[2, 2, 2]
[2, 3, 2]
[2, 4, 2]
[2, 0, 3]
[2, 1, 3]
[2, 2, 3]
[2, 3, 3]
[2, 4, 3]
[2, 0, 4]
[2, 1, 4]
[2, 2, 4]
[2, 3, 4]
[2, 4, 4]
[2, 0, 5]
[2, 1, 5]
[2, 2, 5]
[2, 3, 5]
[2, 4, 5]
[2, 5, 0]
[2, 5, 1]
[2, 5, 2]
[2, 5, 3]
[2, 5, 4]
[2, 5, 5]
[3, 0, 0]
[3, 1, 0]
[3, 2, 0]
[3, 3, 0]
[3, 4, 0]
[3, 0, 1]
[3, 1, 1]
[3, 2, 1]
[3, 3, 1]
[3, 4, 1]
[3, 0, 2]
[3, 1, 2]
[3, 2, 2]
[3, 3, 2]
[3, 4, 2]
[3, 0, 3]
[3, 1, 3]
[3, 2, 3]
[3, 3, 3]
[3, 4, 3]
[3, 0, 4]
[3, 1, 4]
[3, 2, 4]
[3, 3, 4]
[3, 4, 4]
[3, 0, 5]
[3, 1, 5]
[3, 2, 5]
[3, 3, 5]
[3, 4, 5]
[3, 5, 0]
[3, 5, 1]
[3, 5, 2]
[3, 5, 3]
[3, 5, 4]
[3, 5, 5]
[4, 0, 0]
[4, 1, 0]
[4, 2, 0]
[4, 3, 0]
[4, 4, 0]
[4, 0, 1]
[4, 1, 1]
[4, 2, 1]
[4, 3, 1]
[4, 4, 1]
[4, 0, 2]
[4, 1, 2]
[4, 2, 2]
[4, 3, 2]
[4, 4, 2]
[4, 0, 3]
[4, 1, 3]
[4, 2, 3]
[4, 3, 3]
[4, 4, 3]
[4, 0, 4]
[4, 1, 4]
[4, 2, 4]
[4, 3, 4]
[4, 4, 4]
[4, 0, 5]
[4, 1, 5]
[4, 2, 5]
[4, 3, 5]
[4, 4, 5]
[4, 5, 0]
[4, 5, 1]
[4, 5, 2]
[4, 5, 3]
[4, 5, 4]
[4, 5, 5]
[5, 0, 0]
[5, 1, 0]
[5, 2, 0]
[5, 3, 0]
[5, 4, 0]
[5, 0, 1]
[5, 1, 1]
[5, 2, 1]
[5, 3, 1]
[5, 4, 1]
[5, 0, 2]
[5, 1, 2]
[5, 2, 2]
[5, 3, 2]
[5, 4, 2]
[5, 0, 3]
[5, 1, 3]
[5, 2, 3]
[5, 3, 3]
[5, 4, 3]
[5, 0, 4]
[5, 1, 4]
[5, 2, 4]
[5, 3, 4]
[5, 4, 4]
[5, 0, 5]
[5, 1, 5]
[5, 2, 5]
[5, 3, 5]
[5, 4, 5]
[5, 5, 0]
[5, 5, 1]
[5, 5, 2]
[5, 5, 3]
[5, 5, 4]
[5, 5, 5]
[/code]

What’s going wrong?


#9

So… I dont think anything is going wrong, its just the chunking is tricky.

Remember - first you visit the chunks then you visit the cells. So it’s like this.

chunk [0,0,0]
cells:
[0, 0, 0]
[0, 1, 0]
[0, 2, 0]
[0, 3, 0]
[0, 4, 0]

chunk [0,0,1]
cells:
[0, 0, 1]
[0, 1, 1]
[0, 2, 1]
[0, 3, 1]
[0, 4, 1]

chunk [0,0,2]
cells:
[0, 0, 2]
[0, 1, 2]
[0, 2, 2]
[0, 3, 2]
[0, 4, 2]

... chunks [0,0,3] through [0,0,5]
chunk [0,5,0]
...
chunk [1,0,0]
chunk [1,0,1]
...
chunk [1,5,0]
...

So as you see, the iteration over the chunks follows the rule I described. And the iteration over the cells inside a chunk also follows the rule. But, of course, by changing chunk sizes you change the overall order in which cells are visited…

This help?


#10

Oh wow, adjacent chunks don’t nessecarily contain adjacent cells? That means the method I’ve been using won’t work at all… Is there any clean way to iterate through the cells of an array in row-major order? If not, I’m back at square one…

Thank you for your help by the way; this has been very frustrating and it is great to have assistance with this.


#11

Well…no that’s not strictly true. You have a 3D coordinate system broken up into rectilinear 3D blocks. Pick any two blocks that are touching. Of course data in them will be adjacent along one of the axes - but not along the other axes. The order in which the blocks are visited is determined by the invariant I mentioned. So changing the dimensions of the blocks will give different cell visit order.

Put in other words, chunk sizing is key. Suppose you had:
[x=0:10,1,0, y=0:5,6,0, z=0:5,6,0]
or
[x=0:10,1,0, y=0:5,3,0, z=0:5,6,0]
then you would get the order you want.

One solution might be to just call repart prior to the op and force the data to be chunked in a particular fashion.
Another solution would be to build a map of all the chunks and then iterate over them in any order you like via setPosition.
A third solution is to shove everything into a 1D array and add coordinates as attributes. Then call sort() on that.

By the way - what is the op?


#12

I’m w/ Alex.

Parallelism is a central goal of SciDB. If your algorithm requires that you proceed through the entire input array’s cell in a particular order, then your algorithm is fundamentally serial. None of our algorithms in SciDB handle things that way, even operators where the result looks as if it was computed that way. For example:

Suppose you have an array named inputArray with a schema and data that looks like this:

inputArray 
< 
  v : double
>
[ I=0:* ..., J=0:*, ... ]

 \   I --> 
  \
J  +------+------+------+------+------+------+------+------+------+------+
|  |  01  |  02  |  03  |  04  |  05  |  06  |  07  |  08  |  09  |  10  |
|  +------+------+------+------+------+------+------+------+------+------+
v  |  01  |  02  |  03  |  04  |  05  |  06  |  07  |  08  |  09  |  10  |
   +------+------+------+------+------+------+------+------+------+------+
   |  01  |  02  |  03  |  04  |  05  |  06  |  07  |  08  |  09  |  10  |
   +------+------+------+------+------+------+------+------+------+------+

Now, if you run the following query, you will get the following results:


cumulate ( inputArray, sum ( v ), I );

 \   I -->
  \
J  +------+------+------+------+------+------+------+------+------+------+
|  |  01  |  03  |  06  |  10  |  15  |  21  |  28  |  36  |  45  |  55  |
|  +------+------+------+------+------+------+------+------+------+------+
V  |  01  |  03  |  06  |  10  |  15  |  21  |  28  |  36  |  45  |  55  |
   +------+------+------+------+------+------+------+------+------+------+
   |  01  |  03  |  06  |  10  |  15  |  21  |  28  |  36  |  45  |  55  |
   +------+------+------+------+------+------+------+------+------+------+

Looking at that result, it appears as if we’ve run through the entire data set in row-major order, calculating the per-row sums as we go. But this isn’t what actually happens, because to do it that way would be very inefficient from a “how do I maximize parallelism” perspective and also from a “how do I minimize inter-instance communication” perspective.

What we actually do is to first calculate a set of intermediate results on a per-chunk basis–that is, we compute an intermediate result per-chunk, and therefore entirely in parallel with the same calculation applied to all other chunks–before merging the first pass results, and finally, calculating the query’s result–again, entirely in parallel on a per-chunk basis–by using the merged results of the first pass.

It’s good advice, when programming w/ SciDB, to think through your algorithms in terms of, “How can I maximize the degree of parallelism?”