Does store() redistributes chunks among cluster nodes?


#1

Suppose I perform an operation like store(xgrid(…)), store(regrid(…)), or store(aggregate(…))

Whether the chunks of the resulting array are redistributed/shuffled by’'store" operator among the available instances/cluster nodes?


#2

Yes.

You can use summarize
summarize(array, 'per_instance=1')
or query the chunk map to find exactly which instances the data went to.


#3

is it possible to disable shuffling?
For example, if I run xgrid, there is a 1:1 mapping of the input and output chunks. Is it possible to store the output chunk(s) on the same instance as the source chunks?


#4

So the assingment of chunk->instance happens via a hash over chunk position minus array origin, divided by chunk size. The way xgrid inflates chunks means that actually the output will be colocated with the input. For example:

$ iquery -naq "store(build(<val:double> [x=1:10,5,0, y=1:10,5,0], iif(x<6 and y<6, 1, iif(x<6,2, iif(y<6,3,4)))), foo)"
Query was executed successfully

$ iquery -otsv:l -aq "grouped_aggregate(apply(foo, instance, instanceid()),  instance, min(val), max(val), count(*))"
instance	val_min	val_max	count
0	1	4	50
1	2	3	50

$ iquery -naq "store(xgrid(foo, 7, 19), bar)"
Query was executed successfully

$ iquery -otsv:l -aq "grouped_aggregate(apply(bar, instance, instanceid()),  instance, min(val), max(val), count(*))"
instance	val_min	val_max	count
0	1	4	6650
1	2	3	6650

But of course this varies with different operators. xgrid is lucky. There’s a plugin to restrict array residency to force the array to reside only on specific instances. https://github.com/Paradigm4/elastic_resource_groups That’s not exactly what you’re asking for though.

In the future there will be a way to store into a specific distribution scheme - range partition by row, by column, different hash functions and so on.


Consume() and chunk shuffling
#5

Thank you.

Just to clarify, the following displays bar array values stored on a particular instance, right?

instance	val_min	val_max	count
0	1	4	6650
1	2	3	6650

Does this mean that regrid will do the same if I specify to reduce array resolution twice (the number of chunks will not change)?

Also, after applying xgrid and regrid does SciDB choose chunk size for the new array automatically? or does it rechunk the new array in order to make the chunk size equal to the chunk size of the old array?

Will you kindly share the hash function that is used by SciDB for chunk placement?