Inter-node parallel processing


Hi Folks,

Suppose I initialize a scidb array fed by a file on a remote physical instance. For example here’s a snippet of my python script:

sdb.query(“store(aio_input(‘paths=/tmp/mri_3.out’, ‘instances=4294967299’, ‘num_attributes=1’), mri_4_3)”)

If I apply some operations to the array (e.g., aggregate(mri_4_3,max(a0)), would the processing (e.g., CPU usage, memory usage, etc.) happen on the remote instance (i.e., 4294967299)? It would be very risky to load remote files all into the local memory… which may cause out-of-memory error if the data set is huge.

I guess a more general question is: how to control which instance works (both computation and I/O) on which data chunk?



I found this in the user guide:
“For any query, the client can choose any instance to serve as the coordinator (in effect overriding the default coordinator).”

Does it mean I can specify on which instance/node to load data into memory and use the CPU cycles?



Ah, I see, so the array is not really loaded into memory but distributed as files chunks into multiple instances… not much the users can do except for the chunking size.