Exporting an array stored in multiple nodes to a single file


#1

Hi,

I’m developing an operator for exporting arrays from SciDB to a specific file format (not the built-in CSV!).

The chunks may be distributed in SciDB across multiple nodes.

But I want to store the entire array in a single file in one of the nodes.

It seems that there are two ways to do this:

  1. I rewrite the query plan to introduce an ‘sg’ if there’s > 1 node, to ensure that the array is first gathered in memory in the desired node. The export operator will then simply scan the array in that node and save it to a file. (In the remaining nodes, the operator would do nothing.)

  2. I think that I may also rely on the ability to retrieve results directly from remote nodes? In this case, the execute() part of the operator would simply scan the part of the array present in each node and return it. I would then implement a postSingleExecute() that would go through the collected results and save them to a file on the coordinator node.

Option (1) seems better but it means that I have to change the query rewrite part… which doesn’t feel right, because I’m implementing a user-defined operator.

Option (2), if it works - I’m not sure really, since I’m not very familiar with this part - means that the exported file can only be saved to the coordinator node.

Do you have any opinion on this: any thoughts on the preferred implementation? BTW, is there any example that I can follow? (And, am I right that Option (2) is actually doable?)


#2

I guess Option (3) would be better: call redistribute directly from within the execute() part to ensure that the chunks are first sent to the desired node…


#3

Interesting observations. Yes, option 3 will work quite well.

If you wanted to achieve this functionality without materializing, you could do it with something called ParallelAccumulatorArray. This is the class that the system currently uses to pull query results to the host.

Another option (4 if you will) is to add another “format” to iquery (parallel to dense,sparse,csv,lcsv,…). That would be a change to the DBSaver class and not too hard to implement. The advantage of this option is that you get to use the ParallelAccumulatorArray for free and you don’t need to pre-materialize the array on the coordinator before you output it to the file.