Load operation


#1

What are all of the ways to load data into a SciDB array? We have a LOT of data and want to load it all. I know of AQL and AFL load statements. I know of executing the load statements through python. But, I am looking for the highest performance method. For example, if I have a python array, it would be great to be able to load that data into a SciDB array directly from memory without first writing to a file then running a load command to load the file. That seems unnecessary.

Lastly, what would it entail to write a function to write a sciDB array from memory to a sciDB array if such a thing doesn’t exist?

Thanks,
Alan


#2

Hi Alan,

One thing to try would be to write the array into a fifo device (instead of a file – see bash command “mkfifo”) and then have scidb read from it. It would be faster as the data would never hit the disk. Still, there would be an expensive conversion to and from string. Expensive and unnecessary…

What would it take to speed it up? You could take a look at the input() operator – LogicalInput.cpp and PhysicalInput.cpp. That’s the piece that reads data from a file (or filesystem device like a fifo) and emits it into a scidb array. All of our load operations go through input(). You could write an operator similar to input() - that takes data from a different source (socket?) and creates an array out of that. We’ve had some discussions about this idea but no one’s currently actively pursuing it. It would definitely be very helpful for a variety of scenarios!

Come to think of it - the first step would be to write a binary-file input(). Then you could do:
-output binary data into a fifo (from your python script or what have you)
-have scidb load() read binary data from the fifo
This would also be useful for general “save an array as a binary file for fast loading” use case. And it would be pretty much as fast as it can go. Only minimal OS overhead. The binary format would have to be chunk-centric - data would have to be organized a chunk at a time.