Parallel Loading


#1

Gang:

I’ve been looking into parallel load scenarios using my own code, and thought I could do the following:

aql: create array timeseries_set<timeseries:int64,time:double,value:double>[row=0:*,1000000,0]
afl: load(timeseries_set, '/shared/a', 0, '(int64,double,double)')
afl: load(timeseries_set, '/shared/b', 1, '(int64,double,double)')

… the idea being that two separate processes could write data in binary load format to named pipes “a” and “b” in a shared filesystem and have the data loaded by SciDB instances 0 and 1, respectively. However, it seems that this is actually treated as two consecutive loads of the array, with one overwriting the other. Am I missing something? This seems to negate the value of being able to specify a particular instance in load(), as it appears that the only way to do a parallel load is to use -1 for the instance?

Cheers,
Tim


#2

I have been doing binary parallel loads by distributing distinct slices of the array I am building and ensuring that the pathnames are the same on each node’s local file system. In other words I have files time1.bin time2.bin time3.bin etc, and I move them out onto the server node so that 10.0.0.1 gets time1, but I rename it to /pathname/local.bin on 10.0.0.1. But then 10.0.0.2 gets time2.bin with its name as /pathname/local.bin.

Then I execute the load command with -1 on the coordinator node. It seems to work fast and well. I cut down my loading time from 4 weeks to just under 3 days. Clearly, the single-node text based loader using named pipes is robust, but slow for very big datasets.

Cheers, George


#3

George:

Thanks for confirming my suspicions on this … I’m working on the same approach now. Because I’m experimenting with using SciDB as the backend for a web server, I’ll be happy when there’s some socket-based alternative so I can stop fussing with filesystem permissions :smile:

Cheers,
Tim