Optimizing load performance


#1

Hello,

I have a question here about how I am going about loading dense array datasets into SciDB. I want to make sure that I have optimized this as much as possible. I have a fairly simple python script that executes the following loop. I want to utilize the chunk estimation as much as possible because that is will result in the best performance? First, I try to get the optimal chunk size in the 1D data creation, by not setting the chunk. If I know the actual limit of the 1D array, should I add that to get the best loading performance? Next I employ “using” to estimate the chunk size for the target array from the 1D array. Since almost every array will look similar to 1D array that makes sense to me.

This seems to work pretty good, but I am looking for some pointers as some of the files will take over 3 days to load the way I have written them now. One of the datasets I am loading now is a global meris satellite image. It has 8398080000 pixels. 129600*64800. With the script I have written now, It takes about 8 minutes for it to load 1 of 648 inserts. I should mention that I am using the sdb-py library for interacting with the files and scidb.

Would I see performance using the subprocess module and executing iquery statements directly? Also I am thinking of using python’s multiprocessing script. Essentially the script has 2 parts. Part 1 write the file and move into 1D array. Part 2 is the redimension and insert. Could I use the multiprocessing function in python to use different processes. How does SciDB handle simultaneous requests? Would this result in slower performance for both?

Loop:

  1. WriteSciDBFile

{0}[

#(v1,v2,v3,…),
#(v1,v2,v3,…),
#(v1,v2,v3,…)
#]
2) create array 1Darray <v1:type, v2:type, v3:type,…> [xy=0:,?,?]
#I’m doing this in the hopes that I will get the best chunk size for loading the 1D array
3) load(1Darray, scidbfile)
4) if loop == 1:
Create junkArray <v1:type, v2:type, v3:type,…> [xy=0:
,?,?]
store(redimension(apply(1Darray , x, x1, …), junkArray ), junkArray )
#Create the target array that has the optimal chunk size for the 1D array I am loading
create targetArray <> [y=0:{y_max},?,?; x=-0:{x_max}, ?, 1] using junkArray

— # Insert the data from the 1D array into the target Array in strips
insert(redimension(apply( 1Darray, x, x1+{yOffSet}, y, y1 ), targeArray ), targetArray)
if loop > 1:
remove_versions(targetArray, #)


#2

Hey David,

Yes, looks like there’s a lot we can optimize here - along several lines:

  1. speeding up data ingest
  2. speeding up redimensioning

When it comes to data ingest, the fastest way to do it is via binary load. If you don’t want to mess with that, then the fastest way to load text is to use aio to load in TSV data. See a nice set of instructions here.

When it comes to redimensioning, your junkArray scheme most likely consumes more time than it gives you. The boundaries of the array don’t make a big difference when it comes to selecting the chunk size. I would definitely try to find ways to use one redimension instead of two - as that is an expensive operation.

It looks like your data are completely dense, so why not pick a chunk size along the lines of 800x800? Do attributes vary in size much?

Finally, there is a prototype for faster redimensioning, that may give you another ~2-4x boost depending on how many attributes your array has. You can give that a try as well. This one won’t auto-chunk for you yet - folks are working on bringing some of that into the core codebase…


#3

In regards to #2 let me be a bit more clear.

When I originally wrote the script I wanted it to be able to dynamically adjust for different datasets. To do that the script needed to determine what the optimal target input array. So the first time through the loop it determines that optimal array using a reredimension in step 4. It only executes this once and you are right that there is a definite cost, but I struggled to figure out how I could do it another way.

I will try implement the aio load to determine for a bit more speedup.