Load very large file


#1

Is there a good way to load a very large file(like 250GB), I’m using the approach that first loading a 1-dimension array and then redimension it to multiple dimension.
It incurs large storage overhead,since the loading file need to contain the dimension index as well as attribute value. for example a 3d array needs loading format as (i,j,k,val1,val2).
Thanks.

Jian


#2

Hello!

It looks like you are not the only person with this problem. Take a look at viewtopic.php?f=6&t=575.
Looks like this is a big issue and we’ll need to address it.

I think that eventually there should be an automagical load from CSV to n-dimensional array with no intermediate steps where the system does all the hard work for you. Over time, I am sure we will get there.

For now - we need work around the problem.
I guess the first option is to just eat the overhead. Note that the 1D copy is not needed after the redimension is done. Note also that you could possibly use smaller size types like uint32 or uint16 for your i,j,k and then upcast them as part of the redimension. You could take that idea further and come up with some sort of a space-saving encoding over your (i,j,k). If this temporary overhead is acceptable, it might be a good way to proceed for now.

The other option is that, if i,j,k are integers and if you can provide a sparse format input to scidb that looks like this:

{100,100,1}[[[(0.1, 0.2), (0.3, 0.4), ...], [(1.1, 1.2), (1.3, 1/4), ...]], ... ]]]
{100,100,2}[[[(0.1, 0.2), (0.3, 0.4), ...], [(1.1, 1.2), (1.3, 1/4), ...]], ... ]]]
...

Then scidb can eat this. This is the “sparse load format”. This input needs to be properly chunk-ified - if your array starts at 0,0,0 and has chunks of 100x100x100 - then the first chunk in the file needs to be from 0,0,0 to 100,100,100. Second chunk needs to be 0,0,100 to 100,100,200…
You could imagine writing a tool that outputs this to a pipe and then scidb will read it from the pipe so it does not have to hit the disk. I’m not 100% sure where your data comes from.


#3

Here’s another, slightly more efficient way to do this. This works only if the dimensions are all integer. It bypasses creating a 1D array:

Here’s our file testfile.csv

apoliakov@daitanto:~/csv_load_test$ cat testfile.csv 
i,j,k,val1,val2
0,0,0,1,2
0,0,1,2,3
0,0,2,4,5
0,0,3,5,6
0,0,4,7,8
0,0,5,9,10
0,0,6,11,12
0,0,7,13,14
0,0,8,15,16
0,0,9,17,18
1,1,0,1,20
1,2,1,2,30
1,3,2,4,50
1,4,3,5,60
1,5,4,7,80
1,6,5,9,100
1,7,6,11,120
1,8,7,13,140
1,9,8,15,160
0,1,0,1,2
0,2,1,2,3
0,3,2,4,5
0,4,3,5,6
0,5,4,7,8
0,6,5,9,10
0,7,6,11,12
0,8,7,13,14
0,9,8,15,16

And here’s a load script that loads it:

#!/bin/bash
iquery -aq "remove(test_array_template)" > /dev/null 2>&1
iquery -aq "remove(target_array)" > /dev/null 2>&1

iquery -aq "create array test_array_template
<i:int64,
 j:int64,
 k:int64,
 val1:double,
 val2:double>
[n=0:*,10,0]"

iquery -aq "
create array target_array
<val1:double,
 val2:double>
[i=0:*,10,0,
 j=0:*,10,0,
 k=0:*,10,0]"

rm -f /tmp/load.fifo
mkfifo /tmp/load.fifo

csv2scidb -i testfile.csv -p "NNNNN" -s 1 > /tmp/load.fifo &
sleep 2;

#The -1 passed to the input command is good for parallel loading. 
#If the -1 is removed, the system will first redistribute the array between instances and then load
#Depending on your exact situation, the fastest option may vary
iquery -o lcsv+ -aq "redimension_store(input(test_array_template, '/tmp/load.fifo',-1), target_array)"

These are toy chunk sizes that illustrate that the script works with multiple chunks. You obviously want larger chunk sizes for real data.
We are investigating ways to build this into our parallel load tool directly. Hopefully we can offer a more pre-packaged solution in 12.10. Please stay tuned!