Load data from csv file


#1

Hello!

I have a file with follow contents (epoch timestamp, value):

1511589798, 1900
1511589800, 1903

I’d like to load them in SciDB to have timestamp as a dimension.
I do this way:

  1. create array : create array test_10 <time:int64, val:double>[i=0:*,500000,0];
  2. load data: load(test_10,’/home/scidb/test_10.csv’,-2,‘csv’);
  3. store(redimension(test_10,val:double[time=0:*,500000,0]),test_10_1);

After that I have scheme

{1511589798}[1900]

But there is performance trouble, amount of records in csv is significant (5000000) and the time for load is 10hours.

Is the sufficient way I chose to do load data ?

Thanks


#2

Hi Anton,

I tried to repeat your experiment on a laptop and I’m getting a few seconds. I use SciDB to create at test dataset. I don’t know what your sampling period is but based on your two rows, I made it 2 milliseconds. This is one thing that may cause your slowness. If you have lots of gaps between data, you could be creating many small chunks and that might explain why it takes so long. If that’s the case, increase the chunk size of test_10_1.

Here’s what I got on a laptop VM with 4GB ram, 4 cores:

$ iquery -anq "store(apply(build(<timestamp:int64>[i=1:5000000,50000,0], 1511589798 + i * 2), value, double(random()%2000)), td)"
Query was executed successfully

$ iquery -anq "save(td, '/home/apoliakov/test_10.csv', -2, 'csv')"
Query was executed successfully

$ head -n 5 /home/apoliakov/test_10.csv 
1511589800,569
1511589802,956
1511589804,1697
1511589806,1516
1511589808,635

$ wc -l /home/apoliakov/test_10.csv 
5000000 /home/apoliakov/test_10.csv

$ iquery -aq "create array test_10 <timestamp:int64, value:double>[i=0:*,500000,0]"
Query was executed successfully

$ time iquery -anq "load(test_10, '/home/apoliakov/test_10.csv', -2, 'csv')"
Query was executed successfully

real	0m4.340s
user	0m0.008s
sys	0m0.008s

$ time iquery -anq "store(redimension(test_10, <value:double>[timestamp=0:*,500000,0]), test_10_1)"
Query was executed successfully

real	0m3.144s
user	0m0.012s
sys	0m0.000s

Here’s my config:

[mydb]
server-0=kali,3
redundancy=0
...
base-port=1239
interface=eth0
security=trust

execution-threads=4
result-prefetch-threads=3
operator-threads=1

sg-send-queue-size=4
sg-receive-queue-size=4

mem-array-threshold=256
smgr-cache-size=256
merge-sort-buffer=128

If chunking is an issue, then you could try to have SciDB pick a chunk size for you using auto-chunking like this:

$ time iquery -anq "store(redimension(test_10, <value:double>[timestamp=0:*:?]), test_10_2)"

More notes about chunking:

A nice blog post with some more load options:
http://rvernica.github.io/2016/05/load-data