Errors from loadcsv.py with 2 instances


#1

Hi,

I am developing with SciDB quite a lot recently and it is good to see a new release :wink: Just switched to 2 instances on my local Ubuntu to try out the parallel data loading (speed!), but puzzled with an error that I have never seen before. If I switched it back to one instance in the config then it loads without error.

File test_data.csv is a single-column CSV file (with one integer each row) and a title row.

scidb@ubuntu:~/svn/SciDB/tools$ loadcsv.py -n 1 -t N -b -i ../test_data/test_data.csv -x -a "test_data" -s "<value:double>[i=0:180223,20480,0]"
Parsing chunk size from provided load schema.
Getting SciDB configuration information.
This SciDB installation has 2 instance(s).
Creating CSV fragment FIFOs.
Creating DLF fragment FIFOs.
Starting CSV splitting process.
Starting CSV distribution and conversion processes.
Removing "test_data" array.
Creating "test_data" array.
Loading data into "test_data" array (may take a while for large input files). 1-D load only since no target array name was provided.

##### ERROR ##################
Load failed.
UserException in file: src/query/ops/input/InputArray.cpp function: end line: 196
Error id: scidb::SCIDB_SE_IMPORT_ERROR::SCIDB_LE_FILE_IMPORT_FAILED
Error description: Import error. Import from file 'test_data.csv.dlf' (instance 1) to array 'test_data' failed at line 81930, column 5, offset 517181, value='3489': Chunk is outside of array boundaries.
Failed query id: 1100875960846

##############################

Removing CSV fragmemt FIFOs.
Removing DLF fragment FIFOs.
Failure: Error Encountered.

scidb@ubuntu:~/svn/SciDB/tools$ 

Here is my config.ini for SciDB:

[test2]
server-0=localhost,1
db_user=test2user
db_passwd=test2passwd
install_root=/opt/scidb/13.2
metadata=/opt/scidb/13.2/share/scidb/meta.sql
pluginsdir=/opt/scidb/13.2/lib/scidb/plugins
logconf=/opt/scidb/13.2/share/scidb/log4cxx.properties
base-path=/home/scidb/scidb-data-2
base-port=1239
interface=eth0

I have also attached the CSV file for reference. Is there something special that I need to do about single-column CSV?

Thanks,
Patrick
test_data.csv.zip (110 KB)


#2

Interestingly, if I used:

loadcsv.py -n 1 -t N -b -i ../test_data/test_data.csv -x -a "test_data" -s "<value:double>[i=0:*,20480,0]"

Making it an unbounded array seems to be okay… but I would prefer to produce a bounded one since I need to reshape it to a 2D array :frowning:
Or any way to convert an unbounded to a bounded array? Thanks again.

Cheers,
Patrick


#3

Hey Patrick,

Yes, this is expected behavior with loadcsv.py. The way it works is to split the data between instances and it uses the variable “i” as kind of a placeholder for data smearing. Note that “i” has no real meaning; it does not come from your csv file; it’s auto-generated and it relates to number of rows you have and the number of instances. i is not populated densely; treat i as a kind of meaningless “row id” that loadcsv.py generates, based on chunk size and number of instances.

For example, here’s a 4-instance cluster and some test data:

$ cat file.csv 
value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

$ iquery -aq "create array test1 <val:double> [i=0:*,6,0]"   #chunk size of 6
Query was executed successfully

$ loadcsv.py -i file.csv -n 1 -a test1
iquery  -aq "scan(test1)"
i,val
0,1       #instance 0
1,5
2,9
3,13
6,2       #instance 1 (note the gap from 3 to 6)
7,6
8,10
9,14
12,3     #instance 2
13,7
14,11
15,15
18,4      #instance 3
19,8
20,12

$ iquery -aq "create array test2 <val:double> [i=0:*,10,0]"    #repeat the experiment with chunk size 10 to see the difference:
Query was executed successfully

$ iquery -aq "scan(test2)"
i,val
0,1
1,5
2,9
3,13
10,2
11,6
12,10
13,14
20,3
21,7
22,11
23,15
30,4
31,8
32,12

Why does it work this way? Well it allows for maximum speed parallel load to all instances simultaneously. We felt it was fair game to “garble” the value “i” because it’s not in the data to begin with. Most of our customers don’t use “i”. Instead they have CSV files with dimensions as part of the CSV. In that scenario, users do this load, and then issue a “redimension_store” to reorganize the data according to their dimensions. When you do that, it doesn’t matter if the array is bounded or not.

You seem to have a different case where the order of your values is important. Is that right?

To recompute the original ordinal of the value from i, you can do something like this:

$ iquery -olcsv+ -aq "apply(test1, j, ( i - (i/6)*6 + (i/6-instanceid())*6  )* 4 + instanceid())"
i,val,j
0,1,0
1,5,4
2,9,8
3,13,12
6,2,1
7,6,5
8,10,9
9,14,13
12,3,2
13,7,6
14,11,10
15,15,14
18,4,3
19,8,7
20,12,11

In the above, “6” is the chunk size of the array and “4” is the number of instances. You can see it generates an ordinal (0-14) for the original values (1-15) where the ordinal in our case is always one less than the number. It’ll work for “test2” if we replace “6” with the chunk size of “10”.

Once you have this you can reconstruct the original ordering with redimension, and redimension into a bounded array:

$ iquery -aq "count(test1)"
i,count
0,15

$ iquery -aq "redimension(apply(test2, j, ( i - (i/10)*10 + (i/10-instanceid())*10  )* 4 + instanceid()), <val:double> [j=0:14,4,0])"
j,val
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10
10,11
11,12
12,13
13,14
14,15

But - redimension is expensive and now you gotta figure out if you’d rather do a single-stream load without redimension, or use redimension. The latter will surely win as you increase the number of instances.

Also look into the reference for operators subarray, sort, unpack – they may be of interest depending on what exactly you are trying to do.


#4

Hi Alex,

Now I totally understand - basically my input is missing the index for one of the dimension and thus I can’t redimension the data straightaway. Yes, the order of the data is important as they are the low-level readings from a machine similar to a gyro. I tried the way that you suggested and I can see the cost of doing redimension is quite expensive. :frowning:

Ended up I load the data with:

loadcsv.py -n 1 -t N -b -i test_data.csv -x -a "test_data" -s "<value:double>[i=0:*,20480,0]"

And get the number of samples loaded:

samples=`iquery -o csv -aq "count(test_data)" | tail -n 1`

Then reshape it to the final 2D array via:

second_dimen=32
iquery -naq "store(
                    reshape(
                            subarray(test_data,0,($samples-1)),
                            <value:double>[i=0:($samples/$second_dimen - 1),20480,0, j=0:($second_dimen-1),20480,0]),
                test_data_2d)"

Definitely faster than redimension :smile: somehow I thought this would be much easier than the Python plugin… :open_mouth:
Thanks for your help again!

Cheers,
Pat


#5

Understood

Just to be sure - on multiple instances, loadcsv.py will reorder the data; and you say that order is important. That’s something to keep in mind.
To completely preserve the order, you can just use csv2scidb and forgo parallel load. If you do that, you can load into a bounded array.
Redimension will scale with number of instances too.


#6

Hi Alex,

Thanks for reminding - found that out in regression testing so I have used csv2scidb instead now.

Cheers,
Patrick