Load error with loadcsv.py


#1

Anyone know how I can ask scidb to create the pipes in another location, or is this a more general problem w/ my system config?

warmstrong@mongodb:~$ $SCIDB_BIN/loadcsv.py -t NNNNNNNNCCC -a 'balances' -i $DATA_DIR/balances.csv
Retrieving load array schema from SciDB and parsing it to determine load array chunk size.
Getting SciDB configuration information.
This SciDB installation has 12 instance(s).
Creating CSV fragment FIFOs.
Creating DLF fragment FIFOs.

##### ERROR ##################
Failed to create DLF fragment: "/var/lib/scidb/000/0/balances.csv.dlf".
mkfifo: cannot create fifo `/var/lib/scidb/000/0/balances.csv.dlf': Permission denied

##############################

Removing CSV fragmemt FIFOs.
Failure: Error Encountered.

Thanks,
Whit


#2

aha… this needs to be run as the scidb user… that seems a little odd to me since normal queries can be executed as a non-scidb user.

warmstrong@mongodb:~$ sudo -i -u scidb $SCIDB_BIN/loadcsv.py -t NNNNNNNNCCC -m -a 'balances' -i $DATA_DIR/balances.csv
Retrieving load array schema from SciDB and parsing it to determine load array chunk size.
Getting SciDB configuration information.
This SciDB installation has 12 instance(s).
Creating DLF fragment FIFOs.
Starting CSV splitting process.

-Whit


#3

unfortunately, now failing w/ this error:

##### ERROR ##################
Load failed.
UserException in file: src/query/ops/input/InputArray.cpp function: end line: 196
Error id: scidb::SCIDB_SE_IMPORT_ERROR::SCIDB_LE_FILE_IMPORT_FAILED
Error description: Import error. Import from file 'balances.csv.dlf' (instance 0) to array 'balances' failed at line 2, column 71, offset 76, value=''': Unterminated character constant.
Failed query id: 1100873073787

This is what the data looks like:

warmstrong@mongodb:~$ head $DATA_DIR/balances.csv
19832725,2000-10-01,2000-10-01,210812,210812,8.5,0,1635.48,C,C,
19832726,2000-10-01,2000-09-01,79558.4,79490.4,7.75,0,581.73,C,C,
19832727,2000-10-01,2000-09-01,48256.7,48224.4,8.625,0,379.18,C,C,

Thanks,
Whit


#4

Strangely, this data loads fine with a regular load command, but still can’t see to get it to load with load.csv.

warmstrong@mongodb:~/dvl/structured.credit/scidb.loader$ time $SCIDB_BIN/iquery -n -q "LOAD balances from '$DATA_DIR/balances.scidb';"
Query was executed successfully

real    180m3.624s
user    0m0.012s
sys     0m0.004s

but with load.csv:

[code]warmstrong@mongodb:~/dvl/structured.credit/scidb.loader$ time sudo -i -u scidb loadcsv.py -i /var/lib/nas/scidb.loader.files/balances.csv -t “NSSNNNNNCCs” -a balances -s “<loan_id:int64,file_date:datetime,last_int_p:datetime,balance:float,inv_bal:float,int_rate:float,sch_princ:float,sch_mnth_p:float,mba_stat:char,ots_stat:char,exc_stat:char null> [i=0:*,10000,0]” -x
Parsing chunk size from provided load schema.
Getting SciDB configuration information.
This SciDB installation has 12 instance(s).
Creating CSV fragment FIFOs.
Creating DLF fragment FIFOs.
Starting CSV splitting process.
Starting CSV distribution and conversion processes.
Removing “balances” array.
Creating “balances” array.
Loading data into “balances” array (may take a while for large input files). 1-D load only since no target array name was provided.

ERROR

Load failed.
UserException in file: src/query/ops/input/InputArray.cpp function: end line: 196
Error id: scidb::SCIDB_SE_IMPORT_ERROR::SCIDB_LE_FILE_IMPORT_FAILED
Error description: Import error. Import from file ‘balances.csv.dlf’ (instance 0) to array ‘balances’ failed at line 82, column 25, offset 6361, value=’’: Number errors exceeds threshold.
Failed query id: 1101841563424

##############################

Removing CSV fragmemt FIFOs.
Removing DLF fragment FIFOs.
Failure: Error Encountered.

real 0m0.334s
user 0m0.180s
sys 0m0.032s
[/code]

This is line 82:

warmstrong@mongodb:~/dvl/structured.credit/scidb.loader$ sed -n '82p' /var/lib/nas/scidb.loader.files/balances.csv 
14311560,1992-06-01,1992-05-01,66500,66500,9.3,0,549.49,C,C,

-Whit


#5

Strange. Is this a parallel load? If it’s being done in parallel and the file is split up, line 82 may not correspond to the proper line in the file. I’m wondering if there’s a null in the date column somewhere?


#6

I finally got this to work by dropping the types parameter:

warmstrong@mongodb:~$ time sudo -i -u scidb loadcsv.py -i /var/lib/nas/scidb.loader.files/balances.csv -a balances_test -s "<loan_id:int64,file_date:datetime,last_int_p:datetime,balance:float,inv_bal:float,int_rate:float,sch_princ:float,sch_mnth_p:float,mba_stat:char,ots_stat:char,exc_stat:char null> [i=0:*,10000,0]" -x
Parsing chunk size from provided load schema.
Getting SciDB configuration information.
This SciDB installation has 12 instance(s).
Creating CSV fragment FIFOs.
Creating DLF fragment FIFOs.
Starting CSV splitting process.
Starting CSV distribution and conversion processes.
Removing "balances_test" array.
Creating "balances_test" array.
Loading data into "balances_test" array (may take a while for large input files). 1-D load only since no target array name was provided.
Removing CSV fragmemt FIFOs.
Removing DLF fragment FIFOs.
Success: Data Loaded.


real    22m7.992s
user    20m41.814s
sys     2m9.556s

That’s about an 8x speed up (we have a 12 instance scidb).

I guess the loader can infer the types from the target schema ‘-s’ param which makes sense. I’m still not sure why it failed when I used the types param, I tried many different combinations of C, S and s, but not of them worked. Does any of the data need to be quoted to be considered a proper string or char?

Also, is there a way to test equality of two arrays? It would be nice to check if the array loaded via load.csv is identical to the one loaded via the load command.

Thanks,
Whit


#7

Ok. I’ll ask someone to take a look at this. But good you got it to work.

It’s always a good idea to do some count(*) and sums after a load.

If you want something clever, and the array dimensions are matching, you can try something like:

filter ( join (A, B), A.attribute1 <> B.attribute1 or A.attribute2 <> B.attribute2 or A.attribute3 <> B.attribute3...) "