Load runs a long time, does not finish, uses a >15x space


#1

Hello

I tried to load some vectors that I exported from an RDBMS, and the load ran for a 1.6 hours without completing. The data file was 133 MB, contained effectively 100 vectors each 42315 in length. The size of the SciDB data on disk at the end was up to 8.5 GB though!

scidb@ubuntu:~$ time iquery -naq "load (eigVect, '/home/scidb/export.txt')"
^C
real	162m28.905s
user	0m0.004s
sys	0m0.032s

For ease of transferring from RDBMS to SciDB I used a chunk size of 1 so I could explicitly label the data in the file with its row,column in the matrix:

AFL% create array eigVect <value : double> [i=0:42314,1,0, vectIndex=0:99,1,0];

head of the file:

{0,0}[[(-1.90783001E-003)]];
{0,1}[[(-8.21435009E-004)]];
{0,2}[[(2.80230997E-005)]]; 
{0,3}[[(-8.95769976E-004)]];
{0,4}[[(-2.9486001E-003)]]; 
{0,5}[[(-1.14246004E-003)]];
{0,6}[[(1.58106E-003)]];    
{0,7}[[(1.47579995E-003)]]; 
{0,8}[[(-3.93499015E-003)]];
{0,9}[[(8.95823003E-004)]]; 

I suspect the small chunk size is the problem. Still, I’m a bit surprised that the size on disk ballooned to 8.5 GB!

I’m not sure if there is a bug hiding in here or something else. I’m reporting it and going to try a more reasonable chunk size.

Cheers,
Dave

Edit: follow up: could not shutdown SciDB, had to kill the process. Tried to restart, was able to start iquery but a “list()” command hung. Was unable to shutdown or even kill the SciDB process, so shutdown the entire system. Ran “scidb.py initall test1” and it cleaned up the disk space, and then I was able to start / use SciDB as before.


#2

OK, so I changed my setup such that each “column” is one chunk and the load happened very, very, very fast (~ < 10s ?)

AQL% create array eigVect_3 <value : double> [article=0:42314,42315,0,eig=0:99,1,0];

converted my file so that it had the required format:

{0,0}[
[(-1.90783001E-003)],
[(4.43110999E-004)],
[(2.06423993E-003)],
[(-1.71595998E-003)],
[(-5.16451988E-003)],
[(5.42119006E-003)],
[(-2.41423002E-003)],
[(1.99372997E-003)],
[(-1.15074997E-003)],

I had the data in a single column, approximately correctly formatted and then used a bash script to get it into the above format.

original file - exported / created using SQL:

[(-1.90783001E-003)]
[(4.43110999E-004)]
[(2.06423993E-003)]
[(-1.71595998E-003)]
[(-5.16451988E-003)]
[(5.42119006E-003)]
[(-2.41423002E-003)]
[(1.99372997E-003)]
[(-1.15074997E-003)]
[(2.00516009E-003)]

bash script to convert:

#!/bin/bash

inputFile=$1
vectorLength=$2

let lastRowIndex=$vectorLength-1

let lineNum=0

while read line
do
    let rowIndex=$lineNum%$vectorLength
#    echo $rowIndex

    if [ $rowIndex -eq 0 ]
    then
        if [ $lineNum -gt 0 ]
        then
            echo "];"
        fi

        let columnIndex=$lineNum/$vectorLength
        echo column index: $columnIndex > /dev/stderr
        
        echo "{0,$columnIndex}["
    fi

    if [ $rowIndex -lt $lastRowIndex ]
    then
        echo $line","
    else
        echo $line
    fi

    let lineNum=lineNum+1
done < $inputFile

echo "];"

#3

Hi diahr -

I see you’ve posted before / after. This reply might come at some cross purposes. Here’s the news.

  1. In your original post, you included the following DML

create array eigVect <value : double> [i=0:42314,1,0, vectIndex=0:99,1,0]; .

I’m not surprised your load was so slow and tool up so much disk space. In your CREATE ARRAY …, you’ve said that you want “A 2D array with a single double attribute of size 42315x100, and you want the array broken up into chunks of size 1x1.” That would yield you 423,150 chunks, each with one double. For each chunk, we store information like the chunk’s physical location, array, coordinate range, cell count, etc. So for each 8 byte value we’re going to be storing about 100 bytes of overhead.

  1. When you went to this CREATE ARRAY …
create array eigVect_3 <value : double> [article=0:42314,42315,0,eig=0:99,1,0];

The reason is went so fast and used a much more reasonable about of space is that each chunk was now 42315 values per chunk. In general, we recommend that people try for chunks that each contain about 1,000,000 values. So I would suggest:

create array eigVect_3 <value : double> [article=0:42314,42315,0,eig=0:99,20,0];

This will get you chunks with about 850,000 values.

  1. From your more detailed answer, are you starting with data in a format that looks like this?
article, eig, value
0,0,x
0,1,y
...

That is, are you starting with the data in a kind of CSV file and structured in a way that looks more like what you’d get in a SQL DBMS? If so, then rather than run your script, you might consider something like this:

CREATE ARRAY eigVect_Raw < article: int64, eig : in64, value : double > [LineNum=0:*,1000000,0];
load ( eigVect_Raw, '/tmp/raw_sql_dump.dat');
CREATE ARRAY eigVect_3 <value : double> [article=0:42314,42315,0,eig=0:99,20,0];
redimension_store ( eigVect_Raw, eigVect_3 );

What we’ve found is that folks often start out with CSV data from (say) a SQL engine, and they need a scalable (ie. parallelizable) way of doing the same kind of bulk transforms your script is performing. redimension_store() actually does a lot of things in addition, and we’ve improved its efficiency enormously in 12.10 (yeah, yeah, #include <next_release_solves_everything.h>).

Paul

PS: On your other question, “Why the ‘Select All:’?” Because you might want to cut-n-paste code out of a code block, and clicking on that little link does exactly that.


#4

Hi Paul

Thank you for the explanation here and on the other thread. I had subsequently gotten “close” - I was loading test data as you described into a temporary array with multiple attributes containing row,col,value - but had not noticed / figured_out redimension_store. That will be very convenient.

Also, thanks for explaining the “select all” - very nice feature.

Thanks for all the help.

Dave