HOWTO: Load dense matrix from CSV


#1

I have a dense matrix in a CSV file like this:

$ cat FILE.csv
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15

This is a 3x5 matrix. I want to load it into SciDB.


I could not find a direct way to load it into SciDB, but following the notes in suggest best way to load this data I put together a recipe to load data like this in SciDB. The steps are:

1. Convert CSV to one-dimensional SciDB format.

$ csv2scidb < FILE.csv > FILE.csv2scidb
$ cat FILE.csv2scidb 
{0}[
(1,2,3,4,5),
(6,7,8,9,10),
(11,12,13,14,15)
]

2. Convert one-dimensional SciDB format to two-dimensional SciDB format using a sequence of string replacements.

# - Every row: add '[' to start row and '(' to start column for fist column
# - Last row: add ']' to end row
# - Every row (except last): add ')' to end column for last column and ']' to end row                  
#   Temporary replace ',' with ';'
# - Between columns: add ')' to end column, add '(' to start column
# - Undo: temporary replace ',' with ';'
# - Add block identifier for columns
$ sed "s/(/[(/g" FILE.csv2scidb | \
      sed "s/]/]]/"             | \
      sed "s/),/)];/g"          | \
      sed "s/,/),(/g"           | \
      sed "s/;/,/g"             | \
      sed "s/{0/{0,0/" > FILE.scidb
$ cat FILE.scidb 
{0,0}[
[(1),(2),(3),(4),(5)],
[(6),(7),(8),(9),(10)],
[(11),(12),(13),(14),(15)
]]

3. Load two-dimensional SciDB file into SciDB array.

AFL% create array FILE<val:int8>[i=0:2,3,0, j=0:4,5,0];
AFL% store(input(FILE, '/home/scidb/FILE.scidb'), FILE);
{i,j} val
{0,0} 1
{0,1} 2
{0,2} 3
{0,3} 4
{0,4} 5
{1,0} 6
{1,1} 7
{1,2} 8
{1,3} 9
{1,4} 10
{2,0} 11
{2,1} 12
{2,2} 13
{2,3} 14
{2,4} 15

The block identifiers in the .scidb file and the chunk sizes in the create array statement might need adjustments on a case-by-case basis.


#2

Oh boy. LOTS of changes since this.

Your best bet for simple experimentation is to look at this, much more recent write-up. If you are trying to do something more demanding or something that’s more complex, your best bet is to chat w/ our customer services folk.


#3

Thanks for the pointer to the write-up. It is informative but it does not address this case.

For the sake of completeness, the cvs2scidb tool was removed in 15.7, but the accelerated_io_tools plug-in is great for this type of data.

Starting from the original example input file:

$ cat /tmp/FILE.csv
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15

Once the accelerated_io_tools plug-in is installed and loaded, the file can be loaded like this:

AFL% redimension(
       between(
         apply(
           aio_input(
             '/tmp/FILE.csv',
             'attribute_delimiter=,', 'num_attributes=5', 'split_on_dimension=1'),
           i, tuple_no,
           j, attribute_no,
           val, int8(a)),
         0, 0, 0, 0, 
         2, 0, 0, 4),
       <val:int8 null>[i=0:2,3,0, j=0:4,5,0]);
{i,j} val
{0,0} 1
{0,1} 2
{0,2} 3
{0,3} 4
{0,4} 5
{1,0} 6
{1,1} 7
{1,2} 8
{1,3} 9
{1,4} 10
{2,0} 11
{2,1} 12
{2,2} 13
{2,3} 14
{2,4} 15

#4

Just one bit of warning with AIO: if you had more than 1 block of data then you would see a “gap” in the tuple_no assignment. The default block is 8MB and the default chunk size is 10 million.

So you might have something like:

1. Block 1: tuple_no = [0       ...  512,345] (depending on how many rows fit into 8MB)
2. Block 2: tuple_no = [10000000...10535,623] (can't assume that blocks would contain equal #rows either)
3. Block 3: tuple_no = [20000000...206...

Since you redimension using i = tuple_no, it would be sparsely populated. This may or may not be a problem for you. Once you load data in this form, you could apply tuple_no as attribute and then sort it by that attribute - that would give you a dense assignment.

An older version of load_tools actually counted the number of newlines and assigned tuple_no densely. It was easier to map error locations to file line number. But that meant looking for the ‘\n’ character on the coordinator. So AIO was made and we saw at least 3-4x speed boost - but that’s part of the price.

Thought about adding a ‘dense_tuple_no’ option to AIO potentially. This problem comes up sometimes - not very often.