[Help] suggest best way to load this data


#1

Hi,

I am trying to load 361(row)x481(column) matrix that was seperated with tab delimeter.
every cell has double precision float (ex. -32767.0000)

I have two difficulties to load it into scidb.
First, How can I replace tab in the matrix with a comma using csv2scidb.
Second, Should I go through the course in redimension? (e.q. csv raw matrix -> single column matrix -> single dimension array in scidb -> multi dimension array in scidb)

What is the best way to load it that you think?


#2

One at a time.

  1. “How can I replace tab in the matrix with a comma using csv2scidb”?

So your data looks (something like) this?

[plumber@localhost tests]$ cat /tmp/Foo
0	0	10.0
0	1	10.0
0	2	10.0
1	0	10.0
1	1	10.0
1	2	10.0
2	0	10.0
2	1	10.0
2	2	10.0
[plumber@localhost tests]$

OK. Have a look at csv2scidb. There’s a command line option -d that lets you specify the field delimiter. In the case of [tab] it’s a bit tricky, but the following little example illustrates the idea.

[plumber@localhost tests]$ csv2scidb -help
csv2scidb: Convert CSV file to SciDB input text format.
Usage:   csv2scidb [options] [ < input-file ] [ > output-file ]
Default: -f 0 -c 1000000 -q
Options:
  -v        version information
  -i PATH   input file
  -o PATH   output file
  -a PATH   appended output file
  -c INT    length of chunk
  -f INT    starting coordinate
  -n INT    number of instances
  -d CHAR   delimiter: defaults to ,
  -p STR    type pattern: N number, S string, s nullable-string,
            C char, c nullable-char
  -q        quote the input line exactly by wrapping it in ()
  -s N      skip N lines at the beginning of the file
  -h        prints this helpful message

Note: the -q and -p options are mutually exclusive.

[plumber@localhost tests]$ cat /tmp/Foo | csv2scidb -c 3 -d "\t" -p NNN
{0}[
(0,0,10.0),
(0,1,10.0),
(0,2,10.0)
];
{3}[
(1,0,10.0),
(1,1,10.0),
(1,2,10.0)
];
{6}[
(2,0,10.0),
(2,1,10.0),
(2,2,10.0)
];
[plumber@localhost tests]$ 

Which yields the appropriate 1D load form.

  1. "Should I go through the course in redimension? "

    Answer in three parts.

2.1

What’s your ultimate goal? If it was me, the usual procedure is:

i. load ( 1D_Array, ‘file or pipe’ );
ii. redimension_store ( 1D_Array, Target_Array );

iquery -aq "CREATE ARRAY Test_1D
<
  I   : int64,
  J   : int64,
  val : double
>
[ ROWNUM=0:*,3,0 ];"

iquery -aq "CREATE ARRAY Test
<
  val : double
>
[ I=0:2,3,0, J=0:2,3,0 ];"

[plumber@localhost tests]$ rm -rf /tmp/load.fifo
[plumber@localhost tests]$ mkfifo /tmp/load.fifo
[plumber@localhost tests]$ cat /tmp/Foo | csv2scidb -c 3 -d "\t" -p NNN > /tmp/load.fifo &

iquery -aq "load ( Test_1D, '/tmp/load.fifo' );"
[(0,0,10),(0,1,10),(0,2,10),(1,0,10),(1,1,10),(1,2,10),(2,0,10),(2,1,10),(2,2,10)]

iquery -aq "redimension_store ( Test_1D, Test );"
[[(10),(10),(10)],[(10),(10),(10)],[(10),(10),(10)]]

2.2 How ambitious are you? This load procedure, converting ASCII .csv files to an internal load format, and from there to use redimension_store(…), was the first approach we developed. But it’s ( a ) slow (ASCII parsing is hard, lots of data is binary), ( b ) not parallelized (you want to split the load up over multiple machines), and ( c ) has a lot of moving parts.

Have a look at the documentation, where we describe multiple loading approached new since about 12.12. These include parallel loading, an efficient binary load, etc.

2.3 There’s also a convenient, easy to use tool out there now, called loadcsv.py. We’ve put considerable work into getting it efficient, parallel and easy to use, and it’s the vehicle we’ll be using to surface load improvements in the future. If you have “special” requirements then you can always fall back on the lower level tools we provide. But loadcsv.py is put there to make your initial loading / redimension_store(…) experience as easy and fast as possible.

[plumber@localhost tests]$ loadcsv.py -help
Usage: loadcsv.py [options]

SciDB Parallel CSV Loader

Options:
  -h, --help            show this help message and exit
  -d DB_ADDRESS         SciDB Coordinator Hostname or IP Address (Default =
                        "localhost")
  -p DB_PORT            SciDB Coordinator Port (Default = 1239)
  -r DB_ROOT            SciDB Installation Root Folder (Default =
                        "/Devel/trunk/stage/install")
  -i INPUT_FILE         CSV Input File (Default = stdin)
  -n SKIP               # Lines to Skip (Default = 0)
  -t TYPE_PATTERN       CSV Field Types Pattern: N number, S string, s
                        nullable-string, C char (e.g., "NNsCS")
  -D DELIMITER          Delimiter (Default = ",")
  -f STARTING_COORDINATE
                        Starting Coordinate (Default = 0)
  -c CHUNK_SIZE         Chunk Size (Default = 500000)
  -o OUTPUT_BASE        Output File Base Name (Default = INPUT_FILE or
                        "stdin.csv")
  -m                    Create Intermediate CSV Files (not FIFOs)
  -l                    Leave Intermediate CSV Files
  -M                    Create Intermediate DLF Files (not FIFOs)
  -L                    Leave Intermediate DLF Files
  -P SSH_PORT           SSH Port (Default = System Default)
  -u SSH_USERNAME       SSH Username
  -k SSH_KEYFILE        SSH Key/Identity File
  -b                    SSH Bypass Strict Host Key Checking
  -a LOAD_NAME          Load Array Name
  -s LOAD_SCHEMA        Load Array Schema
  -w SHADOW_NAME        Shadow Array Name
  -e ERRORS_ALLOWED     # Load Errors Allowed per Instance (Default = 0)
  -x                    Remove Load and Shadow Arrays Before Loading (if they
                        exist)
  -A TARGET_NAME        Target Array Name
  -S TARGET_SCHEMA      Target Array Schema
  -X                    Remove Target Array Before Loading (if it exists)
  -v                    Display Verbose Messages
  -V                    Display SciDB Version Information
  -q                    Quiet Mode
[plumber@localhost tests]$ 

#3

Thanks. you reply using fifo and ‘loadcsv.py’ is very helpful to me.

but I still have inconvenient to csv2scidb.
I realized my question was not in detail.

my matrix looks like this.

0 1 2 3 4 5 6 … 360
1 1 2 3 4 5 6 … 360


480 1 2 3 4 … 360

Yes, the raw data matrix is 2D with only one attribute.
that’s why I could not give the option ‘-p’ to csv2scidb.

I found scidb can directly load 2D matrix in scidb format without temporary 1D matrix.
so I wrote a program in python to convert my tsv file into 2D scidb format.

I hope that csv2scidb could change 2D csv file to 2D scidb format and load into scidb directly.


#4

Sadly, no.

But that said, have a look at the format that csv2scidb generates. It should be quite easy to convert your 2D ASCII array representation to the 2D SciDB load format. To give you an even better idea of what that looks like …

#!/bin/sh
#
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1}"
};
#
#------------------------------------------------------------------------------
#
#  Hygiene.
CMD="remove ( Simple_2D )"
exec_afl_query "${CMD};"
#
#  Create the array. 
CMD="
CREATE ARRAY Simple_2D 
<
    val : double 
>
[ I=0:7,4,0, J=0:2,3,0]
"
exec_afl_query "${CMD};"
#
# Populate the array.
CMD="
store ( 
  build ( 
    Simple_2D,
    double( I * 4 + J )
  ),
  Simple_2D
)
"
exec_afl_query "${CMD};" -n 
#
# Save the contents of the array into a 2D load file. 
CMD="
save ( 
  Simple_2D, 
  '/tmp/Simple_2D.dat' 
)
"
exec_afl_query "${CMD};" -n
#
# And what does that file look like? 
cat /tmp/Simple_2D.dat

Query was executed successfully
Elapsed Time: 0:00.17
{0,0}[[(0),(1),(2)],[(4),(5),(6)],[(8),(9),(10)],[(12),(13),(14)]];
{4,0}[[(16),(17),(18)],[(20),(21),(22)],[(24),(25),(26)],[(28),(29),(30)]]

The last part of the script above illustrates the ASCII load format for SciDB. We really don’t encourage folk to use this, though, and the only reason I’m bringing it up here is that it sounds like you already have data roughly in an array shape already, so it might be simpler for you to get to this format from where you are.

In slightly format terms:

  1. The load file is divided into “blocks”; one block per target chunk.

  2. The “blocks” are prepended with {X,Y} identifiers. These contain the chunk’s “origin”; the minimum integer value along each dimension for the chunk.

  3. The chunk data itself used “[…]” characters to delimit rows and columns within the chunk. So the outer “[…]” tell you there’s a begin/end chunk, and the inner “[…]” delimit the start/end of each row within the chunk.

  4. The attribute tuple is enclosed within a “(…)” pair. Attributes are comma separated. Strings are within quotes.

  5. If the cell is “empty” (ie. the array is sparse, and the cell isn’t there) then we use “()” to indicate that.

This load format is really only any good for “toy” examples. Most data being loaded into SciDB comes to us as a .csv file with “one row per cell”, and we have built machinery to convert 1D load scripts to nD target arrays.

Hope this helps!


HOWTO: Load dense matrix from CSV
HOWTO: Load dense matrix from CSV
#5

Thank you very much.

Your reply is a great help to understand scidb loading format.


#6

soory to interrupt you ,when i use the csv2scidb command as just you said,it doesn’t work


#7

Hi fate_testarossa -

How are you using it? And what do you get / see?

Paul


#8

[quote=“plumber”]Hi fate_testarossa -

How are you using it? And what do you get / see?

Paul[/quote]
That’s Ok now.But that is too slow.Sometimes it stops indicating a network mistake。

Can you help about the mistake that i encount: viewtopic.php?f=11&t=1307

Thanks!