Can you give some better way to import data


#1

When i use csv2scidb and load to import data,i find that too slow.Theremore,when i use the loadcsv.py command,i run out a problem.And from reading the topics on the forum,i find that the loadcsv.py command too slow.Can you suggest me a better way to import data that has a faster speed.
Thanks!


#2

For those interested in understanding SciDB’s load options, I’ve put together a (pretty long) script that describes the various options and gives users some tools to test 'em out. The basic results of this

Executive Summary:

  1. ASCII/UTF8 loading, the default, is the most flexible but the slowest method for loading data into SciDB. Because data often comes in the form of a .csv file, it’s typically necessary to use ‘csv2scidb’ to convert it into the external SciDB ASCII/UTF8 format, although this conversion adds to the computational overhead of the load operation.

  2. ASCII/UTF8 loading in parallel considerably speeds things up, but requires some kind of splitter to divide the input file up into one data source for the parallel loader per SciDB instance. This ‘splitter’ function is what’s implemented by the loadcsv.py script.

  3. BINARY loading is much faster than ASCII/UTF8, but is (at the moment) limited by the fact that we can only load BINARY in serial (through the coordinator). BINARY loading is ideal when what you’re trying to load is a stream of binary data from a single source.

  4. OPAQUE save/load is the most efficient mechanism for getting data into and out of a SciDB cluster. OPAQUE data formats essentially copies the native chunk format, together with some meta-data, and avoids almost all of the CPU load associated with type conversion, etc.

In general, OPAQUE loading is the ideal vehicle for moving data between SciDB instances, for backup, or when upgrading SciDB to a new version. With the 14.3 release of SciDB we’re introducing a script to automate the backup process for SciDB. This script uses the OPAQUE mode for exporting data.

  1. If you’re looking to support incremental load in your application, there’s a little sub-section at the end of this script which will give you a basis for measuring load rates and assessing the kinds of SLA guarantees you can expect loading data into SciDB.
#!/bin/sh 
#
#   File:   Loading_Answers.sh
#
#  About: 
#
#   This script consists of an extended example intended to answer questions
#  about SciDB loading, and the data re-organization into arrays suitable 
#  for querying. 
#
#   You can run this script from start to finish if you'd like. But a better
#  way to use this is to examine its contents with a bit of care, cutting-n-
#  pasting sections of it into another window. There are a number of shell 
#  functions in this script to simplify the script and to let us focus on 
#  the important things in here. 
#
#------------------------------------------------------------------------------
#
#  Useful shell script functions. Use these to drive queries against the 
# SciDB instance and to measure how long things take. 
#
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1};"
};
#
exec_aql_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -q "${1};"
};
#
#------------------------------------------------------------------------------
#
#  Scale Parameters:
#  ~~~~~~~~~~~~~~~~
#
#  To make this script a little more useful (to allow it to run over a 
# variety of array sizes ) the following little segment allows you to control 
# sizes (to scale the tests over multiple instances) of the dimensions of the 
# arrays we're testing. 
#
CHUNK_LEN=1000
DIM_LEN=10
#
#  The values above will give you two dimension arrays, where each dimension
#  has 10 chunks along it's length--meaning 10 x 10 = 100 chunks overall--and
#  each chunk contains 1000 x 1000 = 1,000,000 cells. Note that there are 
#  parts of the script below which deal with sparse data at 0.5 occupancy. 
#
DIM_MAX=`expr \( ${DIM_LEN} \* ${CHUNK_LEN} \) - 1`
DIM_STR="0:${DIM_MAX},${CHUNK_LEN},0"
UNBOUNDED_DIM_STR="0:*,${CHUNK_LEN},0"
#
#------------------------------------------------------------------------------
#
#   Timings in this Script: Physical Configuration: 
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   Note that the box I am running all of these experiments on is rather 
#  old, and more geared to getting a handle on CPU, not I/O issues. 
#
# CPU:
#
#   From 'cat /proc/cpuinfo'
#
#   4 x Intel(R) Core(TM) i3 CPU         540  @ 3.07GHz
#   cpu cores   : 2
#   
#   From 'free -m'
#
#              total       used       free     shared    buffers     cached
# Mem:          7976       2148       5828          0        598       1087
#
#    8G of RAM. 
#
# Disk: 
#
#  From 'df -k' 
#
#   /dev/sda1            957380244   75791080 832957052   9% /
#
#
# SciDB Configuration: 
#
#   4 x Instances, 2G max per instance. 
#
#  $ scidb --version
#  SciDB Version: 14.3.7299
#  Build Type: RelWithDebInfo
#  Copyright (C) 2008-2013 SciDB, Inc.
#
#   Anyway, you get the idea. This is a pretty small, old "test" box, and 
#  although SciDB is installed in "cluster" form all instances share one disk. 
#  So whatever parallelism I'm able to show is only the consequence of CPU 
#  parallelism, and would (of course) be improved by having more disks (one 
#  per instance maximizes your parallelism, but you mileage will vary depending
#  on the nature of your workload). 
# 
#------------------------------------------------------------------------------
#
#   Functions to Re-Create and Populate Example Arrays: 
#   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   The following shell functions manipulate arrays; create them, populate 
#  them, etc. From time to time in this script we will reset the contents of 
#  the arrays we're going to work with here. Best to encapsulate those 
#  operations in functions. 
#
recreate() { 
#
#   Function to drop and (re-)create a 2D array with three attributes. Name 
#  of the array is the function's only argument. 
CMD_HYGIENE="DROP ARRAY ${1}";
exec_aql_query "${CMD_HYGIENE}"
#
CMD_CREATE_ARRAY="
CREATE ARRAY ${1} 
<
  attr1 : int32,
  attr2 : int32,
  attr3 : double 
>
[ I=${DIM_STR}, J=${DIM_STR} ]
"
exec_aql_query "${CMD_CREATE_ARRAY}"
#
}
#
recreate_unbounded() { 
#
CMD_HYGIENE="DROP ARRAY ${1}";
exec_aql_query "${CMD_HYGIENE}"
#
CMD_CREATE_ARRAY="
CREATE ARRAY ${1}
<
  attr1 : int32,
  attr2 : int32,
  attr3 : double
>
[ I=${UNBOUNDED_DIM_STR}, J=${DIM_STR} ]
"
exec_aql_query "${CMD_CREATE_ARRAY}"
#
}
#
populate() { 
#
#  Function to populate the 2D array whose name is the only argument. 
CMD_POPULATE_ARRAY="
store ( 
  project ( 
    apply ( 
      cross_join ( build ( < val : int32 > [ I=${DIM_STR} ], I ) AS X,
                   build ( < val : int32 > [ J=${DIM_STR} ], J ) AS Y
                 ),
      attr1, int32(X.val * ${DIM_LEN} + Y.val),
      attr2, int32(Y.val * ${DIM_LEN} + X.val),
      attr3, double( random()%1000000 )  / 1000000.0 
    ),
    attr1, attr2, attr3 
  ),
  ${1}
)
"
exec_afl_query "${CMD_POPULATE_ARRAY};" -n 
#
}
#
populate_sparse() { 
#
#   Minor variant on the example above which populates only 50% of the cells 
#  in the array. 
CMD_POPULATE_ARRAY_SPARSE="
store ( 
  project ( 
    bernoulli ( 
      apply ( 
        cross_join ( build ( < val : int32 > [ I=${DIM_STR} ], I ) AS X,
                     build ( < val : int32 > [ J=${DIM_STR} ], J ) AS Y
                   ),
        attr1, int32(X.val * ${DIM_LEN} + Y.val),
        attr2, int32(Y.val * ${DIM_LEN} + X.val),
        attr3, double( random()%1000000 )  / 1000000.0 
      ),
      0.5 
    ),
    attr1, attr2, attr3 
  ),
  ${1}
)
"
exec_afl_query "${CMD_POPULATE_ARRAY_SPARSE};" -n
#
}
#
#------------------------------------------------------------------------------
#
#   1. Basic Serial ASCII/UTF8 save/load
#      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   The point of this section is to illustrate the external ASCII/UTF8 format 
#  we use organizing array data into text files. This format can require more 
#  storage, and parsing / processing it can be computationally expensive 
#  because the the format concatenates an array's attributes into a single 
#  'logical' per-cell record on save(...), a step that must be reversed on 
#  load(...)).
#
#    However, it is a quite flexible external storage format, and as you 
#  can see from this example it is capable of supporting efficient loading 
#  of parallel, pre-chunked data into SciDB. 
#
#    All timings shown for 10,000x10,000 arrays, with 1000x1000 chunks. 
#
recreate Two_D_Test;
populate Two_D_Test;
#
# Elapsed Time: 1:01.67    
# Elapsed Time: 1:01.41
#
#  62 seconds. 
#  
CMD_SAVE_2D_ASCII_UTF8="
save ( Two_D_Test, '/tmp/Two_D_ChunkWise.txt')
"
exec_afl_query "${CMD_SAVE_2D_ASCII_UTF8};" -n 
#
#  Query was executed successfully
#  Elapsed Time: 8:42.25
#  Elapsed Time: 8:46.33
#
#   524 seconds. 
#
#   Have a look at the file /tmp/Two_D_ChunkWise.txt on the coordinator 
#  instance. (If you only have one SciDB instance in your installation, then 
#  that one is the coordinator). Have a look at the structure of the 
#  /tmp/Two_D_ChunkWise.txt file. That will tell you all the details you need 
#  to know about how to organize the 2D ASCII/UTF8 load file. The important 
#  point is that the file format supports chunk-wise loading. No need to load 
#  in 1D (as from a .csv file) and then redimension(...) the data to n-D. 
#
#   Each such file contains many chunks. The basic model is, for each chunk, 
#  looks like this: 
#
#  { chunk_origin_coordinates } [[ { cell_coordinates } ( attributes ), ...]]
#
#   Which in practice might look like: 
#
#  { 0,0 } [[ {0,0}( 1,2,3.0 ), {0,2}( 2,3,4.0)  ... ]] 
#  {0,1000} [[ ... ]]
#
#   If the chunk is dense, then we might omit the inner "cell_coordinates". 
#
recreate Two_D_Test_Two;
#
#  Now. To load the data back, use the load(...) operator. 
#  
CMD_LOAD_2D_ASCII_UTF8="
load ( Two_D_Test_Two, '/tmp/Two_D_ChunkWise.txt')
"
exec_afl_query "${CMD_LOAD_2D_ASCII_UTF8};" -n 
#
#  Elapsed Time: 3:06.78
#  Elapsed Time: 3:08.69
#
#   187 seconds. 
#
#   As you can see, the load(...) is relatively fast, compared to the 
#  save(...). 
#
CMD_CHECK_ARRAY_EQUIVALENCE="
SELECT COUNT(*) 
  FROM Two_D_Test AS T1, Two_D_Test_Two AS T2
 WHERE T1.attr1 <> T2.attr1
    OR T1.attr2 <> T2.attr2
    OR T1.attr3 <> T2.attr3
"
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};"
#
# {i} COUNT
# {0} 0
#
#   Most SciDB users start with .csv files, rather than with files formatted
#  according to what SciDB wants to see. So we ship an external tool--
#  csv2scidb--to convert the contents of the .csv file into 1D, dense, load
#  data. The 1D load data must be converted into the n-D target before you can 
#  do any kind of serious analytic work with it, however. 
#
#   For more on this process, see below, where we load data from a BINARY 
#  source and redimension(...) it into the target shape on the fly. 
#  
#------------------------------------------------------------------------------
#
#   2. Parallelism with ASCII/UTF8 save/load 
#      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   The obvious problem with the kind of save/load above is that the entire 
#  data load is pumped through a single instance ... the coordinator in this
#  case. So what we do is to allow you to save(...) and load(...) data into 
#  ASCII/UTF8 files in parallel. 
#
#   To accomplish this, we introduce a couple of additional (optional) params
#  to the save(...) and load(...) operators. To tell SciDB that you want 
#  to save(...) data in parallel on each instance and store data in the 
#  output file using the ASCII/UTF8 formatting, use the following command. 
#
CMD_SAVE_2D_ASCII_UTF8_PARALLEL="
save ( Two_D_Test, 'Two_D_ChunkWise', -1, 'auto')
"
exec_afl_query "${CMD_SAVE_2D_ASCII_UTF8_PARALLEL};" -n 
#
#  Elapsed Time: 4:23.79
#  Elapsed Time: 4:24.95
#
#  264 seconds. 
#
#   This parallelized unload is, for large parts of its run-time, IO bound, as
#  is the serialized save(...) above. 
#
#    The '-1' option in the third argugment informs SciDB that it is to to 
#   parallelize the save(...) operation. When operating in parallel, SciDB 
#   will create a local (per instance) copy of each instance's data. 
#
#    The 'auto' option specifies the format. Sometimes, people want to 
#   save(...) the per-instance data in .csv or some other format (to be loaded 
#   into a spreadsheet or passed over a network socket, say), and this option 
#   allows us to support that requirement. But for our purposes we save the 
#   file in the default, 'auto' format. The list of options here is the same 
#   list that is valid with the "-o" option in the "iquery" client tool. 
#
#    Each of the files produced by the save(...) command finds its way into 
#   the data directory, which you can find by looking at the base-path 
#   directory in the config.ini file. Generally you'll find a set of 
#   files all named 'Two_D_ChunkWise' in the base-path/000/Instance# 
#   directory. 
#
#    Now ... how do you load(...) in parallel? 
#
recreate Two_D_Test_Two;
#
#   The following command illustrates how to load(...) the data we previously 
#  unloaded (using the save(...) command), using parallelism. Again, the '-1'
#  option tells SciDB to have each instance look in its local data directory 
#  for a file named 'Two_D_ChunkWise', and to load the chunks it finds there.
#  Note that the only valid format for a load file is the 'auto' format. 
#
#   Note that SciDB will internally figure out where each chunk actually 
#  belongs. So you can save(...) on (say) 4 instances, move the 4 files to 
#  instances out of 8 in a SciDB installation with 8 instances, and then 
#  use load(...) to migrate data between installations. 
#
CMD_LOAD_2D_ASCII_UTF8_PARALLEL="
load ( Two_D_Test_Two, 'Two_D_ChunkWise', -1)
"
exec_afl_query "${CMD_LOAD_2D_ASCII_UTF8_PARALLEL};" -n 
#
#  Elapsed Time: 1:36.54
#  Elapsed Time: 1:33.28
#
#   95 Seconds. 
#
#   Again, the scalability of the operation, in this case, is limited by the 
#  machinery. But this illustrates how parallelism dramatically reduces load 
#  times. 
#
#   In this case, loading and saving with 4 instances was about twice as 
#  fast as loading and saving by serializing through the coordinator. That's 
#  because--as we mentioned earlier--the machine on which this testing is 
#  being done only has a single disk. Better scalability can be achieved 
#  by actually parallelizing the I/O, as well as the CPU load. 
#
#   Check that the data is the same. . . 
#
CMD_CHECK_ARRAY_EQUIVALENCE="
SELECT COUNT(*) 
  FROM Two_D_Test AS T1, Two_D_Test_Two AS T2
 WHERE T1.attr1 <> T2.attr1
    OR T1.attr2 <> T2.attr2
    OR T1.attr3 <> T2.attr3
"
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};" 
# 
#  {i} COUNT
#  {0} 0
#
#   This example also illustrates a mechanism for loading pre-chunked (but 
#  ASCII/UTF8 formatted) data into SciDB. The parallelized, ASCII/UTF8 
#  formatted load method of loading data is ideal when what you have to load 
#  is presented as some kind of pre-chunked data, and you want to retain the 
#  flexibility that scripting languages can give you when working with the 
#  data. 
#
#   Using parallelized, ASCII/UTF8 loading gives you the scalability of 
#  parallelized loading and the flexibility of being able to use scripting 
#  languages to format your load data. Because you can load data in 
#  pre-chunked format, you will also be able to avoid the need to use 
#  redimension(...). BUT because you will be parsing ASCII/UTF8 to convert it 
#  into the appropriate binary types, using the ASCII/UTF8 formats don't give 
#  you the best performance. 
#
#------------------------------------------------------------------------------
#
#   3. BINARY save/load:
#      ~~~~~~~~~~~~~~~~~
#
#   Given the overhead of parsing the ASCII/UTF8 format, and the way that 
#  BINARY encodings can be more space efficient than ASCII/UTF8, we introduced 
#  support for a 'binary' mode for data loading. The idea is that we're able 
#  to take a file that contains binary data and load it into SciDB, avoiding 
#  the need to parse the ASCII/UTF8 and convert it. As a result, loading data 
#  that's encoded in binary is going to be faster than loading the equivalent 
#  volume of data encoded as ASCII/UTF8. 
#
#   The following query illustrates how you use the SciDB save(...) to create 
#  an external binary file. 
#
recreate Two_D_Test;
populate Two_D_Test;
#
CMD_SAVE_2D_BINARY="
save ( Two_D_Test, 'Binary_Data', 0, '(int32,int32,double)')
"
exec_afl_query "${CMD_SAVE_2D_BINARY};" -n 
#
#   Elapsed Time: 0:44.52
#   Elapsed Time: 0:40.57
#
#   42 seconds. 
#
#   NOTE: Check the timing difference between the ASCII/UTF8 and the binary 
#         save and load. The ASCII/UTF8 unload took 524 seconds. BINARY is an 
#         order of magnitude faster because it incurs less CPU for conversion,
#         and because the size of the file is less. 
#
#  $ ls -lat /tmp/Two_D_ChunkWise.txt 
#  -rw-rw-r-- 1 plumber plumber 2288961889 Mar 19 16:14 /tmp/Two_D_ChunkWise.txt
#  $ ls -lat ~/Devel/Data/000/0/Binary_Data 
#  -rw-rw-r-- 1 plumber plumber 1600000000 Mar 19 17:04 /home/plumber/Devel/Data/000/0/Binary_Data
#
recreate Two_D_Test_Two;
#
CMD_LOAD_2D_BINARY_TWO="
load ( Two_D_Test_Two, 'Binary_Data', 0, '(int32,int32,double)')
"
exec_afl_query "${CMD_LOAD_2D_BINARY_TWO};" -n 
#
#  Elapsed Time: 1:08.53
#  Elapsed Time: 1:06.88
#
#  67 Seconds. 
#
#    NOTE: The load time for the parallel ASCII/UTF8 load was 95 seconds. The 
#          load from the single threaded ASCII/UTF8 was 187 seconds. 
#
CMD_CHECK_ARRAY_EQUIVALENCE="
SELECT COUNT(*) 
  FROM Two_D_Test AS T1, Two_D_Test_Two AS T2
 WHERE T1.attr1 <> T2.attr1
    OR T1.attr2 <> T2.attr2
    OR T1.attr3 <> T2.attr3
"
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};"
#
#  {i} COUNT
#  {0} 0
#
#   Now, any time you're dealing with binary data, there are a number of 
#  problems. 
#
#    1. The binary loader supports both missing codes, and variable length 
#       data types like strings. These factors make it relatively difficult to 
#       cut up a binary file along cells boundaries. For this reason, we don't
#       currently support parallel loading of binary data. (We might do so. 
#       Just not yet.) 
#
#    2. The BINARY mode is only really useful for dense array data. The array 
#       in the example above is dense, so we're able to save(...) and then 
#       reload(...) the 2D array's data. But if the array was sparse, we're 
#       going to have to "unpack" the sparse data (which has the effect of 
#       densi-fying it) and then redimension(...) it on the load(...) 
#       side. 
#   
#------------------------------------------------------------------------------
#
#   4. "Sparse" Data and BINARY save/load
#      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#    And probably the biggest challenge with using BINARY mode is that the
#  the data needs to be densely 'chunked'. For sparse data, BINARY save/load 
#  pattern needs to use the unpack(...) operator to turn the n-D array data 
#  into a "dense" single 1D array. 
#
recreate Two_D_Test;
populate_sparse Two_D_Test;
#
CMD_SAVE_2D_BINARY_SPARSE="
save ( 
  unpack( Two_D_Test, RowNum ), 
  'Binary_Data_Sparse', 0, '(int64,int64,int32,int32,double)'
)
"
exec_afl_query "${CMD_SAVE_2D_BINARY_SPARSE};" -n 
#
#  Elapsed Time: 0:57.06
#  Elapsed Time: 1:00.02
#
#   58 Seconds. 
#
#  NOTE: The number of values in the "sparse" data set is 1/2 the number 
#        of cells in the "dense" data set we were using prior. Thus 
#        the time taken to unpack(...) and save(...) this data is about 
#        the same for 1/2 the data as it is for the dense data. The 
#        cost of "unpacking" a chunk is about the same CPU cost as pulling
#        values from a chunk and rendering the data in binary format. 
#
recreate Two_D_Test_Two;
#
#   To reload the data, you need to load the 1D version, and then use the 
#  redimension(...) operator to convert it back into the original n-D target. 
#
CMD_LOAD_2D_BINARY_SPARSE="
redimension ( 
  input ( 
      <
        I     : int64,
        J     : int64,
        attr1 : int32,
        attr2 : int32,
        attr3 : double 
      > 
      [ RowNum=0:*,1000000,0 ], 
      'Binary_Data_Sparse', 0, '(int64, int64, int32,int32,double)'
  ),
  Two_D_Test_Two
)
"
exec_afl_query "${CMD_LOAD_2D_BINARY_SPARSE};" -n 
#
#  Elapsed Time: 3:00.70
#  Elapsed Time: 2:57.15
#
#   178 Seconds. 
#
#   So, to load(...) and redimension(...) and save(...) about 1/2 the data 
#  takes about twice amount of time as loading the dense data set. 
#
CMD_CHECK_ARRAY_EQUIVALENCE="
SELECT COUNT(*) 
  FROM Two_D_Test AS T1, Two_D_Test_Two AS T2
 WHERE T1.attr1 <> T2.attr1
    OR T1.attr2 <> T2.attr2
    OR T1.attr3 <> T2.attr3
"
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};"
#
#  {i} COUNT
#  {0} 0
#
#   So, the advantage of BINARY is that it's much faster than ASCII/UTF8 on 
#  small to medium sized load operations. But it does have the disadvantage 
#  that it's slightly less flexible, and can't be parallelized. It has the 
#  further disadvantage that if you're dealing with "sparse data, you need 
#  extra processing to convert it to and from the 1-D dense representation. 
#
#-----------------------------------------------------------------------------
#
#   5. OPAQUE save / load
#      ~~~~~~~~~~~~~~~~~~
#
#   Even though it saves CPU processing time and storage, managing BINARY 
#  data is still not as as efficient as the last save / load method we'll 
#  look at. The motivation for the OPAQUE option was to provide something
#  that was as fast as possible. When you save(...) data in the OPAQUE mode, 
#  what we write to the file is essentially the internal chunk organization. 
#  That is, an array's attributes are pre-stripped into chunks (the other 
#  load(...) methods need to strip the cell-as-tuples into SciDB's 'columnar' 
#  format), and organized using the chunk's internal layout (run length 
#  encoded). 
#
#   The OPAQUE mode also works in parallel. As with the parallel ASCII/UTF8 
#  mode, the external files are located in each instance's data directory. 
#
#   Let's look at the dense data first. 
#
recreate Two_D_Test;
populate Two_D_Test;
#
CMD_SAVE_2D_OPAQUE="
save ( Two_D_Test, 'Opaque_Data', -1, 'OPAQUE')
"
exec_afl_query "${CMD_SAVE_2D_OPAQUE};" -n 
#
#  Elapsed Time: 0:11.13
#  Elapsed Time: 0:13.12
#
#   12 seconds. 
#
recreate Two_D_Test_Two;
#
CMD_LOAD_2D_OPAQUE="
load ( Two_D_Test_Two, 'Opaque_Data', -1, 'OPAQUE')
"
exec_afl_query "${CMD_LOAD_2D_OPAQUE};" -n 
#
#   Elapsed Time: 0:30.05
#   Elapsed Time: 0:28.87
#
#   29 Seconds. 
#
#   Loading is much more expensive that saving, because we need to do a 
#  lot more work to organize the meta-data: track chunks, their locality, 
#  and so on. However, as you can see from these timings, the OPAQUE 
#  load(...) is much faster than the BINARY. 
#
CMD_CHECK_ARRAY_EQUIVALENCE="
SELECT COUNT(*) 
  FROM Two_D_Test AS T1, Two_D_Test_Two AS T2
 WHERE T1.attr1 <> T2.attr1
    OR T1.attr2 <> T2.attr2
    OR T1.attr3 <> T2.attr3
"
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};"
#
#  {i} COUNT
#  {0} 0
#
#    The OPAQUE format works fine for sparse data also. 
#
recreate Two_D_Test;
populate_sparse Two_D_Test;
#
CMD_SAVE_2D_SPARSE_OPAQUE="
save ( Two_D_Test, 'Opaque_Sparse_Data', -1, 'OPAQUE')
"
exec_afl_query "${CMD_SAVE_2D_SPARSE_OPAQUE};" -n 
#
#  Elapsed Time: 0:12.65
#  Elapsed Time: 0:11.50
#
#  12 Seconds. 
#   
recreate Two_D_Test_Two;
#
CMD_LOAD_2D_SPARSE_OPAQUE="
load ( Two_D_Test_Two, 'Opaque_Sparse_Data', -1, 'OPAQUE')
"
exec_afl_query "${CMD_LOAD_2D_SPARSE_OPAQUE};" -n 
#
#  Elapsed Time: 0:59.88
#  Elapsed Time: 0:52.57
#
#  55 Seconds. 
#
exec_aql_query "${CMD_CHECK_ARRAY_EQUIVALENCE};"
#
#  {i} COUNT
#  {0} 0
#
#-----------------------------------------------------------------------------
#
#  6. Trickle Inserts and "Real Time" Analytics
#     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   One of the important, emerging use-cases for SciDB is in support "trickle 
#  loading" of data into SciDB. That is, the  ability to support streaming 
#  ingest of data as it is generated by (for example) a sensor or some other 
#  software. To support trickle inserts SciDB provides the insert(...) 
#  operator, which appends data to the "end" of an array. 
#
#    An important property of any streaming analytic system is the kind of 
#  guarantee it makes with respect to how current data is. In other words, 
#  what is the interval of time between when the data is generated (or made 
#  available to SciDB for loading/ingest) and when it is visible to the 
#  query language. In this section we'll also provide a quick guide to 
#  let users get a handle on the kinds of guarantees they can make with their 
#  SciDB system. 
#  
#   First, let's look at how the insert(...) operator works, and contrast it 
#  with the store(...) used above. 
#
recreate Two_D_Test;
populate Two_D_Test;
#
#    The target array for trickle inserts, by definition, will have at least 
#   one unbounded dimension. It's considered a good practice in SciDB to 
#   append chunk-at-a-time, rather than incur the additional overhead of  
#   managing per-chunk histories. 
# 
recreate_unbounded Test_2D_Three;
#
#     We asume that the file 'Binary_Data_Sparse' is still in the coordinator's
#   data directory. The following query loads the contents of this file, and 
#   stores it in the unbounded array "Test_2D_Three". 
#  
CMD_LOAD_UNBOUND_2D_BINARY_SPARSE="
store ( 
  redimension (
    input (
        <
          I     : int64,
          J     : int64,
          attr1 : int32,
          attr2 : int32,
          attr3 : double
        >
        [ RowNum=0:*,1000000,0 ],
        'Binary_Data_Sparse', 0, '(int64, int64, int32,int32,double)'
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_LOAD_UNBOUND_2D_BINARY_SPARSE};" -n
#
#  Elapsed Time: 5:02.81
#  Elapsed Time: 4:47.41
#
iquery -q "SELECT COUNT(*) FROM Test_2D_Three;"
# 
# {i} COUNT
# {0} 49991808
#
#    This next query is almost identical to the store(redimension(input(...)))
#  query above, except that it uses insert(...) rather than store(...), and 
#  the apply(...) operator shifts the entire data set out by 10,000 along 
#  the "I" dimension. 
#
CMD_APPEND_TO_2D_BINARY_SPARSE="
insert ( 
  redimension (
    project ( 
      apply ( 
        input (
          <
            nI    : int64,
            J     : int64,
            attr1 : int32,
            attr2 : int32,
            attr3 : double
          >
          [ RowNum=0:*,1000000,0 ],
          'Binary_Data_Sparse', 0, '(int64, int64, int32,int32,double)'
        ),
        I, nI + 10000
      ),
      I, J, attr1, attr2, attr3 
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_APPEND_TO_2D_BINARY_SPARSE};" -n
#
#  Elapsed Time: 5:13.05
#  Elapsed Time: 5:04.29
#  Elapsed Time: 5:22.39
#
#   308 Seconds.
#
iquery -q "SELECT COUNT(*) FROM Test_2D_Three;"
#
#  {i} COUNT
#  {0} 99983616
#
#    Repeat the load, this time adding a third block of data starting at 
#  20,000. The previous two blocks spanned 0:9999, and 10,000:19,999. 
#
CMD_APPEND_TO_2D_BINARY_SPARSE_TWO="
insert ( 
  redimension (
    project (
      apply (
        input (
          <
            nI    : int64,
            J     : int64,
            attr1 : int32,
            attr2 : int32,
            attr3 : double
          >
          [ RowNum=0:*,1000000,0 ],
          'Binary_Data_Sparse', 0, '(int64, int64, int32,int32,double)'
        ),
        I, nI + 20000
      ),
      I, J, attr1, attr2, attr3
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_APPEND_TO_2D_BINARY_SPARSE_TWO};" -n
#
# Elapsed Time: 5:13.51
# Elapsed Time: 5:17.64
#
iquery -q "SELECT COUNT(*) FROM Test_2D_Three;"
#
# {i} COUNT
# {0} 149975424
# 
#   So, SciDB can ingest and make available 49,991,808 additional data points,
#  each of which has 16 bytes of data, in about 5 minutes. That is, we can 
#  ingest 760M of data in 5 minutes, or 2.5Meg per second (on crummy hardware).
#  Note that if you include the index values present in the binary load file,
#  those figures double. We're loading 1.5G of data in 5 minutes, or about 
#  5Meg per second. 
#
#   If your requirements are to make data available in under 1 minute, it 
#  follows that your ingest rate can be up to about 150 Meg per minute on 
#  this hardware. 
#
#-----------------------------------------------------------------------------
#
#   7. Trickle Insert: Loading Smaller Blocks of Data at a Time
#      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#   In the example above, we showed that we could (on crummy hardware) load 
#  make a stream of data being generated at between 2.5M/second and 5M/sec
#  available to query with 5 minutes of "currency". Let's ask a slightly 
#  different question: if the currency guarantee request was (say) 1 minute,
#  what data arrival rate could we service? 
#   
#   Let's try creating load files at 2.5M, 25M and 250M. The complete binary 
#  file is 1.5G in size. So to reduce it to 5M, which would have the 
#  correct number of cells, we would beed 1500 / 5, or 1 / 300th of the 
#  original. 
#
CMD_SAVE_2D_BINARY_SPARSE_2_5M="
save (
  unpack( 
    bernoulli ( Two_D_Test, 0.00333 ),
    RowNum 
  ),
  'Binary_Data_Sparse_2.5M', 0, '(int64,int64,int32,int32,double)'
)
"
exec_afl_query "${CMD_SAVE_2D_BINARY_SPARSE_2_5M};" -n
#
#
CMD_SAVE_2D_BINARY_SPARSE_25M="
save (
  unpack( 
    bernoulli ( Two_D_Test, 0.0333 ),
    RowNum
  ),
  'Binary_Data_Sparse_25M', 0, '(int64,int64,int32,int32,double)'
)
"
exec_afl_query "${CMD_SAVE_2D_BINARY_SPARSE_25M};" -n
#
#
CMD_SAVE_2D_BINARY_SPARSE_250M="
save (
  unpack( 
    bernoulli ( Two_D_Test, 0.333 ),
    RowNum
  ),
  'Binary_Data_Sparse_250M', 0, '(int64,int64,int32,int32,double)'
)
"
exec_afl_query "${CMD_SAVE_2D_BINARY_SPARSE_250M};" -n
#
#  OK. Let's see what happens as we load this data. 
#
CMD_APPEND_TO_2D_BINARY_SPARSE_2_5_M="
insert (
  redimension (
    project (
      apply (
        input (
          <
            nI    : int64,
            J     : int64,
            attr1 : int32,
            attr2 : int32,
            attr3 : double
          >
          [ RowNum=0:*,1000000,0 ],
          'Binary_Data_Sparse_2.5M', 0, '(int64, int64, int32,int32,double)'
        ),
        I, nI + 30000
      ),
      I, J, attr1, attr2, attr3
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_APPEND_TO_2D_BINARY_SPARSE_2_5_M};" -n
#
#  Elapsed Time: 0:14.06
#  Elapsed Time: 0:15.17
#
#   15 seconds to load 2.5M of data (really, 5M of data counting the indices).
#
CMD_APPEND_TO_2D_BINARY_SPARSE_25M="
insert (
  redimension (
    project (
      apply (
        input (
          <
            nI    : int64,
            J     : int64,
            attr1 : int32,
            attr2 : int32,
            attr3 : double
          >
          [ RowNum=0:*,1000000,0 ],
          'Binary_Data_Sparse_25M', 0, '(int64, int64, int32,int32,double)'
        ),
        I, nI + 40000
      ),
      I, J, attr1, attr2, attr3
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_APPEND_TO_2D_BINARY_SPARSE_25M};" -n
#
#  Elapsed Time: 0:26.08
#  Elapsed Time: 0:25.81
#
#   26 seconds to load 25M of data. 
#
CMD_APPEND_TO_2D_BINARY_SPARSE_250M="
insert (
  redimension (
    project (
      apply (
        input (
          <
            nI    : int64,
            J     : int64,
            attr1 : int32,
            attr2 : int32,
            attr3 : double
          >
          [ RowNum=0:*,1000000,0 ],
          'Binary_Data_Sparse_250M', 0, '(int64, int64, int32,int32,double)'
        ),
        I, nI + 50000
      ),
      I, J, attr1, attr2, attr3
    ),
    Test_2D_Three
  ),
  Test_2D_Three
)
"
exec_afl_query "${CMD_APPEND_TO_2D_BINARY_SPARSE_250M};" -n
#
#  Elapsed Time: 3:24.10
#  Elapsed Time: 3:30.17
#
#   210 Seconds to load 250M. 
#
#   So. At both the 25M and 250M levels, we're loading between 1M and 1.2M 
#  per second. At the 1.5G load levels we're at 2.5M / sec. The larger the 
#  "lump" of data, the higher data throughput during the load, but (of course)
#  the load takes more time. 
#
#-----------------------------------------------------------------------------

#3

[quote=“plumber”]For those interested in understanding SciDB’s load options, I’ve put together a (pretty long) script that describes the various options and gives users some tools to test 'em out. The basic results of this

[/code][/quote]

[b][color=#FF0000]Thanks !Sorry to interrupt you again.I modified the loadcsv.py code and i sucessfully use this command to load data from *.csv.But i still encounted two problems.

Firstly,i use the command to load data when the size of *.csv at the size of around 10GB.But when i try to use the .csv at the size of 50GB and 100GB. It seems working successfully before i see the time it cost. So i print the 'iquery ’ command ,and use 'select count() from *’ to see whether i have loaded the data.But the result shows zero. So i get surprised. It havenot give me a error indicator.[/color][/b] :astonished:
[color=#FF0000]Secondly[/color],when i try again .it sugests that "

[color=#FF0000]Failed to obtain schema for load array.
SystemException in file: src/system/catalog/SystemCatalog.cpp function: _lockArr
ay line: 2765
Error id: scidb::SCIDB_SE_SYSCAT::SCIDB_LE_PG_QUERY_EXECUTION_FAILED
Error description: System catalog error. Execution of query 'COORD-RD-insert int
o array_version_lock (array_name, array_id, query_id, instance_id, array_version
_id, array_version, instance_role, lock_mode)(select $1::VARCHAR,$2,$3,$4,$5,$6,
$7,$8 where not exists (select AVL.array_name from array_version_lock as AVL wh
ere AVL.array_name=$1::VARCHAR and AVL.lock_mode>$9 and AVL.instance_role=$10))'
failed with error ERROR: could not extend relation base/16385/35247: No space
left on device
HINT: Check free disk space.
.
Failed query id: 1100909153510

[/color]

" even when i use “scidb.py stopall *;scidb.py startall *” to restart all the nodes.
[color=#FF0040]I have solved the second problem by rebooting the centos system.But why this?[/color]
My email is "asdf08010801@gmail.com" or "asdf08010801@sina.com"
Thanks.


#4

[color=#FF0000]And i also try the csv file with the size of 20GB,which has 1600000 rows. when it suggests that the load is successful,i use “select count(*)” statement but puzzled to see that it shows only 300000+ rows menwhile the 10G has 800000 rows.

Very surprising![/color]


#5

So …

SciDB is a transactional system. Each load(…) operation is done under transactional guarantees. This means that a load(…) either succeeds completely or fails utterly (loads are atomic) and it means that during a load(…) operation, any changes that the load(…) is making are invisible to other users/connections/queries (transactional operations are isolated from one another).

This is why you’re seeing count(*) = 0 from the other connection. While the data is still being loaded, you won’t see any new data. To monitor what’s going on, have a look in the scidb.log file. We report what’s going on during a load there.

Now … the error you’re reporting has (I think) to do with the Postgres instance. What’s happened is that Postgres as tried to do something locally related to a lock acquisition, and it’s failed because there’s not enough space for Postgres.

Where did you install Postgres? Use a “df -k” to check the status of the file-systems you have attached to the physical instance Postgres is installed on. I think you will find one of them is very nearly full. We don’t use Postgres to hold much data … only system catalogs … but if there’s no space for it to grow, then there’s no space for it to grow.


#6

[quote=“plumber”]So …

SciDB is a transactional system. Each load(…) operation is done under transactional guarantees. This means that a load(…) either succeeds completely or fails utterly (loads are atomic) and it means that during a load(…) operation, any changes that the load(…) is making are invisible to other users/connections/queries (transactional operations are isolated from one another).

This is why you’re seeing count(*) = 0 from the other connection. While the data is still being loaded, you won’t see any new data. To monitor what’s going on, have a look in the scidb.log file. We report what’s going on during a load there.

Now … the error you’re reporting has (I think) to do with the Postgres instance. What’s happened is that Postgres as tried to do something locally related to a lock acquisition, and it’s failed because there’s not enough space for Postgres.

Where did you install Postgres? Use a “df -k” to check the status of the file-systems you have attached to the physical instance Postgres is installed on. I think you will find one of them is very nearly full. We don’t use Postgres to hold much data … only system catalogs … but if there’s no space for it to grow, then there’s no space for it to grow.[/quote]

[color=#BF00FF]TKs.I will try it. :smile: [/color]


#7

[quote=“plumber”]For those interested in understanding SciDB’s load options, I’ve put together a (pretty long) script that describes the various options and gives users some tools to test 'em out. The basic results of this
[/quote]
[color=#FF0040]
i have update the version of scidb to 14.3,but i can’t find another efficient method to load csv file. i use the loadcsv.py command,just to find that load the csv file at the size of 10GB cost about 909s,and 50GB cost 8738s,and 100GB cost 23435s,it cost very long time ,can you give me some suggestions.

Tks.[/color]


#8
  1. What’s the error you’re seeing with loadcsv.py? Did you get past the postgres errors? (Which seem to be related to the fact that the postgres installation is running out of disk space?)

  2. Can you share with us the commands you’re using to perform the load? I suspect that you might have your chunk sizes set small, but I really can’t tell.


#9

[quote=“plumber”]1. What’s the error you’re seeing with loadcsv.py? Did you get past the postgres errors? (Which seem to be related to the fact that the postgres installation is running out of disk space?)

  1. Can you share with us the commands you’re using to perform the load? I suspect that you might have your chunk sizes set small, but I really can’t tell.[/quote]

[color=#FF0000]first,there is no errors.But i find that the loadcsv command needs that the size of memory and virtual memory can not be smaller than the size of csv file

second,i didn’t set the chunk sizes .I just used the deafult setting. i use the command just like "loadcsv.py -a ‘[loadarray]’ -s ‘LOAD_SCHEMA’ -i ‘csv file absolute path’ ".

Tks[/color] :smiley: :smiley: :smiley: :smiley:


#10

OK. So …

  1. The implementation of loadcsv.py doesn’t have any particular requirements with respect to virtual memory. The underlying implementation uses a collection of named pipes in the data directories of the target machines, and a “splitter” on the coordinator to funnel “slices” of the .csv file to each of the named pipes on the targets. So your reference to “the loadcsv command needs that the size of memory and virtual memory can not be smaller than the size of csv file” confuses me.

The named pipes on the target instances are just FIFO queues. They pass data through but at any point in time, the data they hold is very small.

What errors are you seeing?

  1. There’s an option to loadcsv.py that’s “-C” which lets you specify the chunk size of the target. Try setting it to 1 million (1000000). The default is half that size.

  2. If you have a look at the details of that script I posted, you’ll see that loading from a binary file is typically much faster than loading from a UTF-8 file. And typically, the source data you’re loading has often been converted from a binary source–another database, sensors, other computer systems–into UTF-8 for readability reasons.


#11

[quote=“plumber”]OK. So …

  1. The implementation of loadcsv.py doesn’t have any particular requirements with respect to virtual memory. The underlying implementation uses a collection of named pipes in the data directories of the target machines, and a “splitter” on the coordinator to funnel “slices” of the .csv file to each of the named pipes on the targets. So your reference to “the loadcsv command needs that the size of memory and virtual memory can not be smaller than the size of csv file” confuses me.

The named pipes on the target instances are just FIFO queues. They pass data through but at any point in time, the data they hold is very small.

What errors are you seeing?

  1. There’s an option to loadcsv.py that’s “-C” which lets you specify the chunk size of the target. Try setting it to 1 million (1000000). The default is half that size.

  2. If you have a look at the details of that script I posted, you’ll see that loading from a binary file is typically much faster than loading from a UTF-8 file. And typically, the source data you’re loading has often been converted from a binary source–another database, sensors, other computer systems–into UTF-8 for readability reasons.[/quote]

Ok.i will try it.Question 1 is what i see from the command “top” and it said that splicsv that loadcsv.py called cost lots of memory.

Tks.