A (Non-definitive) Guide to Data Loading in SciDB


#1

First, I put together a short Google Docs presentation that covers the territory of SciDB’s load/save/input features. The emphasis here is on answering the question, “How do I load a large corpus of binary data files into SciDB?”. The presentation goes over SciDB’s features and functionality, and presents some of the “best practice” ideas we’ve learnt from some of our customers.

Second, just a brief reminder the “load_tools” on github, which provide you with a simple and easy to use approach to loading text into SciDB.

And third, I’ve created a long script (below) from which you can cut and paste various examples included in the presentation, to get some idea of how these things work. To run this script, you will need to install a macro. To do so:

  1. Create a file named ‘/tmp/my_macro.txt’
  2. Into that file, place the following macro definition.
op_unit( _LEN_ ) = build ( < dummy : bool > [ R=0:_LEN_-1,iif(_LEN_>1000000,1000000,_LEN_),0 ], true );
  1. In the script below, there’s a call to load_module (’/tmp/my_macro.txt’) that will load the macro from the script.

Then run the script.

--
--   File:   ~/Devel/Examples/Load_Examples/load_examples.aql
--
--  About: 
--
--    The purpose of the script in this file is to provide examples that 
--   are more colorfully illustrated in the presentation on SciDB 
--   data loading. Many of the queries and examples in that document are
--   taken from this script. 
-- 
-------------------------------------------------------------------------------
--
--  Step 1: Let us create the first of the basic data sets we will use. In 
--          this case, the goal is to illustrate SciDB's various external
--          data/load formats.
--
SET LANG AFL;
--
--  Load the op_unit(...) macro ...
load_module ( '/tmp/my_macro.txt');
SET FETCH;
--
--   Q1: Hygiene (to make the script re-runnable) and the CREATE ARRAY ...
--       statement for the basic, 1D load array. 
--
remove ( loadArray );
CREATE ARRAY loadArray 
<
    latitude        : double,
    longitude       : double, 
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64 
>
[ RowNum=0:*,1000000,0 ];
--
--   Q2: Create some data for the array. Note that what we're doing here
--       is to generate some random() position information over the globe
--       and over a year after 2010-01-01, but instead of saving this 
--       data to the array, we'll instead push the data out to the 
--       file-system using save(...). The point we're trying to emphasize here
--       is that the SciDB input(...) and save(...) operators can work with 
--       query results, as well as array data. 
--
--       See lower for examples where we're using save(...), load(...) and 
--       input(...) for array-at-a-time data movement. 
--
SET NO FETCH;
save ( 
  project ( 
    apply ( 
      op_unit(10000000),
      latitude,   double(random()%180000)/1000.0,
      longitude,  double(random()%360000)/1000.0,
      time,       datetime('2010-01-01 00:00:00') + 
                                        (random()%(365*4) * 6 * 3600),
      attr_one,   double(random()%10000)/1000.0,
      attr_two,   random()%10,
      attr_three, iif ( random()%2=0, 'A', 
                    iif( random()%2=0, 'B',
                      iif (random()%2=0, 'C', 'D' )  ) )
    ),
    latitude, longitude, time, attr_one, attr_two, attr_three
  ),
  '/tmp/DataOut',
  0, 'tsv'
);
--  
-- NOTE: SciDB prepends a header line of attribute names to the text files it 
--       writes out using save(...). To trim the header line off the data file,
--       use the command 'sed -e '1d' /tmp/DataFile'. 
--
--  cat /tmp/DataOut | sed -e '1d' | sed -e "s/'//g" > /tmp/DataFile
--
-------------------------------------------------------------------------------
--
--  Step 2: Let's explore how we use the input(...) operator.
--
--   Q3: Using the input(..) operator. This query illustrates how to read 
--       the contents of a file into a query, rather than just loading the 
--       data and storing it directly into an array using the load(...)
--       operator. 
--
-        The signature of the input(...) operator is almost identical to
--       load(...), the functional difference being that input(...) does 
--       write data into its "shape" array (if one is named instead of 
--       the anonymous schema in the example below). 
--
SET LANG AFL;
SET NO FETCH;
input ( 
  <
    latitude   : double,
    longitude  : double,
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string
  > 
  [ RowNum=0:*,1000000,0 ],
  '/tmp/DataFile', 0, 'tsv' 
);
--
--   The following query is practically identical to the one above, only 
--   rather than using an anonymous schema to specify the input data's shape,
--   the query below uses an array, naming it. 
--
--  load ( 
--    loadArray,
--    '/tmp/DataFile', 0, 'tsv'
--  );
--
-------------------------------------------------------------------------------
--
--  Q4: The following query illustrates how the input(...) operator can be 
--      used in a query. In this case, in addition to loading the data,
--      the query calculates the integer "index" values that we will use with 
--      the n-Dimensional target array. 
--
SET NO FETCH;
store ( 
  apply ( 
    input (
      <
       latitude   : double,
       longitude  : double,
       time       : datetime,
       attr_one   : double,
       attr_two   : int64,
       attr_three : string
      > [ RowNum=0:*,1000000,0 ],
      '/tmp/DataFile', 0, 'tsv'
    ),
    time_index,     
                int64(time - int64(datetime('2010-01-01 00:00:00')))/3600,
    latitude_index, int64(floor(latitude * 10000)),
    longitude_index,int64(floor(longitude * 10000))
  ),
  loadArray
);
--
-------------------------------------------------------------------------------
--
--  Q5: One of the more difficult aspects of SciDB development is figuring 
--      out how to "size" chunks. The basic problem is that you want to 
--      partition your data into blocks that are all about the same size. 
--      The difficulty is that it's frequently very hard to figure out, 
--      a priori, how to set your per-dimension chunk lengths. 
--
--      SciDB provides a mechanism to help you set these sizes. The 
--      following query illustrates how this can be used. It's important to 
--      note that although the query below creates the new array 
--      potentialTargetArray and replaces each of the '?' place-holders with 
--      actual values, it does not move any data. 
--
remove ( potentialTargetArray );
CREATE ARRAY potentialTargetArray 
<
    latitude   : double,
    longitude  : double, 
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string 
>
[ time_index=0:*,?,0, 
  latitude_index=0:*,?,0, 
  longitude_index=0:*,?,0
] USING loadArray;
--
--   What does the result look like? 
--
SET FETCH;
show ( potentialTargetArray );
--
-- potentialTargetArray
-- < latitude:double, longitude:double, time:datetime, 
--   attr_one:double, attr_two:int64,   attr_three:string
-- > 
-- [ time_index=0:*,4096,0,
--   latitude_index=0:*,1048576,0, longitude_index=0:*,2097152,0
-- ];
--
--
-------------------------------------------------------------------------------
--
--  Q6: Create the actual target array, based on the shape computed using 
--      the USING functionality above. 
--
--      Note that we extend the array definition produced above to include 
--      an extra "tie-breaker" synthetic dimension. The idea being that 
--      it's fairly common to encounter "collisions" in the data; cases 
--      where two precise input values find their way into the same 
--      granular cell. 
--      
remove ( targetArray );
CREATE ARRAY targetArray
<
    latitude   : double,
    longitude  : double,
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string
>
[ time_index=0:*,4096,0,
  latitude_index=0:*,1048576,0, longitude_index=0:*,2097152,0,
  seq=0:99,100,0 
];
--
-- Q6.1: Now, take the data we just loaded, and move it into the newly created 
--       "target" array. This query illustrates how we would append a new 
--       block of data 
--
SET NO FETCH;
insert ( 
  redimension ( 
    loadArray,
    targetArray 
  ),
  targetArray
);
--
--
SET FETCH;
op_count ( targetArray );
--
--------------------------------------------------------------------------------
--
--  Q7:  The following few queries illustrate how files of data can be 
--       repeatedly (iteratively) appended to the end of the data set
--       along (in this case) the time dimension (the geographic locations
--       are still generated at random over the surface of the globe,
--       but the temporal index values are from the next year.) 
--
--  Q7.1: 
--
SET LANG AFL;
SET NO FETCH;
save (
  project (
    apply (
      op_unit(10000000),
      latitude,   double(random()%180000)/1000.0,
      longitude,  double(random()%360000)/1000.0,
      time,       datetime('2010-01-01 00:00:00') +
                                        (random()%(2*365*4) * 6 * 3600),
      attr_one,   double(random()%10000)/1000.0,
      attr_two,   random()%10,
      attr_three, iif ( random()%2=0, 'A',
                    iif( random()%2=0, 'B',
                      iif (random()%2=0, 'C', 'D' )  ) )
    ),
    latitude, longitude, time, attr_one, attr_two, attr_three
  ),
  '/tmp/DataOut',
  0, 'tsv'
);
--
--  cat /tmp/DataOut | sed -e '1d' | sed -e "s/'//g" > /tmp/DataFile
--
--  Q7.2: 
--
remove ( loadArray );
CREATE ARRAY loadArray
<
    latitude        : double,
    longitude       : double,
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64
>
[ RowNum=0:*,1000000,0 ];
--
--  Q7.3: 
--
SET NO FETCH;
insert (
  apply (
    input (
      <
       latitude   : double,
       longitude  : double,
       time       : datetime,
       attr_one   : double,
       attr_two   : int64,
       attr_three : string
      > [ RowNum=0:*,1000000,0 ],
      '/tmp/DataFile', 0, 'tsv'
    ),
    time_index,
                int64(time - int64(datetime('2010-01-01 00:00:00')))/3600,
    latitude_index, int64(floor(latitude * 10000)),
    longitude_index,int64(floor(longitude * 10000))
  ),
  loadArray
);
--
--  Q7.4: Re-organize the loadData into a set of new "slices" in the 
--        targetArray. 
--
SET NO FETCH;
insert (
  redimension (
    loadArray,
    targetArray
  ),
  targetArray
);
--
--  Q7.5: Check the new count.... 
--
SET FETCH;
op_count ( targetArray );
--
--------------------------------------------------------------------------------
--
--   Third Year. 
--
--  Q8:  Third year. 
--
-- Q8.1:
--
SET LANG AFL;
SET NO FETCH;
save (
  project (
    apply (
      op_unit(10000000),
      latitude,   double(random()%180000)/1000.0,
      longitude,  double(random()%360000)/1000.0,
      time,       datetime('2010-01-01 00:00:00') +
                                        (random()%(3*365*4) * 6 * 3600),
      attr_one,   double(random()%10000)/1000.0,
      attr_two,   random()%10,
      attr_three, iif ( random()%2=0, 'A',
                    iif( random()%2=0, 'B',
                      iif (random()%2=0, 'C', 'D' )  ) )
    ),
    latitude, longitude, time, attr_one, attr_two, attr_three
  ),
  '/tmp/DataOut',
  0, 'tsv'
);
--
--  cat /tmp/DataOut | sed -e '1d' | sed -e "s/'//g" > /tmp/DataFile
--
-- Q8.2:
--
remove ( loadArray );
CREATE ARRAY loadArray
<
    latitude        : double,
    longitude       : double,
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64
>
[ RowNum=0:*,1000000,0 ];
--
-- Q8.3:
--
SET NO FETCH;
insert (
  apply (
    input (
      <
        latitude   : double,
        longitude  : double,
        time       : datetime,
        attr_one   : double,
        attr_two   : int64,
        attr_three : string
      > [ RowNum=0:*,1000000,0 ],
      '/tmp/DataFile', 0, 'tsv'
    ),
    time_index,
                int64(time - int64(datetime('2010-01-01 00:00:00')))/3600,
    latitude_index, int64(floor(latitude * 10000)),
    longitude_index,int64(floor(longitude * 10000))
  ),
  loadArray
);
--
-- Q8.4: And ... append the data to the end of the target array ... 
--
SET NO FETCH;
insert (
  redimension (
    loadArray,
    targetArray
  ),
  targetArray
);
--
--  Q8.5: Check the new count.... 
--
SET FETCH;
op_count ( targetArray );
--
-------------------------------------------------------------------------------
--
--  Q9: At this point, we have loaded three "tiles" of data, scattered over 
--      some considerable interfal of time (three years). Each addition of 
--      new data creates a new "version" in SciDB. 
--
--      To remove the versions, and to compress any duplication of data in 
--      the process (also due to a delete, for example), use the following
--      commands. 
--
--  Q9.1: Get a list of the versions. 
--
versions ( targetArray );
--
--  Q9.2: Reclaim any space and delete meta-data details of the version = 3
--        for the 
--
remove_versions ( targetArray, 3 );
--
--  Q9.2: Get a list of the versions. 
--
versions ( targetArray );
--
-------------------------------------------------------------------------------
--
--  In this section, we measure the performance of text, v. binary, v. 
--  opaque load. To do this, we:
--
--   1. Use the data we created previously, that's hanging around in the 
--      loadArray array.
--
--   2. Unloading the data set, and then measure the performance of 
--      reloading it (to a new array) using each "flavor"; text, binary 
--      and opaque. 
--
--  Q10.1 Check the count of data cells. In this case, there are 6 attributes,
--        with a total of about 64 bytes per cell. So we're looking at a 
--        total volume of 64 x 10,000,000 = 640,000,000 bytes = 640MB
--        of data. 
--
SET FETCH;
op_count ( loadArray );
--
--  Q10.2: Unload the loadArray into a binary file. 
--
SET FETCH
op_now();
--
SET NO FETCH;
save ( 
  loadArray,
  '/tmp/TextLoadData.txt', 
  0,
  'tsv'
);
--
-- cat /tmp/TextLoadData.txt | sed -e '1d' | sed -e "s/'//g" > /tmp/TextLoad.txt
--
-- Q10.3: Recreate the loadArray, and load the text data. 
--
SET NO FETCH;
remove ( loadArray );
--
CREATE ARRAY loadArray
<
    latitude        : double,
    longitude       : double,
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64
>
[ RowNum=0:*,1000000,0 ];
--
SET FETCH;
op_now();
--
--   2015-03-01 08:22:51 -08:00
--
--  Q:10.3.1 - load(...) the text data. 
SET NO FETCH;
load ( 
  loadArray, '/tmp/TextLoad.txt', 0, 'tsv'
);
--
SET FETCH;
op_now();
--
--   2015-03-01 08:26:49 -08:00
--
--   3 minutes, 58 seconds = 138 seconds. 
--
--  Binary ... 
--
--  Q10.4: save(...) the loadArray data in binary. 
--
SET NO FETCH;
save ( 
  loadArray,
  '/tmp/BinaryLoad.data',
  0,
  '(double,double,datetime,double,int64,string,int64,int64,int64)'
);
--
SET NO FETCH;
remove ( loadArray );
CREATE ARRAY loadArray
<
    latitude        : double,
    longitude       : double,
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64
>
[ RowNum=0:*,1000000,0 ];
--
SET FETCH;
op_now();
--
--   2015-03-01 08:35:44 -08:00
--
--  Q10.4.1: load(...) the loadArray data in binary. 
--
SET NO FETCH;
load (
  loadArray, '/tmp/BinaryLoad.data', 0,
  '(double,double,datetime,double,int64,string,int64,int64,int64)'
);
--
SET FETCH;
op_now();
--
--  2015-03-01 08:37:23 -08:00
--
--    1 minute, 39 seconds = 99 seconds. 
--
--  Q10.5 OPAQUE data load(...) ...
-- 
SET NO FETCH;
save (
  loadArray,
  '/tmp/OpaqueLoad.data', 0, 'opaque'
);
--
SET NO FETCH;
remove ( loadArray );
CREATE ARRAY loadArray
<
    latitude        : double,
    longitude       : double,
    time            : datetime,
    attr_one        : double,
    attr_two        : int64,
    attr_three      : string ,
    time_index      : int64,
    latitude_index  : int64,
    longitude_index : int64
>
[ RowNum=0:*,1000000,0 ];
--
SET FETCH;
op_now();
--
--  2015-03-01 08:44:54 -08:00
--
--  Q10.5.1: load (...) the OPAQUE data ...
--
SET NO FETCH;
load (
  loadArray,
  '/tmp/OpaqueLoad.data', 0, 'opaque'
);
--
SET FETCH;
op_now();
--
--  2015-03-01 08:45:49 -08:00
--
--  55 seconds. 
--
--  Where is the data going? 
--
SET NO FETCH;
save (
  loadArray,
  'ParallelBinaryLoad',
  -1,
  '(double,double,datetime,double,int64,string,int64,int64,int64)'
);
--
-------------------------------------------------------------------------------
--
--  In this section, we look at how we would go about loading a long series of 
-- individual image files. The idea is to start by generating a simple, 
-- single slice of "image" data, and then to repeatedly load it (in practice
-- you would have a different file for each image and you would get information
-- from the file name / directory to figure out the image's meta-data). 
--
--  Then we add each "image" to a large "target" array being composed by 
-- combining multiple images "over time. 
--
--  Q15: Generate a single image binary over 3,500 x 7,000 size. To make this 
-- "realistic" we'll later make it span an area of 2.5 degrees x 5 degrees
-- when we load it. Each image represents a "slice" of time. 
--
SET LANG AFL;
SET NO FETCH;
save (
  redimension ( 
    apply (
      op_unit((3500*7000)-1),
      latitude_index,   R/7000,
      longitude_index,  R%7000,
      attr_one,         double(random()%10000)/1000.0,
      attr_two,         random()%10,
      attr_three,       iif ( random()%2=0, 'A',
                          iif( random()%2=0, 'B',
                            iif (random()%2=0, 'C', 'D' )  ) )
    ),
    < attr_one : double, attr_two : int64, attr_three : string > 
    [ latitude_index=0:3499,3500,0, longitude_index=0:6999,7000,0 ]
  ),
  '/tmp/DenseDataOut',
  0,
  '(double,int64,string)'
);
--
-- Q16: We will load each image into a 2D load file. Now ... each file 
--      corresponds to a "slice" of data at a point in time. We don't 
--      know (yet) which physical location, we only have the image's 
--      X . Y reference dimensions, but we can figure this out from 
--      some other meta-data.
--
remove ( twoDimensionLoadArray );
CREATE ARRAY twoDimensionLoadArray
<
    latitude   : double,
    longitude  : double,
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string
>
[ time_index=0:*,1,0,
  latitude_index=0:3500,3499,0, longitude_index=0:7000,6999,0
];
--
--  Q17: input(...) the data from the external file, and then figure 
--       out the array's location in latitude / longitude terms. 
--
--  The important point being made in this example is that data from a 
--  load file can be manipulated using SciDB's regular query processing
--  mechanisms, before it is dumped into a file. Doing so can cut down 
--  the amount of I/O operations required during the load and curation 
--  phase of building a SciDB database. 
--
SET LANG AFL;
SET NO FETCH;
insert ( 
  redimension ( 
    apply ( 
      input ( 
        < attr_one : double, attr_two : int64, attr_three : string >
        [ latitude_index=0:3499,3500,0, longitude_index=0:7000,6999,0 ],
        '/tmp/DenseDataOut',
        0,
        '(double,int64,string)'
      ),
      time_index, ( 3600 * 6 * 10 ),
      time,       datetime('2010-01-01 00:00:00') + ( 3600 * 6 * 10 ),
      latitude,   double(40.0 + latitude_index * ( 5.0 / 3500.0 )),
      longitude,  double(-70.0 - longitude_index * ( 5.0 / 7000.0 ))
    ),
    twoDimensionLoadArray
  ),
  twoDimensionLoadArray
);
--
--  Q18: Using the USING facility to figure out how big to make each of the
--       per-dimension chunk lengths in an array. 
--
remove ( potentialTargetArray );
CREATE ARRAY potentialTargetArray
<
    latitude   : double,
    longitude  : double,
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string
>
[ time_index=0:*,?,0,
  latitude_index=0:*,?,0,
  longitude_index=0:*,?,0
] USING twoDimensionLoadArray;
--
--
--  Q18.1: What does the array created by the USING look like? 
--
SET FETCH;
show ( potentialTargetArray );
--
--  {i} schema
--  {0} 'potentialTargetArray<latitude :double,
--                            longitude :double,
--                            time :datetime,
--                            attr_one :double,
--                            attr_two:int64,
--                            attr_three:string > 
--       [time_index=0:*,1,0, 
--        latitude_index=0:*,1024,0,longitude_index=0:*,2048,0]'
--
--  NOTE: The sizes of the latitude_index and longitude_index are such that 
--        each image is going to be broken into multiple chunks, and its 
--        data distributed. 
--
--  NOTE: What's very important to understand here is that the size of these
--        "chunks" is determined by the contents of the twoDimensionLoadArray
--        which has one image. As more images are added, they will need to 
--        be:
--
--        1. Geo-rectified. In Q17, the latitude and longitude *values* are 
--           computed relative to ( 40N, 70W ). It's more usually the case
--           that each *file* is the same size (3,500 x 7,000) but the file's
--           contents refer to a *different* geo-location. 
--
--           Geo-rectification is an important and complex topic, so the 
--           mechanic we're showing here is very, very bare-bones. I'm 
--           simply going to indicate that the tile's (that is, the 
--           file's ) origin at ( 40N, 70W ) turns into the ( 0, 0 ) 
--           coordinates of the overall array. 
--
--  Q19: Create the 3D target array. 
--
remove ( targetArray );
CREATE ARRAY targetArray
<
    latitude   : double,
    longitude  : double,
    time       : datetime,
    attr_one   : double,
    attr_two   : int64,
    attr_three : string
>
[ time_index=0:*,1,0,
  latitude_index=0:*,1024,0,
  longitude_index=0:*,2048,0
];
--
--  Q20: Move the data from the load array into the 3D target array. 
--
SET NO FETCH;
store ( 
  redimension ( 
    twoDimensionLoadArray,
    targetArray
  ),
  targetArray
);
--
--
SET FETCH;
show ( targetArray );
--
--------------------------------------------------------------------------------

HOWTO: Load dense matrix from CSV
Scidb Load Time
Loadcsv.py : command not found
Load numpy array into SciDB array