Loading data into an array via a pipe


#1

In a prior version of SciDB (14.12), I was able to load data to an array using a pipe and the loadcsv.py script. This script also seemed to use all available instances while loading data.

In version 16.9 of SciDB, loadcsv.py has been depreciated and is no longer available. I would like to be able to load data to an array via a pipe, using all instances. Can this be done in SciDB version 16.9?

Thank you,
Nathan


#2

Hey Nathan,
For loading text files quickly, right now we recommend the accelerated_io_tools (aio) package. See the documentation here:

You can use this with named pipes - you’d just have to create a named pipe at first. See an example here:

You should see a significant speedup in performance relative to what 14.12 could do with loadcsv.py. The aio package also has some options for loading from multiple files at the same time - some finance folks like to do that for extra speed.


#3

Thank you for your response. I have a follow up question. I am trying to follow your code to populate a test array.

I create a 1-D array that looks like this:

'attr_temp<p_key:int64> [row_num=0:*:0:*]'.

Then I try to load data from PostgreSQL into attr_temp using this:

rm -f /tmp/pipe
mkfifo /tmp/pipe
psql -c"COPY(select p_key from attr limit 100) to STDOUT with CSV" > /tmp/pipe &
iquery -anq "insert(aio_input('/tmp/pipe', 'num_attributes=1', 'attribute_delimiter=\n'), attr_temp)"

I have also tried the store operator instead of the insert operator but I get the following error message in either case:
UserException in file: src/array/Metadata.cpp function: checkConformity line: 857
Error id: scidb::SCIDB_SE_INFER_SCHEMA::SCIDB_LE_DIMENSION_COUNT_MISMATCH
Error description: Error during schema inferring. Dimension counts for operator ‘INSERT/STORE’ do not match: [3] != [1].

Any thoughts at what I might be doing wrong here?
Thanks again.

Also if you run this code

psql -c "COPY(select p_key from attr limit 100) to STDOUT with CSV" > /tmp/pipe &
cat < /tmp/pipe

gives

16
32
5
7
91
...

Which is the correct input.


#4

Yeah - that’s expected - the <p_key:int64>[row_num] is not the schema that aio_input returns, so you’d need to coerce or redimension the data before you can insert.

For starters, just try storing it into a brand new array without pre-creation. This will show you what shape aio creates:

$ iquery -anq "remove(attr_temp2)" #make sure it doesn't exist / pick a unique name
$ iquery -anq "store(aio_input('/tmp/pipe', 'num_attributes=1'), attr_temp2)"
$ iquery -aq "show(attr_temp2)"

To put the data into the shape you have would look something like this:

$  iquery -anq "
insert(
 unpack(
  apply(
   aio_input('/tmp/pipe', 'num_attributes=1'),
   p_key, dcast(a0, int64(null))
  ),
  row_num
 ),
 attr_temp
) 

The aio package does the multi-instance parsing inside the DB, and, in addition to that, the input data may come from multiple files or streams. So to disambiguate and for error tolerance it returns an array with all string attributes and 3 dimensions (source instance ID, destination instance ID and line number). It also adds a “error” attribute for error handling.

For more info, take another look at the aio doc (github link above). There are many additional examples and tutorials. This walkthrough is quite nice: http://rvernica.github.io/2016/05/load-data


#5

Great! I should have read the docs a little closer.
Thanks for your help.


#6

This was just what I was looking for. Worked like a charm!

We created a large array (480 GB) from a PostgreSQL table containing approximately 161 million rows and 1100 attributes. And it was able to load in about 3 hours. This was done across four servers, each containing 8 instances. After we create the named pipes on a shared directory, here is the aio_input query:

iquery -anq "store(aio_input('paths=/shared/pipe.0;/shared/pipe.1;/shared/pipe.2;/shared/pipe.3;/shared/pipe.4;/shared/pipe.5;/shared/pipe.6;/shared/pipe.7;/shared/pipe.8;/shared/pipe.9;/shared/pipe.10;/shared/p\
ipe.11;/shared/pipe.12;/shared/pipe.13;/shared/pipe.14;/shared/pipe.15;/shared/pipe.16;/shared/pipe.17;/shared/pipe.18;/shared/pipe.19;/shared/pipe.20;/shared/pipe.21;/shared/pipe.22;/shared/pipe.23;/shared/pipe\
.24;/shared/pipe.25;/shared/pipe.26;/shared/pipe.27;/shared/pipe.28;/shared/pipe.29;/shared/pipe.30;/shared/pipe.31','instances=0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;\
31', 'num_attributes=1101', 'attribute_delimiter=,'), attr_temp)" 

There is one more question I have though. After the array has been created and SciDB yields control back to the terminal emulator, the memory footprint from the array creation is persistent. That is, it appears SciDB is not releasing memory after process termination. And this memory usage is not just cache.

Any thoughts on this? Is there a command/script/operator that SciDB uses to clean up memory usage?

Thanks


#7

Hello - yes looks like a fair amount of data.

The short answer on memory is that SciDB will grow to reach an upper bound memory usage and then stay there. You can use the configs to control where the upper bound is. If you need to somehow
clear the memory, you can do scidb stopall / startall. Sometimes folks do that if they run a large load / sort / compute job once a day and expect only small lookup queries after that. We do intend to make the control more responsive in the future.

For a longer story with more info on the configs, see here: Running Out of Memory!


#8

Another thing that may be of interest is Elastic Resource Groups. This lets you dynamically attach and detach instances, i.e.:

  1. attach 4 new nodes
  2. run a big load or compute job
  3. detach
    This is particularly cost-effective in the cloud - letting you pay for extra compute only when you need it. At the moment it does require the SciDB Enterprise Edition. More at https://github.com/Paradigm4/elastic_resource_groups

#9

Great information. I will definitely look into these in more depth.
Thank you once again.