Loading of 87GB of integer data into unbounded array


#1

I need the DB engine for storage and retrieval of huge set of integer data with three integer “keys/dimensions” that are used to select subsets - multiple rows of two integer values. Unfortunately the keys are quite non-localized (almost pseudo-random). I tried Berkeley DB, which has very nice C API that supports loading C structures directly into tables. But it’s too slow for such huge sets thus I am looking for different solutions.

I read the SciDB manual chapter 5 “Loading Data” and thought I can use 87GB binary data file to load into SciDB. However the example shows that binary data array should be bounded, eg. it defines cells attribute values, however does not specify cell coordinates (keys). That assumes keys are consequtive (0,1,2…n) which is not my case where keys are very non-localized. I cannot define bounded [2^64,2^32,2^32] array.

Does this means that binary load is possible only for bounded arrays? Our arrays have huge dimensions thus cannot be bounded. It seems that cell coordinates can be specified in CSV file. But wouldn’t re-generation and importing of 87GB of integer data as CSV be too slow and impractical? It should support terabytes and petabytes, right?


#2

redimension_store() is the answer. Although I am still in process of generating test data. It will be interesting to see how much time loading of binary data into flat array and transforming it into multidimensional array would take.


#3

Alas, it bails out with out of memory error:

AFL% redimension_store(shazamFlat,shazam);
SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 233
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.
Failed query id: 1102907588183
AFL%

Related part of scidb.log:

2013-07-13 07:53:44,014 [0x7fd8ec483700] [DEBUG]: Prepare physical plan was sent out
2013-07-13 07:53:44,014 [0x7fd8ec483700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 0 instances
2013-07-13 07:53:44,014 [0x7fd8ec483700] [DEBUG]: Execute physical plan was sent out
2013-07-13 07:53:44,014 [0x7fd8ec483700] [INFO ]: Executing query(1102907588183): redimension_store(shazamFlat,shazam); from program: 127.0.0.1:45487/opt/scidb/13.6/bin/iquery ;
2013-07-13 07:53:44,016 [0x7fd8ec483700] [DEBUG]: Request shared lock of array shazamFlat@1 for query 1102907588183
2013-07-13 07:53:44,016 [0x7fd8ec483700] [DEBUG]: Granted shared lock of array shazamFlat@1 for query 1102907588183
2013-07-13 07:53:44,016 [0x7fd8ec483700] [DEBUG]: Request exclusive lock of array shazam for query 1102907588183
2013-07-13 07:53:44,016 [0x7fd8ec483700] [DEBUG]: Granted exclusive lock of array shazam for query 1102907588183
2013-07-13 07:53:44,039 [0x7fd8ec483700] [DEBUG]: Request shared lock of array shazam@1 for query 1102907588183
2013-07-13 07:53:44,039 [0x7fd8ec483700] [DEBUG]: Granted shared lock of array shazam@1 for query 1102907588183
2013-07-13 07:53:44,062 [0x7fd8ec483700] [DEBUG]: [RedimStore] Begins.
2013-07-13 07:55:55,057 [0x7fd8ec483700] [DEBUG]: Query::done: queryID=1102907588183, _commitState=0, erorCode=67
2013-07-13 07:55:55,057 [0x7fd8ec483700] [ERROR]: executeClientQuery failed to complete: SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 233
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.
Failed query id: 1102907588183
2013-07-13 07:55:55,057 [0x7fd8ec483700] [DEBUG]: Query (1102907588183) is being aborted
2013-07-13 07:55:55,057 [0x7fd8ec483700] [ERROR]: Query (1102907588183) error handlers (2) are being executed
2013-07-13 07:55:55,057 [0x7fd8ec483700] [DEBUG]: Update error handler is invoked for query (1102907588183)
2013-07-13 07:55:55,081 [0x7fd8f7662700] [DEBUG]: Query (1102907588183) is being aborted
2013-07-13 07:55:55,081 [0x7fd8f7662700] [DEBUG]: Deallocating query (1102907588183)
2013-07-13 07:56:05,653 [0x7fd8ec483700] [DEBUG]: UpdateErrorHandler::handleErrorOnCoordinator: the new version 1 of array shazam is being rolled back for query (1102907588183)
2013-07-13 07:56:05,654 [0x7fd8ec483700] [DEBUG]: Performing rollback
2013-07-13 07:56:05,679 [0x7fd8ec483700] [DEBUG]: End of log at position 6724032 rc=112
2013-07-13 07:56:05,679 [0x7fd8ec483700] [DEBUG]: End of log at position 0 rc=0
2013-07-13 07:56:05,680 [0x7fd8ec483700] [DEBUG]: Rollback complete
2013-07-13 07:56:05,717 [0x7fd8ec483700] [DEBUG]: Broadcast ABORT message to all instances

Here’s my config.ini:

[SingleInstance]
server-0=localhost,0
db_user=singleinstance
db_passwd=singleinstance
install_root=/opt/scidb/13.6
pluginsdir=/opt/scidb/13.6/lib/scidb/plugins
logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties
base-path=/home/scidb/DB-SingleInstance
tmp-path=/tmp
base-port=1239
interface=eth0
base-path=/backup/scidb
metadata=/opt/scidb/13.6/share/scidb/meta.sql
max-memory-limit=4096
network-buffer=256
mem-array-threshold=1024
smgr-cache-size=1024
execution-threads=4
result-prefetch-queue-size=4
result-prefetch-threads=4
chunk-segment-size=100485760

Shall I increase max-memory-limit value? That seems to be hopeless because the data is much larger than 16GB RAM available. However, properly designed DB should not have memory allocation errors, there should be disk buffers used when there’s not enough of RAM.

df -H /backup
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb6       1.1T  310G  719G  31% /backup

Queries used to create arrays:

CREATE ARRAY shazam <chid: uint16, time: uint64> [hash (uint64)=*,1000000,0, id (int64)=*,1000000,0, pos (int64)=*,1000000,0];
CREATE ARRAY shazamFlat <hash:uint64,id:int32,pos:int32,chid:uint16,time:uint64>[i=0:*,1000000,0];

Please help.


#4

I increased chunk-segment-size value to 400MB:

[SingleInstance]
server-0=localhost,0
db_user=singleinstance
db_passwd=singleinstance
install_root=/opt/scidb/13.6
pluginsdir=/opt/scidb/13.6/lib/scidb/plugins
logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties
base-path=/home/scidb/DB-SingleInstance
tmp-path=/tmp
base-port=1239
interface=eth0
base-path=/backup/scidb
metadata=/opt/scidb/13.6/share/scidb/meta.sql
max-memory-limit=8192
network-buffer=256
mem-array-threshold=1024
smgr-cache-size=1024
execution-threads=4
result-prefetch-queue-size=4
result-prefetch-threads=4
chunk-segment-size=401943040

Also I decreased target multi-dimension array chunk size:

AQL% CREATE ARRAY shazam <chid: uint16, time: uint64> [hash (uint64)=*,50000,0, id (int64)=*,50000,0, pos (int64)=*,50000,0];
Query was executed successfully

But it fails anyway albeit in different allocation code place:

AQL% set lang afl;
AFL% redimension_store(shazamFlat,shazam);
SystemException in file: src/smgr/io/Storage.cpp function: allocate line: 3296
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_CANT_ALLOCATE_MEMORY
Error description: Storage error. Failed to allocate memory.
Failed query id: 1103600985591

Can someone suggest anything?


#5

Hello,

It looks like your chunks are still too large and you need to reduce the chunk sizes further. The chunk-szegment-size setting is not relevant but leave it as-is for now.

Your dimensions pos and id are int64 and will therefore be sparsely populated. Your dimension hash (uint64) will be densely populated with sorted hash values (or you can choose to use an int64 here, or use the new example UDO “uniq” and “index_lookup” – you have 3 alternatives here). So, at the moment each of your chunks will have 50,0000 values along the “hash” dimension - so if your data is very sparse, you will have 50K values per chunk. If your data is dense, you will have 50K x 50K x 50K values per chunk. Judging by the error message, that might be happening - but I can’t tell without looking at your data.

You may want to do “aggregate(shazamFlat, min(id), max(id), approxdc(id))” to get an idea.

Also - google around for more info on selecting the right chunk sizes.


#6

Even with such small chunks (also I made hash signed int64 to make it sparse because most signigficant bit is not used):

CREATE ARRAY shazam <chid: uint16, time: uint64> [hash (int64)=*,100,0, id (int64)=*,100,0, pos (int64)=*,100,0];

It still produces an error:

AFL% redimension_store(shazamFlat,shazam);
SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 233
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.
Failed query id: 1100592073306

Now I am trying to decrease chunk size of flat array:

CREATE ARRAY shazamFlat <hash:int64,id:int32,pos:int32,chid:uint16,time:uint64>[i=0:*,10000,0];

However loading of 87G rows now takes very long time (unfinished yet).

With

CREATE ARRAY shazamFlat <hash:int64,id:int32,pos:int32,chid:uint16,time:uint64>[i=0:*,1000000,0];

it took just 2-3 hours. Now it takes much longer time (I’ll report when it will be finished).

I’ll also run the suggested aggregate functions query when flat array will be imported.


#7

Hello,

  1. The chunk size for the 1D array is fine - we don’t need to change that.

  2. The chunk size for the 2D array is probably an overshot now. Let’s do a real estimate. Can you run these queries and send me the results:

aggregate(shazamFlat, min(id), max(id), approxdc(id))
aggregate(shazamFlat, min(pos), max(pos), approxdc(pos))
aggregate(shazamFlat, min(hash), max(hash), approxdc(hash))
aggregate(shazamFlat, count(*))

Just need to calculate a good average chunk size, otherwise we are throwing darts in the dark.

Also, please attach your config file.


#8

Hello, Alex!
Thanks for the support.

Here’s the log of commands:

AQL% set lang afl;
AFL% set fetch;
AFL% aggregate(shazamFlat, min(id), max(id), approxdc(id))
CON> ;
[(-2147483648,2147483645,5025319398)]
AFL% aggregate(shazamFlat, min(pos), max(pos), approxdc(pos));
[(-2147483648,2147483646,4988330746)]
AFL% aggregate(shazamFlat, min(hash), max(hash), approxdc(hash));
[(0,49999999,49853201)]
AFL% aggregate(shazamFlat, count(*));
[(5000000457)]
AFL%

Here’s my current config file:

[SingleInstance]
server-0=localhost,0
db_user=singleinstance
db_passwd=singleinstance
install_root=/opt/scidb/13.6
pluginsdir=/opt/scidb/13.6/lib/scidb/plugins
logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties
base-path=/home/scidb/DB-SingleInstance
tmp-path=/tmp
base-port=1239
interface=eth0
base-path=/backup/scidb
metadata=/opt/scidb/13.6/share/scidb/meta.sql
max-memory-limit=10240
network-buffer=256
mem-array-threshold=2048
smgr-cache-size=2048
merge-sort-buffer=1024
execution-threads=4
result-prefetch-queue-size=4
result-prefetch-threads=4
chunk-segment-size=803886080

The people who use real datasets want the subsets of data to be requested by hash column (“key”). Generated test “fake” data will be replaced by real data only if it will be possible to load, rearrange and select the needed data in reasonable time (not more than a 24 hours) at 1…4 node cluster. Currently I use single Ubuntu 12.04LTS i7-2600K node with 16GB RAM and terabyte SATA disk. More powerful hardware (especially disk subsystem) can be requested, however only after initial success with generated test data.


#9

Also I thought that maybe I wrongly defined (id, pos) as additional dimensions instead of attributes. Each set of (hash,id,pos) is unique, but requesting via id or pos is unneeded. Now I’ll try to create non-flat array with only hash dimension and check how much time it will take and whether there will be out of memory error or not.


#10

I changed schema of target array so only hash is defined as dimension in hope that such array will be faster applied via redimension_store() and with no errors:

AFL% CREATE ARRAY shazam <id:int32, pos:int32, chid: uint16, time: uint64> [hash (int64)=*,1000000,0];
Query was executed successfully

However redimension_store() still fails:

AFL% redimension_store(shazamFlat,shazam);
SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 233
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.
Failed query id: 1100334894327

Here’s aggregate statistics:

AFL% aggregate(shazamFlat, min(hash), max(hash), approxdc(hash));
Query was executed successfully
AFL% [(0,49999999,49853201)]

The config.ini is unaltered (above).


#11

It looks like the error could affect other users and, since I have stats of your data, I can try to reproduce it. I will let you know what I find out.


#12

Allright, so I found two things that are causing the problem.

  1. Even if you specify hash as (int64) SciDB will try to create a non-integer dimension unless you make it int64 first using apply. The reason it matters is because non-integer dimension code will put all of the data in memory and increase memory usage. Solutions to these issues are on the way.

  2. SciDB uses a little too much memory and you have “max memory limit” setting enabled. SciDB tries to allocate more memory than the limit - and you receive the error.

So I’ve developed the following recommendations for the system you described. This is the recommended config:

[...]
server-0=localhost,3
db_user=...
db_passwd=...
install_root=/opt/scidb/13.6
pluginsdir=/opt/scidb/13.6/lib/scidb/plugins
logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties
base-path=...
tmp-path=...
base-port=1239
interface=eth0
max-memory-limit=4096
network-buffer=128
mem-array-threshold=128
smgr-cache-size=128
merge-sort-buffer=64
execution-threads=6
result-prefetch-queue-size=2
result-prefetch-threads=2

Some notes about the above:

  • I use 4 instances to get more parallelism
  • I have it set up to run up to 3 queries at one time, 2 threads per query, so one query will keep an 8-core processor busy
  • Lowered the memory numbers
  • I set the hard cap on each instance to 4GB (you said you have 16 total - other programs + swap). If this is not enough, then go down to 3 instances and increase mem limits.

You were right about not using pos and id for dimensions. But you need a synthetic dimension because you need to know how many values there are per hash. If you know, then move to next topic - if you don’t know then this is how you find out:

$ iquery -aq "create array count_redim<count:uint64 null> [ihash=0:*,1000000,0]"  ##took about 1.5 hours for me - my file is as big as yours
$ iquery -anq "redimension_store(apply(shazam, ihash, int64(hash)), count_redim, count(*) as count)"

I have randomly generated data so I get something like this:

$ iquery -aq "aggregate(count_redim, min(count), max(count))"
[(48,164)]

Note: this step can actually be done using redimension, without the “_store”.

And to be extra safe, I use the upper bound of 200 in my synthetic dimension:

$ iquery -aq "create array shazam_redim <id:int32,pos:int32,chid:int16,time:int64>[ihash=0:*,5000,0,synth=0:199,200,0]"
$ iquery -anq "redimension_store(apply(shazam, ihash, int64(hash)), shazam_redim)"

I just started this command on my server. If it doesn’t finish, it will be a matter of adjusting the limit. Of course, you could remove the limit setting and allow the system to swap - it depends on your environment.

Hope this helps - sorry for the delay!


#13

Alex, did the last query finish?
I changed the config according to your recommendations, and flat array is transformed successfully. However I found that ihash index is not enough, because there can be multiple id, pos values per hash, unfortunately. How can I overcome that, is it possible to have multiple attribute values (records) per hash other than defining (id, pos) as dimensions which causes out of memory error?

Also, I wish SciDB was able to use disk buffers or temporary tables, I did not get such errors even for slow queries in MySQL. Also, errors could be more informative - for example, error message can suggest required config option change.

Thanks.
Dmitriy


#14

I changed the schema of source flat array a bit:

CREATE ARRAY shazamFlat <hash:int64,id_pos:int64,chid:uint16,time:uint64>[i=0:*,1000000,0];
LOAD shazamFlat FROM 'shazam.bin' AS '(int64,int64,uint16,uint64)';
create array count_redim <count:uint64 null> [ihash=0:*,1000000,0];
create array shazam_redim <id_pos:int64,chid:uint16,time:uint64>[ihash=0:*,5000,0,synth=0:199,200,0];

The change is minor (id,pos) are placed into one int64 field.

However, the redimension_store() still fails albeit with different error:

AFL% redimension_store(apply(shazamFlat, ihash, int64(hash)), shazam_redim);
SystemException in file: src/util/FileIO.cpp function: writeAll line: 94
Error id: scidb::SCIDB_SE_IO::SCIDB_LE_PWRITE_ERROR
Error description: I/O error. pwrite failed to write 34446 byte(s) to the position 49180807168 with error 28.
Failed query id: 1101272002363
AFL%

There’s enough of free space at volume:

# df -H /backup
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb6       1.1T  316G  714G  31% /backup