Another large file loadand redimension problem


#1

I’ve been asked to evaluate SciDB, and I’m having problems in using it.
I have a table with some 50 fields and 2 billions rows. Fields are almost all double precision floating point numbers.

The loading from a CSV file went just fine, in an array with 50 attributes and one synthetic dimensions.

create array ARR<F1:double,F2:double,F3:...,F50:int64>[i=0:*,500000,0];

The first two fields univocally identify a row, and are used in most queries, so I tried to create dimensions along them:

create array ARR_REDIM<F3:double,F4:...,F50:int64>[F1(double)=*,1000,0,F2(double)=*,1000,0];
redimension_store(ARR,ARR_REDIM);

This failed for using too much virtual space, over 40GB: actually the server has been killed by the out-of-memory daemon.

The fields F1 and F2 are unique at the sixth decimal digit, so I tried to prepare an array with two added integer fields and redimension on the new fields:

create array ARR_F<F1:double...F50:int64>[IF1=*,1000,0,IF2=*,1000,0];
redimension_store(apply(ARR,FF1,F1,FF2,F2,...,FF50,F50,IF1,int64(F1*1000000),IF2,int64(F2*1000000)),ARR_F);

This too goes out-of-memory, also trying it in two steps:

select * into ARR_TMP from apply( as above );
redimension_store(ARR_TMP,ARR_F);

I tried to impose the limits for the dimensions, instead of going unbounded:
[IF1=1:1295999999,1000,0,IF2=-350000000:350000000,1000,0]
with the same result.

I tried then to reduce the limits of the dimensions, adding a synthetic dimension to compensate the lack of uniqueness:

select * into ARR_TMP2 from apply( ARR,FF1,F1,FF2,F2,...,FF50,F50,IF1,int64(F1*10),IF2,int64(F2*10));
create array ARR_F2<F1:double...F50:int64>create array ARR_F<F1:double...F50:int64>[IF1=*,1000,0,IF2=*,1000,0];[IF1=1:12959,100,0,IF2=-3500:3500,100,0,syn=1:*,15000,0];

I arrived at the value 15000 for the ‘syn’ chunck size by trials and error, raising it after the “Too much duplicates” errors. With 15000 I got the error:
[/code]
SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 232
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error ‘std::bad_alloc’ during memory allocation.
[/code]
Am I doing anything wrong? Is there a way to foresee the resources a certain array creation process would use?

I’m using a SciDB 12.3 single-node installation on a Linux CentOS 6.3 computer with 36GB RAM + 4 GB swap. Data dir is on a 4x2TB Raid bunch, on xfs. Tmpdata on a 2TB xfs disk.

The .ini file is the one of the manual, but for ‘server’ instead of ‘instance’, redundancy=0 instead of 1, and a different tmp-path:

[test1]
server-0=localhost,0
db_user=xxxx
db_passwd=xxxx
install_root=/opt/scidb/12.3
metadata=/opt/scidb/12.3/share/scidb/meta.sql
pluginsdir=/opt/scidb/12.3/lib/scidb/plugins
logconf=/opt/scidb/12.3/share/scidb/log4cxx.properties
base-path=/home/scidb/data
base-port=1239
interface=eth0
no-watchdog=true
redundancy=0
merge-sort-buffer=1024
network-buffer=1024
mem-array-threshold=1024
smgr-cache-size=1024
result-prefetch-queue-size=4
result-prefetch-threads=4
execution-threads=4
chunk-segment-size=10485760
tmp-path=/data7/scidb/tmp

Thank you very much for the attention.


#2

Yeah.

We’re completely re-writing redimension in 12.10 (Cheshire). There are several problems connected to the management of intermediate result data inside the 12.3 redimension, and the algorithm isn’t very scalable. But … can you try the following?

Instead of a single redimension that works on all 50 attributes, what happens if you create 10 target arrays (same dimensionality, different set of attributes per array) and then try to populate each of the 10 targets, one at a time. The queries you will write will be more complex, but the underlying physical data organization will be more or less identical to what you would get with a 50 attribute single array.

The reason I’m asking this is to try to figure out where the resources are being (inefficiently) used.


#3

Two more thoughts:

  1. Try smaller chunks on the 1D array. You’ve gotten 500,000 values @ 8 bytes per = 40G per chunk @ 50 attributes = 2G per “logical” chunk. When you redimension data, you’re moving blocks of it around. Slightly smaller chunks might help reduce the size of the intermediate result.
create array ARR<F1:double,F2:double,F3:...,F50:int64>[i=0:*,100000,0];
  1. Your config.ini file?
smgr-cache-size=1024

This says you’re reserving 1G of RAM for chunks. Even if it works, it will be slow. Set this to (say) 16384.


#4

Thank you very much for your answer. I did some more test along the lines you suggested.
I only made 1/10th of my homework, arriving at just one redimensioned array, but I hope this can be useful. You know, every attempt took the best part of a working day, (and I made several…)

I applied the “smgr-cache-size=16384” in the config.ini, as suggested.

Attempt to create a “subarray” with a choice of fields, 6+2 dimensions out of 50.

First attempt to use directly the values with no conversions:

create array ARR_REDIM<F3:double,F4:...,F8:int64>[F1(double)=*,1000,0,F2(double)=*,1000,0];
redimension_store(ARR,ARR_REDIM);

Not restarting the server, ‘top’ shows VIRT raising from 29GB to 34GB + 2GB SWAP in less than 3 minutes, and I kill the process to avoid the corruption of the DB.
By the way, I had to re-init the DB and reload the data after the crashes, apparently they got corrupted in some way.

Restarting the server VIRT starts from 700MB, and the after 14 minutes I see the process starts to swap. Killed the process.

Other attempt, with the float dimension multiplied by 10 and made integer, plus a synthetic dimension:

select F1,...,F8,IF1,IF2 into ARRplustwo FROM APPLY(ARR,IF1,int64(F1*10),IF2,int64(F2*10));
create array ARR_REDIM<F1:double,F2:double,F3:double,F4:...,F8:int64>[IF1=1:12959,100,0,IF2=-3500:3500,100,0,syn=1:*,15000,0];
redimension_store(ARRplustwo,ARR_REDIM);

This fails for “Too much duplicates”. Clearly 15000 is not enough for the syntetic dimension.
Here there is a catch-22: you cannot count the elements in a cell unil you have redimensioned the array, but you cannot redimension the array without knowing how many elements you are going to have per cell.

Anyway, I have a good indication that the duplicated rows are less than 410,000, and I use this number:

create array ARR_REDIM<F1:double,F2:double,F3:double,F4:...,F8:int64>[IF1=1:12959,2,0,IF2=-3500:3500,2,0,syn=1:*,410000,0];
redimension_store(ARRplustwo,ARR_REDIM);

I lowered the IF1,IF2 chunk size to 2 after some attempts ending in “Chunk size 11941012 is larger than segment size 10485760”.

This succeded in about 5.5 hours, taking exactly all the available memory: 33.7GB VIRT, and up to 1.8GB SWAP.

After that I’ve been able to perform some analysis, and discover, for instance that the max ‘count’ for couple of values (IF1,IF2) was actually 89977.

As soon as possible I’ll try to build the whole set of arrays covering all the fields.

I’m looking forward to testing the new version.