I’ve been asked to evaluate SciDB, and I’m having problems in using it.
I have a table with some 50 fields and 2 billions rows. Fields are almost all double precision floating point numbers.
The loading from a CSV file went just fine, in an array with 50 attributes and one synthetic dimensions.
create array ARR<F1:double,F2:double,F3:...,F50:int64>[i=0:*,500000,0];
The first two fields univocally identify a row, and are used in most queries, so I tried to create dimensions along them:
create array ARR_REDIM<F3:double,F4:...,F50:int64>[F1(double)=*,1000,0,F2(double)=*,1000,0]; redimension_store(ARR,ARR_REDIM);
This failed for using too much virtual space, over 40GB: actually the server has been killed by the out-of-memory daemon.
The fields F1 and F2 are unique at the sixth decimal digit, so I tried to prepare an array with two added integer fields and redimension on the new fields:
create array ARR_F<F1:double...F50:int64>[IF1=*,1000,0,IF2=*,1000,0]; redimension_store(apply(ARR,FF1,F1,FF2,F2,...,FF50,F50,IF1,int64(F1*1000000),IF2,int64(F2*1000000)),ARR_F);
This too goes out-of-memory, also trying it in two steps:
select * into ARR_TMP from apply( as above ); redimension_store(ARR_TMP,ARR_F);
I tried to impose the limits for the dimensions, instead of going unbounded:
with the same result.
I tried then to reduce the limits of the dimensions, adding a synthetic dimension to compensate the lack of uniqueness:
select * into ARR_TMP2 from apply( ARR,FF1,F1,FF2,F2,...,FF50,F50,IF1,int64(F1*10),IF2,int64(F2*10)); create array ARR_F2<F1:double...F50:int64>create array ARR_F<F1:double...F50:int64>[IF1=*,1000,0,IF2=*,1000,0];[IF1=1:12959,100,0,IF2=-3500:3500,100,0,syn=1:*,15000,0];
I arrived at the value 15000 for the ‘syn’ chunck size by trials and error, raising it after the “Too much duplicates” errors. With 15000 I got the error:
SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 232
Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR
Error description: Not enough memory. Error ‘std::bad_alloc’ during memory allocation.
Am I doing anything wrong? Is there a way to foresee the resources a certain array creation process would use?
I’m using a SciDB 12.3 single-node installation on a Linux CentOS 6.3 computer with 36GB RAM + 4 GB swap. Data dir is on a 4x2TB Raid bunch, on xfs. Tmpdata on a 2TB xfs disk.
The .ini file is the one of the manual, but for ‘server’ instead of ‘instance’, redundancy=0 instead of 1, and a different tmp-path:
[test1] server-0=localhost,0 db_user=xxxx db_passwd=xxxx install_root=/opt/scidb/12.3 metadata=/opt/scidb/12.3/share/scidb/meta.sql pluginsdir=/opt/scidb/12.3/lib/scidb/plugins logconf=/opt/scidb/12.3/share/scidb/log4cxx.properties base-path=/home/scidb/data base-port=1239 interface=eth0 no-watchdog=true redundancy=0 merge-sort-buffer=1024 network-buffer=1024 mem-array-threshold=1024 smgr-cache-size=1024 result-prefetch-queue-size=4 result-prefetch-threads=4 execution-threads=4 chunk-segment-size=10485760 tmp-path=/data7/scidb/tmp
Thank you very much for the attention.