Processing very wide CSV files with SciDB


#1

Dear all,

I am currently processing a rather oddly-structured dataset, and I’d like your help.
My data resides in various CSV files of ~ 5000 rows each.
My problems stem from the column number in the files - some of the files comprise 17000 columns per row.
To make matters worse, in the general case there is no useful correlation between the columns, so no additional groupings can be defined.
Besides an ID value and potentially a name field in each one of them, the other fields are either floats, or integers taking values from a small domain (e.g. 0-1-2)

As you understand, I end up having to process many, many small column chunks, thus penalizing performance.
My schema ends up looking like

The valueX fields are either integers or floats, depending on the file.

This causes issues both when loading data, as well as when executing queries.

I am using SciDB 13.9, deploying a single instance on a quad-core wth 16GB of RAM, and my configuration is

[mydb] server-0=localhost,0 db_user=... db_passwd=... install_root=/opt/scidb/install pluginsdir=/opt/scidb/install/lib/scidb/plugins logconf=/opt/scidb/install/share/scidb/log1.properties base-path=... base-port=1239 interface=eth0 smgr-cache-size = 1024 merge-sort-buffer = 512 max-memory-limit = 12000

At the moment, a 200MB CSV file takes up to 15 minutes to load. From what I understand, IO is pretty much taking place all the time
(according to iotop, journaling - jbd2 etc - is always active)

Besides that, after loading the file, I encounter problems when attempting to perform redimension_store or redimension.
Ideally, I would like to redimension and have the unique identifier act as the dimension. However, any
redimension I attempt involving more that ~ 250 attributes results in the message

UserException in file: src/array/MemArray.cpp function: getItem line: 2200 Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_NO_CURRENT_ELEMENT Error description: Error during query execution. No current element.

The full log contents are

2013-10-22 20:42:24,959 [0x7f91fbe9a700] [DEBUG]: [RedimStore] Begins. 2013-10-22 20:42:24,959 [0x7f91fbe9a700] [DEBUG]: [RedimStore] build mapping index took 0 ms, or 0 millisecond 2013-10-22 20:42:30,283 [0x7f91fbe9a700] [DEBUG]: [RedimStore] inputArray --> redimensioned took 5324 ms, or 5 seconds 324 milliseconds 2013-10-22 20:42:30,288 [0x7f91fbe9a700] [DEBUG]: [SortArray] Sort for array begins 2013-10-22 20:42:36,135 [0x7f91fb3b1700] [DEBUG]: [SortArray] Produced sorted run # 1 2013-10-22 20:42:36,189 [0x7f91fbe9a700] [DEBUG]: [SortArray] merge sorted chunks complete took 5901 ms, or 5 seconds 901 milliseconds 2013-10-22 20:42:36,189 [0x7f91fbe9a700] [DEBUG]: [RedimStore] redimensioned sorted took 5906 ms, or 5 seconds 906 milliseconds 2013-10-22 20:42:36,189 [0x7f91fbe9a700] [DEBUG]: [RedimStore] synthetic dimension populated took 0 ms, or 0 millisecond 2013-10-22 20:42:36,792 [0x7f91fbe9a700] [DEBUG]: SharedMemCache::sizeCoherent computed 54726146 used 54726146 2013-10-22 20:42:36,806 [0x7f91fbe9a700] [DEBUG]: SharedMemCache::sizeCoherent computed 27362765 used 27362765 2013-10-22 20:42:36,850 [0x7f91fbe9a700] [DEBUG]: SharedMemCache::sizeCoherent computed 0 used 0 2013-10-22 20:42:36,867 [0x7f91fbe9a700] [DEBUG]: Query::done: queryID=1100984835141, _commitState=0, errorCode=32 2013-10-22 20:42:36,867 [0x7f91fbe9a700] [ERROR]: executeClientQuery failed to complete: UserException in file: src/array/MemArray.cpp function: getItem line: 2200 Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_NO_CURRENT_ELEMENT Error description: Error during query execution. No current element. Failed query id: 1100984835141 2013-10-22 20:42:36,867 [0x7f91fbe9a700] [DEBUG]: Query (1100984835141) is being aborted 2013-10-22 20:42:36,867 [0x7f91fbe9a700] [ERROR]: Query (1100984835141) error handlers (2) are being executed 2013-10-22 20:42:36,867 [0x7f91fbe9a700] [DEBUG]: Update error handler is invoked for query (1100984835141) 2013-10-22 20:42:36,869 [0x7f91fbe9a700] [DEBUG]: UpdateErrorHandler::handleErrorOnCoordinator: the new version 1 of array ImagingRedim is being rolled back for query (1100984835141) 2013-10-22 20:42:36,869 [0x7f91fbe9a700] [DEBUG]: Performing rollback 2013-10-22 20:42:36,877 [0x7f91fbe9a700] [DEBUG]: End of log at position 1420832 rc=112 2013-10-22 20:42:36,877 [0x7f91fbe9a700] [DEBUG]: End of log at position 0 rc=0 2013-10-22 20:42:36,877 [0x7f91fbe9a700] [DEBUG]: Rollback complete 2013-10-22 20:42:36,900 [0x7f91fbe9a700] [DEBUG]: Broadcast ABORT message to all instances for query 1100984835141 2013-10-22 20:42:36,900 [0x7f91fbe9a700] [DEBUG]: Deallocating query (1100984835141) 2013-10-22 20:42:36,900 [0x7f91fbe9a700] [DEBUG]: Releasing locks for query 1100984835141 2013-10-22 20:42:36,900 [0x7f91fbe9a700] [DEBUG]: SystemCatalog::deleteArrayLocks instanceId = 0 queryId = 1100984835141 2013-10-22 20:42:36,910 [0x7f91fbe9a700] [DEBUG]: Release lock of array Imaging@1 for query 1100984835141 2013-10-22 20:42:36,910 [0x7f91fbe9a700] [DEBUG]: Release lock of array ImagingRedim for query 1100984835141 2013-10-22 20:42:36,910 [0x7f91fbe9a700] [DEBUG]: Release lock of array ImagingRedim@1 for query 1100984835141 2013-10-22 20:42:36,916 [0x7f92071357c0] [DEBUG]: Query 1100984835141 is not found 2013-10-22 20:42:36,917 [0x7f92071357c0] [ERROR]: Dropping message of type=15, for queryID=1100984835141, from CLIENT because SystemException in file: src/query/Query.cpp function: getQueryByID line: 797 Error id: scidb::SCIDB_SE_QPROC::SCIDB_LE_QUERY_NOT_FOUND Error description: Query processor error. Query 1100984835141 not found. Failed query id: 1100984705129

From the top of my head, some things I could try are

  • Store all the data in a contiguous file, thus avoiding redimension, but potentially penalizing loading (and performance?) more
  • To handle the redimension issue, an option would be splitting contents of some “splittable” file in multiple files and join them by ID when querying. There are few cases where I can create groupings of “only” 50 attributes :smile:. However, in the default case there is no correlation between the columns to be grouped in different files, so splitting would be a hack that complicates querying.
  • Treat the ‘payload’ (i.e. all the integer values) as extra dimensions, thus having few attributes and exploding the number of dimensions, but I don’t know what this will cause. I also don’t know how useful this is in the case of files with float values as ‘payload’ - from what I understand, support for NIDs is still not full.

Any insight would be appreciated.

Thanks,
Manos


#2

The best way to approach this is to split the file so that your column number is a dimension.

Currently, your file looks like this:

ID, filename, position1, position2,.... A, 'B', p1,p2,p3... ...

Split it into two files, like this:

File 1:

ID, filename A, 'B' C, 'D' ....

File 2:

ID, PositionNo, Value A, 1, p1, A, 2, p2, ... C, 1, p1701 ...

You can write a shell script that does that job. And then load your data into two SciDB arrays.