redimension_store error


#1

I’m really trying to use scidb, but keep running into serious errors when not handling any trivial cases.

I have a large file of sparse data about 10G. I’m running on an ec2 cluster server with 12 instance storage drives (2T each) attached and 16 cpus. I’ve setup the scidb to run 15 instances distributed over the disks.

I can’t even load this data with the loadcsv.py file so I used csplit to split the file every 10million lines into separate files before I was able to even load one file with loadcsv.py

loadcsv.py \ -p 1600 \ -i "/disk1/staging/quotes00" \ -t NNNNNNNNNNN \ -a "quotesFlat" \ -s "<date:int64,msofday:int64,seqno:int64,symbol:int64,b:double NULL,bs:uint32 NULL,be:uint16 NULL,a:double NULL,as:uint32 NULL,ae:uint16 NULL,re:uint8 >[i=0:*,100,0]" \ -x \

Chunking any order of magnitude higher than 100 fails.

I’m trying to use redimension_store to save the data in the final format needed for queries.

AQL% CREATE ARRAY quotes2 <b:double NULL,bs:uint32 NULL,be:uint16 NULL,a:double NULL,as:uint32 NULL,ae:uint16 NULL>[msofday=0:86400000,1,0,symbol=0:*,1,0,seqno=0:*,1,0,re=0:*,1,0]; Query was executed successfully AQL% set lang afl; AFL% redimension_store(quotesFlat, quotes2); SystemException in file: src/query/executor/SciDBExecutor.cpp function: executeQuery line: 233 Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation. Failed query id: 1100933303168

Is this even something that’s possible in scidb? Any help will be greatly appreciated. Thanks


#2

Hello,

Sorry you’re having trouble. We’re here to try to be of help. Let me ask some questions and make some suggestions.

  1. When you say the load fails with loadcsv.py what is the symptom you are getting (freeze / error / slowness) ?

  2. These chunk sizes are way off. I would try something like this:

[msofday=0:86400000,86400000,0,symbol=0:*,5,0,seqno=0:*,100,0,re=0:*,1,0];

What’s seqno in your case? Is that the exchange-assigned sequence number per trade or is it unique only for trades that occur in the same ms? It matters. Not sure what “re” is either.

We’ve done this with quote data, and there was a whole webinar on it: paradigm4.com/watch-the-usin … inar-form/
This should not be a difficult task. Just a few nuances.


#3

Thanks for the quick reply.

  1. Freeze and slowness. I run into NETWORK (type error) multiple times. I’ve shutdown the ec2 instance, but can bring it back up later today and test again.

  2. There are over 500 million quotes within the day, is my understanding of the chunks on the dimensions correct? Is that an 1day x 5 symbols x 100 sequence numbers chunk?

You are right. seqno is tape-assigned number per quote (NBBO). It is unique for the whole day. Disregard “re”.

Thanks for the webinar link. I’m doing a poc specifically because of that link. I need to prove this out before we can move into discussing leveraging some of paradigm4s add-on libraries.


#4

Ok understood.

  1. For the load - if and when you can give me the exact error text, it would help diagnose / fix / prevent the problem in the future.

  2. If the number is unique for the whole day, there’s a couple of things you can do.
    When we did the demo, we would actually store the whole quote as a string. When we redimensioned the data, we just used the sum(string) aggregate to concatenate quotes together. So our dimensions were simply [symbol=0:,5,0,ms=0:,86400000,0]. That averaged to about 200K-some cells per chunk. Some stocks get a lot of quotes, some get few quotes and having multiple stocks in a chunk gives us some semi-random smearing. Now some cells had multiple quotes in them. Then we defined custom aggregates over those strings and that gave us all we needed. Using strings is a crude approach but even so we were able to achieve decent performance. We would’ve gone even faster if we used a custom binary UDT instead of strings and if we spent more days optimizing the aggregate.

Another thing you can do is leave the seqno dimension in the array. So your dimensions can be something like [symbol=0:,10000,0, ms=0:,86400000,0,seqno=0:*,1000000,0]. Since seqno is unique for the day, that gives you about 1M cells per chunk, and the other dimensions just span the entire domain. Conceptually, your chunks look like large diagonals in 3D-space. Now you can do aggregations by symbol and by time, but you do lose some windowing capabilities.

The first approach is definitely cleaner and gives you more capabilities, but it does require some C++ coding. We can help with some of that. Hope it makes sense.