Excessive storage footprint for dense array


#1

Hello.

I’ve been testing the storage size of different arrays in scidb and am having trouble understanding why some arrays use so much storage. In particular, I’m storing a dense, 1-d array of ~3-billion integers. I’m storing each integer as ‘uint8’ so I would expect the uncompressed file size to be roughly 3Gb (1 byte per cell, 3 billion cells). Instead scidb uses nearly 14gb! If I turn on zlib compression I can bring this down to 4Gb, but this is still excessive and I don’t understand why it’s happening. The precise array schema I’m using is:
<ref:uint8 > [gpos=0:*,1000000,0]

The only thing I can think is that my array is essentially random, so not well suited to run length encoding. Does scidb always force run length encoding on an array, even in the worst case where the data is dense and there are no repeated values? I’m guessing that this is what is causing my inflated file size (but I’m just guessing). Can anyone confirm/deny that my guess is correct? Any suggestions for reducing the storage size?


#2

Wow interesting result. Thanks for telling us about it.

I tried to repeat your experiment. I’m not getting 14 but I am getting around double - at least on 15.7. Your suspicion about RLE is true.

In particular RLE kicks in when repeated values are present, so if the data looks like this: ‘112345567899’ then the RLE encoder will create several “runs”: 11, 234, 55, 678, 99. And for each run it will store a few more bytes (length of run, etc). So this may become an inefficient case for a uint8.

In my experiment, I create one array “a” where the values are monotonically increasing: “12345…”. I create another array “b” where the values are random() downcast to a uint8. In the case of “a” RLE does not kick in. In the case of “b” it is more likely we will see repeated values. So, for “a” I get about 3GB and for “b” I get 6.

Perhaps your source or “randomness” is different or perhaps you used a different tool to measure disk usage?

$ iquery -aq "create array a <val:uint8> [i=0:2999999999, 1000000,0]"
Query was executed successfully

$ iquery -anq "store(build(a, i%256), a)"
Query was executed successfully

$ iquery -aq "create array b <val:uint8> [i=0:2999999999, 1000000,0]"
Query was executed successfully

$ iquery -anq "store(build(b, random()), b)"
Query was executed successfully

$ iquery -aq "filter(list('arrays'), name='a' or name='b')"
{No} name,uaid,aid,schema,availability,temporary
{0} 'a',83,83,'a<val:uint8> [i=0:2999999999,1000000,0]',true,false
{1} 'b',85,85,'b<val:uint8> [i=0:2999999999,1000000,0]',true,false

$ iquery -aq "aggregate(filter(list('chunk map'), uaid=83), sum(asize))"
{i} asize_sum
{0} 3147264000

$ iquery -aq "aggregate(filter(list('chunk map'), uaid=85), sum(asize))"
{i} asize_sum
{0} 6292992000

This is something we will carefully look at.


#3

Just a few more thoughts – indeed with a larger datatype (like int64 or double) the defect would not be so apparent. Every RLE run has an overhead of 16 bytes. If we have a run of more than 3 repeated 8-byte values then it makes sense to create a run because we are saving more than 16 bytes of space. But if there are fewer repeated values, we shouldn’t bother. So the unlucky case is one where we have pairs of repeated values occurring frequently – and it’s exacerbated when the values are smaller size - just like in your case.

I suppose we didn’t catch this before because we haven’t seen data like ‘1122334455…’ which also happened to be a small type like uint8.

In fact, such a logic shouldn’t be hard to add to the RLE encoding mechanism - to only add RLE runs when it, in fact, yields compression. We’ll definitely evaluate this as a potential fix. Thanks again for pointing it out.


#4

Thanks for the quick reply. It’s good to know I’m not doing something stupid at least. Just to add a few more details in case they’re useful. I’m using the following macro to measure the size of the array I create
asize(xx) = project(apply(aggregate(filter(list(‘datastores’),uaid = xx),sum(file_blocks_512) as sum_file_blocks_512),total_MiB,sum_file_blocks_512/2048.0),total_MiB);

The data I’m storing is the human reference genome, where I’ve concatenated all the chromosomes and mapped nucleotides to integers (i.e., A->0,C->1,G->2,T->3). It’s not exactly random and there are certainly some stretches with repeated values, but a large fraction of the values (I’m not sure exactly what fraction) will be non-repeated.


#5

Oh cool. I’ve done a bunch of stuff with VCF, MAF, BAM files but haven’t tried to do a whole long sequence.

Clearly, only 2 bits per character are needed so even using 1 byte per is too much. Add to that the RLE silliness and I can see how it can get unreasonably large.

We do have user-defined datatypes, but at the moment they must be 1 byte or more. So we can’t define a two-bit nucleotide UDT. However, maybe we can split the sequence into 4-letter words like “ACAT GTGC CCTA…” and then use one byte per 4 nucleotides. We could define a datatype to convert it to and from strings, other representations, whatever is needed. We can essentially divide the coordinate system by 4. This might be an interesting approach that would indeed save space…