SciDB data types and their sizes


#1

Hello,

when I issue
iquery -aq "summarize(build(<val:int16>[i=0:3; j=0:3],1));"

I get
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'all',16,122,2,16,16,16,48,61,74

so,
bytes/count = 122/16 = 7.625

does this mean that SciDB keeps about 8 bytes per each 2-byte signed integer?

Also
iquery -aq "summarize(build(<val:int16>[i=0:999; j=0:999],1));" {inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes {0,0} 'all',1000000,122,2,1000000,1e+06,1000000,48,61,74

iquery -aq "summarize(build(<val:int16>[i=0:999; j=0:999],1*j));" {inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes {0,0} 'all',1000000,2000120,2,1000000,1e+06,1000000,48,1.00006e+06,2000072

in the first case SciDB compresses data efficiently (probably RLE?) and keeps only 122 bytes for 1000000 values of int16
while in the second case SciDB has to store all data as is?

How can I understand and influence this?

Thank you!


#2

Data chunks are RLE encoded in row-major order.

Example:

1 1 1         versus      1 2 3
2 2 2                     1 2 3
3 3 3                     1 2 3

In the first examples, the values to be encoded are 1, 1, 1, 2, 2, 2, 3, 3, 3 and in the second the values to be encoded are 1,2,3,1,2,3,1,2,3. SciDB uses a single pass RLE, so the first set of numbers will result in 3 segments: “1 with run length 3”, “2 with a run length of 3”, “3 with a run length of 3” whereas the second will report 1 segment with 9 literals (1,2,3,1,2,3,1,2,3).

There is also an additional attribute which is the empty bitmap for the given chunk (which cells in the chunk have data and which are empty) For dense arrays this attribute is small, but can be larger with sparse arrays.

The second query, has a different value in each cell, so the RLE encoding is one segment with 100000 literals.

If you would like to see the per attribute summary, you can use the summarize(<ARRAY>,by_attribute:1):

This might show the following:

$ iquery -aq "summarize(build(<val:int16>[i=0:999; j=0:999],1), by_attribute:1);"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'val',1000000,74,1,1000000,1e+06,1000000,74,74,74
{0,1} 'EmptyTag',1000000,48,1,1000000,1e+06,1000000,48,48,48


$ iquery -aq "summarize(build(<val:int16>[i=0:999; j=0:999],1*j), by_attribute:1);"
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} 'val',1000000,2000072,1,1000000,1e+06,1000000,2000072,2.00007e+06,2000072
{0,1} 'EmptyTag',1000000,48,1,1000000,1e+06,1000000,48,48,48