Do dense and sparse chunks use different strategy to store cells?



When I store the data in an array, I found that each chunk has the exact same size. ( I have checked the file size of the chunk in the file system)
The data I used contain some nulls, so I suspect that each chunk has physically stored them as a fully dense array.

I am aware that offsets for cells are pre-determined for dense arrays, so it can speed up the cell selection. As far as I know, sparse arrays are stored with additional information of the cell being stored, excluding the cell with any data(nulls). Am I getting this right?

I want to know if SciDB leverages both strategies for storing chunks physically. If so, is the strategy determined for each chunk??

Thank you.


Hi! The short answer is that SciDB stores per-attribute chunks of both dense and sparse arrays in the same format, which is a run-length-encoding of values as they appear in row-major order.

Be aware that what you see in the file system are not chunks but “data stores”: per-array files containing many chunks of many different attributes. Some of the space in these files may be on a free list and unallocated, so the file size doesn’t really correlate directly with these sizes of individual chunks, especially if a lot of insert() and remove_version() operators have executed.

Physical chunk size is going to be determined by how many runs of successive values (including the null value) appear in the row-major ordering. If you have a long run of nulls (or of 3.14, or of ‘some string’, etc.), the value is stored once in the attribute chunk, along with a count. Runs can span “missing cells”, so for example if you have an array

{0} 4
{1} 5
{6} 5
{7} 23

the two 5 values are condensed to one 5 with a count of two, even though they are not in successive logical positions. (This is accomplished using a special system-accessible chunk called the “empty bitmap” or EBM, which is used to map physical positions in the chunk to logical positions in the array.)

The paper SciDB MAC Storage Explained goes into excruciating detail about the storage subsystem:

I hope this helps!