How are chunks stored on disk?


#1

Hello,

I have been studying the literature behind SciDB during the last days and there is one thing that I haven’t been able to clarify yet. I can understand that arrays are split into chunks (either regular or irregular) and that irregular chunks can be additionally divided into (regular) tiles.

I can also understand how SciDB’s catalog can point the query execution engine to the right chunks by keeping track of the dimension instance ranges for every one of them.

What I don’t understand is: how are the chunks are stored on the disk?. Specifically, how are the cells stored. I took a look at the codebase and the only thing that I could understand is that the payload is of course Run-Length Encoded, but couldn’t understand much more about the structure of how the cells are stored inside each specific chunk.

I’ve read in one of your publications that “for sparse arrays, only non-null cells are stored inside chunks and their order is arbitrary.” [Soroush2011]

For example, let’s say that the iterator returns a chunk, and I want to perform a join with another chunk. How do I align all the cells, if the dimension values are random?
Do I have to perform a loop over all the the cells of the other chunk for each cell?

What I want to ask is, does SciDB also store dimension identifiers with the value itself for every cell, or are the dimensions indexed or ordered somehow within each chunk?

Thank you very very much!!


#2

For a ConstChunkIterator, SciDB 14.8 have APIs operator++(), getPosition(), and getData()/getItem(), to iterate through all non-empty cells. Those should be enough for you to walk through the data sequentially. In case you want to join two chunks, and the use case is that given a cell position in one chunk you calculated a cell position in the other chunk and want to directly jump to it, you may call setPosition().

You are right in that chunks are stored in Run-length encoding. You should never need to get inside the storage format of the chunks. But if you realy want to know, see RLE.h and RLE.cpp – very complicated though.