Vertical Partioning and Chunk Size


#1

I’m trying to determine the chunk sizes of a given array. However, I’m a little bit confused about a couple things:

  1. What’s the ideal chunk size?

I’m only asking because the user manual says that:

In this topic the ideal size is said to be around 4-8 mb:

And finally, this paper says that:

I wonder, after all, what’s the proper size for a chunk. And, if there’s no proper size, based on what should I define a more efficient way to divide the dimensions?

  1. The ideal chunk size (in MB) is defined for an entire cell, or for each attribute? As I could understand, scidb creates one chunk per attribute (vertical partitioning). If so, should I determine the size of the chunk to make it have a proper size taking into account the size of the entire cell, or only the attribute.

I’ll explain. Suppose I have a 3-attribute array. All of the attributes are double. When I calculate my chunk size to make it occupy 10 mb, should I take into account the 3 attributes (24 bytes times the chunk size of each dimension) or only one (8 bytes).

3)Do I have the garanty that scidb stores the entire cell on the same worker? Suppose I create the array with 2 attributes. Scidb splits it on a per attribute basis, so we have a chunk for each atribute. The values of a given cell, that are in different chunks will be stored in the same worker? Or there’s the possibility that a cell is split among different workers?

That’s all for now.
Thanks in advance.


#2

Hello, some answers:

1,2) What’s the ideal chunk size?
For the double datatype, you want to have about 1 million elements in the chunk. Chunks are rle-encoded; but if the encoding does not kick in, the chunk will have about 8MB of data.
This is independent of the number of attributes since the chunks for each attribute are processed separately.
There are, unfortunately, some older operators that will read all the chunks for a particular position at once. If that’s the case and you have a lot of attributes (i.e. 100) you might want to shrink the chunk size for now. In the future that code will go away.

If your data is sparse - you want to increase the declared chunk size so that it still contains about 100K to 1M non-empty elements.

  1. Do I have the garanty that scidb stores the entire cell in the same worker?
    Yes that’s always the case at the moment.

Hope it helps.
-Alex Poliakov


#3

Thank you very much for the answers.
One last question: Is there a way to find out the actual size of a chunk on disk? Could we find this kind of information somewhere? Maybe in the system catalog?


#4

There is a “hidden” command you can use called
"list ( ‘chunk map’ )"
You definitely want to look at the output in “csv+” form, or write a query to interrogate it. One of the attributes is “asize” or “allocated size” which is the size of the data on disk.
Bear in mind, sometimes chunks are “cloned” so sum(asize) is not always the size on disk.
If you want to go ahead and play with that - I recommend starting with small array examples to get an idea.