Chunk size and total array storage size


#1

When I stored an array with different chunk sizes, I found that with the decrease of chunk size (i.e. more chunks), the array storage increased. What is the reason for this? Is this due to more physical addresses (metadata) of chunks stored in the data file?


#2

So … when we store a chunk, there are three things to keep in mind when estimating storage.

  1. There’s the per-chunk meta-data overhead. We need to track each chunk’s physical location.

  2. There’s some per-chunk book keeping data that’s a fixed size.

  3. We compress and encode an attribute chunk’s data. So if you have, for example, a chunk where every value is “1”, we compress that down to a single RLE segment and value. BUT the per-chunk fixed overhead remains the same.

When you decrease a chunk’s logical size, you increase the number of chunks in the array. As the per-chunk overhead is fixed, you’re going to increase the storage requirements. The problem is particularly acute when what you have is a lot of data that’s benefiting from RLE compression … lots of data but only a few distinct values.


#3

So for the RLE encoding case, with the increase of array size, the storage of chunks increases significantly besides per-chunk metadata. e.g. a chunk of size 1000 x 1000 containing all 0 values has storage size of 1 KB (excluding per-chunk metadata), and if it is split into 100 chunks with size 100 x 100, then each chunk occupies more than 1 KB/100, i.e. 0.01 KB. This is understandable. Thanks


#4

Hello!

I have a similar Problem regarding array storage size. It seems that SciDB needs 100 to 300 times more space on disk than a netCDF File.

An example:
netCDF File: 105K
loadable binary: 322K
flat array: 324K
redimensioned (2 Dimensions) array from 12332K to 33584K, depending on the chunk size settings.

Is there an explanation for this behaviour? Does SciDB reserve space for data yet to come? I am using the 14.3 virtual machine image with standard settings.

Best regards
Karl


#5

This really looks like a problem.

Can you share your example data files and netCDF extraction code w/ us, please?

(I’ll drop you a PM.)