Calculating chunk size


#1

I’m trying to calculate a good chunk size for a simple array holding geospatial data. Forgive me if I’m overlooking the obvious but after reading the respective section in the manual I’m not entirely sure how to do it…

To quote the manual:

The first two sentences are clear, it’s the next two (in bold) that I don’t get. The information I read is “The chunk size is the chunk size”, possibly because I misunderstand the terms number of cells and/or chunk size. Then, a straightforward calculation is mentioned but not given or explained - could you give an example for the not-so-gifted like me? :wink:
The schema for my array looks like this

SpatialGridNode <lat:float,lon:float,cellid:int64 NULL DEFAULT null,land_flag:bool> [gpi]

In essence, you have geographic coordinates (lat/lon) and a flag to indicate whether these are on land or ocean (land_flag). The coordinates are grouped into regions called cells, hence the cellid.
So how do I calculate a good chunk size value for this array?

Oh, and is there a list of byte sizes for the data types? I checked this manual page but found nothing.


#2

Hi,

Have you seen this tutorial video?
scidb.org/forum/viewtopic.php?f=18&t=1204

It’s long but there’s a section on chunk sizes at around 1:15:00. Have you seen it? Does it help at all?


#3

Hadn’t seen it yet - thanks for the hint :smile:

The rule of thumb I take away is: If your data ISN’T sparse, the chunk sizes of the dimensions of your array multiplied should yield something like 1 million - is that correct? And you arrive at the 1M by assuming 8 bytes average bytesize per cell and aiming for 10-20MB?
Automatic chunk size calculation seems to use use half a million cells per chunk as the default and assumes that the dimensions are square, right?

So, if I know my dimensions are not square (e.g. I know one dimension will have only a few entries while the other will have many), I could set the chunk size manually (on at least one of them) to improve performance?


#4

As I recall, true in 14.3, but not as of 14.6. We changed things so that the totally default chunk size selection (ie. what happens when you don’t specify anything) assumes ( a ) dense data, ( b ) a target chunk size of 1,000,000 entries.

Chunk size selection has a big impact on performance. It’s a physical tuning question that’s on a par with picking indices in a relational DBMS, or getting the mapred.min.split.size parameter right for your Hadoop setup.

Also: Have a look at the calculate_chunk_length.py script (added in about 14.6). The idea behind this script is that it basically implements the same strategies we use to figure out chunk size. Might (or might not) produce the perfect chunking, but what it will give you is usually a very good initial guess.

Current Best Advice (YMMV)?

Use the default settings for 1D arrays.
Use the calculate_chunk_length.py to get a good first cut at per-dimension chunk length for nD arrays, based in the data you’re actually going to store.
Use the tools described in the video or in the forum posting viewtopic.php?f=11&t=1330&p=2815 to monitor how your chunking strategy is working … in terms of elements per chunk, per-chunk sizes … and adjust when (for example) you find your chunk sizes fall (on average) outside the 1M to 16M size.
Pay attention to your queries to determine things like overlap size. If you’re using lots of data window queries, using the overlapping chunks feature effectively can make your queries go really, really fast.


#5

where can find the script calculate_chunk_length.py ?


#6

Hi Jigeeshu, that script is installed in the /opt/scidb//bin directory.


#7

Also … jigeeshu?

Have a look at the USING clause on the CREATE ARRAY … statement.

Suppose you have a the data you want to use to populate your 2-D array in a 1-D array. Let’s call the former (the 2-D) targetArray, and the other (the 1-D) sourceArray.

CREATE ARRAY targetArray
<
   attributes 
>
[ Dimensions Specification ] USING sourceArray;

So long as the attribute and dimension names line up between the sourceArray and the targetArray in the same way that you would need for redimension, and substitute question marks in the “Dimension Specification” in place of the per-dimension chunk lengths, then this USING query will create the targetArray replacing the question marks with values it derives from the data in the sourceArray.

Check it out.