How are Chunks Distributed?


#1

I’m curious how chunks are distributed among SciDB instances … if I create a bunch of arrays and store them using two SciDB instances, will even-numbered chunks always be stored on the first SciDB instance? I ask because I’d like to experiment with running a lot of queries in parallel, and am curious how well the load will be balanced.

Cheers,
Tim


#2

Hey Tim,

“Even numbered” is hard to define when you have more than one dimension. Essentially, there is a hash that gets applied to the difference of the chunk coordinates from the array origin. The hash attempts to smear chunks evenly between instances, and keep determinism, so we can still quickly determine what instance a chunk is on.

This is the actual code

uint64_t ArrayDesc::getChunkNumber(Coordinates const& pos) const
{
    Dimensions const& dims = _dimensions;
    uint64_t no = 0;
    /// The goal here is to produce a good hash function without using array
    /// dimension sizes (which can be changed in case of unboundary arrays)
    for (size_t i = 0, n = pos.size(); i < n; i++)
    {
        // 1013 is prime number close to 1024. 1024*1024 is assumed to be optimal chunk size for 2-d array.
        // For 1-d arrays value of this constant is not important, because we are multiplying it on 0.
        // 3-d arrays and arrays with more dimensions are less common and using prime number and XOR should provide 
        // well enough (uniform) mixing of bits.
        no = (no * 1013) ^ ((pos[i] - dims[i].getStart()) / dims[i].getChunkInterval());
    }
    return no;
}

In practice, if the array is sparse, some chunks may be missing, some chunks may be bigger than others, etc, etc. You may get chunk-to-chunk skew or instance-to-instance skew. There’s a post here viewtopic.php?f=18&t=1091 that shows a query you can run to tell you where your data actually landed. And there are some smart academic folks that are currently doing research on this very issue.


#3

Alex:

Yeah, I meant to say “even numbered with a dense, 1D array” to keep it conceptually simple. Anyway, this is exactly what I needed, many thanks!

Cheers,
Tim