How to set a appropriate CHUNK value when "create array"?


#1

How to set a appropriate CHUNK value when “create 1-D dimention array” that has 31 million records?

why scidb invent a concept of CHUNK?

Does one cell correspond to one record in the RDBMS?

Thanks very much!


#2

Hello,

Yes, one Array cell corresponds roughly to one SQL tuple. One key SciDB difference is that we separate attributes from dimensions and treat them very differently. Lookup by dimensions, filtering by dimensions and join on dimensions are optimized and should perform quite well.

The chunk serves as an important unit of processing. Namely:

  • the chunk is a unit of storage; we write one chunk at a time and read one chunk at a time
  • the chunk is a unit of network transfer; we send data between nodes one chunk at a time
  • some operators also process a chunk of data at a time; sometimes the chunk is a unit of processing

Also observe that, when the array has n dimensions, chunks make sure that cells that are close to each other in n-dimensional space are likely to be stored in the same chunk - or “close to each other” on disk.

What’s a good chunk size for your 1-D array with 31 million cells?
We need to know two things:

  • how much data is in each cell (what are the attributes)?
  • how sparse is the array? are there empty cells and how often do they occur?
    Given these two pieces of data, you should set a chunk size such that a chunk is several megabytes in size - around 4-8MB - small enough to fit in the CPU cache, big enough to optimize disk reads and network transfer.

In the future we are considering some improvements including:

  • automatic chunk size (system does it for you)
  • separate units for processing versus storage versus network

#3

Thank you Apoliakov,excitedly I can try now with your help now, when will the automatic chunk size scidb appear to world?

[quote=“apoliakov”]Hello,

Yes, one Array cell corresponds roughly to one SQL tuple. One key SciDB difference is that we separate attributes from dimensions and treat them very differently. Lookup by dimensions, filtering by dimensions and join on dimensions are optimized and should perform quite well.

The chunk serves as an important unit of processing. Namely:

  • the chunk is a unit of storage; we write one chunk at a time and read one chunk at a time
  • the chunk is a unit of network transfer; we send data between nodes one chunk at a time
  • some operators also process a chunk of data at a time; sometimes the chunk is a unit of processing

Also observe that, when the array has n dimensions, chunks make sure that cells that are close to each other in n-dimensional space are likely to be stored in the same chunk - or “close to each other” on disk.

What’s a good chunk size for your 1-D array with 31 million cells?
We need to know two things:

  • how much data is in each cell (what are the attributes)?
  • how sparse is the array? are there empty cells and how often do they occur?
    Given these two pieces of data, you should set a chunk size such that a chunk is several megabytes in size - around 4-8MB - small enough to fit in the CPU cache, big enough to optimize disk reads and network transfer.

In the future we are considering some improvements including:

  • automatic chunk size (system does it for you)
  • separate units for processing versus storage versus network[/quote]