Estimating storage space and uploading multi-TB datasets



I would like to experiment with Scidb to subset and analyze multi-terabyte gridded satellite data (3 dimensions: x, y, t) and would welcome suggestions/views on:

  • Is there a rule of thumb to estimating the required storage space on disk? For instance, one of my arrays would be of size 1440x400x43800, each cell having one attribute of type double.
  • Best way of uploading the data: either one huge csv file and redimension_store() once, or uploading smaller csv files and insert() into an existing array
  • Expected performance for subsetting / recommended number of nodes.




For data that is dense and does not contain a lot of repeated values, you can count on 8 bytes
per double, where each cell contains a double. In your case, this works out to 201830400000 bytes,
or roughly 188 Gigabytes. Remember that the coordinate system comes “for free,” so that
doesn’t drastically impact your storage requirements.

If the data is sparse, you would need to plan for an additional 8 to 16 bytes per array cell.
That is, even if you have multiple attributes, the overhead for keeping track of sparse data
is per cell, not per attribute.

Additionally, if there are repeated values, SciDB’s RLE encoding can reduce the overall storage
requirements of the data.

For uploading, I would recommend using a few separate inserts, inserting batches of 10-100GB each. This helps cut the large job into pieces of manageable size should anything go wrong.

As for subsetting and number of nodes: the more the merrier. As an example, you could expect
fairly good performance from a cluster similar to this:

4 GPX XS8-2460-4GPU servers, each having
16 CPU cores
4 1-Terabyte disks
128 Gigabytes

This is one of the in-house clusters we use for problems of similar size. This hardware cost about $20,000 when purchased in 2011.