For data that is dense and does not contain a lot of repeated values, you can count on 8 bytes
per double, where each cell contains a double. In your case, this works out to 201830400000 bytes,
or roughly 188 Gigabytes. Remember that the coordinate system comes “for free,” so that
doesn’t drastically impact your storage requirements.
If the data is sparse, you would need to plan for an additional 8 to 16 bytes per array cell.
That is, even if you have multiple attributes, the overhead for keeping track of sparse data
is per cell, not per attribute.
Additionally, if there are repeated values, SciDB’s RLE encoding can reduce the overall storage
requirements of the data.
For uploading, I would recommend using a few separate inserts, inserting batches of 10-100GB each. This helps cut the large job into pieces of manageable size should anything go wrong.
As for subsetting and number of nodes: the more the merrier. As an example, you could expect
fairly good performance from a cluster similar to this:
4 GPX XS8-2460-4GPU servers, each having
16 CPU cores
4 1-Terabyte disks
This is one of the in-house clusters we use for problems of similar size. This hardware cost about $20,000 when purchased in 2011.