Hello Adrian,

I did some math with your numbers. Looks like you are looking at ~8billion objects and between 10 and 20 bytes per object, for a total of 100-200 GB of data. Does that seem correct?

I’ve recently conducted some performance measurements using about 50GB of data, spread over 4 machines using the latest SciDB (not out yet but very soon). The particular dataset I was looking at had 4 dimensions. I tested dimension-based filtering for queries like “give me a sort of everything where dimension 1 is between A and B and dimension 2 is between C and D and …” and found that scidb performs quite well at such tasks. We do not use r-trees but data is chunked/clustered in multiple dimensions, we can usually perform better than standard rdbms at such multi-dimensional filtering.

Your performance will depend on many factors like

- your hardware - node specs, processor speed, storage type, network, etc
- how many nodes you have at your disposal
- the sparsity of the data

I also found that picking the right chunk size can affect performance by as much as 30%; this depends on the characteristics of the data and requires some tuning.

That’s as far as “give me everything inside a box” queries. As far as “nearest neighbors” - that’s more interesting. It should be easily doable, but I don’t know if the capability is fully built yet. Will need to spend some time investigating.

If you do collect some data - it would be very interesting to look at.

Thanks!

-Alex Poliakov