I need to run some statistical hypothesis tests, Anova, student’s, least square fit, median, data mining, clustering… on a very large quantity of distributed data. (>100TB, Maybe columnar or key/value, mostly numerical ).

I wouldn’t like to write all that functions again, I’d rather prefer to use something already implemented (out-of-the-box) on a “big data” database or on a software able to connect to that database and use its functions or packages on it.

Can I do it with SciDB?
I guess not directly.

I think I would need to use it alongside R (or Python or some other analysis software).
But I guess that anyway I won’t be able to use all R functions or packages feeding them with SciDB, isn’t it?

I could convert some subsets of data to R but if you try to to it with a large column you’ll end up with an out of memory error.
I could also try to create a “stream of data” or extract chunks of data and pass them to R, but it would mean to spend a long time figuring out how to do it from basic functions, and it’s prone to errors.

I would also appreciate any advise on other databases or software able to do it.
And I would also like to see benchmarks against Hyperdex, Aerospike, Cassandra or Couchbase.


look at scidb python package.