My colleagues and I have been experimenting with what we call “analysis as a service.” Our customers upload their data to a server for analysis, and they can access the resulting model using a web browser. We’ve been using SciDB to store both the original data and the greatly-reduced analytic results. Put another way, our SciDB instance contains many small, a few medium, and handful of large arrays, all of varying schemas. As I reported earlier this year*, we began seeing performance issues with an instance containing 5000 arrays. I got some good advice on consolidating many small arrays into a few larger ones, but this only postponed the inevitable: we’re now beyond 20000 arrays in our SciDB instance, and performance is becoming a problem again, even with array consolidation. As an example, we did a little scaling study with a two-worker SciDB instance, checking array creation, write, and read times as we grew the number of arrays in the instance:
… note the linear growth in times for array creation and writes, as the number of arrays grows. So my question is: is this likely to change anytime soon? Clearly our data is nowhere near “the [largest] on planet earth”, but it’s disappointing to have a product with the perfect data model for our needs and not be able to use it for this case. And if many small arrays just isn’t a good fit for SciDB, are there any suggestions what could handle this use-case? Relational databases are still a lousy fit, as is well known.