Practical limit on array count, redux


#1

Gang:

My colleagues and I have been experimenting with what we call “analysis as a service.” Our customers upload their data to a server for analysis, and they can access the resulting model using a web browser. We’ve been using SciDB to store both the original data and the greatly-reduced analytic results. Put another way, our SciDB instance contains many small, a few medium, and handful of large arrays, all of varying schemas. As I reported earlier this year*, we began seeing performance issues with an instance containing 5000 arrays. I got some good advice on consolidating many small arrays into a few larger ones, but this only postponed the inevitable: we’re now beyond 20000 arrays in our SciDB instance, and performance is becoming a problem again, even with array consolidation. As an example, we did a little scaling study with a two-worker SciDB instance, checking array creation, write, and read times as we grew the number of arrays in the instance:

… note the linear growth in times for array creation and writes, as the number of arrays grows. So my question is: is this likely to change anytime soon? Clearly our data is nowhere near “the [largest] on planet earth”, but it’s disappointing to have a product with the perfect data model for our needs and not be able to use it for this case. And if many small arrays just isn’t a good fit for SciDB, are there any suggestions what could handle this use-case? Relational databases are still a lousy fit, as is well known.

Many thanks,
Tim


#2

Hello, Tim!

Quite remarkable. Thank you for going to these lengths and posting results. I’d say you’ve got the largest number of total arrays in a SciDB installation – that I know of.

Question: when you say “create” - does that imply simply issuing a "create array … " query?
It looks like this is all Postgres catalog overhead.

And I dug around meta.sql and I found a possible cause of the overhead:

create trigger check_array_repeated_attr_dim_names
after insert or update on array_dimension
for each row execute procedure check_no_array_dupes();

create trigger check_array_repeated_dim_attr_names
after insert or update on array_attribute
for each row execute procedure check_no_array_dupes();

That sure would appear to make things linear. The triggers are there to try and make sure the attribute names of an array don’t collide with dimensions, and vice versa. This will need a delicate and well-thought-out fix. But an adventurous user could try

drop trigger check_array_repeated_attr_dim_names on array_dimension cascade;
drop trigger check_array_repeated_dim_attr_names on array_attribute cascade;

… as as first-order sandbox hack. If you find this gets rid of the problem we can think about what the fix would look like…


#3

A less aggressive tactic would be to log into Postgres and simply say:

analyze;

See: postgresql.org/docs/9.1/stat … alyze.html
Perhaps ol Postgres is intelligent enough to handle this sort of the thing automatically.

I’m very curious to see what you find.


#4

Another question, Tim, is what is the maximum number of arrays that may be in use in one AFL query, in your workload?


#5

Gang:

Sorry for the slow response, it took awhile to get around to re-running our tests.

Alex: running ANALYZE did nothing to change the performance, but dropping the triggers you suggested gave us nice constant performance, as you predicted.

Dongui: we’re typically accessing just a single array in our queries, unless you count redimension_store as two arrays.

Cheers,
Tim


#6

Hmmmm …

  1. The triggers are generally important, as they are all that stands between the innocent user and an error in the CREATE ARRAY that will lead to all kinds of problems down the line.

  2. Clearly, postgres is not doing the right thing with that trigger. It’s looking at every row to evaluate the check, rather than probing for potential problems. We might need to create an index somewhere to improve the trigger processing time.

For reference, everyone, we’ve created a ticket to work this issue.