Practical limit on array count?


#1

Gang:

I’m running an experiment that involves creating and loading a relatively large number (5000) of relatively-small timeseries into SciDB, and I’m noticing a couple of things - first, ingestion seems to be getting slower as I go; second, postgres (specifically, the postmaster process) is doing more work than the rest of the system. Are there practical limits on the number of arrays in a SciDB instance, and am I pushing the envelope with just 5000?

Cheers,
Tim


#2

Tim -

No. There’s no upper limit. But here’s what I suspect is the problem. When you create a (lot) of arrays, with (say) a lot of attributes, you will end up adding a helluva lot (a lot of attributes times a lot of arrays) of rows to the PostgreSQL catalogs.

Now - we’re using PostgreSQL at the moment because it’s a stable, reliable catalog manager. But it’s not the long term solution. So we’ve not spent a lot of time worrying about or tuning our use of it. That said, if you’re seeing increased Postgres CPU times for operations like adding arrays, then I suspect that Postgres’ internal statistics might be out of whack; it thinks its got fewer rows in its tables than it has, so its scanning rather than using an index. To fix this, run the Postgres “ANALYZE” operation on the Postgres database. That will update Postgres with the number of rows it has in its tables. And it might improve performance of the Postgres SQL we embed within our Catalog Manager.

But there’s one more thing. Why are you creating 5,000 separate timeseries? SciDB is perfectly capable–indeed, it’s designed for–arrays with more than 2D. As a thought, why not create a single, 3D array containing 5,000 2D timeseries data sets? That will reduce the Postgres dependency to next to nothing, and certainly won’t slow down your queries.

Care to post some schema decls for us to take a look at?

Paul


#3

Paul:

Many thanks for the quick response, it was pretty-much what I was expecting to hear … I’ll try running ANALYZE and see what happens.

To your broader question: I’ve been exploring using SciDB as the backend for a general-purpose analysis tool, which brings its own set of challenges - in this case, each timeseries is a set of waveforms that share a common set of timestamps. However, not every timeseries has the same set of waveforms, and I don’t know how many to expect in-advance. Although my examples are tiny at the moment, the real data is large enough that I don’t want to wait for all of it to arrive before starting ingestion. Thus, lots of timeseries.

FWIW, I’ve been representing each timeseries using a 1D SciDB array:

… then redimensioning to suit - for example, to rebin the data:

All this because we’re exploring what it means to wean our experimenters off filesystems :wink:

Cheers,
Tim


#4

So, I did a quick test before-and-after running ANALYZE, and got effectively identical results - it takes a bit over 5 seconds to count 10000 (I’m always pushing that envelope) arrays in SciDB:

AFL% count(list('arrays')); i,count 0,10694 Query execution time: 5090ms

FWIW, the good news is that the slowdown I’m seeing doesn’t seem to affect any “real” queries on the arrays themselves - it only affects DDL-type commands, i.e. array creation, list(‘arrays’), and the like.

Cheers,
Tim