Advice on 575 GB of timeseries


#1

Gang:

I have a set of 45220 timeseries which are outputs from 2660 numerical simulations. On average, there are 1.5M samples in each timeseries, but the counts vary - some simulations run longer than others. Because the solvers are constantly adjusting the output timesteps, the samples aren’t uniform, and no two simulations share the same set of sample times. The goal is to get everything into SciDB where we can rebin the timeseries so their samples do match, then do the rest of our analysis. Because of the sample-mismatch problem, I’ve been ingesting each simulation’s output as a dense 1D array with the timestamps and values as attributes and rebinning them individually … but I don’t think this approach plays to SciDB’s strengths.

I’d love to get some ideas from the gurus on how to best organize this data.

Thanks in advance,
Tim

Sample simulation array schema:

Sample rebinning query for one timeseries from one simulation:

Wash, rinse, repeat …


#2

Hi Tim,

Let me try, from my distant view, to identify the problems with the current approach. First, it seems that it creates 2660 (or 45220, not sure?) distinct arrays. That would, no doubt, put some strain on the catalog. Second, and perhaps more bothersome, the timeseries are 1.5 million records, on average. With 1D arrays chunked at one million, it would follow that the majority of the data lands on instances 0 and 1, and you’re not getting any parallelism… Thirdly, of course, SciDB is built to handle few, longer queries, not many small ones.

Would it be possible to organize the data like this?

simulation_id, time, v0, v1, v2 ...
1, t1, v0_1, v1_1, v2_1,...
1, t2, v0_2, v1_2, v2_2,...
...
1, tn, v0_n, v1_n, v2_n,...
2, t1, v0_1, v1_1, v2_1,...
2, t2, v0_2, v1_2, v2_2

Because if we can do that, then we can load the data into a single, huge array, and then we can redimension the data into two dimensions [simulation_id, bin], this will amount to one big redimension query and will surely take advantage of the parallelism.

I think I can pretty easily concoct a sed script or something that would merge your files into one big lump and append simulation_id as a column. Or, perhaps we could compose a scidb query to do it. First, what do you think of the approach?


#3

Alex:

Correct, my first run mapped the data into 2660 arrays. I had hoped that I might get some parallelism by running multiple queries simultaneously, but as you mention, low numbers of chunks make the performance disappointing. The schema you suggest makes sense to me, and I can definitely adjust my input scripts to organize the data in this fashion, but I’m less sure about what the redimension is going to look like. Also, because I’ve had a lot of issues with running out of RAM during redimension-ing, it would be really helpful to get your advice on how best to configure a SciDB cluster for this workload. The hardware I have on-hand at the moment is a single machine with 32 cores, 78GB of RAM and 24 1TB disks in JBOD configuration. Any suggestions on cluster setup would be welcome, even if it’s “don’t use that hardware.”

Many thanks,
Tim