Timeseries data


#1

In looking over the current documentation, am I correct in concluding that SciDB does not have any native operators / libraries for handling timeseries? If it doesn’t, what do other users use to perform these types of ops (eg, diff in milliseconds between two time periods; median | average value across a series; etc)


#2

Hi,

SciDB does support some basic timeseries-like computations.

To give you a watered-down example, I can organize my data into an array that looks like:

ts
<value1:double>
<value2:double>
[year=0:*,1,0,day=0:366,1,0,millisecond=0:86399999,86400000,0]

So this kind of an array schema will partition data into year (i.e. 2010), day-of-year (between 0 and 366) and millisecond of day. There are only 86.4M milliseconds a day. Depending on how many events actually take place, and assuming you only have event-rich periods about 33% of the time - this would work ok. This will also ensure that the entire series for a whole day is stored on the same compute node for quicker window calculations.

Once you have such an array built, you can do aggregations like:

window(ts, 1,1,2000, avg(value1), stdev(value2)) -- a 2-second window moving average and standard dev computed over your data
aggregate(ts, sum(value1), day) -- the sum of value1 for every day
regrid(ts, 1,1,1000, avg(value2)) -- the average of value2, grouped into "second of day" buckets. The first value of the result contains the average for the first second of the day,... and so on.

You can also combine window() with merge, build and cross for some very powerful results. We do run queries like that for a few applications…

Is this something akin to what you’re looking for or am I way off base?


#3

Based on that schema there would be 2012x366x86.4m empty cells just to start with. Is there any performance impact with that many empty cells? Impact on disk or memory usage or query response times?

Same issue with 86.4m cells per day, not all will be used. is there a empty/full cell ratio that impacts performance?

Another possible schema would be:
<value1:double, value2:double>[year=0:*,1,0, month=1:12,1,0, day=1:31,1,0, hour=0:23,1,0, minute=0:59,1,0, second=0:59,1,0, millisecond=0:1000,1,0]

Generally speaking what are the pros and cons between the two schemas on efficiency and performance?

Mike


#4

Hi Mike,

You are asking a very astute question! In fact, we did a lot of work to make sure that we handle sparsity really well.

Given a chunk with M logical elements, N<=M non-empty elements we linearize the N data elements in a buffer and RLE compress it so that neighboring repeated values along the last dimension are not stored repeatedly. We also compose an RLE-encoded bitmap that has a virtual “1” for every non-empty element and “0” for every empty element - but the bitmap is stored and operated on in RLE compressed form. So, the storage footprint is under O(cN). In other words, if you have two chunks:

  1. having 1 million non-empty elements and 1 million empty elements and
  2. having 1 million non-empty elements and 1 billion empty elements

The storage fooptrint, memory footprint and approximate performance for handling the two chunks is the same. It only matters how many non-empty elements you have in the chunk.

Some general caveats are:

  1. you cannot have more than 2^63 total possible logical elements in the chunk
  2. if your chunks are too small (less than 10K physical elements) the ratio becomes inefficient because every chunk has a header of a few kilobytes. The schema you presented above suffers from chunks being too small since your chunks only have space for 1 value.
  3. It’s recommended that you structure your schema such that every chunk contains on average about 1M elements (unless you are using long strings). And we understand that you will sometimes have skew - some chunks will be denser than others. But the 1M average is a good guideline to follow.
  4. The RLE encoding format is a new feature in 12.3 and (since we’ve released it we found that)0 some operators (matrix multiply, window) will sometimes hiccup with very large logical chunks. This isn’t a theoretical problem but just a matter of rewriting some of the code. We’re knocking down those bugs as fast as we can and improving performance of various components.

Does this help?