Modeling 'time' on timeseries


#1

Hi,

I am trying to figure out the best way I can define the dimensions of an array and thought to solicit the opinions of the forum members.

I need an array that will hold a single attribute “price” of type double. That can be thought as the price of a security traded on an exchange or the price of an item being sold at a retail store.

The challenge is to define the array dimensions in such a way that queries do not “lift” chunks from the disk if they do not need to be included in the calculations.

Let’s assume
(1) the optimal chunk size holds one hour worth of data
(2) total duration of time is 5 days -120 chunks in total-.
(3) The array needs to be “indexable” by date and time with millisecond precision.

[quote]There are 3 queries that need to be performed in an efficient way:
(1) What is the average price on Day 3, between 2pm and 3pm ???
(2) What is the average price between Day2@3pm and Day3@2pm ???
(3) What is the average price between 2pm and 3pm, for each of the Days 2,3,4 (group by day) [/quote]

In the most efficient schema definition and query optimization the chunks “lifted” for each query would be:
(1) 1 chunk -1 hour worth of data-
(2) 23 chunks - 23 hours-
(3) 3 chunks -3 hours-

One way we could define the array is with a single dimension, continuous time in milliseconds.

That makes queries (1) and (2) very easy, but what about the third?

code select avg(price) from between( array, 223200000, 226800000 )
(2) select avg(price) from between( array, 140400000, 223200000 )
(3) What is the third query here ?[/code]

A different way to define the array would be with two dimensions, day and time of day

That makes queries (1) and (3) easy, but what about the second?

code select avg(price) from between( array, 3, 50400000, 3, 54000000 )
(2) What is the second query???
(3) select avg(price) from between( array, 2, 50400000, 4, 54000000 ) group by day[/code]
I am sure other people have dealt with modeling time series data. I am just wondering if there is an optimal array schema that deals with timeseries queries without using the “where” clause, but even if some filtering needs to be done, what is the most efficient schema?

Thanks in advance for any suggestions and ideas.


#2

Hi Mike,

Initially I might go with 2D (as you had) and do query 2 as

select avg(price) from filter( between(array, 2, null, 3, null),  day = 2 and time > X or day = 3 and time < Y);

The rationale is that the inner between does most of the data elimination, you are left with a few chunks that you have to look at, the final filter over them should not be so bad.

Other options may include

  1. Custom UDO that does that kind of “between at the edges”. It wouldn’t be too hard to conceive or implement.

  2. You may opt to store two copies of the data and remembering to maintain them. Databases often play this game and often there are materialized views that are maintained for you automagically. SciDB is very likely to get there eventually.

  3. For other cases similar to 1D Query 3, there is another pattern where we build a “mask” array and then use something like join or cross_join between your array and the mask to get what we want. However it seems my initial suggestion will beat that solution in this case.

The one advantage of the 1D approach is that it’s better at doing various windows. Not sure if that’s part of your use case. Interested to talk more about this.


#3

Thanks Alex,

What I came up with was something like:

I do feel this is not a ‘smart’ solution, but not sure why! To visualize chunks for a single day, you have a 24x24 matrix, where only cells {1,1}, {2,2}, {3,3}, {4,4}….{24,24} will have values, and the same pattern carries over to all 5 days. All queries will be efficient in the sense that only relevant chunks will be read from the disk for any type of query. Not sure about disk utilization though.

The more natural would be to treat time as a single dimension.
Can you give me more information and possibly an example or two on the 1D approach? How to build and use the “mask” array in a situation similar to query (3) above?


#4

Mike,

That’s actually not a bad solution. It makes your series diagnoal but allows you to do a quick lookup. It makes your array sparser so you pay a little more of a penalty when it comes to storage size. Not a lot. Also, you can’t keep adding dimensions like this ad infinitum - the total chunk area or volume cannot, at the moment, exceed 2^63 logical cells. Also your solution doesn’t naturally lend itself to a “group by day” but you could probably easily use redimension aggregates to do tat. Furthermore, it won’t work with variable windows because the data is laid out “diagonally”.

A bit about building masks. Suppose I got an array like this:

iquery -aq "create array foo <val:double> [x=1:100000000,1000000,0]"
iquery -anq "store(build(foo,x), foo)"

Suppose I want to the values at coordinates say, 0-5, then 2M-2M+5, then 4M-4M+5, in other words - give me the first 5 values of every other chunk. Well I can build a series that looks like this:

$ iquery -aq "apply(cross(build(<a:int64> [i=0:4,5,0], i*2000000), build(<b:int64> [j=1:5,5,0], j)), x, a+b)" | head -n 15
i,j,a,b,x
0,1,0,1,1
0,2,0,2,2
0,3,0,3,3
0,4,0,4,4
0,5,0,5,5
1,1,2000000,1,2000001
1,2,2000000,2,2000002
1,3,2000000,3,2000003
1,4,2000000,4,2000004
1,5,2000000,5,2000005
2,1,4000000,1,4000001
2,2,4000000,2,4000002
2,3,4000000,3,4000003
2,4,4000000,4,4000004

I can then redimension using x and adding a dummy flag. The result is a “mask” of sorts:

$ iquery -aq "redimension(apply(cross(build(<a:int64> [i=0:4,5,0], i*2000000), build(<b:int64> [j=1:5,5,0], j)), x, a+b, flag, true), <flag:bool> [x=1:100000000,1000000,0])" | head -n 15
x,flag
1,true
2,true
3,true
4,true
5,true
2000001,true
2000002,true
2000003,true
2000004,true
2000005,true
4000001,true
4000002,true
4000003,true
4000004,true

And finally I can join that to foo. Note this is all one big query, I just broke it up into pieces to make it easier to read:

$ iquery -aq "join(foo, redimension(apply(cross(build(<a:int64> [i=0:4,5,0], i*2000000), build(<b:int64> [j=1:5,5,0], j)), x, a+b, flag, true), <flag:bool> [x=1:100000000,1000000,0]))" | head -n 15
x,val,flag
1,1,true
2,2,true
3,3,true
4,4,true
5,5,true
2000001,2e+06,true
2000002,2e+06,true
2000003,2e+06,true
2000004,2e+06,true
2000005,2e+06,true
4000001,4e+06,true
4000002,4e+06,true
4000003,4e+06,true
4000004,4e+06,true

The result is the array foo, filtered to get just the values you want. The dummy flag attribute may be removed with a project(). And you can now compute your average.

But, as data gets sparser, the number of cells in the mask grows. Depending on how sparse the data is, my original suggestion may be much faster than this. This is a better method for denser data. Make sense?


#5

Thanks a lot Alex,

I will go ahead and model as the 2D array then (day, time of day). I will build some queries based on different data retrieval scenarios I have in mind and see how that works out.

Quick question if you can help. Lets say I am trying to grow my array to more than 5 days worth of data so I can do queries spanning several days (e.g. calculate ‘weekly’ or ‘monthly’ aggregates). The actual attribute values do not matter, they can be the same.

How would I go ahead and ‘copy’ the attributes from the first 5 days to the next 5 days so that I have an array with 10 days?

For example if my array is:

and I only have data for days 1…5, how do I duplicate/copy that data for days 6…10 ? (keep the same ‘timestamp’ dimension values but different ‘day’)


#6

Hi Mike,

Sorry about the wait. Here’s an example with two generic dimensions.

  1. Create array that’s 1 to 10K, squared:
$ iquery -aq "create array foo <val:int64> [x=1:10000,1000,0, y=1:10000,1000,0]"
  1. Fill in the first 1K by 1K square:
$ iquery -anq "store(between(build(foo,x),1,1,1000,1000 ),foo)"
  1. Take the first 1K by 1K square and offset it by 1000,1000, and reinsert it back to coordinates 1001,1001 to 2000,2000:
$ iquery -anq "insert(redimension(apply(foo, new_x, x+1000, new_y, y+1000), <val:int64> [new_x=1:10000,1000,0,new_y=1:10000,1000,0]), foo)"
  1. Observe old versions are still accessible:
$ iquery -aq "aggregate(foo@1, sum(val), count(*))"
i,val_sum,count
0,500500000,1000000

$ iquery -aq "aggregate(foo@2, sum(val), count(*))"
i,val_sum,count
0,1001000000,2000000
  1. Offset and reinsert the last square diagonally again:
$ iquery -anq "insert(redimension(apply( between(foo, 1001,1001,2000,2000 ), new_x, x+1000, new_y, y+1000), <val:int64> [new_x=1:10000,1000,0,new_y=1:10000,1000,0]), foo)"

Redimension is the heavy-duty one-size-fits-all hammer. Not as nimble as it could be. It’s conceivable to write a lighter-weight operator for this kind of translation.
Further, insert works best when adding entire whole new chunks at a time. Inserts less than one chunk are not recommended, will start using up too much room on disk.
That help?


#7

Thanks a lot Alex… Will give it a try.