Aggregate on fixed time intervals


#1

I am having problems coming up with a proper AQL query to give me aggregates over fixed time intervals.

For example, using the XLDB11 Timeseries dataset (or a subset of it with only AAPL in the set) I have the following array:

I am trying to get the aggregate of min( price ) and count(*) for 10 second intervals for one minute.

The result is:

{time} START,END,MIN,COUNT {0} "2011-01-02 16:00:00","2011-01-02 16:00:11",101.48,10 {1} "2011-01-02 16:00:12","2011-01-02 16:00:27",101.48,10 {2} "2011-01-02 16:00:28","2011-01-02 16:00:42",103.48,10 {3} "2011-01-02 16:00:43","2011-01-02 16:00:54",100.48,10 {4} "2011-01-02 16:00:55","2011-01-02 16:00:59",103.48,3
That is valid result, but not what I am looking for, because the array does not have attribute values for each dimension index.

I would like to get 6 rows back, on 10 second time intervals :
“2011-01-02 16:00:00”,“2011-01-02 16:00:09”,x,y
"2011-01-02 16:00:10",“2011-01-02 16:00:19”,x,y
"2011-01-02 16:00:20",“2011-01-02 16:00:29”,x,y
"2011-01-02 16:00:30",“2011-01-02 16:00:39”,x,y
"2011-01-02 16:00:40",“2011-01-02 16:00:49”,x,y
"2011-01-02 16:00:50",“2011-01-02 16:00:59”,x,y

I tried different things, the closest I got was to have multiple dimensions, one for each aggregation interval I need. Very clumsy and I am sure not the most optimal.

So I think my question comes down to, how to ‘fill’ the empty cells during aggregation? Is ‘regrid’ the right way to go?

Having START and END timestamps is not really important, min and max work fine, but expecting the right number of rows and the count(*) aggregate return a 0 if within an interval there are no values is most important.

Any hints or advise on which operators to use is greatly appreciated.


#2

Hi MikeJ!

Well. Here’s how you do it. The explaination is broken up over a set of script fragments that you can wire together if you’d like. But you can also just cut-n-paste the individual elements into a shell window.

#
#  Some useful functions. 
exec_afl_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -aq "${1}"
};
#
exec_aql_query () {
    echo "Query: ${1}"
    /usr/bin/time -f "Elapsed Time: %E" iquery -o dcsv ${2} -q "${1}"
};

First, I’m going to try to get your application’s semantics as close to the logical array model as I can. In SciDB, we support a basic, built in ‘datetime’ type. This is more or less just a wrapper for a Posix second offset since epoch. But it’s a good idea to get a handle on the range of functions you can use with it.

If you want to get a handle on the SciDB User-defined Functions that are available, together with their signatures, as they relate to particular types, use the built-in list(…) operator, and apply it to list the functions you can use. The following query lists the functions that involve the ‘datetime’ type. Note the use of the regex(…) function, which performs a regular expression (boolean) match over it’s argument.

CMD_LIST_FUNCTIONS="
SELECT F.name, F.profile
  FROM list ('functions') AS F
 WHERE regex ( F.profile, '(.*)datetime(.*)' )
"
exec_aql_query "${CMD_LIST_FUNCTIONS}"

The following query illustrates how to convert instances of the SciDB datetime type into int64 values, and back again.

CMD_MANIPULATE_DATETIME="
SELECT now() AS Now_Datetime,
       int64 ( now() ) AS Posix_Secs_Now_Datetime,
       datetime ('1970-1-1 00:00:00') AS Epoch,
       int64 ( datetime ( '1970-1-1 00:00:00' ) ) AS Posix_Epoch,
       int64 ( datetime ( '2013-3-11 00:00:00' ) ) AS Start_DT
  FROM build ( <  v : int64  > [ I=0:0,1,0 ], I  ) AS D
"
exec_aql_query "${CMD_MANIPULATE_DATETIME};"
# {I} Now_Datetime,Posix_Secs_Now_Datetime,Epoch,Posix_Epoch,Start_DT
# {0} "2013-04-12 16:30:22",1365784222,"1970-01-01 00:00:00",0,1362960000

OK. To see what’s going on, let’s create an array and populate it with datetime values. Note the limits on the INT_Time dimension. We’re starting at DateTime = ‘2013-3-11 00:00:00’, which is Posix Time = 1362960000. Note also the overlap=3600, which is 1 hour (in seconds).

CMD_HYGIENE="DROP ARRAY Timeseries_Example"
exec_aql_query "${CMD_HYGIENE};"

CMD_CREATE_NO_NID_ARRAY="
CREATE ARRAY Timeseries_Example
< 
  Price : double null,
  Qty   : int64 null
> 
[ INT_Time=1362960000:*,3600,0 ]
"
exec_aql_query "${CMD_CREATE_NO_NID_ARRAY};"

Now. Let’s populate the Timeseries_Example array. First, the following query generates 99 values at random points over a one week period after ‘2013-3-11 00:00:00’.

#   NOTE: The results of CMD_CREATE_DATA won't be identical to what is 
#         produced in the "real" CMD_RD_DATA query, because of the different 
#         results of the random(). 
CMD_CREATE_DATA="
SELECT double(random()%10000) / 100.0 AS Price,
       random()%5 AS Qty,
       datetime ( INT_Time ) AS DTime
  FROM build ( < INT_Time : int64 > [ NUM=0:99,100,0 ], 
                int64 ( datetime('2013-3-11 00:00:00') + 
                           ((random() % 604800))) 
             )
"
exec_aql_query "${CMD_CREATE_DATA};"

Next, populate the Timeseries_Example array with data. This query is a slight variation on the query above. Because of the “REDIMENSION BY”, we need to use an aggregate for each of the source attributes. In the general case we need to cope with “collisions” in the input data; where the 1D array contains a duplicate INT_Time value, in this case there won’t be any.

CMD_RD_DATA="
INSERT INTO Timeseries_Example
  SELECT MIN ( double(random()%10000) / 100.0 ) AS Price,
         MIN ( random()%5 ) AS Qty
  FROM build ( < INT_Time : int64 > [ NUM=0:99,100,0 ], 
                int64 ( datetime('2013-3-11 00:00:00') + 
                           (random() % 604800)) 
             )
REDIMENSION BY [ INT_Time=1362960000:*,3600,0 ]
"
exec_aql_query "${CMD_RD_DATA};"

Let’s look at that data.

CMD_ANALYZE="
SELECT * 
  FROM analyze ( apply ( Timeseries_Example, DT, datetime(INT_Time) ))
"
exec_aql_query "${CMD_ANALYZE};"

The details of this query’s result will vary, depending on how the RNG worked in the CMD_RD_DATA query above. But you should see about 100 entries in the Timeseries_Example array.

Now, let’s reproduce your issue. Suppose you wanted to regrid this into a daily entry between ‘2013-3-12’ and ‘2013-3-15’. Your example uses a different granularity and different start/end points, but the principle is the same.

CMD_REGRID="
SELECT MIN ( datetime ( INT_Time )) AS Start_Time,
       MAX ( datetime ( INT_Time )) AS End_Time,
       MIN ( Price ) AS Low,
       MAX ( Price ) AS High,
       AVG ( Price ) AS Average,
       SUM ( Qty ) AS Volume,
       COUNT ( * ) AS Trade_Cnt
  FROM between ( Timeseries_Example, 
                 int64(datetime ('2013-3-12 00:00:00')),
                 int64(datetime ('2013-3-15 00:00:00'))
               )
 REGRID AS ( PARTITION BY INT_Time 86400 )
"
exec_aql_query "${CMD_REGRID};"

OK. So what’s going on here? I hope the following diagram gives you a feel for what’s going on. To illustrate the point, this is a simple array A, with an INT_Time dimension, and two attributes Q and P.

# A ---> INT_Time
#       0     1     2     3     4     5     6     7     8     9    10     11
#   +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
#   | ( ) | Q,P | ( ) | Q,P | ( ) | Q,P | Q,P | ( ) | Q,P | ( ) | Q,P | Q,P |
#   +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
#      ^                 ^     ^                 ^     ^                 ^
#      |_________________|     |_________________|     |_________________|
#
#              0                        1                       2

The operation we want to compute is:

regrid ( A, 4 )  -> B [ 0:2 ] 

The regrid(…) op is reporting what it sees in the data. In the case of B[0] (which is A[0:3], the A[0] cell is “empty” and therefore doesn’t contribute any values. B[1] (which is A[4:7]) consumes as input (or rather, doesn’t consume as input) empty cells as A[4] and A[7]. This is the way that regrid(…) works. As a rule, we don’t try to “make up” data that isn’t there.

So - how do you get the result you want? In a nutshell, you need to use other SciDB operators to make A look like what’s below:

# A_new ---> INT_Time
#       0     1     2     3     4     5     6     7     8     9    10     11
#   +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
#   | (?) | Q,P | ( ) | Q,P | (?) | Q,P | Q,P | ( ) | Q,P | ( ) | Q,P | Q,P |
#   +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
#      ^                 ^     ^                 ^     ^                 ^
#      |_________________|     |_________________|     |_________________|
#
#              0                        1                       2

There’s a SciDB operation called merge(…) that combines two input array contents. The operator’s signature is merge ( A, B ) where A and B are “conformant” arrays; which means they must have the same list of attributes, and each attribute must have the same name, type and properties like nullability. The merge(…) algorithm says, "If there is a cell in A, then put that in the output. If there is no cell in A, but there is a cell in B at that location, then place the B cell in the output. If neither A nor B has a cell at the location, then nothing ends up there.

The merge(…) operation is very familiar to users with an image processing background. It doesn’t have a direct analog in SQL, however.

How best to handle this in your application? What I’ve done below is to create a “calendar” that you can use like a “mask”. Then you can merge(…) calendar / masks back with the Timeseries data to get the boundaries you want.

Something like this …

CMD_HYGIENE="
DROP ARRAY Timeseries_Day_Mask
"
exec_aql_query "${CMD_HYGIENE};"
#
CMD_BUILD_MASK="
SELECT double(missing(missing_reason(null))) AS Price,
       int64(missing(missing_reason(null))) AS Qty
 INTO Timeseries_Day_Mask
  FROM build_sparse ( < TS : int64 > [ INT_Time=1362960000:1363564800,3600,0],
                      INT_Time,
                      INT_Time%86400 = 0
                    )
"
exec_aql_query "${CMD_BUILD_MASK};"

NOTE: The 1362960000 is the Posix time starting at ‘2013-3-11’ (see above), and the 86400 is the number of seconds in a 24 hour period. 3600 * 24 = 86400. By adjusting things like the start timestamp the interval “86400” value, and so on, you can construct a “mask” that represents the calendar and whatever granularity over whatever time period.

How do you use merge(…)?

CMD_MERGE_MASK_WITH_TIMESERIES="
SELECT * 
  FROM merge ( Timeseries_Example, Timeseries_Day_Mask )
"
exec_aql_query "${CMD_MERGE_MASK_WITH_TIMESERIES};"

Anyone familiar with how calendars and temporal data models work will by now be pointing out that things like leap years and leap seconds will complicate this math. And financial market types will point out that the ideas about trade-days and market calendars makes it more complicated still. But the underlying idea is the same, even for those more complex, practical cases. In an array DBMS, using masks for filtering is a very efficient, powerful idea.

Now. The idea is to use the “mask” to anchor the start/end of each day in the input timeseries, and then use the regrid(…) in the way you did initially. The following query illustrates the idea.

CMD_REGRID_2="
SELECT MIN ( datetime ( INT_Time )) AS Start_Time,
       MAX ( datetime ( INT_Time )) AS End_Time,
       MIN ( Price ) AS Low,
       MAX ( Price ) AS High,
       AVG ( Price ) AS Average,
       SUM ( Qty ) AS Volume,
       COUNT ( Qty ) AS Trade_Cnt
  FROM between ( 
         merge ( 
           Timeseries_Example, 
           Timeseries_Day_Mask
         ),
         int64(datetime ('2013-3-12 00:00:00')),
         int64(datetime ('2013-3-15 00:00:00'))
       )
 REGRID AS ( PARTITION BY INT_Time 86400 )
"
exec_aql_query "${CMD_REGRID_2};"

And that gets you the result you want: The aggregated attribute information, anchored to a calendar.

Hope this helps!


#3

Thank you.
Appreciate the detailed solution. Very helpful information.


#4

Just a thought - you might also consider reordering merge and regrid in the above query: instead of regrid(merge(A,B)), do merge(regrid(A), C) observing that C is a shrunk version of B that requires some math to build properly. Might be an optimization to consider.