Chunk overlap


#1

I am trying to create an array with chunk overlap but it fails during redimension.

I took the following example from the guide :

event,year,person,time 1,1996,Bailey,9.84 1,2000,Greene,9.87 1,2004,Gatlin,9.85 1,2008,Bolt,9.69 2,1996,Keter,487.12 2,2000,Kosgei,503.17 2,2004,Kemboi,485.81 2,2008,Kipruto,490.34 3,1996,Thugwane,7956 3,2000,Abera,7811 3,2004,Baldini,7855 3,2008,Wanjiru,7596

The following commands work:

[code]iquery -q “CREATE ARRAY winnersFlat < event:int64, year:int64, person:string, time:double > [i=0:*,1000000,0]”

iquery -q “LOAD winnersFlat FROM ‘/tmp/data.scidb’”

iquery -q “CREATE ARRAY winners <person:string, time:double>[year=1996:2008,1000,0, event=0:3,2,2]”
[/code]
but the very last step fails:

$ iquery -a -q "insert ( redimension ( winnersFlat, winners ), winners )" UserException in file: src/query/ops/redimension/LogicalRedimension.cpp function: inferSchema line: 226 Error id: scidb::SCIDB_SE_INFER_SCHEMA::SCIDB_LE_OP_REDIMENSION_STORE_ERROR3 Error description: Error during schema inferring. REDIMENSION_STORE cannot fill overlap area larger than chunk interval. Failed query id: 1101971044349

A 0 overlap has no issues.

Any suggestion what I am doing wrong ?

Thanks
mike


#2

Hello,

Redimension has a silly restriction where it does not support overlaps. To get this to work, redimension_store into a temp array, then insert the temp array into target, then remove temp.
This should be fixed soon…


#3

Thank you. Yes, redimension_store works.

A quick follow up question…

What is the best way I can distinguish if a value is from the overlap region or not?

For example, when I am iterating the chunks of an attribute I would like to have a condition such as:

if ( value_is_from_overlap ) { // DO SOMETHING } else { //VALUE BELONGS ON THIS CHUNK //DO SOMETHING ELSE }

How do I build the value_is_from_overlap variable?

Thanks
–mike


#4

You shouldn’t need to care. At least, not at the level of the AFL / AQL query language. The per-chunk overlap is a performance optimization we use to ensure that operators like window can be computed entirely in parallel.

Inside the operators we only deal with the relevant data. So operators like filter and apply will work with all the values whether they’re in the overlap or not. Operators like window use the pieces of the overlap they need to compute their results. And operators like aggregate, regrid etc will only work with values in the core of the array’s chunks.

But what you see in the client space…at least, using iquery…is only data from the non-overlapping region. Why do you need to be aware of whether a value’s in the core or the overlap?


#5

I am looking to find out the overlap region from within a user defined operator (UDO).

Similar to the operators you mentioned, I need the same flexibility of determining how to use the overlap chunk values within my operators.

Currently, in a UDO, I see the overlap values (iterating without ChunkIterator::IGNORE_OVERLAPS) when iterating the chunk, but I have not been able to find a way to put in place an “if” statement as described above. I tried to go over some of the operators you mentioned but was not able to identify a ‘pattern’.

Thanks
–mike


#6

Well … the reason we don’t have any operators that care to know whether the chunk is in the overlap region or not is because yours would be the first!

Which worries me. Putting the check in there is going to incur pretty major run-time expense. As you say … the way to do it would be to look at the ArrayDesc for the input array, pull out it’s Dimensions, and then check, one cell at a time (you can get the position in the iterator with the ConstChunkIterator::getPosition() call) to see whether the cell is in the overlapping region or not.

But … stepping back a second, there might be an alternative design. Would you mind sharing how your new operator is specified? Inputs, Outputs, how-input-maps-to-output?


#7

I think we are talking about the same thing but I am using the wrong terminology. I am looking to find out how to determine if a given cell/value is in the overlap section, of the chunk that is being iterated on, or not.

In any case, I am probably taking a wrong approach on how to implement the operator and my operator specs are not optimal.

Let me go all the way back and explain the problem I try to solve, and the approach I take. I am sure you will be able to give me a better design of how to do it.

Lets take the following array that holds a timeindex as a dimension and a tradepice, a flag and time as attributes.

trades<tradeprice:double, flag:uint32, time:uint64>[ timeindex=0:86399999,10800000,0 ]

I want to create an operator that calculates the number of upticks and downticks from an input array which will be a subarray of 'trades’
Every time there is a tradeprice increase the uptick counter will go up and every time the tradeprice decreases the downtick counter will go up. The output array would be a single cell with those 2 attributes.

A sample query would look like:

select * from myOperator( between(trades, 36000000, 36059999 ) )

Which would provide as input all trades for one minute from 10:00:00.000 until 10:00:59.999

The problem is that I want the very first trade in that time interval to be counted as an uptick or downtick and to do that I need to know the trade before that, which is outside my input array.

So the question is… How do I best do that?

This is the approach I took (You can skip reading the remainder of this post as there must be a better approach)
Add 2 input parameters in the operator, the start & end timeindex ( same values as the input parameters to ‘between’ ) AND decrease the first parameter to ‘between’ by 60 seconds.
So now my operator looks like:

Which would provide as input all trades from 09:59:00.000 to 10:00:59.999, the minute I am interested in AND the minute before that, alongside the 2 parameters (startindex & endindex ) that specify the range I am truly interested in.

My computeLocalStats() method would then look like (pseudocode):

//time is the time (attribute) of this trade I am iterating over, 
//startindex is the first  parameter passed as input 
if (  time < startindex )
{
	//Don't compute this trade. Just record its price
}
else
{
      //This is a trade I am interested in. Compare against latest recorded trade price
      //and update uptick/downtick variable. Record its trade price to compute next trade
}

Not the greatest solution… but it works!!!

Now about overlapping… Without overlapping, the above solution FAILS at the beginning of the chunk. For example… if my chunk would start at 10:00:00.000 I would still get all my trades, but the trades for the minute before would go to a different instance, and I would still not be able to ‘account’ for the first trade.
Adding overlapping fixes that issue, but introduces the problem that certain ‘overlapped’ trades would be counted twice if within the time interval I am working on. For example overlapping trades on between(trades,null,null) would be counted twice.
If I was able to change the above pseudocode to:

if ( time< startindex OR value_is_from_overlap ) { //Don't compute this trade. Just record its price } else { //This is a trade I am interested in. Compare against latest recorded trade price //and update uptick/downtick variable. Record its trade price to compute next trade }

I think it would work out fine…Not an elegant solution but I think it would work. I am trying to figure out how to set value_is_from_overlap to true or false.

Appreciate the help.


#8

Cool…

So there are a couple ways to do this. First you could use these methods on the ConstChunk class:

virtual Coordinates const& getFirstPosition(bool withOverlap) const = 0;
virtual Coordinates const& getLastPosition(bool withOverlap) const = 0;

//Usage:
shared_ptr<ConstChunkIterator> iter = chunk.getIterator();
Coordinates const& first = chunk.getFirstPosition(false);
Coordinates const& last = chunk.getLastPosition(false);
while(! iter->end())
{
  Value v = iter->getItem();
  Coordinates const& pos = iter->getPosition();
  if ( pos[0] >= first[0] && pos[0]<=last[0]) // only one dim, right?
  {
    //...
  }
  else
  {
    //...
  }
}

That help?