Appending Data to Arrays with Labeled Coordinates in 13.12


#1

Astute users may have noticed 13.12 added index_lookup and uniq to scidb core as full-fledged operators. Further, index_lookup no longer requires that the index array is sorted. This provides some good advantages, in particular when using categorical dimensions and inserting data into arrays.

For example, let’s take our trades data sample (examples/example_udos/SampleQueries) and split it into two 1D pieces to simulate incremental data loading. The first piece looks like this:

show(trades_flat_1);
i,schema
0,'trades_flat_1<symbol:string,ms:int64,volume:uint64,price:double> [i=0:*,1000000,0]'

count(trades_flat_1);
i,count
0,447450

We redimension this into [symbol_id, ms, trade], pretty much following the procedure shown here: viewtopic.php?f=18&t=1204

store(uniq(project(trades_flat_1, symbol)), stock_symbol)
i,symbol
0,'FB'
1,'GOOG'
2,'JPM'
3,'MSFT'
4,'ZNGA'

store(redimension(index_lookup(trades_flat_1, stock_symbol, trades_flat_1.symbol, symbol_id), <price:double,volume:uint64> [symbol_id=0:*,5,0,ms=0:86399999,86400000,0,trade=0:499,500,0]),trades )

Note also: as of 13.12 there is virtuall no difference between store(redimension()) and redimension_store().

Now it is time to “add” a second batch of trade data to the mix:

show(trades_flat_2)
i,schema
0,'trades_flat_2<symbol:string,ms:int64,volume:uint64,price:double> [i=0:999999,1000000,0]'

First, let’s just look at the unique symbols in trades_flat_2:

uniq(sort(project(trades_flat_2,symbol)))
i,symbol
0,'AAPL'
1,'BAC'
2,'C'
3,'FB'

So there is some more ‘FB’ data as well as new symbols. Our first objective is to add new symbols to the end of “stock_symbol” but skip FB. Note index_lookup returns null for attribute values that are not present in the index. We can use that property to filter out just the new data:

filter(index_lookup(uniq(sort(project(trades_flat_2, symbol))) as A, stock_symbol, A.symbol, symbol_id), symbol_id is null)
i,symbol,symbol_id
0,'AAPL',null
1,'BAC',null
2,'C',null

Now with some cross/redimension magic, we insert this data into stock_symbol. This runs faster than it looks:

insert(redimension(apply(cross_join(filter(index_lookup(uniq(sort(project(trades_flat_2, symbol))) as A, stock_symbol, A.symbol, symbol_flag), symbol_flag is null), aggregate(stock_symbol, count(*) as count)), new_i, A.i+count), <symbol:string>[new_i=0:*,1000000,0]), stock_symbol)
i,symbol
0,'FB'
1,'GOOG'
2,'JPM'
3,'MSFT'
4,'ZNGA'
5,'AAPL'
6,'BAC'
7,'C'

The array stock_symbol is no longer sorted, however index_lookup will work with it fine - returning the value ‘5’ for ‘AAPL’ for instance. Now, finally, use the new version of stock_symbol to insert data from trades_flat_2 into all the right places:

insert(redimension(index_lookup(trades_flat_2, stock_symbol, trades_flat_2.symbol, symbol_id), trades), trades)

Of note: the old stock_symbol and trades arrays are available of course, thanks to versioning:

scan(stock_symbol@1)
i,symbol
0,'FB'
1,'GOOG'
2,'JPM'
3,'MSFT'
4,'ZNGA'

And, this works best when inserts are large (at least 10MB) and minimally overlap existing chunks.

To summarize, the above is a good approach to adding data to arrays that are using categorical coordinate labels. The key benefit is that the introduction of new dimension values (i.e. new companies going on the market) does not disturb the placement of any of the existing data. One disadvantage is that it is now more difficult to express a query such as “give me everything where stock_symbol > ‘C’ and stock_symbol < ‘MSFT’”. It is still doable with a cross_join(), but not a between(). Of course, real financial analysis hardly cares for such a query. Neither do bionformatics type workloads with “gene_id” for example.

Finally, check out the new R package on github. It offers some nice new syntax for querying using labels. For example:

> stock_symbol = scidb("stock_symbol")
> trades = scidb("trades")
> count(trades[stock_symbol=="'FB'", , ])
[1] 147069

More concise syntax is on the roadmap for future releases.


#2