Prototype Workaround for Non-Integer Dimensions in 13.6


#1

In the 13.6 source tree, you will find the directory examples/example_udos. This directory contains the implementation of 4 prototype user-defined operators. These prototypes serve two objectives:

  1. they are examples of how to write standalone UDOs - from simple to complex
  2. they provide a prototype solution for some non-integer dimension problems

If you are interested in learning to develop UDOs, start with the file example_udos/README - and move on from that to the Makefile and source files. Below I’ll discuss the non-integer dimensions workaround.

Example
We use the stock trades data which can be loaded with example_udos/SampleQueries/load_trades.sh

We now have the following 1D array:

$ iquery -aq "show(trades_flat)"
i,schema
0,'trades_flat<symbol:string,ms:int64,volume:uint64,price:double> [i=0:*,1000000,0]

$ iquery -aq "load_library('example_udos')"   #the .so file IS installed as part of the SciDB 13.6 package

Side reminder - populate your iquery.conf file to make life easier if you haven’t done so already:

$ cat ~/.config/scidb/iquery.conf 
{
  "format":"lcsv+"
}

Our trades array data looks like this:

$ iquery -aq "between(trades_flat,0,4)"
i,symbol,ms,volume,price
0,'AAPL',14409893,196,433
1,'AAPL',15699910,100,436
2,'AAPL',15699976,100,436
3,'AAPL',24744512,157,433
4,'AAPL',24776397,141,433

The objective is to redimension the data using symbol and ms as dimensions. This is where new operators come in. As a first step, we create the array of sorted, unique stock symbols:

$ iquery -aq "store(uniq(sort(project(trades_flat, symbol))), stock_symbols)"
i,symbol
0,'AAPL'
1,'BAC'
2,'C'
3,'FB'
4,'GOOG'
5,'JPM'
6,'MSFT'
7,'ZNGA'

Turns out there’s only 7 - this is a small dataset. Note the usage of uniq - it works just like its Unix namesake command; its input must be sorted. We can now use the array “stock_symbols” as a means of translating symbols to their ordinal and back. To go from the symbol to the ordinal, we use another new operator called index_lookup. This is how it works:

$ iquery -aq "between(index_lookup(trades_flat, stock_symbols, trades_flat.symbol, symbol_id), 0, 4)"
i,symbol,ms,volume,price,symbol_id
0,'AAPL',14409893,196,433,0
1,'AAPL',15699910,100,436,0
2,'AAPL',15699976,100,436,0
3,'AAPL',24744512,157,433,0
4,'AAPL',24776397,141,433,0

Note: “trades_flat.symbol” is the attribute to look up and “symbol_id” is the name of the new applied attribute. Now we can feed this into redimension_store:

$ iquery -aq "create array trades_redim <volume:uint64, price:double> [symbol_id=0:*,5,0, ms=0:*,86400000,0]" 
$ iquery -anq "redimension_store(index_lookup(trades_flat, stock_symbols, trades_flat.symbol, symbol_id), trades_redim)"

And the array is redimensioned. We can now use combinations of operators to relate the symbol IDs back to their string names and vice-versa. For example, this query returns the 3 stocks with the highest total traded volume:

$ iquery -aq "sort(join(redimension(between(sort(unpack(aggregate(trades_redim, sum(volume) as total_volume, symbol_id), i), total_volume desc), 0,2 ), <total_volume:uint64 null> [symbol_id=0:*,1000000,0]), stock_symbols), total_volume desc)"
n,total_volume,symbol
0,122237404,'BAC'
1,58390502,'ZNGA'
2,30949232,'C'

This query selects only the trades for AAPL and MSFT. Note the use of repart:

$ iquery -aq "count(cross_join( trades_redim as A, repart( filter( stock_symbols, symbol = 'MSFT' or symbol = 'AAPL'), <symbol:string> [symbol_id=0:*,5,0]) as B, A.symbol_id, B.symbol_id))"
i,count
0,108565

Note: it’s not as good as between(), but it is more versatile. It will be faster than filter() on large arrays.

Problems Addressed
This mechanism of using uniq and index_lookup enables all of the functionality of the current NIDs, and solves some problems. Namely:

  1. Current NIDs can use too much memory if there are too many unique values in the dimension. Here’s an example complaint: viewtopic.php?f=11&t=1144 . Both operators should be much friendlier to memory and respond better to tuning. index_lookup has its own ‘memory_limit=’ optional parameter. See some tuning notes below for more information.

  2. uniq() is useful on its own, it provides functionality that previously was available only as an awkward redimension_store.

  3. The source data for the dimension values no longer has to come from the array that’s being redimensioned. Imagine a case where you have a master list of all the possible stock symbols and partial data sets that may contain some of the symbols. You can now create multiple such arrays and combine them with each other.

  4. None of the NID restrictions apply - you can perform joins, inserts, cross_joins and merges. Provided you used the same index to create the values, you will get the correct result. You can also use regular redimension without the _store more often.

  5. Some users might enjoy the fact that there are no implicitly created hidden arrays and everything is out in the open. The user has full control over the lifetime of all the objects involved.

Tuning Notes
You might find that you are using the operators sort() and repart() more often than before, hence some hints.

Remember that sort will spawn prefetch-queue-size number of threads to sort data in parallel. Each thread will consume up to merge-sort-buffer MB of ram, so you don’t want to make that number too large. In practice, something like 64MB might be well-advised for an average machine.

If you find you are repart-ing often, consider that index arrays typically are not large and you might want to store several copies with different chunk sizes. Note that index_lookup does not require a particular chunk size.

Also note that index_lookup has a memory_limit option, which is memory usage on top of the intermediate array cache and storage manager cache. The more memory you give to this operator - the faster it works.

Alternatives
If I were actually dealing with real stock data I might adopt a different approach.

There are not that many different equities on the market. You can simply create a UDF that has a hardcoded table (or a custom hash) that converts from a symbol to a number, and another function that converts the number back to a symbol. A custom reversible hash might be a good choice since stock symbols use a limited alphabet and tend to be short. We call such a pair of functions a “functional NID”. That approach would be a better at handling new dimension values that are introduced over time (i.e. when a new financial instrument goes on the market). However, other applications (i.e. genomics) may be better served by this uniq/index_lookup mechanism.

Future and roadmap
Clearly, more work is needed in this area. We are happy to be able to offer a prototype that may help some folks. However, this isn’t a final product and we are in the process of designing and discussing what the actual final product will look like. To that end - we welcome your comments with requirements and use cases!


#2

How would the redimensioning work if there are two or more non-integer dimensions? I have two non-integer dimensions and I have created their corresponding index arrays but can’t figure out how to write the redimension command with two index_lookups.


#3

Sure, take a look at
viewtopic.php?f=11&t=1371


#4

Thanks! I was using a join of two index_lookups, but using nested index_lookups visibly improved performance.


#5