Scidb vs traditional sql


#1

I’m experiencing a little confusion about joins in traditional sql vs scidb.

I’ve read a few other posts which explain some of the differences between scidb’s array store vs sql and my question is really about how I should change the layout of my data to use scidb effectively.

We have a time series store which we’ve implemented in a traditional sql (normalized) way with a symbols table and an observations table:
symbols(id, ticker, exchange, sector, …)
observations(id, series_id, asofdate, value) in which series_id is foreign key’d to symbols(id)

What I have tried so far is to import each of these tables as separate arrays and then re-dimensioned them using id.

I was attempting to do some simple summary stats grouped by sector, but I have not been able to cross_join the two tables.

In the scidb model, should I simply import a ‘pre-joined’ version of the data?

i.e. export this query to csv for import into scidb:
select * from observations o
join symbols s on s.id = o.series_id

Should I start thinking about using all scidb arrays in de-normalized terms?

Thanks for the advice.

-Whit


#2

Hi Whit,

When denormalizing or pre-joining data (in any engine) you are always playing the tradeoff game between data storage space (denormalized data will take up more room on disk) versus join performance (it takes extra CPU time to do the joins). Faced with this kind of tradeoff, the right strategy will depend on how often you need to perform the join and what your data looks like. If you find that every useful query against A requires a join with B and “A join B” is not too big, it might make sense to somehow materialize “A join B”.

One salient difference between SciDB and relational engines is SciDB’s insistence on splitting “columns” into “attributes” and “dimensions”. Using id as a dimension is a good intuition. We often find timeseries application often use dimensions “symbol id” and “time” where time is some number of milliseconds or microseconds since a known date.

Currently scidb supports join (join on dimensions where two arrays have exact same dimensions) and cross_join (join on dimensions where two arrays share some similar dimensions). If you must join on attributes, you can use cross() followed by a filter() - but this can get expensive and should be used sparingly. At the moment join and cross_join support integer dimensions only. It’s possible you’re having a problem because “id” is not integer? If that’s the case, there is a workaround I can show you.

Does this help at all?


#3

I am using integers as the id variables.

I still don’t understand how to do the join.

I put together a small script to convey what I’m trying to do.

If someone can help me understand why the cross join doesn’t work I would really appreciate it.

-Whit

This is safe to run, it just creates a few tmp files to load the data.

#!/bin/bash

SCIDB_BIN=/opt/scidb/12.12/bin

OBS_FILE=`mktemp`
chmod a+r $OBS_FILE
cat > $OBS_FILE <<end-of-obs-data
{0}[
(1,2/4/2013,442.316),
(1,2/5/2013,457.844),
(1,2/6/2013,457.35),
(2,2/4/2013,98.26),
(2,2/5/2013,98.44),
(2,2/6/2013,97.7),
(3,2/4/2013,89.15),
(3,2/5/2013,89.74),
(3,2/6/2013,89.79),
(4,2/4/2013,58.03),
(4,2/5/2013,57.72),
(4,2/6/2013,57.59),
(5,2/4/2013,13.27),
(5,2/5/2013,13.42),
(5,2/6/2013,13.52)
]
end-of-obs-data

SYMBOLS_FILE=`mktemp`
chmod a+r $SYMBOLS_FILE
cat > $SYMBOLS_FILE <<end-of-sector-data
{0}[
(1,AAPL,TECH),
(2,CAT,INDUSTRIAL),
(3,XOM,ENERGY),
(4,COP,ENERGY),
(5,DELL,TECH)
]
end-of-sector-data


$SCIDB_BIN/iquery -a -q "remove(obs);"
$SCIDB_BIN/iquery -q "CREATE ARRAY obs <sid:uint64,asofdate:datetime,close:float>  [i=0:*,3,0];"
$SCIDB_BIN/iquery -n -q "LOAD obs from '$OBS_FILE';"
rm -f $OBS_FILE


$SCIDB_BIN/iquery -a -q "remove(obs_asofdate_sid)"
$SCIDB_BIN/iquery -q "CREATE ARRAY obs_asofdate_sid <close:float> [asofdate(datetime)=*,1,0,sid=1:*,1,0];"
$SCIDB_BIN/iquery -a -n -q "redimension_store(obs,obs_asofdate_sid);"

## symbols
$SCIDB_BIN/iquery -a -q "remove(symbols);"
$SCIDB_BIN/iquery -q "CREATE ARRAY symbols <id:uint64,ticker:string,sector:string>  [i=0:*,1,0];"
$SCIDB_BIN/iquery -n -q "LOAD symbols from '$SYMBOLS_FILE';"
rm -f $SYMBOLS_FILE


$SCIDB_BIN/iquery -a -q "remove(symbols_id)"
$SCIDB_BIN/iquery -q "CREATE ARRAY symbols_id <ticker:string,sector:string>  [id=1:*,1,0];"
$SCIDB_BIN/iquery -a -n -q "redimension_store(symbols,symbols_id);"

## join symbols table to observations table
## so that we can calculate aggregates using sector attribute of symbol table
$SCIDB_BIN/iquery -a -q "cross_join(obs_asofdate_sid,symbols_id,sid,id);"

#4

Ah yes, I was bitten by this as well. The problem is that the only dimensions fully supported by SciDB are int64s. uint64s will not work in a cross_join(). Since the datatype of the original attribute is an uint64, so is the new dimension from redimension_store(). Switch them to int64, and everything will work fine.

[gtdev11]~/test/scidb$ ./doit.sh
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
Query was executed successfully
asofdate,sid,close,ticker,sector
0,1,442.316,"AAPL","TECH"
0,2,98.26,"CAT","INDUSTRIAL"
0,3,89.15,"XOM","ENERGY"
0,4,58.03,"COP","ENERGY"
0,5,13.27,"DELL","TECH"
1,1,457.844,"AAPL","TECH"
1,2,98.44,"CAT","INDUSTRIAL"
1,3,89.74,"XOM","ENERGY"
1,4,57.72,"COP","ENERGY"
1,5,13.42,"DELL","TECH"
2,1,457.35,"AAPL","TECH"
2,2,97.7,"CAT","INDUSTRIAL"
2,3,89.79,"XOM","ENERGY"
2,4,57.59,"COP","ENERGY"
2,5,13.52,"DELL","TECH"

The thing is, I am not sure it wouldn’t work, except the cross_join() code checks to see if all dimensions to be joined are int64, and if not, it then returns

UserException in file: src/query/ops/cross_join/LogicalCrossJoin.cpp function: inferSchema line: 207
Error id: scidb::SCIDB_SE_INFER_SCHEMA::SCIDB_LE_ARRAYS_NOT_CONFORMANT
Error description: Error during schema inferring. Arrays are not conformant.

Which is not actually the problem, but they do tell you which part of the code to look in at least. I suspect if they relaxed the constraint to actually be any integer type, it would work.


#5

wow… they should update the manual. Or better yet, update the code…

Thanks v v much!

-Whit


#6

Hey guys,

The real story is actually even a little more exciting! :smile:
Currently in SciDB there are two types of dimensions. They are called “integer” and “non-integer” (we often use the abbreviation NIDs). There’s actually also a sub-type called “functional NIDs” but let’s save that for a separate thread. What’s the difference between integer dimensions and NIDs? Integer dimensions are your standard vanilla way of organizing array data. I.e. [x=0:,1000000,0]. SciDB uses a lot of facts about the dimension to its advantage. Given a coordinates of a particular cell i.e. [x=123456] we can instantly determine what instance that value is in, and what chunk it’s in. Furthermore if I am joining the array with something like [y=0:,1000000,0] then just the knowledge that the dimensions are the same means that all the data for the join is collocated. So things like join and merge are trivial provided you have organized your data this way.

But of course we started having ideas about supporting non-integers for dimensions. What if the user wants a “stock symbol” or a “gene id” as a dimension? It would be annoying to convert it to an integer. So NIDs were born. And a NID works by taking all of the data in the dimension, sorting it, creating a hidden array that maps from a dimension value to an integer coordinate.

Here’s an example that shows a NID being created and how to examine the hidden NID array:

apoliakov@daitanto:/opt$ iquery -aq "create array foo <val:double> [x=1:10,10,0]"
Query was executed successfully
apoliakov@daitanto:/opt$ iquery -aq "store(build(foo, random()*1.0/100000), foo)"
x,val
1,10030
2,13513.5
3,18940.1
4,550.89
5,8121.08
6,6759.14
7,12824.4
8,11073.2
9,15172.7
10,17514.4
apoliakov@daitanto:/opt$ iquery -aq "create array bar <x:int64> [val(double)=*,10,0]"
Query was executed successfully
#now val is a NID 
apoliakov@daitanto:/opt$ iquery -aq "redimension_store(foo,bar)"
val,x
550.89,4
6759.14,6
8121.08,5
10030,1
11073.2,8
12824.4,7
13513.5,2
15172.7,9
17514.4,10
18940.1,3

apoliakov@daitanto:/opt$ iquery -aq "list('arrays', true)"
No,name,id,schema,availability
0,"bar",5,"bar<x:int64> [val(double)=*,10,0]",true
1,"bar@1",6,"bar@1<x:int64> [val(double)=*,10,0]",true
2,"foo",3,"foo<val:double> [x=1:10,10,0]",true
3,"foo@1",4,"foo@1<val:double> [x=1:10,10,0]",true
4,"NID_5@1:val",7,"not empty NID_5@1:val<value:double> [no=0:9,10,0]",true

#And we can access the NID_5 array using this syntax. It maps the double values to integer values:
apoliakov@daitanto:/opt$ iquery -aq "scan(bar:val)"
no,value
0,550.89
1,6759.14
2,8121.08
3,10030
4,11073.2
5,12824.4
6,13513.5
7,15172.7
8,17514.4
9,18940.1

One important note here is that the algorithm for joining NID arrays is a lot harder. Suppose I was joining the “bar” above with some other array that had a double dimension. “bar” has val = 550.89 at position 0. The other array could have 550.89 at position 500, could be on a different instance, etc. So we haven’t had the time to address NID joins yet. That’s why we advertise NIDs as a “not fully implemented” feature. And we recommend integer dimensions wherever possible. That’s why you are getting the error.

There are some ways to go around it if you really need it. If you know that your NID arrays match exactly, you could use “cast” to perform the join. There are some other hackish workarounds.

The other important note is that we decided that everything thats not an “int64” is a NID. That seems weird. But the rationale is that if you have an integer dimension - you should ALWAYS use int64. You will not get any advantage whatsoever from “int32” or “int16”. The dimension values are not stored to begin with - there won’t be any space saving. So perhaps the better way to do it would be to not allow folks to create arrays with “int16” or “int32” dimensions altogether - just tell them to use int64. I realize that sometimes you may have an int32 attribute going into redimension_store and you could be creating an in32 NID without even thinking about it. We’ll think about what we can do here…

Hope this helps. I’ll try to check up on this thread - see if you have any more questions.