Query performance on different schemas


#1

I have the following flat array holding stock timeseries

iquery -q "CREATE ARRAY F < id:int64, day:int64, time:int64, seqnumber:int64, t:int64, tp:double > [ i=0:4999999,1000000,0 ]"

[quote]id : is the symbol id
day: is the day since some epoch
time: is the time in ms since midnight
seqnumber: a sequence number for each stock, starting at 1 and increasing up to infinity
t : a message type for every given entry in the array
tp: a single attribute[/quote]

My problem is that I am getting different performance results depending on how I redimension my target array. Two sample arrays are:

iquery -q "CREATE ARRAY Z < tp:double > [ day=10000:22000, 1, 0, id=0:*,1,0, seqnumber=0:*, 1000000, 1, t=0:2,3,0, time=0:86399999,86400000, 0 ]"

iquery -q "CREATE ARRAY X < tp:double > [ day=10000:22000, 1, 0, id=0:*,1,0, t=0:2,3,0, time=0:86399999,86400000, 0, seqnumber=0:*, 1000000, 1 ]"

I only loaded a single stock with 4,086,732 cells

$ time iquery -o dcsv -q "select avg(tp),count(*) from between( X, 15727, 9445, null, null, null, 15727, 9445, null, null, null)" {i} avg,count_1 {0} 17.3908,4086732 real 0m0.063s user 0m0.011s sys 0m0.005s

$ time iquery -o dcsv -q "select avg(tp),count(*) from between( Z, 15727, 9445, null, null, null, 15727, 9445, null, null, null)" {i} avg,count_1 {0} 17.3908,4086732 real 0m0.282s user 0m0.015s sys 0m0.004s

On array X I get consistent results in less than 70ms, while on array Z I consistenly get results in excess of 260ms

What makes Z having a 4x performance hit over X?

I do not want to throw off anybody but the following may have something to do with…but still cannot explain that much difference.

[quote]Array Z info:
{i}number_of_chunks,number_of_cells,storage_allocated,min_chunk_size,max_chunk_size,avg_chunk_size
{0} 10,8173480,99298504,59652,24000072,9.92985e+06
Array X info:
{i} number_of_chunks,number_of_cells,storage_allocated,min_chunk_size,max_chunk_size,avg_chunk_size
{0} 10,8173480,7842356,86944,1654296,784236[/quote]

And the main reason I care, is that natural ordering of messages (based on sequence) on the best performing array is out of sync.

Any ideas ?
thanks
–mike


#2

Mike,

I am not 100% sure about what the data looks like but it could be our run-length encoding. SciDB chunks are stored in run-length-encoded payload and the encoding always follows the order of the last specified dimension. So, data like

1 1 1
2 2 2
3 3 3

Will, in fact, compress better than

1 2 3 1 2 3 1 2 3

That could be what you are observing. It explains the size difference and thus the runtime difference… Indeed SciDB is very sensitive to chunk sizing and dimension choice at the moment.


#3

I am not following… That applies to attribute values only or dimensions?
The 2 arrays you put bellow are dimension indices or attribute values?

In my case, both X & Z have the exact same attribute values.

Just to check, I set the values of attribute tp at 777 for all cells on both X & Z

[code]$ time iquery -o dcsv -q “select avg(tp),count(*) from between( X, 15727, 9445, null, null, null, 15727, 9445, null, null, null)”
{i} avg,count_1
{0} 777,4086732

real 0m0.099s
user 0m0.012s
sys 0m0.006s[/code]

[code]$ time iquery -o dcsv -q “select avg(tp),count(*) from between( Z, 15727, 9445, null, null, null, 15727, 9445, null, null, null)”
{i} avg,count_1
{0} 777,4086732

real 0m0.306s
user 0m0.011s
sys 0m0.004s
[/code]
The performance ratio remained the same.


#4

Sorry for the confusion. It applies to attributes. Here’s the example that I drew above, but expanded to 1…10,000.

iquery -aq "create array foo <val:double> [x=1:10000,1000,0,y=1:10000,1000,0]"

First populate with data from along y:

#1 2 3 ...
#1 2 3 ...

iquery -anq "store(build(foo, y), foo)"
time iquery -aq "aggregate(foo, avg(val))"
i,val_avg
0,5000.5

real	0m0.596s
user	0m0.015s
sys	0m0.008s

Now do the same thing along x. It’s the same data, just transposed:

# 1 1 1 ...
# 2 2 2 ...
iquery -anq "store(build(foo, x), foo)"

time iquery -aq "aggregate(foo, avg(val))"
i,val_avg
0,5000.5

real	0m0.095s
user	0m0.016s
sys	0m0.003s

So (for the very advanced user) placing the last dimension in the direction along which most values are repeated … can result in better compression and better performance. We’re not sure it will always be the case but it’s how the compression system works now.


#5

How is that affecting smaller sub-arrays?

Why calculating an average price on a smaller sub-array (1 / 8500 the original size) takes almost 3 times longer on array Z?

[code]$ time iquery -o dcsv -q “select avg(tp),count(*) from between( X, 15727, 9445, null, 40000000, null, 15727, 9445, null, 40001000, null)”
{i} avg,count_1
{0} 777,561

real 0m0.085s
user 0m0.010s
sys 0m0.005s[/code]

[code]$ time iquery -o dcsv -q “select avg(tp),count(*) from between( Z, 15727, 9445, null, null, 40000000, 15727, 9445, null, null, 40001000)”
{i} avg,count_1
{0} 777,561

real 0m0.647s
user 0m0.015s
sys 0m0.004s[/code]


#6

After looking into it further, I think it is something not related with RLE.

Looks like an issue with overlaps.

Leaving everything else the same, but removing the overlap:

iquery -q "CREATE ARRAY Z < tp:double > [ day=10000:22000, 1, 0, id=0:*,1,0, seqnumber=0:*, 1000000, 0, t=0:2,3,0, time=0:86399999,86400000, 0 ]"

iquery -q "CREATE ARRAY X < tp:double >
[
day=10000:22000, 1, 0,                                         
id=0:*,1,0,
t=0:2,3,0, 
time=0:86399999,86400000, 0,
seqnumber=0:*, 1000000, 0
]"			[/code]	

Query times are down to more reasonable numbers, and the order of dimensions does not make any difference:

[code]$ time iquery -o dcsv -q "select avg(tp),count(*) from between( Z, 15727, 9445, null, null, null,  15727, 9445, null, null, null)"
{i} avg,count_1
{0} 17.3908,4086732

real    0m0.080s
user    0m0.008s
sys     0m0.009s[/code]

[code]$ time iquery -o dcsv -q "select avg(tp),count(*) from between( X, 15727, 9445, null, null, null,  15727, 9445, null, null, null)"
{i} avg,count_1
{0} 17.3908,4086732

real    0m0.073s
user    0m0.011s
sys     0m0.005s

#7

Very interesting.

There are a lot of moving parts at play here.

  • the between() operation itself does not take zero time, it is a bitmask AND operation that extracts a small sub-region from a chunk
  • the query runs in a special “tile mode” which has its own issues and is being re-engineered at the moment
  • indeed when overlaps are involved there is another bitmask operation that strips them out. yet I am not sure why that accounts for the difference

We’ll keep an eye out for this sort of thing. See if we figure it out.