Query Performance


#1

Recently I executed a couple queries on an array, filtering the data in order to produce variable sized result sets. They query is basically an aggregation, and a filter specifying a range for my dimensions.The query execution time remains the same, no matter if my filter specifies a very small range for the dimensions (resulting in an empty subarray), or if the range is large enough to encompass the entire array. So, I’m just trying to understand why the query execution time doesn’t vary accordingly the size of my result set.

My array has 13 attributes, five dimensions, and about 200 million cells.
Is this too little data, so the I/O necessary to execute the query is meaningless compared to the time required to filter it?

I’m assuming that SciDB selects the chunks that fit the requirements of the query, and then, selects the cells in these chunks whose values also match the filter. Does It mean that for every query, SciDB needs to verify each chunk in all nodes to select the right ones? Or does SciDB maintain the chunks ordered in some structure that allows them to be selected without the need of a complete scan? (I’m wondering if it’s even possible)


#2

Hello,

I am guessing you are using the filter() operator, which does evaluate a boolean predicate over every value. However, if you are filtering by dimension values, using operators between(), subarray() and slice() then scidb will eliminate certain chunks from the query quickly.

We have a decent video presentation on the subject. Check out:
youtube.com/watch?v=SsF_Mke0Mlw

Expand the table of contents and select
1:38:28 Filtering Operators: project, filter, between, slice, subarray

Let me know if this helps. Thanks.

  • Alex Poliakov

#3

Yes, it helped a lot. I was using the filter operator and a WHERE clause sometimes, but since I was filtering only dimensions I thought there would be no diferences. I was wrong, the subarray operator was faster, and the between operator was even faster, as expected.

I was wondering if there’s an array operator that provides way to filter a certain unrelated indexes for a dimension. Something like a IN in ordinary SQL:

Select * from Myarray where MyDimension in (val1, val2, val3, …, valN);

The only way to express this is by using OR clauses, right?


#4

For dimension values, consider cross_join:

{i} schema
{0} 'foo<val:double> [x=1:10000,1000,0,y=1:10000,1000,0]'

##Slow, naive way:
$ time iquery -aq "consume(filter(foo, x=12 or x=5123 or x=8890))"
Query was executed successfully

real	0m3.162s
user	0m0.014s
sys	0m0.006s

#Faster:
$ time iquery -aq "consume(cross_join( foo, redimension(apply(build(<x:int64> [i=1:3,3,0], '[(12),(5123),(8890)]', true), f, bool(true)), <f:bool>[x=1:10000,1000,0]) as B, foo.x, B.x))" 
Query was executed successfully

real	0m1.481s
user	0m0.011s
sys	0m0.008s

Check out the cross_join section in the video. The advantage will vary on a case-by-case basis. Give it at try.