Implementation of "filter", "between" and "subarray"


#1

Hi,
I am recently doing research related to SciDB. For selecting sub-array from the original array (selection only based on dimensions), three operators, i.e. “filter”, “between” and “subarray” are available. By reading and practicing, somehow I learnt that “between” might be the preferable option. Perhaps “subarray” has similar implementation as “between” because by testing, they can achieve nearly the same speed. As to “filter”, in most cases, it is much slower than “between” or “subarray” for selection based on dimensions, but I also encountered the case where “filter” is faster than “between”. Can anyone describe the working principle of these three operators and is it really the case that for selecting subarray, I can just choose “between” since it is more efficient?


#2

between(…) and subarray(…) can figure out which chunks are “in the zone” using the operator’s parameters, and the target array’s chunking scheme. For example, consider the following array:

  Array A < v > [ I, J ];

   \  I -->
    \  C1                                        C2
 J   \   00     01     02     03     04            05     06     07     08     09
 |    +------+------+------+------+------+      +------+------+------+------+------+
 | 00 |  47  |  52  |  54  |  31  |  10  |      |  19  |  14  |  08  |  07  |  27  |
 v    +------+------+------+------+------+      +------+------+------+------+------+
   01 |  39  |  35  |  53  |  29  |  24  |      |  44  |  28  |  43  |  52  |  22  |
      +------+------+------+------+------+      +------+------+------+------+------+
   02 |  42  |  39  |  36  |  22  |  40  |      |  09  |  12  |  11  |  48  |  31  |
      +------+------+------+------+------+      +------+------+------+------+------+

       C3                                         C4
      +------+------+------+------+------+      +------+------+------+------+------+
   03 |  10  |  32  |  23  |  28  |  28  |      |  52  |  50  |  30  |  54  |  34  |
      +------+------+------+------+------+      +------+------+------+------+------+
   04 |  00  |  12  |  28  |  48  |  20  |      |  26  |  06  |  37  |  16  |  04  |
      +------+------+------+------+------+      +------+------+------+------+------+
   05 |  33  |  07  |  53  |  42  |  24  |      |  19  |  44  |  33  |  46  |  03  |
      +------+------+------+------+------+      +------+------+------+------+------+

This is a 10 x 6 array with a single attribute. It has chunks that are 5 x 3 in size. Suppose you ask the following query:

aggregate ( 
  between ( A, 5, 3, 7, 4 ),
  count(*)
);

SciDB knows that the array’s chunks are each 5 x 3 in size. So it can figure out that the cells the query wants are all in the chunk C4 (the one that spans A[5,3 -> 10,6]. Which means it only has to look at one chunk.

filter(…) on the other hand is more general purpose. For example, supposing your query looked like this:

aggregate ( 
 filter ( A, I < v ),
 count(*)
);

In this query, SciDB needs to examine every cell in the array, which is what filter(…) is designed to do.

Bottom line? If you know the region of the array you care about–where the region is defined in terms of the dimension index values–then use the operators that are designed to work with dimensions: between(…), subarray(…), cross_join(…), etc. But if your expression involves a function/expression that needs to be evaluated once per cell–because it involves an attribute say–then you need to use the filter(…). You can always re-write a between(…) into a filter(…) but the filter(…) version is always going to be less efficient than the between(…).

Hope this helps!


#3

There are two major kinds of sub-selection operations in SciDB.

Given a SciDB array:

array X1 <attr1: string, attr2: float> [dim1 = ..., dim2 = ...]
filter(X1, attr1 = 'Bob' AND attr2 < 33.5 AND dim1 <= 32)
between(filter(X1, attr1 = 'Bob' AND attr2 < 33.5), NULL, NULL, 32, NULL)

between is typically faster when you are selecting a contiguous set of dimensions (for reasons described in @plumber’s response) – however when you are selecting a non-contiguous set of elements, you might need to use filter.

subarray, cross_join, cross_between all fall in the category of the second faster class of selection operations (i.e. where you would only open the chunks of interest).