SciDB pairwise distance query performance


#1

Hi, sorry for possibly stupid question but I am looking for a database which would support an indexed* range query between two (even better between multiple) elements in Cartesian space.

[size=11](* by indexed, I mean implementing any feature (index, tree, bitmapping…) that would speed up the search comparing to exhaustive search)[/size]

I.e., I need to SELECT all pairs with specific mutual distance (specified as a range, ideally including uncertainty, but I guess that this would be too much to ask).

Problem:
I have a table of ~x billion points in a Cartesian space and a norm (can be simply L2 norm --> distance) and need to find all pairs that have the norm between <x1, x2>.
Obviously, exhaustive search will need to go through x^2/2 comparisons - practically not feasible. Linear index is not feasible either (too many entries). R/R*-tree derivative will help (limiting the comparisons), but still involve quite too many comparisons.

I have noticed that SciDB “Optical Astronomy (relational)” use-case (scidb.org/UseCases/Astronomy%20in%20ArrayDB.pdf) has the following example:

Find stars with stellar neighbor within distance x where at least one of the stellar  neighbors has the colors of a white dwarf
SELECT S1.objectId AS s1, S2.objectId AS s2
  FROM Star S1 -- S1 is the white dwarf
  JOIN Star S2 -- S2 is the second star
  WHERE SPH_DIST(S1.raDecl, S2.raDecl) < 5 -- arcseconds
    AND S1.ugColor < 0.4 -- and S1 meets Paul Szkody's color cut
    AND S1.grColor < 0.7 -- for white dwarfs
    AND S1.riColor > 0.4 
    AND S1.izColor > 0.4;

… but I am not sure how this SELECT is evaluated.

As far as I am aware, databases I have looked at do not support this type of query directly, only querying by a distance to a specific fixed point (e.g. postGIS), but maybe I’ve overlooked something.

Questions:

  1. Could you please briefly explain (or point me towards the appropriate documentation), how will SciDB evaluate such query?
  2. Do you have any idea how to quickly perform the query I am describing(pairwise distance query)?
  3. In general - can SciDB help me with this? :smile:

Thank you!
Regards,
Ondrej


#2

SciDB can definitely handle this type of query.

I am better at working with AFL (the query you presented is in AQL). The AFL query would look something like this:

project (filter (cross(Star as S1, Star as S2), S1.objectID != S2.objectID and abs(S1.distance - S2.distance) < 5 and S1.ugColor < 0.4 and S1.grColor < 0.7 and S1.riColor > 0.4 and S1.izColor > 0.4), S1.objectID as S1, S2.objectID as S2)

To view the query execution plans, the iquery client has an option that shows the logical and physical plans. You could use the following iquery command to see the plans:

$ iquery -r /dev/null -avq "
>    project
>       (filter
>          (cross(Star as S1, Star as S2),
>       S1.objectID != S2.objectID and
>       abs(S1.distance - S2.distance) < 5 and
>       S1.ugColor < 0.4 and S1.grColor < 0.7 and
>             S1.riColor > 0.4 and S1.izColor > 0.4),
>    S1.objectID as S1, S2.objectID as S2)"

The parameters for iquery are as follows:
[ul]
[] r – sends the output to a file (in this case to /dev/null so it won’t scroll a potentially large dataset across the screen)[/]
[] a – indicates that the query is in AFL (default is AQL)[/]
[] v – prints out the query plans.[/]
[] q – indicates that the query follows, between the double quotes[/]
[/ul]

Note that the performance on this query won’t be very good – it is n^2. We do have a prototype implementation of bestmatch in the SciDB source (in the examples folder). However, the prototype is currently broken. There are some people working on it, and when it is ready, we’ll let everyone know.