 # Looking for a best Cross Matching solutions

#1

There are two tables, table A and table B. Table A and table B have three field, ‘id’, ‘alpha’,‘delta’. And they all have more than 50,000 records.like:
table A: ``` id alpha delta 1 95.259888 61.118385 ... ... ```
In fact, each record in the table represents a position of a star.we take table A and table B to cross-match. Looking for the same star by the records in table A and table B. For example, every record in table A need to calculate with all records in table B.The formula is：

```Distance = acos(sin(A.alpha)*sin(B.alpha) +cos(A.alpha)*cos(B.alpha)*cos(|A.delta-B.delta|) );```
if the distance < 0.008, so the records in table A and table B are describe the same star.
The result table like this : ``` ida alpha delta idb alpha delta error 1 95.259888 61.118385 2 95.260040 61.117493 0.000895677 ```

Now i want to solve this question with SciDB,I was a novice，could someone give me a good solution?

#2

So here’s how we can interpret your problem:

We start with two flat scidb arrays:

``````A <alpha:double, delta:double, id:int64> [i=0:*,500000,0]
B <alpha:double, delta:double, id:int64> [i=0:*,500000,0]``````

Here’s a naive n-squared] method of doing your calculation in scidb:

``````filter (
cross ( A, B),
acos(sin(A.alpha)*sin(B.alpha)+cos(A.alpha)*cos(B.alpha)*cos(|A.delta-B.delta|)) < 0.008
);``````

That’s it. That should spit out only the objects that match.

But, depending on the nature of your data, there could be a better solution. We could, first, redimension A and B into a different set of dimensions, and then perform a spatial join between them. If you can come up with a rounding scheme that rounds your alpha and delta coordinates to integers, you could say:

``````create array A_projected <alpha:double, delta:double, id:int64> [rounded_alpha= ? , rounded_delta= ? ]
create array B_projected <alpha:double, delta:double, id:int64> [rounded_alpha= ? , rounded_delta= ? ]

--Careful: this will not tolerate more than once cell for each unique {rounded_alpha,rounded_delta} pair.
redimension_store(
apply(A,
rounded_alpha, ...
rounded_delta, ...
),
A_projected
);

redimension_store(
apply(B,
rounded_alpha, ...
rounded_delta, ...
),
B_projected
);

--Now we can use join, not cross.
filter (
join ( A_projected, B_projected),
acos(sin(A.alpha)*sin(B.alpha)+cos(A.alpha)*cos(B.alpha)*cos(|A.delta-B.delta|)) < 0.008
);``````

So, if you can convert alpha and delta to integers such that these three conditions are met:

• no two objects in A have the same {rounded_alpha, rounded_delta}
• no two objects in B have the same {rounded_alpha, rounded_delta}
• if A.{rounded_alpha, rounded_delta} = B.{rounded_alpha, rounded_delta} then the objects are at a distance of .008 or greater.

Then you can use this faster solution.
Take a look, I’m interested to hear how it works for you.

• Alex Poliakov

#3

I am guessing, tagging each object with an htmid would speed up cross matching tremendously.
HtmID is what Sloan Sigital Sky Survey (SDSS) uses for spatial indexing. I worked on skyserver.org/htm/ while there, and
am trying to see if there is an interest in adding this kind of spherical spatial reasoning to a SciDB plugin. Level 3 Earth science products all use some global gridding scheme which all suffer from rips and distortions. I am working on a 3D digital spherical image processing toolkit based on trixels (triangular pixels) over this mesh. I already have the basic topological functions (adjacency, neighborhood) worked out, and want to see if SciDB’s data partitioning to … but I digress.

The advantage over HTMID tagging is that you don’t have to acos(sin(A.alpha)*sin(B.alpha)+cos(A.alpha)*cos(B.alpha)*cos(|A.delta-B.delta|)) for most pairs of A,B’s when you can compare their HTMIDs and immediately know that the separation between them is much bigger then your threshold. YOu can eliminate 98% of all comparisons, and do the fine grain stuff only on the remaining 2% (your mileage may vary)

Just a thought …

George

#4

Thank you for all. Now i am attempting to solve this question use apoliakov’s method. First i use cross method.

``````create array A <id:int64,alpha:double, delta:double> [i=0:*,50000,0];
create array B <id:int64,alpha:double, delta:double> [i=0:*,50000,0];

apply ( filter (
cross ( A, B),
acos(sin(A.alpha)*sin(B.alpha)+cos(A.alpha)*cos(B.alpha)*cos(abs(A.delta-B.delta)))<0.008
),error,acos(sin(A.alpha)*sin(B.alpha)+cos(A.alpha)*cos(B.alpha)*cos(abs(A.delta-B.delta))));
``````

but there has a problem.
alpha and delta 's data type is double.But after i load the data in file.Data will always be rounded, such as 95.259888 load as 95.2599.i don’t know why .
In fact, my data requires very precise.how can i solve it.

To the " spatial join" method,i’m very sorry to apoliakov.I don’t konw what does “If you can come up with a rounding scheme that rounds your alpha and delta coordinates to integers” mean,and i don’t konw how to write:

``````create array A_projected <alpha:double, delta:double, id:int64> [rounded_alpha= ? , rounded_delta= ? ]
create array B_projected <alpha:double, delta:double, id:int64> [rounded_alpha= ? , rounded_delta= ? ]``````

can you give me a simple example? Thank you.

gfekete’s idea is very interesting.I will continue to understand your idea.if i have some question,i ask you again.Thank you.

#5

I still have some other problem.
I need to know the time that SciDB load data and calculate had cost.
Because my goal is test the performance of SciDB,then to promotion SciDB to the traditional database users.
Thank you.

#6

I think it’s just iquery outputting rounded data. We’ve had a few complaints about this before. iquery tries to make output “look pretty” and rounds floating points to a few significant digits. The data should be loaded fine and you should be able to use algebraic expressions to confirm that…

#7

I am planning to write and add a plugin to process and manipulate regions of interest on the surface of the sphere.
This would be a reimplementation of the HtmID concept already used in SDSS and others. I’d like to continue this dialog, maybe I can replicate your data on our SciDB cluster and can play with your data, your query and my plugin. Looks like a win-win to me Cheers, George

#8

I’m very sorry to apoliakov,i still do not know what does “rounded_alpha” and “rounded_alpha” mean? How to set their values？What syntax is in SciDB.Thank you. gfekete,thank you very much. I only have some test data,if you need i will give you. If i know your email,i will send it to you in few days.

#9

Let me give you a hypothetical example.

``````create array A <id:int64,alpha:double, delta:double> [i=0:*,50000,0];
create array B <id:int64,alpha:double, delta:double> [i=0:*,50000,0];``````

We can do the following:

``````create array A_redim <a_id:int64 null, a_alpha:double null, a_delta:double null> [rounded_alpha=...,...,...,rounded_dela=...,...,...];
redimension_store(
apply (
A,
a_id, id, a_alpha, alpha, a_delta, delta,
rounded_alpha, int64(alpha * ROUNDING_FACTOR),
rounded_delta, int64(delta * ROUNDING_FACTOR)
),
A_redim
);
create array B_redim <b_id:int64 null, b_alpha:double null, b_delta:double null> [rounded_alpha=...,...,...,rounded_dela=...,...,...];
redimension_store(
apply (
B,
b_id, id, b_alpha, alpha, b_delta, delta,
rounded_alpha, int64(alpha * ROUNDING_FACTOR),
rounded_delta, int64(delta * ROUNDING_FACTOR)
),
B_redim
);

--Perform an "outer spacial join" semantic:
store(
merge(
apply(A_redim, b_id, int64(null), b_alpha, double(null), b_delta, double(null)),
project( apply( B_redim, a_id, int64(null), a_alpha, double(null), a_delta, double(null)), a_id, a_alpha, a_delta, b_id, b_alpha, b_delta)
),
AB_redim_join
);

-- NOTE: if there is no pair a,b within WINDOW_SIZE of each other, then distance would be computed as null and fail the test.
filter(
apply(
window(
AB_redim_join, ALPHA_WINDOW_SIZE, DELTA_WINDOW_SIZE,
max(a_id) as a_id, max(b_id) as b_id, max(a_alpha) as a_alpha, max(a_delta) as a_delta, max(b_alpha) as b_alpha, max(b_delta) as b_delta
),
distance,  acos(sin(a_alpha)*sin(b_alpha)+cos(a_alpha)*cos(b_alpha)*cos(abs(a_delta-b_elta))))
),
distance<0.008
);``````

So you can have something like this. Again this is something I threw together quickly. You need to determine the ROUNDING_FACTOR and then you need to determine ALPHA_WINDOW_SIZE and DELTA_WINDOW_SIZE based on that. BUT, this ONLY works IF no two objects from A can be inside the same ROUNDING_FACTOR “cell”, and no two objects from B are inside the same ROUNDING_FACTOR “cell”. That depends on your data… which is why I mentioned, it may or may not work for you.