Best way to store data?


#1

Hello,
we are creating a database which will store astronomical data and are not sure of the most optimized way to create and array. We have the properties ra, dec, field and various other properties of the object. What we want to find out is given the ra, dec and radius, return all the objects within that radius. The question I have is what is the best way to create this array to analyze it the quickest such as the number of dimensions, chunk size etc. I am very new to SciDB and am not entirely sure of how it stores data. Thanks.


#2

You have a couple of options. At the moment, the best thing to do is to take your RA/DECL values, multiply them by the precision, and create an array with the integer-ized values as the dimensions:

CREATE EMPTY ARRAY Objects <
Proper_Motion_RA : int32,
Error_in_Motion_RA : int32,
Proper_Motion_DECL : int32,
Error_in_Motion_DECL : int32,
Obs_Epoch : double,
B_mag : double,
B_mag_flag : int32,
R_mag : double,
R_mag_flag : int32,
B_mag2 : double,
B_mag2_flag : int32,
R_mag2 : double,
R_mag2_flag : int32 >
[ RA=0:36000000,144000,8400, DECL=0:18000000,72000,4200];

I’ve used this kind of approach to model (say) the USNOB catalog. Now, conceptually this can be confusing. We’re inclined to think about the array as something physical. But that isn’t how SciDB treats the dimension bounds work. In this example, the RA dimension extends logically over the range 0 to 36000000: it isn’t the case that there are 36000000x18000000 cells in the array. Rather, there are as many cells in this “Objects” array as there are objects in the catalog.

Why are the RA and DECL components of the array dimensions, rather than attributes? The way to think about dimensions is as a kind of “index”. SciDB organizes the physical data in an array according to the array’s dimensions. So as your query involves ranges over the RA/DECL range, choose those as the dimensions.

What does the query look like?

AQL:

SELECT *
FROM Objects O
WHERE O.RA > :RA_Min AND O.RA < :RA_Max AND O.DECL > :smiley:ECL_Min AND O.DECL < :smiley:ECL_Max
AND (O.RA - :RA_Centroid) * ( O.DECL - :smiley:ECL_Centroid ) < (:Region_Range * :Region_Range);

Of course, it’s important to note that there are other ways to organize this kind of data. Rather than viewing astronomical objects by their position, we’ve learnt that folks sometimes want to view the objects by according to the “main sequence”. In which case, the “array space” isn’t RA/DECL, but rather it’s color / intensity.

CREATE EMPTY ARRAY Main_Sequence_Objects <
Proper_Motion_RA : int32,
Error_in_Motion_RA : int32,
Proper_Motion_DECL : int32,
Error_in_Motion_DECL : int32,
Obs_Epoch : double,
B_mag : double,
B_mag_flag : int32,
R_mag : double,
R_mag_flag : int32,
B_mag2 : double,
B_mag2_flag : int32,
R_mag2 : double,
R_mag2_flag : int32,
RA : int64,
DECL : int64 >
[ B_mag=,100000,0, R_mag=,100000,0];

Something to note: The values of the chunk sizes in this array are things I’ve just made up to complete the example.

Hope that helps.


#3

Hi Plumber,

Using RA/DEC as dimensions is really a good idea.
Just a quick follow up question: what is the most efficient way to load this array? (is the sparse loader for SciDB efficient for this case?)

Thanks
-Yushu


#4

To get a nice circular region around a point on a unit sphere involves a bit more complex formula for RA/DEC.

You may have wrap-around problems and you won’t get a nice cone-shaped spatial constraints as the actual extent of an RA/DEC cell depends on (the cosine of) DEC and there is more funny stuff around the poles. I would borrow an idea from the Sloan Digital Sky Survey (SDSS). It uses normalized x,y,z coordinates for Ra/dec. The XYZ space is more well behaved everywhere. All your RA/DEC values are normalized to a unit sphere ||x,y,z||^2 = 1 and then the separation between to objects is more easily computed. Then the cosine of the separation angle between two objects is the dot-product of their (x,y,z) coordinates. So if you store the x,y,z coordinates in addition to ra/dec you can easily get the true angle of separation with the dot-product. You don’t even have to perform the Acos() since when you compare two angles, you can also compare their cosines instead a < b iff cos a > cos b (with the appropriate conditions a,b, < pi, and you don’t care about reflections because it’s an even function)

You can also store a special number that you can use as an index or primary key that encodes location on a unit sphere. A technique used successfully in SDSS builds an index (1D, not 2D) of so-called Hierarchical Triangular Mesh ID numbers. I am thinking about employing similar techniques for building an array for climate data and use HTM stuff for spatial constraints in a query. There is a whole body of work from Alex Szalay’s group about this…

Cheers, George


#5

Hi George,

Do you know if there’s an opensource version of HTM implementation out there? The official one is for MS SQL only and no source.

As an alternative, I heard that there’s something called Q3C, does it also give you and ID for each RA/DEC value?

Thanks,

-Yushu


#6

There’s code for HTM in MySQL here: https://launchpad.net/scisql. A future release will be more compatible (and verified as such) with the original Johns Hopkins libraries.

Q3C is a similar indexing schema with somewhat different accuracy/density properties.


#7

There is also the C# from JHU: skyserver.org/HTM/HtmPrimer/tut_install.html
I wrote this while at JHU.

It has sources in C# and you can use it standalone outside the DB server. Works in Mono in Linux and Mac OS X too.

I have an almost working C++ port, if anyone is interested in it I’d be happy to work with it…
Cheers, George


#8

Three quick questions about using HTM_ID in SciDB:

  1. The HTM_ID (Or Q3C_ID) are int64 numbers. If we use this number as one dimension and build a 1D array, it will be REALLY SPARSE!! Does SciDB handle this efficiently (If it could)?
  2. Assume the super sparse 1D array is practical, how to get all objects within a cone around a postion specified by an HTM_ID? Is this a continuous region (in the HTM_ID) space?
  3. One step further, what is the best way to do cross-matching between 2 of these super sparse 1D arrays? (i.e. find out does an object in array No.1 also shows up in array No.2? and vise versa)

Thanks a lot!
Any help is greatly appreciated.

-Yushu


#9

Indeed, the HTMID space will be sparse, and HTMID numbers in a small neighborhood are not necessarily contiguous. However, a cone (or circle) is represented by a triangular mesh each of which represent a range of HTMIDs and often enough these ranges can coalesce once you put them in sorted order and merge two ranges whose high number is next to the following range’s low number.

Please take a look at this: skyserver.org/HTM/HtmPrimer/tut_covermap.html and related pages to see how we used htmid ranges to approximate any region (including a circle) and how that in turn was used to speed up spatial components of queries. I do plan to experiment with this in SciDB, I would love to share this work and look for challenges…

Now for the three answers:

  1. Everything that I read about SciDB makes me want to say that scidb is very apt at dealing with sparsely populated dimensions. Paul, Alex please chime in
  2. I refer you to skyserver.org you can also write to me, and I’ll send you papers.
  3. Many ways to do cross matching. o1 in array1 matched against objects in array2… Normally, you’d have to compare ra/dec numbers for each pair of objects, which is, of course, combinatorially impossible or at least not very feasible. If you have an associated HTMID for an object, you can immediately consider only those object in the other array with matching HTMids, and this will probably reduce your search candidates to a tiny fraction, depending on how sparse you HTMIDs are (We choose level 20 for practical considerations) Think of the HTMID as a triangular cell (trixel) that serves as a bin to hold all objects in the same neighborhood spanned my the trixel’s footprint…

Here is what a cone search would look like in a relational (SDSS, Alex Szalay et al. SQLServer) database:

The select returns object IDs and distance from the target
@Lat @Lon is the target, center of the search circle
5.5 is arc-minutes, the radius of search

select ObjID, dbo.fDistanceLatLon(@Lat,@Lon, Lat, Lon) as distance
from SpatialIndex
join fHtmCoverCircleLatLon(@Lat, @Lon, 5.5)
On HtmID between HtmIDStart and HtmIDEnd – coarse test
and dbo.fDistanceLatLon(@Lat,@Lon, Lat, Lon) < 5.5 – fine test
order by distance asc

HtmCoverCircleLatLon is a table valued function we wrote that returns HtmID ranges and so that the number of rows is limited to 20 to avoid combinatorial JOIN explosions.
DistanceLatlon() is another function that just uses the standard formulas to return the angular separation between to points expressed in ra/dec or lat/lon.

The notion should carry over to SciDB, I hope. I would like to get some hints and ideas on how to best do this in SciDB. If we succeed we could change how things get done around where I am now in big ways. But the burden of proof is on me, so I hope you all wish me luck :smile:

Sorry for the length of this response, but I am in a bit of a rush, and don’t have the time to be brief :smile:

George