Questions about the searching function of SciDB


#1

Hi:

I’m working on handling astronomical data using SciDB. The installaion and data loading is successful.

But I found that when selecting a single cell by attribute value using filter() function, the performance confused me.

For example, I have 5000 cells in a single array, then I select a cell by its ID attribute. filter( testArray, id=123456 )
I believe that the speed of filter() is very fast. But it needs a lot of time to show the result to me, that’s because it shows the other null ( not satisfied the searching conditions ) cells to me at the same time. The result is like this:
[[[()]]]; … … [[[()]]];[[[()]]];[[[(1206005843,291721349,671160980,256.637,-87.8129,0.02, true)]]];[[[()]]]; … … [[[()]]]
This showing process may need 20 seconds.

I wanna know that whether I used filter in a wrong way or there is any other function I didn’t know. How to do a searching for a large data set and get good performance ? Thanks a lot !


#2

Can you share the schema and the queries with us, wangruntao?

But - some thoughts.

  1. It seems, from the output that you are reporting, that you’re using a chunk size of 1. That’s what I get from the [[[()]]]. Our design point for the system was larger chunks, containing (say) 8,000,000 cells.

  2. filter() scans all of the data, and so it isn’t hugely fast (we don’t currently support any indexing on attribute values). Rather, the idea is to optimize for performance in situations where the predicate is applied to the dimensions using slice() or subsample(). It isn’t clear from your example here quite what you’re up to (what are the three dimensions, for example) but I suspect you might be better off using dimensions, rather than attributes (might not be possible w/ 0.75 - but let’s see).

  3. We make a distinction, in SciDB, between cells that are “empty”, and cells that contain attributes with “null” values. What you see here, with the [[[()]]] isn’t a “null”. Rather, it says that this is a chunk, and that the chunk contains a single “empty” cell.

We need to keep “empty” as a data model concept because the arrays we manage might be either sparse (meaning there are cells in the array that can never contain any values) or dense (in which case there cannot be any “empty” cells). And our ideas for “null” are also more elaborate than what you’re no doubt familiar with in SQL. Scientific data is rarely “null” in the classic three value logic sense. Rather, it’s “missing”, or “to be ignored”, or “requires special handling”. So while we do support “null” (and will go even further in the next release) we’re trying something different than SQL here. Users (will be) able to assign an out-of-band code to attributes which they can then use to track the “missing” information, and take appropriate steps with it.

But - it would really help us to see your schema, and get some idea about your data.

Thanks!


#3

Thanks for your reply, plumber!
As you suggested, I’ll share my application backdrop and the schema I designed with you hoping you could help me!
I’m working on loading 170,000 astronomical observation records to SciDB, and then, compare its performance with the traditional database.
Each record contains some attributes of one star, one example is as followed:

oid | htmid | hpid | RA | DEC | d(pix) | X(029.fit) | Y(029.fit) | X(007.fit) | Y(007.fit) | mag(r=3) | err3 | mag(r=4) | err4 | mag(r=5) | err5
1206005843 | -774238732 | 738342944 | 283.72745 | -87.600861 | 0.41 | 59 | 6670 | 3036 | -422 | 4.609 | 0.001 | 4.561 | 0.001 | 4.556 | 0.001

The oid is the Unix timestamp of the time. The htmid and the hpid is calculated from RA and DEC. Other attributes is not so important that I won’t introduce in detail.
Because the 170,000 records are actually records of 10,000 stars, so I need to find the 10,000 stars according to their htmid, hpid, RA and DEC.

These records are simply divided into different files and floders by their observation time and date. Say, one file contains records with the same date and time, and one floder contains files with the same date.

My initial design is:
create array starTest < oid:int32, htmid:int32, hpid:int32, rra:double, ddec:double, dpix:float, x029:int32, y029:int32, x007:int32, y007:int32, mag3:float, emag3:float, mag4:float, emag4:float, mag5:float, emag5:float >[x=0:,1,0, y=0:,1,0, z=0:*,1,0]

That’s means that one line with same y and z contains the records of one file, and one section with same z contains the records of one floder.

Then I loaded some files for test, and when I excute “filter( starTest, htmid=-774238732 )”, I got the question above.

Do you have any better idea for my application ? How should I design the array to get a good performance ?

Thanks a lot!


#4

Thanks for the detailed response!

We seem to be slipping and slopping from forum to forum. :frowning:

Some thoughts:

  1. First, your create array ... statement. What you’re doing here is to create a heckuva lot of chunks:
create array starTest < oid:int32, htmid:int32, hpid:int32, rra:double, ddec:double, dpix:float, x029:int32, y029:int32, x007:int32, y007:int32, mag3:float, emag3:float, mag4:float, emag4:float, mag5:float, emag5:float >[x=0:*,1,0, y=0:*,1,0, z=0:*,1,0]

The dimension specification - [x=0:*,1,0, y=0:*,1,0, z=0:*,1,0 ] says, “This array has three dimensions, x, y and z, which begin at 0 but are unbounded in length. I want SciDB to decompose this array into chunks which are each of length 1 in the x dimension, 1 in the y dimension, and 1 in the z dimension, and I want these chunks to have no overlap”. In other words, you will have at most 1 “cell” in each chunk.

The problem with this is that you’re creating 15 x 170,000 chunks. One of the design points we adopted in SciDB was giving users the ability to put everything that was “close together” in logical space (in this case, the [ x, y, z] space) close together in the storage. That way, if you want to see (for example) what’s around you in any of the three dimensions, you’re pretty much guaranteed that the data will be in the chunk you’re starting with.

Second - the unbounded dimensionality won’t get you the best results you want. SciDB supports it, but it makes some calculations hard. In the next release we make it considerably easier to work with data where you don’t know the dimensionality in advance. I assume you’re working with data in the RA / Decl space. I used SciDB to load (some of) the USN-B catalog, and I used the following schema:

CREATE EMPTY ARRAY Objects < Proper_Motion_RA : int32, \
                        Error_in_Motion_RA : int32, \
                        Proper_Motion_DECL : int32, \
                        Error_in_Motion_DECL : int32, \
                        Obs_Epoch : double, \
                        B_mag : double, \
                        B_mag_flag : int32, \
                        R_mag : double, \
                        R_mag_flag : int32, \
                        B_mag2 : double, \
                        B_mag2_flag : int32, \
                        R_mag2 : double, \
                        R_mag2_flag : int32 > \
[ RA=0:35999999,144000,8400, DECL=0:17999999,72000,4200]"

The idea is to divide the “space” into 250x250 chunks, each of which has about a 1/2 arc-degree. Then if you want queries like:

subarray ( starTest, (126.734*10000), (-43.3245 * 10000), (126.956*10000), (-41.9483*10000) );

We can very quickly find you everything in the specified “region of the sky”.

  1. SciDB exploits arrays to get our performance. Unlike SQL DBMS engines, we don’t (yet) have things like B-trees over attributes (although we completely expect to get to them). For your query:
filter( starTest, htmid=-774238732 )

Being a filter over an attribute, we are obliged to scan.

Next release, we have functionality that lets you do a better job using arrays. But … that’s not there yet.