bestMatch Prototype


#1

I’m not familiar with C++ so reading the code is a little bit tough for me. Does any know how to use the bestMatch prototype? I have 2 arrays in an xyzt-coordinate [x=-6371:6371,1000,0, y=-6371:6371,1000,0, z=-6371:6371,1000,0, t=0:*,86400000,0] (basically trajectories of 2 satellites), and I want to use the bestMatch operator to get the pair of points that’re closest in the x,y,z,t coordinates.

Thank you,
Khoa


#2

Hi Khoa,

The status of the prototype is, unfortunately, indeterminate. We know it’s worked in the past for some toy examples. We know that, because of our architecture, we are well-positioned to deliver such an operator. The problem is, it just hasn’t been priority so far, so we haven’t looked at it. I know for sure this prototype is not on the feature list for the next release 13.12, for example.

Of course, Paradigm4 has, in the past, shuffled priorities around in response to customers’ requests. Moreover, for a standalone component like this, we might be very happy to consider patch submissions from interested developers out there.


#3

Thanks Alex. I’ve taken some time to understand the bestMatch code, so I know what it’s doing right now. We’re actually writing a paper to evaluate an algorithm in MapReduce and scidb. Paul Brown suggested us to use the bestMatch in our queries, and it actually fits quite nicely. However, there’ve been problems in piping operators together. For example, when doing count(bestMatch(…)), it’s giving:

SystemException in file: src/array/MemArray.cpp function: operator++ line: 2257
Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_OPERATION_FAILED
Error description: Error during query execution. Operation 'setPosition' failed.
Failed query id: 1101711436702

So we had to do 2 steps, one is to save the result of bestMatch into an intermediate array (store(bestMatch(…), a), and one to apply the rest of the query on this array a.

One thing, though, is that we also try to implement a similar algorithm in MapReduce, and the result is actually better, about 4 times less in running time.

-Khoa


#4

Hi Khoa, thanks for the update.

It is a curious finding. Theoretically, SciDB should do well at this task because of chunking. Chunking organizes data better so that when we need to do a match, we only need to look at two chunks. It is a good indexing strategy. That strategy should provide an algorithmic advantage.

However, there are other things to consider like chunk sizing, thread and memory tuning settings, the fact that this code is an old prototype and has known bugs in it, and, as you mentioned, the overhead of storing the matches.

So, it looks like there may be room for improvement if you look at the problem carefully… Always feel free to post your config, hardware, schemas and queries up here and folks can take a look and make suggestions. Alas, I don’t have a whole lot of time to spend on this but I’ll try to help if I can.


#5

The implementation in Hadoop actually uses a very similar Indexing Strategy, i.e. collocating of corresponding chunks, so it should be similar to Scidb in some way. I didn’t actually expect it to run faster than SCIDB, but the fact the it did raises some questions for sure. In the next response, I will post more information related to my work.