How to implement join?


#1

I’m a freshman called Cherry from China. I want to ask 6 questions.

  1. Do JoinArray and crossJoinArray implement with Hash Join or any other join?
  2. What’s the differences between JoinArray and crossJoinArray according to the usage?
  3. What’s the differences between Physical Join and Logical Join?
  4. Does Join Array run all in memory?
  5. Is there any refenences to JoinArray in SciDB,such as paper and so on?
  6. How to create a array,such as [(1,3),(2,5),(3,null),(null,null)]?

Thank you very much.


#2

Hi Cherry!

Just a quick pre-amble. Most of your questions can be answered by reading either the documentation (where the join(…) and cross_join(…) operators are discussed, and where the build(…) operator is discussed) or by looking at the code (where your question about the implementation of join(…) can be answered). But … assuming these questions are motivated by curiosity and aren’t just homework… :wink:

  1. join(…) and cross_join(…) use different algorithms. join(…) is a kind of sort-merge join. We are able to do this because we are always assured that the inputs to the joins are sorted on the join keys, because the join keys are always array dimensions. It’s worth noting that the cross_join(…) actually implements a hash-join locally (per-instance) and a kind of semi-join globally, where all of the join keys (from the smaller side) are replicated on all of the compute instances.

  2. join ( A, B ) requires that the shapes of A and B are identical (that is, that A and B have the same number of dimensions, and the dimensions align) and it produces a result array C that has the same shape as A and B, but with the attributes of A and B concatenated. Further, there is only a cell in C if there are cells in both A and B at the same location. cross_join ( A, B ) produces an array C that has as many dimensions as there are in both A and B. If A is a 1D array, and B is a 2D array, then C will be a 3D array. If you specify additional dimensions in cross_join(…), then we reduce the number of dimensions in C. Consult the fine manual …

  3. Like most DBMSs, SciDB breaks query processing up into phases; parse, plan generation, logical planning, physical planning, execution. “LogicalJoin” plays a part in the plan generation and logical planning phase, where we figure out what are the shapes of the arrays being passed up the query plan. “PhysicalJoin” is the run-time execution of the join operator, as well as things like checking for interrupts and errors.

  4. No.

  5. No. It’s not particularly novel.

  6. Lots of ways. Have a look at the build(…) operator, and specifically, the options it gives you for building an array from a literal string.

Hope this helps!


#3

Thanks. But I still have a question. I read the ‘SciDB-Reference-v14-12.pdf’. It’s said below:
Deprecation Notice: The join() operator will be removed from a future release. Use cross_join()
instead.

I guess that it is because cross_join could instead join in terms of function. join(Array1, Array2) is equal to cross_join(Array1, Array2, Array1.row, Array2.row, Array1.col, Array2.col). Am I right? Is there any other reasons?

AFL% create array Array1< value:string NULL DEFAULT null, value2:string NULL DEFAULT null >[row=1:3,3,0,col=1:3,3,0]; AFL% create array Array2< value:string NULL DEFAULT null >[row=1:3,3,0,col=1:3,3,0]; AQL% insert into Array1 '[[()()(a7,5)][()(a9)()][()()()]]'; AQL% insert into Array2 '[[()()(33333333)][()(555555555)()][(777777777)()()]]'; AFL% store(join(Array1, Array2), Array3); AFL% cross_join(Array1, Array2, Array1.row, Array2.row, Array1.col, Array2.col);


#4

That’s correct.

The algorithm decision … between hash and merge-join … becomes an optimization question, rather than a choice of operator.


#5

Thanks a lot, Plumber. I got it.


#6

[quote=“plumber”][quote=“Cherry”]
I guess that it is because cross_join could instead join in terms of function. join(Array1, Array2) is equal to cross_join(Array1, Array2, Array1.row, Array2.row, Array1.col, Array2.col). Am I right? Is there any other reasons?
[/quote]

That’s correct.

The algorithm decision … between hash and merge-join … becomes an optimization question, rather than a choice of operator.[/quote]

Hi, plumber
which .cpp files implement the join algorithms(mergejoin or hashjoin) ? Because I want to research the join algorithms in SciDB.

Thank you.


#7

Your questions aren’t as simple to answer as you might think. SciDB uses neither hashjoin nor merge join (well … not in general … some algorithms use unordered maps internally, which means techniques that are more or less a hash-join, but that’s not a really the way hashjoin algorithms as you’ve read about them work). Once you’re in a world of column stores etc, the options and trade-offs change.

Data in every SciDB array is organized in a way that makes implementing joins as merge-joins simple and easy. That’s what you get when you pre-sort all the data fed as inputs into joins.

If you’re looking to compare join algorithms … SciDB is a particularly bad code base to use. You’re far better off with Postgres or even MySQL. Traditional trade-offs between hash-joins, merge-join etc are far more important in the way those engines work. SciDB … columns store … not so much.


#8

[quote=“plumber”]Your questions aren’t as simple to answer as you might think. SciDB uses neither hashjoin nor merge join (well … not in general … some algorithms use unordered maps internally, which means techniques that are more or less a hash-join, but that’s not a really the way hashjoin algorithms as you’ve read about them work). Once you’re in a world of column stores etc, the options and trade-offs change.

Data in every SciDB array is organized in a way that makes implementing joins as merge-joins simple and easy. That’s what you get when you pre-sort all the data fed as inputs into joins.

If you’re looking to compare join algorithms … SciDB is a particularly bad code base to use. You’re far better off with Postgres or even MySQL. Traditional trade-offs between hash-joins, merge-join etc are far more important in the way those engines work. SciDB … columns store … not so much.[/quote]

Hi plumber,

 Thank you for your reply.
 Then I want to get the location of the join (.cpp) files because I only try to optimize the Join algorithm.

Thank you.