JOIN on values


#1

I have two single dimmension arrays, each array has two column fields x and y which are as follows

ar4: 1,2; 1,2; 6,9
ar5: 5,2; 7,2; 6,1

So for cell 1 of ar4 x is 1 and y is 2.
when I run following command:

select * from ar4 join ar5 on ar4.y = ar5.y;

I get following output:

[(1,2,5,2),(1,2,7,2),()],[(1,2,5,2),(1,2,7,2),()],[(),(),()]

and when I run following command:

select * from ar4 join ar5 on ar4.x = ar5.x;

I get following output:

[(),(),()],[(),(),()],[(),(),(6,9,6,1)]

Can any one explain me the output

Thanks and Regards
Rishabh Agrawal
My Blog: nosql.rishabhagrawal.com/


#2

Hi Rishabh!

Well … welcome to the world of Array DBMSs! What you’re observing is a semantic difference between the SciDB Array Data Model and the SQL Relational Data Model. In a nutshell, when you’re dealing with arrays, shape matters. SQL tables have columns and unordered rows. AciDB arrays have dimensions–which defined the array’s shape–and attributes–which define it’s contents. When you write queries over arrays, you need to be mindful of the shape of the input and output arrays.

Let’s look at your examples in a little more detail. (All code examples here are working on the current 12.10 (Cheshire) trunk - so they might not all work for those of you still on 12.3. Mostly the differences are how the AQL works.)

Let’s start with two arrays:

#
#  Create and populate a couple of 2 attr, 1D arrays ... 
CMD="
 SELECT *
   INTO Simple_1D_A
   FROM array(
    < a : int64, b : int64 >[I=0:*,9,0],
    '[(0,8),(1,7),(2,6),(3,5),(4,4),(5,3),(6,2),(7,1),(8,0)]')"
iquery -q "$CMD"
CMD="
 SELECT *
   INTO Simple_1D_B
   FROM array(
    < a : int64, b : int64 >[I=0:*,9,0],
    '[(8,0),(7,1),(6,2),(5,3),(4,4),(3,5),(2,6),(1,7),(0,8)]')"
iquery -q "$CMD
#
#  What do they look like?
CMD="SELECT * FROM show ( Simple_1D_A )"
iquery -q "$CMD"
[("Simple_1D_A<a:int64,b:int64> [I=0:*,9,0]")]

CMD="SELECT * FROM show ( Simple_1D_B )"
iquery -q "$CMD"
[("Simple_1D_B<a:int64,b:int64> [I=0:*,9,0]")]

The shape of each of these arrays is (like yours) a 1D array with two attributes. The thing is, when you “join” two arrays, you need to specify what you want the “shape” of the result to be. Unlike SQL, where the “shape” of the result of a join is simply the attribute lists from the two inputs, in an Array DBMS need to be clear on what shape you’re producing from the “array join”. We have two options:

  1. The shape of the result is the same shape as the two input arrays. That is “A JOIN B” requires that the shape of A (number of dimensions, dimension types, etc) is the same as B, and the result (call it C) gets the same shape as the inputs. If you want to reach for a SQL analogy, this join is like a “natural” join.
  2. The shape of the result is the outer product if the shape of the two input arrays (which do not have to be the same shape as each other). If you want a SQL analogy, this “join” is the same as the cartesian product.

Examples:

CMD="SELECT * FROM Simple_1D_A S1, Simple_1D_B S2"
iquery -q "$CMD"
[(0,8,8,0),(1,7,7,1),(2,6,6,2),(3,5,5,3),(4,4,4,4),(5,3,3,5),(6,2,2,6),(7,1,1,7),(8,0,0,8)]

CMD="SELECT * FROM Simple_1D_A S1 JOIN Simple_1D_B S2 ON S1.I = S2.I"
iquery -q "$CMD"
[[(0,8,8,0),(0,8,7,1),(0,8,6,2),(0,8,5,3),(0,8,4,4),(0,8,3,5),(0,8,2,6),(0,8,1,7),(0,8,0,8)],[(1,7,8,0),(1,7,7,1),(1,7,6,2),(1,7,5,3),(1,7,4,4),(1,7,3,5),(1,7,2,6),(1,7,1,7),(1,7,0,8)],[(2,6,8,0),(2,6,7,1),(2,6,6,2),(2,6,5,3),(2,6,4,4),(2,6,3,5),(2,6,2,6),(2,6,1,7),(2,6,0,8)],[(3,5,8,0),(3,5,7,1),(3,5,6,2),(3,5,5,3),(3,5,4,4),(3,5,3,5),(3,5,2,6),(3,5,1,7),(3,5,0,8)],[(4,4,8,0),(4,4,7,1),(4,4,6,2),(4,4,5,3),(4,4,4,4),(4,4,3,5),(4,4,2,6),(4,4,1,7),(4,4,0,8)],[(5,3,8,0),(5,3,7,1),(5,3,6,2),(5,3,5,3),(5,3,4,4),(5,3,3,5),(5,3,2,6),(5,3,1,7),(5,3,0,8)],[(6,2,8,0),(6,2,7,1),(6,2,6,2),(6,2,5,3),(6,2,4,4),(6,2,3,5),(6,2,2,6),(6,2,1,7),(6,2,0,8)],[(7,1,8,0),(7,1,7,1),(7,1,6,2),(7,1,5,3),(7,1,4,4),(7,1,3,5),(7,1,2,6),(7,1,1,7),(7,1,0,8)],[(8,0,8,0),(8,0,7,1),(8,0,6,2),(8,0,5,3),(8,0,4,4),(8,0,3,5),(8,0,2,6),(8,0,1,7),(8,0,0,8)]]

That’s the simplest description of what’s going on. The “simple join” that has the form SELECT stuff FROM A, B ... has the same shape as the inputs[(0,8),(),(2,6),(),(4,4),(),(6,2),(),(8,0)]
but different contents (the attributes from each “aligned” cell are combined) while the results of the “cross join” has as many dimensions as the sum of the dimensions in the inputs. In this 1D cross 1D example, the output has 2D. If the inputs had 2D and 3D, then the output shape would have 5D. Note also that the order of the inputs to the cross matters. The dimensions from the second input array become the inner dimensions on the output. This becomes significant once you start pointing the outputs of your queries into liner algebra routines, for example.

The situation is (of course) complicated when the input arrays are sparse (or ragged), which is what’s going on in your example. Let’s tinker with the examples a bit.

CMD=" SELECT * FROM Simple_1D_A S1 WHERE S1.a % 2 = 0"
iquery -q "$CMD"
[(0,8),(),(2,6),(),(4,4),(),(6,2),(),(8,0)]

CMD=" SELECT * FROM Simple_1D_B S1 WHERE S1.a % 3 = 0"
iquery -q "$CMD"
[(),(),(6,2),(),(),(3,5),(),(),(0,8)]

I hope this example is close enough to your original to make the analogy clear.

When we use any kind of join the rule is that there’s a result cell only if there is a pair of matching cells from the two inputs. So in this case:

CMD="
SELECT * 
  FROM Simple_1D_A S1, Simple_1D_B S2 
 WHERE S1.a % 2 = 0 AND S2.a % 3 = 0"
iquery -q "$CMD"
[(),(),(2,6,6,2),(),(),(),(),(),(8,0,0,8)]

Recall how, in SQL, inner joins omit records from the two input tables when there isn’t a “match”. Same’s true of array algebras. The full cross join looks like this.

CMD="
 SELECT *
   FROM Simple_1D_A S1 CROSS JOIN Simple_1D_B S2
 WHERE S1.a % 2 = 0 AND S2.a % 3 = 0"
iquery -q "$CMD"

[[(),(),(0,8,6,2),(),(),(0,8,3,5),(),(),(0,8,0,8)],[(),(),(),(),(),(),(),(),()],[(),(),(2,6,6,2),(),(),(2,6,3,5),(),(),(2,6,0,8)],[(),(),(),(),(),(),(),(),()],[(),(),(4,4,6,2),(),(),(4,4,3,5),(),(),(4,4,0,8)],[(),(),(),(),(),(),(),(),()],[(),(),(6,2,6,2),(),(),(6,2,3,5),(),(),(6,2,0,8)],[(),(),(),(),(),(),(),(),()],[(),(),(8,0,6,2),(),(),(8,0,3,5),(),(),(8,0,0,8)]]

Now - I don’t think that you’ve cut-n-paste exactly what you’ve seen. The result of the select * from ar4 join ar5 on ar4.x = ar5.x; should be a 2D result. Overall, you should see something like this.

#
CMD="
 SELECT *
   INTO ar4
   FROM array(
    < x : int64, y : int64 >[I=0:*,3,0],
    '[(1,2),(1,2),(6,9)]')"
iquery -q "$CMD"
#
CMD="
 SELECT *
   INTO ar5
   FROM array(
    < x : int64, y : int64 >[I=0:*,3,0],
    '[(5,2),(7,2),(6,1)]')"
iquery -q "$CMD"
#
CMD="SELECT * FROM ar4, ar5;"
iquery -q "$CMD"
[(1,2,5,2),(1,2,7,2),(6,9,6,1)]
#
CMD="SELECT * FROM ar4 CROSS JOIN ar5 CROSS;"
iquery -q "$CMD"
[[(1,2,5,2),(1,2,7,2),(1,2,6,1)],[(1,2,5,2),(1,2,7,2),(1,2,6,1)],[(6,9,5,2),(6,9,7,2),(6,9,6,1)]]
#
CMD="SELECT * FROM ar4 CROSS JOIN ar5 WHERE ar4.x = ar5.x;"
iquery -q "$CMD"
[[(),(),()],[(),(),()],[(),(),(6,9,6,1)]]

Does this clarify things?