Find rows in array1 that are not present in array2


#1

Here is a question submitted by one of our users:

"In Scidb, if I want to have the intersection of two files, I could use merge.
My question is:

If I want to compare two data.frames to find the rows in array1 that are not present in array2, how to do this? Because SciDB dose not support “%in%”, I have no idea how to work out.
"


#2

Here is one solution using the equi_join library:

library(scidb)
scidbconnect()
a = scidb("build(< a1 : double, a2 : string > [r=0:3,100,0, c=0:0,100,0], 
          '[  [ (1, \\'One\\')] , 
              [ (2, \\'Two\\')] , 
              [ (3, \\'Three\\')] , 
              [ (4, \\'Four\\')]  
            ]',
          true)")
a[]
#   r c a1    a2
# 1 0 0  1   One
# 2 1 0  2   Two
# 3 2 0  3 Three
# 4 3 0  4  Four

b = scidb("build(< b1 : double, b2 : string > [r=0:3,100,0], 
          '  [ (0.12, \\'One\\') , 
               (0.45, \\'Two\\') , 
               (0.13, \\'Fifteen\\') , 
               (0.34, \\'Sixteen\\')  
            ]',
          true)")
b[]
#   r   b1      b2
# 1 0 0.12     One
# 2 1 0.45     Two
# 3 2 0.13 Fifteen
# 4 3 0.34 Sixteen

# First join array1 with matching elements in array2, but keep all values of array1
step1 = merge(a, b, by.x = "a2", by.y = "b2", all.x=TRUE) 
step1[]
#   instance_id value_no    a2 a1   b1
# 1           0        0   One  1 0.12
# 2           0        1   Two  2 0.45
# 3           0        2 Three  3   NA
# 4           0        3  Four  4   NA

# Next filter joined array on elements where array2 did not have match
step2 = subset(step1, "b1 is NULL")[]
step2[]
#   instance_id value_no    a2 a1 b1
# 1           0        2 Three  3 NA
# 2           0        3  Four  4 NA

Note that only the attributes of the array are retained via this merge. If you want to retain dimensions of the original array, you have to project the dimension as new named attributes e.g. a1b = transform(a1, r_ = r, c_=c) before running the merge.