Comparison matrix with dot product


#1

Hi everybody,

I have a a 2-dim array with dimensions ‘patch_id’ and ‘feature’ and one attribute called ‘value’ (double). There are about 268’000 patch_ids and 1024 features:

fvNormvalue:double [patch_id=0:*:0:1024; feature=0:*:0:1024]

I would like to extract each feature vector (row, 1024 features) defined by the patch_id and compare it against all other feature vectors in the array by computing the dot product of the 2 vectors and store the results in a new array with 2 dimensions. Both dimensions will thus represent the patch_id, in order to store the results of each vector compared to all other vectors.

Can anyone give me a hint on how to do this?
Thank you,
-Ivan


#2

Hi Ivan,

If I’m not mistaken, it looks like essentially a matrix product, right? There are a few ways to do it. It depends on whether the array has any empty cells or sparsity.

If the array is dense, you can use gemm from dense linear algebra. However your array needs to be bounded - and you can use subarray to do that. gemm also requires a third argument which is added to the product. You can supply an array of 0.

This can also be done with joins - cross_join or equi_join followed by an apply and a aggregate or grouped_aggregate. This is useful if you need to compute something special - not a standard dot product. Or use left_outer/right_outer in equi_join if you want to compute something special for values from the right that does not have a match on the left. I’ll show you the gemm way and the cross_join way with a simple array:

$ iquery -aq "show(bar)"
{i} schema
{0} 'bar<val:double> [x=1:*:0:256; y=1:*:0:256]'

$ iquery -aq "scan(bar)"
{x,y} val
{1,1} 2
{1,2} 3
{1,3} 4
{2,1} 3
{2,2} 4
{2,3} 5
{3,1} 4
{3,2} 5
{3,3} 6

$ iquery -aq "load_library('dense_linear_algebra')"
Query was executed successfully

$ iquery -aq "gemm(transpose(subarray(bar,1,1,3,3)),subarray(bar,1,1,3,3), build(<val:double>[x=0:2,256,0,y=0:2,256,0],0))"
{x,y} gemm
{0,0} 29
{0,1} 38
{0,2} 47
{1,0} 38
{1,1} 50
{1,2} 62
{2,0} 47
{2,1} 62
{2,2} 77

$ iquery -aq "aggregate(apply(cross_join(bar as A, cast(bar, <val:double>[x1,y1]) as B, A.x, B.x1), p, A.val *B.val), sum(p), y, y1)"
{y,y1} p_sum
{1,1} 29
{1,2} 38
{1,3} 47
{2,1} 38
{2,2} 50
{2,3} 62
{3,1} 47
{3,2} 62
{3,3} 77

Gemm is much faster but also more demanding - requiring arrays that are bounded and it has chunk size restrictions.

For those who might be curious, this is also an area where SciDB Enterprise Edition has a lot more functionality.


#3

Hi Alex,

Thank you! This is what i needed. I just tried the cross_join solution on a small array:

aggregate(apply(cross_join(fvNorm as A, cast(fvNorm, <val:double>[patch_id1,feature1]) as B, A.feature, B.feature1),p, A.val*B.val),sum(p),patch_id,patch_id1);

This works great with a small test array.
However, the real array has the dimension of 268’000 x 1024:

fvNorm [patch_id=0:268000:0:1024; feature=0:1024:0:1024]

With cross_join (see above) this would create an array with 268’000 x 268’000 x 1024 cells (about 73,5 x 10^12). Can this be computed in a sensible amount of time on a single node running 10 instances? More than enough memory should be available (about 1TB).

Thanks,
-Ivan


#4

Hey Ivan,

The good news about the cross_join is that it won’t materialize the 26800 x 26800 x 1024 cells in memory - it is one of the stream-through operators, which means it will serve up the values to the aggregate above it as requested. But gemm and other methods will go much faster. There’s also spgemm available in the linear_algebra library that you can try.

You can get a sense of the scalability curve by trying with a subset of rows and slowly increasing that number. For example, between(fvNorm, 0, null, 1023, null) will quickly give you a slice with 1024 rows (1 chunks worth). Your final output will be 268000 x 268000. But you can first try running a 268000 x 1024 and that, in effect, completes about 1/261 of your problem. You can insert that result into a target array and then compute the next “slice”. So you can run the problem in pieces and quickly get an idea of how long the whole job will take. This slice-at-a-time strategy is not specific to just cross_join of course, it will work with gemm and other methods.

In practice we find sometimes people apply thresholding to avoid materializing the whole output. Sometimes folks compute various distance metrics other than just dot product. So another option is to use streaming . This lets you write your distance metric calculation in R and, potentially, apply thresholds to the output if needed.

Make sense?


#5

Great.Thanks Alex for the detailed answer. It definitely makes sense. I’ll probably also have a look at the streaming you proposed.
-Ivan