Array-level apply operator


#1

Is there an apply-like operator that would generate new arrays, instead of new attributes?

Here is an use-case. I have a three dimensional array like this:

AFL% store(
       redimension(
         build(<val:double>[i=0:3,4,0,j=0:3,4,0],i*4+j),
         <val:double>[i=0:3,4,0,j=0:3,4,0,foo=0:1,1,0]),
       temp);
[[[(0)],[(1)],[(2)],[(3)]],
 [[(4)],[(5)],[(6)],[(7)]],
 [[(8)],[(9)],[(10)],[(11)]],
 [[(12)],[(13)],[(14)],[(15)]]]

The first two dimensions, i and j, hold the raw data, while the third dimension, foo, holds a “row” i of data computed from each “row” i of raw data. I have a FOO operator which takes a 1-dimensional array and produces a new 1-dimensional array of same size. Applying FOO to the first “row” i=0 results in:

AFL% FOO(slice(temp, i, 0, foo, 0));
[(), (2), (), (6)]

So, the raw data is:

AFL% slice(temp, foo, 0);
[[(0),(1),(2),(3)],
 [(4),(5),(6),(7)],
 [(8),(9),(10),(11)],
 [(12),(13),(14),(15)]]

While the foo data is:

AFL% slice(temp, foo, 1);
[[(),(2),(),(6)],
 [(4),(),(6),()],
 [(),(18),(),(22)],
 [(12),(),(),(15)]]

The goal is to apply FOO operator to all the “rows” of the temp array and store the result in the foo=1 dimension. One option is to do a loop outside SciDB (in a Bash script) and issue i queries. Is there a way to do this directly inside SciDB in a constant number of queries? I am thinking of something like the apply operator.


#2

Sorry, operators in SciDB may only produce a single result array by design.


#3

Hi,
Indeed, I don’t know of anything that does exactly what you want.

You can write an operator that accepts a 2D array and returns a 3D array. Your best bet would be to set the chunk interval along the new dimension to include the entire range, and keep existing chunk sizes the same. Thus the operator needs to proceed by generating exactly one new output chunk for each input chunk. That’s easier than partially-filling chunks and merging them. Depending on the exact computation though, it could still be tricky. Looks like you need to produce a row of output for each row of input - and the input rows could possibly be on separate nodes?

You can also make a more hacky approach using UDAs and UDFs:

  1. Create a UDA A that accepts a whole set of values, does something to them and returns a complex type (binary or string)
  2. Create a UDF F that accepts the output of A and some integer i, and returns the ith value

Then you could do like:

project(
 apply(
  cross_join(
   aggregate(input, A(attribute) as a, dim1),
   build(<unused:bool> [i=0:99,100,0], true)
  ),
  v,
  F(a, i)
 ),
 v
)

And that returns <v>[dim2, i]
Where dim2 is leftover from the original array and i is the newly applied dimension.

This could work OK for relatively small intervals along i.