Manipulating arrays without storing or fetching them


#1

Hello,
I’m working with SciDB and I need to implement a complex query on two arrays that results in one array, all are stored in the database. The input arrays are:

(‘Forestflux:float,dist:int64 [index=0:180000,1000,0,pixel=0:1000,500,0]’)
(‘Thetatheta:float [index=0:180000,1000,0,index_2=0:180000,1000,0]’)
The output array is:
(‘Bins<total_flux:float> [r_parallel=0:50,50,0,r_transervse=0:50,50,0]’)

The queries that I need to execute are the following, I’m writing them in a python string format and execute the final query with iquery:

cross_forest = "cross(Forest,Forest)"
apply_filt = “apply(%s, flux_sq, flux*flux_2)” % cross_forest
forest_cross = “project(%s, flux_sq, dist, dist_2)” % apply_filt
cross_join_theta_and_forest = “cross_join(%s, Theta, %s.index, Theta.index, %s.index_2, Theta.index_2)” % (forest_cross, forest_cross, forest_cross)

calc_r = “apply(%s, r_parallel, dist100 * theta3 , r_transverse, dist+dist_2+theta)” % cross_join_theta_and_forest
filter_r = “filter(%s, r_parallel + r_transverse<100)” % calc_r
project_r = “project(%s, flux_sq, r_parallel, r_transverse)” % filter_r

final_step = “redimension(%s, Bins, sum(flux_sq) as total_flux)” % project_r

iquery -aq “<final_step>;”

My problem is that I don’t need the output of the intermediate arrays though it seems that I have to fetch and store them, which takes a very long time. For example, the first command I should execute is:

iquery -aq "store(cross(Forest, Forest), forest_cross);"
and so on.

Is these any way for me to avoid fetching and saving the intermediate results?
Any other suggestions to speed up the process are welcome as well!

Thanks! :smile:


#2

SciDB queries are composable. (Very nearly almost) Whenever, in the documentation, you see an operator that takes as input an array, you can substitute a query. (The exceptions are situations where we use an array to define a shape.)

So you’ve got …

cross_forest = "cross(Forest,Forest)"
apply_filt = “apply(%s, flux_sq, flux*flux_2)” % cross_forest
forest_cross = “project(%s, flux_sq, dist, dist_2)” % apply_filt

… which can be expressed as …

project (
apply (
cross ( Forest AS F1, Forest AS F2 ),
flux_sq, F1.flux * F2.flux
),
flux_sq, F1.dist, F2.dist
);

I’ve not worked the details out all of the way but I’m pretty confident, looking at the operators in your list, that what you’re trying to do can be expressed as a single expression. And when SciDB actually computes that query, we won’t just compute intermediate results. Rather, we’re going to compute incremental output for each incremental input.

All of that said, I’m looking at your overall plan here. Are you sure you don’t just want to compute a matrix multiply? There’s a lot of sums of products going on in there. . .


#3

add “-n” switch to your iquery. then it won’t be printing and fetching the results of queries.


#4

Hi plumber,
Thank you, it works now. I changed it to the following:

cross_forest = "cross(Forest AS F1, Forest AS F2)"
apply_filt = “apply(%s, flux_sq, F1.flux * F2.flux)” % cross_forest
forest_cross = “project(%s, flux_sq, F1.dist, F2.dist)” % apply_filt
cross_join_theta_and_forest = “cross_join(%s, Theta, F1.index, Theta.index, F2.index, Theta.index_2)” % forest_cross
calc_r = “apply(%s, r_parallel, F1.dist100 * Theta.theta3 , r_transverse, F1.dist+F2.dist+Theta.theta)” % cross_join_theta_and_forest
filter_r = “filter(%s AS CALC_R, CALC_R.r_parallel<50 and CALC_R.r_transverse<50 and CALC_R.r_parallel > 0 and CALC_R.r_transverse > 0)” % calc_r
project_r = “project(%s AS FILTER_R, FILTER_R.flux_sq, FILTER_R.r_parallel, FILTER_R.r_transverse)” % filter_r
final_step = “redimension(%s AS PROJECT_R, Bins, sum(PROJECT_R.flux_sq) as total_flux)” % project_r
print final_step

$IQUERY -aq “%s;” % final_step

The process indeed does something that is similar to a matrix multiplication. A matrix multiplication does not work because I need to keep some of the attributes (and not sum over any of them) until I build some needed attributes from them (i.e., I need all the triples for (dist_i, dist_j, theta_k) to create the attributes r_parallel and r_transverse).

Thanks!