SVD / operators with multiple results


#1

Folks:

Playing around with the SVD functionality, it looks to me as if you have to compute your SVD three times to get U, V, and the singular values. I presume that this is because an operator always returns exactly one array, but it seems pretty wasteful, and I can’t think of a time when I wouldn’t want all three results. Would the alternative be an SVD operator that writes results to named arrays in the spirit of store()? Not a pressing thing for me, but it seems as if many non-trivial analytic operations will need to handle the case of multiple results …

Cheers,
Tim


#2

Dear Tim,

We’re working on this. There are two new operators coming soon, an SVD for dense matrices based on ScaLAPACK, and a new method for dense or sparse matrices to efficiently compute a truncated SVD (a few largest singular values and corresponding vectors).

Each of the new methods will return a 3D array with the U, S, and V matrices in the extra dimension. The truncated SVD method will additionally return a 4th layer with diagnostic information.

I’m not sure exactly when all this will be in place, but it should all be there by March.

Best regards,

Bryan Lewis


#3

Bryan:

Happy to hear that there’s ongoing progress in this area, particularly the truncated sparse SVD - my interest is in text analysis, where this will be an especially good fit.

That said, packing results into higher-dimension arrays sounds like a suboptimal approach for operators that return multiple arrays, since you lose explicit extents for those results. That would seem to preclude writing operators that produce indeterminately-sized results, something that is of great interest in text analysis (e.g: splitting a string into an array of tokens, creating a dictionary of unique tokens, etc). It also precludes writing operators that return multiple arrays with dissimilar types.

Given all the support for user defined types in SciDB, I wonder if it would be possible to have a “reference to array” type, allowing an operator to return an array-of-arrays?

Changing subjects, I recently ran a 49815 x 500 matrix through the SVD operator, and expected to get back a 49815 x 500 matrix, a size 500 vector, and a 500 x 500 matrix as results. Instead, I get back a 49824 x 512 matrix, size 512 vector, and a 512 x 512 matrix (details below). I assume that chunk size is playing a role here, since all those dimensions are multiples of 32, but IMO those dimensions are wrong. Am I supposed to ignore the values that are outside the expected dimensions? Regardless of chunk size, why would the dimensions need to be altered?

Many thanks,
Tim

AFL% dimensions(frequency_matrix); {No} name,start,length,chunk_interval,chunk_overlap,low,high,type {0} "i",0,4611686018427387903,32,0,0,49814,"int64" {1} "j",0,4611686018427387903,32,0,0,499,"int64" AFL% dimensions(lsv); {No} name,start,length,chunk_interval,chunk_overlap,low,high,type {0} "i_1",0,49824,32,0,0,49823,"int64" {1} "i_2",0,512,32,0,0,511,"int64" AFL% dimensions(sv); {No} name,start,length,chunk_interval,chunk_overlap,low,high,type {0} "i",0,512,32,0,0,511,"int64" AFL% dimensions(rsv); {No} name,start,length,chunk_interval,chunk_overlap,low,high,type {0} "i",0,512,32,0,0,511,"int64" {1} "j",0,512,32,0,0,511,"int64"