SVD of large matrix


#1

Hi Scidb users,

I’m trying to compute the SVD of a large (ish) matrix (10000000 x 10). However, the operator crashes with the exception: Error description: Operator error. Operation 'per-instance share of input matrix exceeds library size limit' failed. Can anyone explain what’s going on here? I’m using SciDB 16.9 CE on Ubuntu (in a docker container if that matters).

As a bit of context - I’m trying to compute the pseudo-invserse of a data matrix to implement least-squares. If there’s a more efficient way to do this, I’m all ears.


#2

Hey Anthony, sorry about the trouble.
How many SciDB instances are you running?
$ iquery -aq "list('instances')"


#3

Hi @apollakov - I’m using 24 instances


#4

Interesting. Our linear algeba devs tell me that error would be triggered if your matrix exceeds 2GB per instance. However that does not seem possible given your dimensions. So, can you send us the output of this:

iquery -aq "show(ARRAY_NAME)" #Substitute with the name of your matrix
iquery -aq "summarize(ARRAY_NAME, by_instance:1)"

Also show us the exact query you’re trying to run.


#5

I noticed that @ahthomas is using SciDB 16.9 (BTW we have SciDB 18.1 Docker images available too). So the summarize operator would not ship by default in 16.9 (but is shipped with SciDB 18.1).

@ahthomas would have to install summarize from here

Also, the syntax for summarize at SciDB 16.9 would be:

iquery -aq "summarize(ARRAY_NAME, 'per_instance=1')"

#6

@apoliakov, @Kriti_Sen_Sharma - sorry for the slow reply! In response to your comments:

  1. AFL% show(M10000000100); {i} schema {0} 'M10000000100<val:double> [row=0:10000000:0:1000; col=0:100:0:1000]'

  2. AFL% summarize(M10000000100, by_instance:1);
    {inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
    {0,0} ‘all’,42218000,344385772,836,101000,101000,101000,48,411945,825200
    {1,0} ‘all’,40501000,330389516,802,101000,101000,101000,48,411957,825044
    {2,0} ‘all’,43632000,355920192,864,101000,101000,101000,48,411945,825080
    {3,0} ‘all’,39996000,326269068,792,101000,101000,101000,48,411956,825660
    {4,0} ‘all’,43026000,350980392,852,101000,101000,101000,48,411949,825460
    {5,0} ‘all’,46561000,379828244,922,101000,101000,101000,48,411961,825396
    {6,0} ‘all’,38582000,314745600,764,101000,101000,101000,48,411971,825108
    {7,0} ‘all’,41208000,336170704,816,101000,101000,101000,48,411974,825416
    {8,0} ‘all’,48076000,392194392,952,101000,101000,101000,48,411969,825396
    {9,0} ‘all’,44844000,365803160,888,101000,101000,101000,48,411940,825428
    {10,0} ‘all’,43531000,355108744,862,101000,101000,101000,48,411959,825264
    {11,0} ‘all’,42521000,346884332,842,101000,101000,101000,48,411977,825464
    {12,0} ‘all’,42016000,342755352,832,101000,101000,101000,48,411966,825304
    {13,0} ‘all’,43228000,352616432,856,101000,101000,101000,48,411935,825060
    {14,0} ‘all’,38380000,313082792,760,101000,101000,101000,48,411951,825196
    {15,0} ‘all’,39592000,322974956,784,101000,101000,101000,48,411958,825488
    {16,0} ‘all’,41612000,339435116,824,101000,101000,101000,48,411936,825208
    {17,0} ‘all’,45248000,369130228,896,101000,101000,101000,48,411976,825272
    {18,0} ‘all’,39592000,322979280,784,101000,101000,101000,48,411963,825240
    {19,0} ‘all’,42824000,349331588,848,101000,101000,101000,48,411948,825348
    {20,0} ‘all’,40804000,332866632,808,101000,101000,101000,48,411964,825616
    {21,0} ‘all’,40400101,329582440,802,101,100748,101000,48,410951,825332
    {22,0} ‘all’,41612000,339461364,824,101000,101000,101000,48,411968,825156
    {23,0} ‘all’,39996000,326279440,792,101000,101000,101000,48,411969,825512

Also, I realized I am actually using SciDB 18.01 - sorry for the confusion!


#7

Hey - no worries - and the last piece of information: what exact query are you trying to run?


#8

The full query is the following (where {X} and {y} are placeholders for arbitrary arrays of shape NxK and Nx1 respectively):

gemm(gemm(project(apply(cross_join(
    transpose(gesvd({X}, 'VT')) as V,
    project(apply(gesvd({X}, 'S'), sigma_inv, pow(sigma, -1)), sigma_inv) as S_INV,
    V.i, S_INV.i), vsinv, v*sigma_inv), vsinv), transpose(gesvd({X}, 'U')),
    ZKN), {y}, ZK1)

But simply running gesvd({X}, 'VT') is sufficient to produce the error. The query should be computing the w which minimizes ||Xw - y|| which can be shown to be w = Ay, for A = V*S^{-1}*U^{T} where svd(X) = U*S*V^{T}.


#9

Interesting. Thanks @ahthomas. I can confirm we’ve reproduced this issue with random data.
Also let’s confirm the matrix size: yours is actually 101 x 10,000,001; note the indeces starting at 0.

Thanks for finding this issue! We’ll investigate internally and get back to you.


#10

Awesome thanks! And I think the matrix dims should be transposed: i.e. 10,000,001 x 101 (but the size is correct). I’m coming from MATLAB where everything is 1 indexed so I still get bit by the off-by-one bug all the time :slight_smile:


#11

A related question: what algorithm does SciDB use for computing the SVD? Additionally, in the query above, the SVD operator is called three times to extract each resulting matrix. Is the query optimizer able to cache intermediate computations in each call to gesvd, or does it have to start from scratch each time?


#12

Hey @ahthomas - a few updates.

This operator is built to call to ScaLAPACK under the hood and that is also where the error comes from. There is an issue where the per-instance portion of the matrix (the “per-instance share” in the error message) cannot exceed roughly ~2GB, per instance. It’s a 32-bit addressing issue inside ScaLAPACK. The nuance is that the size needs to be rounded up to the nearest chunk. So, even though your dimension col is 0 to 100, the chunk size on that is 1000 and that is where our discrepancy comes from. Another constraint is that the chunks need to be square.

So, for example, this works on 8 instances with 100x100 chunk sizes:

iquery -aq "create array M10000000100<val:double> [row=0:9999999:0:100; col=0:99:0:100]"
iquery -aq "gesvd(M10000000100, 'VT')"

Your example with 1000x1000 chunks should work on 64 instances.

We are working on fixing this 32-bit limitation in a future release.

As to the re-computation question - yes right now the repeated invocations of gesvd would re-compute it again. That’s also a known issue we’re working on. In SciDB Enterprise Edition there is a different “tsvd” operator for sparse matrices that uses a different algorithm.


#13

@apoliakov - just noticed I never responded to the above! Thanks for looking into this - I can confirm that this solves my problem. Also, as a note which may be useful to future users: While it is possible to compute the least squares projection using a direct SVD of X (the design matrix), when X is tall and skinny (as is usually the case) it’s more efficient to precompute X^{T}X, use SVD to compute its inverse and then compute b = (X^{T}X)^{-1}(X^{T}y). This avoids the need to repeatedly call SVD on X which may be a very large matrix.