Diagonal Matrix


Hi SciDB Folks,

I’m trying to generate a large (10,000,000 x 10,000,000) diagonal matrix (so only the 10,000,000 elements for which row=column are nonzero). I’m currently using the following query:

         iif(row=col, (RANDOM() % 100) / 10.0, 0)), 
         val <> 0), diag)

However, iterating over all 10^14 values seems very inefficient and indeed the query is quite slow. Is there a way that I can generate a 10,000,000 x 1 vector for the main diagonal and then reshape it to be a matrix?


Hi @ahthomas, that’s a lot of zeroes you have there! :slight_smile: How about:

create array diag<val:double default 0.0>[row=0:10000000:0:1000; col=0:10000000:0:1000];
         apply(build(<val:double>[row=0:10000000:0:1000], (RANDOM() % 100) / 10.0),
               col, row),

I used a much smaller array size but this cut the time down from over a minute to 0.25s on a 10,000 x 10,000 array.


@mjl - awesome, works like a charm! Unfortunately, this was for the purpose of creating a sparse matrix to use in gemm which is now crashing with an OutOfMemory error :frowning: - From what I understand gemm fills in missing values with zeros at runtime. Do you know if this happens lazily, or all at once? If the latter, then that would certainly explain what’s going on.


Interesting :frowning: .

The default 0.0 clause in the type specification generates chunks of zeroes when the array is scanned, but these don’t take up much space individually because they are RLE encoded (so, one 0.0 value and a count). Still, there is some per-chunk performance overhead. By replacing default 0.0 with not null you should be able to use spgemm() to get the “fill missing values with zeroes” behavior from the operator without actually manufacturing the zero-filled chunks. (I don’t know whether gemm() does the “turn empty cells into zeroes” behavior, but spgemm() certainly does… thought it requires the attribute to be declared not null.)

Nevertheless, there should be no instance crashes on OutOfMemory conditions. Can you post the list(‘arrays’) output and the query that lead to the crash? Is this the same 192 instance cluster you mentioned in the other thread, with same config.ini etc.?


I discussed the issue a bit with @apoliakov. Apparently although the RLE-encoded zero chunks are small, they must be expanded before being passed to the MPI slave processes that run ScalaPACK. So that could cause the memory starvation. Using spgemm should help there.


@mjl - Thanks! Switching to spgemm() resolves the OOM error. Just curious - looking at top while spgemm is running, it looks like only a single DB instance is active (i.e. 1 has 100% CPU while the others all have zero). Is this expected behavior? I.e. is this operation done purely in a single instance?


Excellent, glad it’s working!

The spgemm code does indeed execute on every cluster instance. For short duration queries you have a better chance of seeing activity on just the “coordinator” node (the one that the iquery client or the shim is actually connecting to), since the non-distributed work of query execution happens there (parsing the query, broadcasting it to workers, merging their responses, managing locks, etc.) happens there.

Also, chunks of the array(s) are distributed across the cluster based on a hash algorithm, so it’s possible that some instances have more work to do than others. That’s especially true when the arrays are relatively small: for a large number of chunks the hash function does a good job of even distribution, but for small chunk counts the distribution can be bumpy. You can use the undocumented (because subject to change without notice) list(‘chunk maps’) query to get an understanding of how a particular array’s chunks are distributed. This recent thread describes it: “Find out the size of a compressed array”. Chunk size and distribution have a big impact on query performance. Once your proof-of-concept queries are working, it’s time to search this forum for ways to optimize, perhaps by adjusting the chunk sizes of your array dimensions to get a better distribution.

So short answer is, I don’t know why your particular query seems to be exercising just the one node, but there are legitimate reasons why it could be.


P.S. The place to start when pondering chunk distribution is the PDF here: “SciDB’s MAC storage explained”


Hey just to add a bit more info here. There are two potential strategies that SPGEMM can execute in. One strategy involves “rotating” the right array between instances. Another strategy is to “replicate” the right array. The latter may be better when the right array is small. If the op picks the “rotate” strategy, it is possible that there aren’t enough chunks to “land” on all instances. In such a case, that is one reason you may see skewed utilization.

For example, try:

spgemm(A, B, right_replicate: true)
spgemm(A, B, right_replicate: false)