Spgemm for boolean matrices?


#1

Hi, I am new to SciDB, and this is my first post here.

I like having spgemm for the distributed sparse matrix multiplication, but it only works with float or double type attributes. For my application I need to perform boolean matrix multiplication which should be significantly easier to implement and must peform better.

So, how can I do this in SciDB? Can I simply define the new boolean type with the * and + operators and still use standard spgemm? If not, is my only option then to develop my own boolean_spgemm? Is it even possible?

Thanks.


#2

Hi Alexey,

You can do a few things.

  1. convert the matrix to double in the middle of the query:
A <a:bool> [i, j]
B <b:bool> [j,k]

spgemm(
 project(apply(A, a_double, iif(a, 1.0, 0.0)),a_double),
 project(apply(B, b_double, iif(b, 1.0, 0.0)),b_double)
)

As you point out though, this may not perform as fast as possible.

  1. We also have a prototype called “dmetric” which defines pairwise distance metrics (as well as allow thresholding). It is completely open on github: github.com/Paradigm4/dmetric
    It is already a standalone plugin so it might be easy to edit.
    You could edit the Logical… and the Settings files to add a new “boolean” option. You could then edit the spgemmSemiringTraits.h to add a path. It should be doable with just a few careful changes.

Let us know how it goes.


#3

Alex, thanks!

The option #1, that is interpreting double as boolean, is what I tried and it was slower than I need. I am looking for at least 10x impovemens. I started with the lowest hanging fruits: optimizing the chunk size and increasing the number of instances running on a server which is still underutilized. I am using SciDB_14.12_02_13_2015 (ami-3cace654) AMI in AWS, but instead of m3 as a recommended instance type I am using c3.8xlarge, which is significantly larger. Then I changed config.ini to have 1 coordinator and 31 worker nodes, run scidb.py initall mydb, and scidb_restart.sh. I can see all 32 processes, but during the query execution only a few of them are actually doing something. If you could recommend any document that can help me to tune up the cluster in AWS I would appreciate it.

BTW, the task that I have is to multiply very sparse boolean matrices of a size of a few billions by a few billions, with around 10 billion non-zero elements. Luckily, the task can be partitioned into 5% of the total size partitions. I am hoping to be able to process the 5%, or 500 million of matrix elements, in a SINGLE server, avoiding the slow network traffic. I mean one matrix multiplication to be done in under 15 min.

The option #2, that is using dmetric and defining new semiring traits sounds like what I was looking for. I’ll give it a try after I complete the cluster tuneup. However I have a vague idea how to create plugins in SciDB.

Thanks again.