The option #1, that is interpreting double as boolean, is what I tried and it was slower than I need. I am looking for at least 10x impovemens. I started with the lowest hanging fruits: optimizing the chunk size and increasing the number of instances running on a server which is still underutilized. I am using SciDB_14.12_02_13_2015 (ami-3cace654) AMI in AWS, but instead of m3 as a recommended instance type I am using c3.8xlarge, which is significantly larger. Then I changed config.ini to have 1 coordinator and 31 worker nodes, run scidb.py initall mydb, and scidb_restart.sh. I can see all 32 processes, but during the query execution only a few of them are actually doing something. If you could recommend any document that can help me to tune up the cluster in AWS I would appreciate it.
BTW, the task that I have is to multiply very sparse boolean matrices of a size of a few billions by a few billions, with around 10 billion non-zero elements. Luckily, the task can be partitioned into 5% of the total size partitions. I am hoping to be able to process the 5%, or 500 million of matrix elements, in a SINGLE server, avoiding the slow network traffic. I mean one matrix multiplication to be done in under 15 min.
The option #2, that is using dmetric and defining new semiring traits sounds like what I was looking for. I’ll give it a try after I complete the cluster tuneup. However I have a vague idea how to create plugins in SciDB.