Sparse, high-dimensional vector space in SciDB


#1

Hi all,

In an application, I have to store and search many vectors (100M+) in a sparse, high-dimensional space 10 to 10K.
This problem is analogous to a word-document occurrence matrix, with many documents, many words, but few words in each document.
My schema is roughly following this JSON structure:

Document { id: "a1", author: "Peter", word1: 0, word2: 1, word3: 2, word4: 0, ... word10040: 4, word10041: 0, ... }

So, the documents are the rows of a matrix, while the words are the columns, with very high sparsity (non-zero values are less than 5%).
The main operations I need to perform on these objects are: appending new rows at the end of the matrix, slicing (apply operations only on certain rows/columns), dot product between rows (or Cosine distance), singular value decomposition, etc.
To date, I’ve been doing these things using SciPy arrays, but I’m hitting memory limits.

SciDB seems great to store and search large arrays in a transparent way, but I can’t find the ideal way of storing this kind of data in it. I guess words can be dimensions and things like “author” attributes of an array. Would that work ok on a very large number of dimensions (10K+)?

What is your recommendation? How would you deal with this type of data?


#2

Hi,

I think it might be possible to do something like this. Using scidb notation but omitting bounds and chunk sizes ( which will be important to calculate right):

documents         <id:string, author: string> [document_id]
words             <word:string>               [word_id]
word_occurrences  <count:uint64>              [document_id, word_id]

In other words, we are storing a matrix, encoding n-dimensional vectors in a 2D space. We’ve done this sort of thing a few times - bitcoin transactions (see goo.gl/xT0Pd2), social network data, etc.

The sparse truncated SVD and GLM are currently in the Enterprise Edition but we sometimes make those available for academic research. Let us know if interested.