In an application, I have to store and search many vectors (100M+) in a sparse, high-dimensional space 10 to 10K.
This problem is analogous to a word-document occurrence matrix, with many documents, many words, but few words in each document.
My schema is roughly following this JSON structure:
So, the documents are the rows of a matrix, while the words are the columns, with very high sparsity (non-zero values are less than 5%).
The main operations I need to perform on these objects are: appending new rows at the end of the matrix, slicing (apply operations only on certain rows/columns), dot product between rows (or Cosine distance), singular value decomposition, etc.
To date, I’ve been doing these things using SciPy arrays, but I’m hitting memory limits.
SciDB seems great to store and search large arrays in a transparent way, but I can’t find the ideal way of storing this kind of data in it. I guess words can be dimensions and things like “author” attributes of an array. Would that work ok on a very large number of dimensions (10K+)?
What is your recommendation? How would you deal with this type of data?