Machine learning with Python and SciDB



I am currently working with large datasets that does not fit in my memory (>2GB), therefore I want to use SciDB. I also want to perform some machine learning algorithms and feature selection on my dataset and have started looking at existing python packages to use together with SciDB-Py.

The problem I have is to utilize the functionality in packages like scikit-learn with SciDBs advantages of performing calculations directly in the database.

Is there someone who has worked with SciDB together with existing machine learning packages and can share their experiences? Are there any packages that fit my purposes or do I have to implement learning methods myself?


I have not used SciDB with existing machine learning packages. So I can only share with you some general opinions that I have.

SciDB provides low-level “building blocks”, such as operators to do matrix multiplication, filter by attributes, filter by dimensions, or join. Machine learning algorithms are at a higher level, that need to use those building blocks. So to use SciDB to do machine learning, it may be inevitable to implement the learning methods yourself. There are, however, three ways to do it. The most popular use case is to implement some functionalities as scripts. Some functionalities that are self-contained and reusable may be implemented as user-defined operators, expanding the set of building blocks. Another possibility is, if you have some low-level third-party code (e.g. code that compresses/decompresses/understands/compares DNS sequences), you may implement a user-defined type and overload some operations such as == or <, that wraps over the third-party code.

In terms of scripting, a lot of people are using SciDB through SciDB-R. You mentioned SciDB-Py, which is fine but I am not sure we may support it forever. Personally I wrote some Python scripts (e.g. see in the bin directory of any SciDB installation) using our own scidblib Python package (also in the bin directory) that wraps over iquery. My vision is we will maintain and expand the scidblib Python package in the long run, and at some point publish the part of it that we will commit to support.

Hope this helps.


Thanks, I am a beginner at this so everything helps =)

Ok, that makes sense. I haven’t looked into SciDB-R yet so I will do that and see if there is more support for package overlaps.


osivar, i am working on the same problem.
is there a specific algoritthm in scidb you are planning to use?

scikit-learn is huge…


Donghui, what are the fundamental differnces between scidb-py annd scidblib? can’t they be merged together?


Hey @senya72 @osivar
Happy to report we’ve made considerable progress in this area. Take a look:


ah cool. i will def play with it


first comment - the data streaming options should not be limited to only streaming by the chunk as chunk is a unit of storage but not really a unit of computation unless they make it intentionally to coinside The data should be streamed by something meaningful -maybe by dimensional value or attribute value or combination of the two. For instance if want to do calculation by stock i should be able to stream by stock assuming stock is dimension. or if i wanted to do macro analysis i could stream by country assuming country is an attribute.


I agree. At the moment you can achieve that, in some cases. For example, if your array has “stock” as the first dimension, you can use _sg(array, 3) to redistribute the data “by row”. Then all the data from the same stock will land on the same instance. You’ll still have to stream in chunks but you’re guaranteed that one streaming process will see all the data for a particular stock.

A more heavyweight option is redimension of course. So many streaming workflows I’ve seen so far are of the form of stream(redimension(cross_between())) or stream(redimension(...join())).

I agree that more tools in this space will be useful. Definitely thinking about it.