How does SciDb compare to or work with NumPy, Pandas, SciPy, etc


#1

I understand SciDb has a wrapper for Python but, does it fight with the analytical workflow with libraries such as NumPy, Pandas and SciPy, or does enhance or enable their usage?

I am reading Wes McKinney’s book “Python for Data Analysis” which covers the aforementioned Python tools.

What I am looking for is:

  1. Would SciDB enhance a workflow I am developing or is a completely different road?

  2. What does SciDB do better than a analytical python stack does not?

Thank you.


#2

First, SciDB-Py definitely does not intend to fight with / supersede mature libraries such as NumPy, Pandas and SciPy.

Let me answer your queries one by one:

To answer this, I quote a portion of the following paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4588804/

Fast queries of the data through the web using SciDB, a parallelized database for high performance computing, make this process operate quickly. By using scripting containers, such as IPython or Jupyter, to analyze the data, scientists can utilize a wide variety of freely available graphing, statistics, and information management resources [… namely SciPy, NumPy, Pandas etc.].

SciDB is a multidimensional array database that can manage massive datasets (Terabytes, Petabytes) over a hardware cluster. The analytical Python stack you mentioned above does not do this. What the above mentioned paper did (and what many of our customers do) is to use SciDB to store the data, run certain computations in the database, and then select and download smaller chunks of data for processing with the Python analytical stack. Now SciDB is very fast on the ‘select’ and ‘in-database operations’. Yet, not all the operations of the Python analytical stack are available within SciDB. Hence this way of separating storage, in-database computations, and out-of-database computations makes a lot of sense for many people who use SciDB.