Scidb vs. spark


#1

Hi, new here. At the expense of showing ignorance, I want to get the opinion here on the pros and cons of setting up large data scientific computing solution based scidb vs. spark, which is also very popular at the moment.

For example what are the use cases that are typically bad for each?

To put things in context, for the problem at hand, data set may not fit on one machine either memory or disk, but there are ways around it via preprocessing to chop it up. On the other hand, if scidb indeed solves the problem of sparse tensor storage and direct computation on it, then maybe that preprocessing along with algorithmic limitations that go along with it would be superfluous.

Thank you in advance!


#2

Hi, thanks for your interest.

We did some fairly careful benchmarking. There are even code snippets laying around: github.com/Paradigm4/variant_wa … _benchmark

We might release a lot of the figures and data later. But I’ll give you a very quick impression:

  1. SciDB is usually faster (1.5x at least, 40+x observed). That is because of C++, clustered storage, chunking, etc…
  2. You program SciDB from AFL / R / C++. You program Spark from Java / Scala. You might be partial depending on your expertise.
  3. In some cases, spark might make it easier to stand up and try a new kind of analytic. Spark has a few better looking GUI’s, management stuff.
  4. SciDB is a database built for a particular purpose. Spark is more of a “build your own database” bag of tools. With spark, you get to pick a file system, file format, serializers, concurrency control strategies,… You pick what you need. The drawback is that those things may not always play optimally with each other.
  5. Both systems are hard to tune. With Spark you get extra difficulty because you have to tune individual interacting pieces…

Hope it helps. By way of disclaimer, I’m trying to be impartial but I do work for P4. Let us know if your experience is different.