Provenance in SciDB


#1

I was wondering what is the status and future plans for adding provenance to SciDB as described in The architecture of SciDB?


#2

Well … it’s a mixed bag.

  1. There’s full provenance on the data. SciDB uses a multi-version concurrency control (MVCC) approach to transactions. So unless you undertake some administrative step that obliterates it, you’re able to re-construct the entire history of what your data looked like at any point in time. You’re even able to use queries to figure out things like a change log.

  2. We log every query. All user queries find their way into the scidb.log file. So as long as you’re managing those files somehow you’re able to get a complete history of what queries were run, and when. Simple story? The power of ‘grep’!

So far, so good. But then … things get tricky. We started out in the early days asking people about provenance and they all said, “Yes. We want that.” But then, when you tried to peel back the layers of the onion a little to find out what they actually meant by the term “provenance”, you got a bewildering array (no pun intended) of answers.

Some folk wanted considerably more meta-data about the arrays: where the data came from, and how the downstream products were used. This would be OK in theory but in practice, we were obliged to confront questions like “In what format should that be made available?” (everyone has their own “standard”) and guess what! the standards are incompatible.

Other folk wanted to capture workflow at a “higher” level. They wanted to view the SciDB platform as kind of workflow management tool to organize all of the scripting, external software interactions and so on. There are good standards for this kind of thing–such as the Open Provenance Model–but they’re very big and very general and run to things like vocabulary ontologies and the like which are quite beyond the scope of what we had in mind.

And finally, you’re always confronted with more things than you can reasonably do. And whenever we asked a customer “Would you rather have X or improved provenance?”, their answer has (so far) always been “X, please!”

So: TL;DR - we track the basic information needed to provide provenance about the history of SciDB’s data and queries, but we’re stuck at answering the question, “What next?”