New version of an array


#1

Hi guys,

I read in the documentation that SciDB would create new version of an array when I update it. How is that working? Is it coping the array or just adding a tag to the new/ipdated element? I don’t really need new version, is it possible to disable versioning?

Regards,
Georgi.


#2

So. An answer in two parts.

  1. The way we want it to work (and the way it used to work) is as follows.

When you write a chunk to disk for the first time, we allocate into it a little extra space. When you update a chunk’s contents, we figure out a “backwards delta” of the differences between the “new” and the “previous” versions of the chunk’s data. (Recall, SciDB uses a multi-version concurrency control approach to transactions management.) We then write this “backwards delta” into the extra space at the end of the on-disk chunk.

When you want the “current” or “latest” version of the chunk’s data, you will simply read the current chunk’s state. When you want a previous (older) version, you apply a series of backwards deltas. If you create so many backwards deltas that you fill the space, we create a new copy of the chunk at the latest version (with it’s own space).

The idea is to support ( a ) fast access to latest version with ( b ) support for going backwards through versions at ( c ) modest (minimal) storage space overhead. The design assumption is that in-place updates are rare. We’ve designed things with fast bulk loads / appends in mind, each of which creates additional chunks but rarely updates any.

  1. Notice how much of that was “future tense”? That’s because it isn’t actually the way things really work at the moment.

Our initial implementation of the “compute the delta” code uses a generic method. And it’s very, very slow. But as we have very few (no?) customers / users who are doing in-place updates, and we had lots of customers doing other things, we left the code in place but turned off the feature. Now, when you update a chunk, we just create a new copy of the whole thing. In theory this is terrible from a space utilization point of view. But no one’s doing updates, so no one’s (yet) complained.

We plan to change the implementation of the “compute delta” code and get the approach described in point #1 working again later this year. But there you are. That’s how the sausage is made.

Hope this helps!