Bullet points versus RDBMS + GIS?


#1

Hi, I ran across SciDB and it seems like a possible fit for processing our large 3D images (50 to 500 gigabytes now, terabytes later). My organization could purchase one of the proprietary RDBMS solutions + GIS but as far as I can tell, there isn’t a simple solution for storing large 3D images and enabling read/writes on arbitrary subvolumes of the 3D image. Oracle Spatial GeoRaster seems to stop at 2D + bands, even though 3D GeoRaster is mentioned but not yet implemented? OpenGIS and other RDBMS GIS solutions seem to limit queries on raster data to 2D or are geared toward N-d points (sparse) instead of the dense arrays required of our 3D images. Is this the correct interpretation of the current RDBMS GIS landscape?

Putting aside provenance, user-defined C/C++ functions, and open source considerations (all significant IMHO but for the sake of argument), are there any commercial solutions that address very large 3D image storage and arbitrary subvolume access? And if so, what are SciDB’s advantages versus the commercial solution (e.g., speed, image size restrictions)?

Thanks in advance for any information you can provide!

Best,
Bill


#2

Hi Bill!

Well - it’s not an easy question. Because we don’t (yet) have a good handle on the answers. But that said …

When it comes to software platforms that "address very large 3D image storage and arbitrary subvolume access? ", SciDB is really the only game in town. This is precisely the kind of problem that we’re going after, especially if your application’s goal is to do something more than simply store and curate a large image library. We come from a “big science” perspective. Lots of sensor equipment pointed at interesting events–the sky at night, protons and lead ions smashing together, seismic waves moving through the crust, false color images of the planet–where the goal is to answer a scientific question using the imagery–is the universe expanding? is there a higgs-boson particle? where’s the oil? how much green is left?.

The point is that, in addition to looking at the images, these communities want to use the data in the image to perform some kind of analysis. They want to use complex mathematical tools to pick apart the images to understand what the colors and movement mean, from a scientific perspective. The color profile and light intensity of distance objects tell us about how the universe is structured, for example. And you frequently want to test your understanding of the data at hand by clustering or correlating objects. SciDB does this too.

Now, we (SciDB) certainly aren’t a GIS. And we don’t lay claim to being a terrific operational platform for recording the metadata that describes the contents of the image library. Tools that make up the current RDBMS GIS landscape have things like spatial indexing, spatial and spatio-temporal data types, and business relationships with people who provide data sets you might want (ESRI, for example). We don’t have any of those.

But that said - we’re also motivated by the belief that the days of the “one size fits all” DBMS platform are long gone. If your application needs to combines sophisticated image manipulation and analysis with GIS with operational store data management, then you’re inevitably going to find yourself using a variety of tools to get the job done. We’re mindful of this trend in our design work; we’re trying to keep the interfaces to SciDB contents as open and as easy to get to as possible.

So - a quick summary:

  • SciDB implements an entirely new storage manager. All of the existing RDBMS engines are implemented to support an unordered, SQL/relational data model. So they can do things like pack unordered rows onto pages, and decompose tables in interesting ways. SciDB implements an entirely new storage manager that tries to exploit the “ordered” properties of arrays. Adjacent cells in a SciDB array–in any dimensional direction–are stored adjacent one another, in the same physical data blocks. So if you want to do something like “arbitrary sub-volume access”, the SciDB storage manager ensures that the sub-volume is physically co-located.

  • The SciDB storage manager de-composes an distributes partitions (we call them “chunks”) over the physical computers the installation is using. The result is that when you want to perform a value-search operation (“find me all the blue values”) we can execute these operations in parallel, with all of the nodes in your cluster operating on their own, local chunks concurrently.

  • When you get to large scale analytics, the basic building block algorithms used within statistical processing are all linear algebraic; matrix operations that benefit from block-wise operations. In contrast with RDBMS engines, having an array data model benefits this kind of analysis enormously.

All of that said, there are a lot of things SciDB has basically lifted from other people.

  • SciDB supports a query language API that allows users to codify the data manipulation they want to do in queries, rather than as procedural code. Query languages are enormously powerful tools.

  • SciDB borrows ideas from modern, distributed file-systems. We replicate data chunks on multiple physical nodes to insure against data loss, for example. And because we’re obliged to deal with large scale operations, we’re working on making our platform continue working even when we lose resources in the middle of a query.

Were are we weak?

  • We don’t do transactions very well. We support basic concurrent UPDATE and INSERT operations. We provide isolated transactions. But we’re haven’t optimized this part of our system’s processing.

  • We’re pretty immature in our tooling and our interaction with other software. Specifically, we don’t do things like GIS operations very efficiently.

Hope this introduction helps!

KR

Pb


#3

Yes, it definitely helps. Thanks!