Storing Array Metadata


#1

Besides the data stored in the cells of the array, there is metadata associated with the array and not with the individual cells. I wonder what is the best way to store this metadata.

For example, assume that an instrument generates a matrix of 1000x500 intensities. Besides this intensities matrix, there are instrument settings used to do the measuring like power and wavelength. What is the best way to store such metadata?


Here are some possible options:

1. Additional Dimensions

One could create an array with four dimensions, two for the 1000x500 intensities matrix and one for each of the metadata values (power and wavelength), like this:

CREATE ARRAY data<intensity: double>
  [i=1:1000,1000,0, j=1:500,500,0, power=1:100,100,0, wavelen=1:1000,1000,0]

This might work best, but it would generate a large number of dimensions if there are a lot of metadata values. Also, this might get complex if the additional dimensions cannot be easily mapped to integer values.

2. Additional Attributes

Another approach would be to store the metadata as additional attributes in each cell, like this:

CREATE ARRAY data<intensity: double, power: int8, wavelen: int8>
  [i=1:1000,1000,0, j=1:500,500,0]

This can easily accommodate for metadata for different types, but would create a lot of duplicate data since an array instance will have the same power and wavelen in every cell.

3. Additional Array

The metadata can be stored an an additional array which as a 1:1 mapping to the data array, like this:

CREATE ARRAY metadata<power: int8, wavelen: int8>
  [data_id=0:*,1000,0]

This would probably be the simplest way but would require joins to retrieve the metadata.

4. Nested Arrays

Maybe the cleanest approach would be to have a nested array as mentioned in The architecture of SciDB, where the data array would have three attributes (1000x500 matrix for intensities, power and wavelen) and a single dimension. I am not sure what are the plans for adding nested arrays to SciDB.


In the past, this has been briefly discussed here:


#2

The short answer to your question is that there is no short answer to your question.

:wink:

But let me try to help out with some observations:

  1. The “meta-data in attributes” strategy? I wouldn’t worry about the data duplication. At least not if what you’re worried about is increasing the volume of physical data you’re storing. At the physical level SciDB will encode and compress the heck out of that kind of data.

  2. We didn’t go with nested arrays mostly because of lessons from the SQL / relational world, where first normal form has proved itself a very durable. The problem with “nested” arrays (like nested tables) is that it complicates the query language, optimization problem, and physical plan design without improving performance at all. If you think about it, adding a dimension is (more or less) the sam as nested arrays.

  3. Don’t worry about the cost of the joins. In general, the costs of the data manipulation work–whatever else it is you want to do with the data–generally dominates things. One problem, of course, is that expressing joins are a bit complex in SciDB’s query languages. But don’t let that hold you back (we’re looking at how we can fix that).

TL;DR version … SciDB is a DBMS, which means that schema design is an important aspect of using SciDB. Schema design is all about combining “data” and “meta-data” into a single, query-able database.