Storing images with variable sizes/dimensions


#1

I’m considering SciDB as a database to store our scientific image data. The way I currently think about my data is as follows:

  1. All data belongs to one or several projects, which can themselves be subprojects, etc., which can be easily handled by a relational database.
  2. Each “image” (datum) consists, typically, of three or four 1000x1000 arrays of ints and a 1000x1000 array of floats. However, the size can vary, for instance occasionally we acquire 1000x200x4 arrays of floats, etc. I would like this data to be compressed and to be able to extract slices.
  3. Each “image” has ~50-100 associated metadata, which I think of as keys and values. The metadata keys will vary from image to image, but I need to be able to search image with shared metadata keys based on those metadata’s values. E.g., I would like to search those images with a ‘Temperature’ key whose value falls within a certain range.

This whole system will have to be easily managed. We don’t have the resources to spare a lot of time or money. The data will probably be in the ~100GB to ~TB range, so it could fit on a single computer. It would be nice if it were scalable.

My current approach is to handle (1) in Django + MySQL and (2-3) in hdf5 files, as managed by PyTables. However, PyTables seems like a poor choice in a multiuser environment. Is this a feasible project with SciDB? Can you offer any advice on how to pursue this? I am not a database expert by any stretch of the imagination. Rather, I’m a graduate student researcher in an experimental lab, and I only spend a fraction of my time programming at all. I’m happy to give lots more detailed info, if it would help.


Storing Array Metadata
#2

I’m posting a reply as a script. The idea is to try to illustrate how we see SciDB being used.

#!/bin/sh

File: Forum_Query_1.sh

About:

This file contains the scripting I’ll use to try to answer the

forum query by ‘emarti’ on July 5th, 2012.

set -x

1. All data belongs to one or several projects, which can themselves be

subprojects, etc., which can be easily handled by a relational database.

Well … depending on what how you’re organizing the projects, SciDB might

also be able to help you with the organization of your projects. We’re a

long way from being a fully functional SQL DBMS, but our pure data

manipulation queries are quite reasonable, in terms of both performance and

flexibility.

2. Each “image” (datum) consists, typically, of three or four 1000x1000

arrays of ints and a 1000x1000 array of floats. However, the size can

vary, for instance occasionally we acquire 1000x200x4 arrays of floats,

etc. I would like this data to be compressed and to be able to extract

slices.

So - it’s not clear to me, from your description, whether we’re dealing

with a single data set, or a number of independent data sets, but here’s

how we would organize one of your data-sets. In this case I’m assuming

that we are dealing with a single data set, containing 4 images, where

each image is 1000x1000.

Hygiene:

iquery -aq “remove ( Data )”

DDL_Q1="CREATE ARRAY Data <
int_attr : int64,
double_attr : double

[ I = 0:3,1,0,
X = 0:999,1000,0,
Y = 0:999,1000,0
];"

iquery -aq “$DDL_Q1”

Now - a bit of a SciDB “best practices” lesson. Organizing your load files

into a format that allows you to load this kind of data is doable but

pretty complex. It’s much easier to get the data into SciDB as a "single

dimensional" array and then to re-organize it into the desired format. What

does this two phase (load, re-organize) process look like?

First - create an array to hold the raw data. Note that the attributes of

this array will find themselves re-organized into dimensions or attributes

of the target array, so they need to share the same name.

Hygiene

iquery -aq “remove ( Raw_Data )”

DDL_Q2="CREATE ARRAY Raw_Data
<
I : uint64,
X : uint64,
Y : uint64,
int_attr : int64,
double_attr : double

[ Cell=0:3999999,100000,0 ]
"

iquery -aq “$DDL_Q2”

This kind of 1D “raw load” array can be loaded from a CSV file that

looks like this:

0,0,0,1,1.0

0,0,1,2,2.0

0,0,2,3,3.0

0,0,999,1000,1000.0

0,1,0,1001,1001.0

0,1,1,1002,1002.0

0,999,998,999999,999999.0

0,999,999,1000000,1000000.0

1,0,0,1,1.0

Note that I’ve shown the data here in stride major order. The values in

the CSV can be in any order. SciDB includes a tool called “csv2scidb” that

takes the CSV file, and formats its contents so that the loader can

accept them. In general, we use the following approach to loading data.

Hygiene - delete the fifo each time we run the script.

rm -rf /tmp/load_pipe

We use a fifo to pipe data from the external

mkfifo /tmp/load_pipe

Suppose the data file above is stored in a file /tmp/Data.csv. The idea is

to use the csv2scidb tool to convert the load file into a chunked load

stream, and then we load it into the Raw_Load array.

cat /tmp/Data.csv | csv2scidb -c 100000 -p NNNNG > /tmp/load_pipe &

Rather than constructing an external load file, I’ll use the SciDB build()

DML_Q1=“
SELECT
uint64(Cell/1000000) AS I,
uint64((Cell%1000000)/1000) AS X,
uint64(Cell%1000) AS Y,
Cell + 1 AS int_attr,
double(Cell+1) AS double_attr
INTO Raw_Data
FROM build (<V:uint64> [Cell=0:3999999,100000,0], Cell);

iquery -nq “$DML_Q1”

What did that query produce?

DML_Q2=“SELECT MIN(I),
MAX(I),
MIN(X),
MAX(X),
MIN(Y),
MAX(Y),
COUNT(*)
FROM Raw_Data”

iquery -nq “$DML_Q2”

In other words, the 1D Raw_Data array contains 4,000,000 cells, and the

cells have values that correspond to a 3D array, IxXxY where I=4, X=1000

and Y = 1000. You might just as easily have loaded this data from a CSV

file using the load() style of operations introduced above.

Now, to convert the array form this 1D “raw” format into the 3D target

format (4 images, each image 1000x1000) you would use the following

AQL command.

DML_Q3="SELECT * INTO Data FROM Raw_Data;"
iquery -nq “$DML_Q3”

Just checking that we have the same data in the 3D target array.

DML_Q4=“
SELECT I,
MIN(X),
MAX(X),
MIN(Y),
MAX(Y),
COUNT(*)
FROM Data
GROUP BY I

iquery -q “$DML_Q4”

It’s worth noting that I could just as easily have used the following

single AQL query instead of the two queries.

Hygiene.

iquery -aq “remove ( Data_2 )”

DDL_Q3="CREATE ARRAY Data_2 <
int_attr : int64,
double_attr : double

[ I = 0:3,1,0,
X = 0:999,1000,0,
Y = 0:999,1000,0
];"

iquery -aq “$DDL_Q3”

DML_Q5=“
SELECT (Cell/1000000) AS I,
((Cell%1000000)/1000) AS X,
Cell%1000 AS Y,
Cell + 1 AS int_attr,
double(Cell+1) AS double_attr
INTO Data_2
FROM build (<V:uint64> [Cell=0:3999999,100000,0], Cell);

iquery -nq “$DML_Q5”

So - some queries.

How would you address a single image from among the 4? One way to do

this is to pull out a “slice” of the Data array using “slice()”.

DML_Q6=“
SELECT SUM ( double_attr )
FROM slice ( Data, I, 1 );

iquery -q “$DML_Q6”

A small point here that helps to illustrates the differences between

SQL and the SciDB data model. The following query, which produces the same

result in this case, is more or less exactly SQL. But there’s am important

difference.

DML_Q7=“
SELECT SUM ( double_attr )
FROM Data WHERE I = 1;

iquery -q “$DML_Q7”

While the aggregate returns the same result here, there is a difference

in the nature of the intermediate result. Arrays always have shape. They

always have a dimension count (rank) and a size (length of each dimension).

What ‘slice()’ does is to change the shape of the input array by pulling

a particular sub-array from it and reducing the number of dimensions by 1.

Q1: SELECT * FROM slice ( Data, I, 1 );

By contrast, the following query simply filters the contents of the

input array, and doesn’t change its shape.

Q2: SELECT * FROM Data WHERE I = 1;

3. Each “image” has ~50-100 associated metadata, which I think of as keys

and values. The metadata keys will vary from image to image, but I need

to be able to search image with shared metadata keys based on those

metadata’s values. E.g., I would like to search those images with a

‘Temperature’ key whose value falls within a certain range.

OK - at this point I’m a bit confused. Could you help us out with an

example?


#3

Plumber,

Our experiment takes images, and immediately assigns it what I call metadata, or a series of key-values pairs based on a first round of image processing. We take images of cold atoms, so the relevant metadata includes things like the number of atoms, the temperature, the details of the camera, etc. Currently, I store this in an hdf5 file. For instance, maybe the data looks like this:

/ image1 / rawdata / 1000x1000 array
/ image1 / metadata / “Fit type”: “Gaussian”
/ image1 / metadata / “Atom number”: 1.3e5
/ image1 / metadata / “Temperature”: 100
/ image1 / metadata / “Pixel size”: [1, 1.5]

/ image2 / rawdata / 1000x1000 array
/ image2 / metadata / “Fit type”: “Gaussian”
/ image2 / metadata / “Atom number”: 1.5e5
/ image2 / metadata / “Temperature”: 150
/ image2 / metadata / “Pixel size”: [1, 1.5]

/ image3 / rawdata / 1000x200x3 array
/ image3 / metadata / “Fit type”: “Parabolic”
/ image3 / metadata / “Atom number”: 2e5
/ image3 / metadata / “Parabolic size”: [50 100]
/ image3 / metadata / “Pixel size”: [1, 1.5]

The type of data we take, the metadata key-value pairs, varies depending on the experiment we do. We change our experiments quickly and might suddenly have a new requirement on the image size or metadata. For instance, above, image3 has a different array size (1000x200x3) and a new tag, a (2,1) array called “parabolic size”. I might want to search for all images with more than 1.3e5 atoms, which should return images 2 and 3. My thoughts right now are to use a hierarchical database (e.g., Mongo) or a relational database with records such as IntegerMetadata, StringMetadata, etc., where each record has the columns “key”, “value”, and “image”.

Does this clarify my question?