Storing data as binary


#1

I’ve just started reading the SciDB documentation and wanted a clarification: if I implement a loader for binary data (you mention HDF5 for example in one of the other discussions), the data will be converted to ASCII for storage in by SciDB? Since I might have a terabyte of float data (4 bytes per sample), and assuming I need 12 UTF-8 characters to describe them all without loss of precision (how does SciDB? handle this?), then would need approximately 3 terabytes of storage. The big problem is not the storage space, but the effort to convert a substantial chunk of this data from ASCII to binary every time I wanted to load it for analysis or visualization.

In practice, we have a lot of legacy formats for data we want to load. A typical example is 3D seismic data which often comes off the boat as IBM float binary format. Very expensive to convert to IEEE floats for processing. The approach we are using is a one-time conversion of all these various formats to a unified mix of ASCII and binary object file for all of them. It is my observation that all objects by definition are key-value pairs (object_field - value) where a value might in fact be an multidimensional array ala SciDB. When stored, scalar values are stored as key-value pairs in an ASCII object header file and arrays are stored as appropriate binary vectors, with each vector being of a homogeneous type (string, byte, float, double, etc). The array reference is stored in the header as a key-value pair where the value is a reference to the binary file or files containing the array binary vectors. Effectively, the datastore is a hierarchical file data system (HFDS) with a well-defined structure defining the objects stored and the associated files (ASCII and binary). I describe all this, because I would like your feedback on how this approach could effectively be built on the SciDB architecture. I see a lot of commonality in what I’m doing and what SciDB and the SciDB community is offering.

Your thoughts and suggestions would be greatly appreciated.

Regards,

Tom Lasseter