Some doubts


#1

Hi everyone,

I’ve got some questions regarding SciDB and scientifc data…more precisely, I’d like to know if SciDB is suitable for ETL on datasets coming from NOAA in the following formats: GRIB2, NetCDF and HDF5, which all are multdimensional value grid files (to put it simple).

Another question would be…could I directly insert the data or do I have to do any parsing before? Does SciDB provide analytical tools to do, for instance, correlation on that type of datasets?

Thanks in advance,
Alejandro


#2

Hi Alejandrovk!

First up, we actually have a couple of examples of people doing precisely what you want to do. One crew are using SciDB to load and analyze MODIS data. The paper describing the work is indico.cern.ch/event/235511/mat … ides/0.pdf . Basically google around for “EarthDB”.

Second, it’s a general truism that file format standards are such a good idea that we have collectively decided that we need lots of them. In addition to the ones you mention, we’re also asked from time to time about FITS, as well as others for many different kinds of scientific data (bioinformatics, for example; FASTA and BAM). Not to mention formats for ‘R’, or Excel, or MATLAB. Or any of the very, very many formats that are used to push financial tick data around. We figure we could spend basically all of our time doing nothing but writing loaders for each of these formats.

Instead, we’ve focussed on two basic approaches.

SciDB supports a basic set of load tools that allow you to load data from a binary stream, or a .csv file. To use the SciDB default loader with your choice of standard file format, the basic idea is to convert the external file into a row-at-a-time stream, load the stream into a one dimensional array, and then convert the one dimensional array back into the desired shape. Alternatively, you can write your own format specific loader and (we hope) contribute it back to the community by posting it on github. We’ve set up an example to show how such a loader might work: github.com/Paradigm4/SciDB-HDF5

Third, “correlation on that kind of dataset”. Absolutely. Making possible the statistical analysis of these kinds of data sets at scale is the primary goal of SciDB. There are two ways to go about it, although I’m afraid I’ll need a little more detail about what you mean by “correlation”. What exactly did you have in mind here?

Paul


#3

Hello Paul!

Thanks for your response :smile:

Ok, I’ll check those links. If I understood correctly, I should convert the multi-dimensional arrays from my GRIB and NetCDF files to a one-dimensional array? I just hope I don’t lose any information in the process. I’m concerned because I’ve seen this happen with some libraries used to parse these type of files (github.com/scitools/iris) that when flattening the n-dimension arrays to 1D you lose some data in the process. I’ll see if there is any example on how to do this in the links you provided.

Regarding my correlation analysis, what I’d like to do a correlation between a set of meteorological data from NOAA for a given area, say north-east atlantic, and the traffic data from AIS (ships GPS traffic basically), trying to see a relation between the weather conditions and the routes from those ships. You think this is feasible with SciDB, or I’ll have to do my analysis in a third-party language, say Python, R or Java?

Thanks!
Alejandro