Optimal way to load data into SciDB within C++


#1

Hello,

I am working on loading HDF data into SciDB (specifically HDF-EOS MODIS satellite data for the EarthDB project). We have this process working with the use of the load() operator: export HDF to CSV, and ingest into SciDB with load() via loadcsv.py. While this works reasonably well, we are trying to avoid the disk usage and processing overhead of creating temporary CSV files (we are dealing with massive amounts of data, so efficiency is key!). Thus we are rewriting our workflow to read directly from HDF files and write the output to SciDB all within a single C++ utility (with no temporary data files involved). While the reading of HDF files is straightforward, we are debating the best way of outputting the data to SciDB. We are exploring a number of options, including:

Option 1: Using the iquery.cpp source code to load data with query strings; e.g. by appending chunks of data to the array with “insert(redimension(build…”. This has the advantage of being easy to write/maintain (i.e. the SciDB query syntax seems to change less than the database internals in new versions), but the disadvantage of requiring more memory/processing overhead (e.g. converting all the data to query strings).

Option 2: Using the source code behind the file load() operator as a starting point (i.e. InputArray.cpp). Essentially we can modify this to read straight from the HDF file instead of the interim CSV file.

Option 3: Use the more low-level write methods in the database (e.g. writeItem() of data chunks). This has the advantage of being most efficient, but the disadvantage of requiring us to handle more of the work (error checking, etc) that the SciDB query tools already do.

Option 4 (similar to Option 3): Merge with some of the existing SciDB-HDF5 code (https://github.com/wangd/SciDB-HDF5). While this code provides a cogent example of HDF loading, it was written for a specific use-case in mind and is not maintained with the current SciDB version. Ultimately, it uses the more low-level (writeItem) methods for SciDB data input (rather than query strings).

We would appreciate any advice on the best way to proceed. Essentially our goals are to load the data efficiently, but allow existing SciDB code to handle as much of the database internals as possible without too much overhead to overall data loading performance. We plan to maintain the code for compatability with new SciDB versions, so we would like to avoid reliance on database internals that are likely to change in the future.

If anyone has some big-picture advice on this or specific code examples, API documentation, etc, we would greatly appreciate it! Thanks!

John


#2

Hey John,

Cool!

How about parallel binary save followed by parallel binary load? Can you dump the data in the form like

It’s a lot like a CSV but loading it is a lot faster. It can also be loaded in parallel from multiple instances at once.
Perhaps using named pipes instead of files is another good way to speed things up?

I also worked with someone else on a genomics workflow where we actually started with build() functionality. They had a data source wherein you could access pieces of the data from different instances. They took BuildArray, gutted it and had it extract different pieces of the data in parallel.

Sorry I dont have more time at the moment, please keep us posted on how it goes.


#3

Hi John,

while I’m not affiliated with the SciDB dev team, I’m also interested in loading HDF5 (actually, NetCDF) files into SciDB with the least possible overhead.

So far, I’m doing what apoliakov suggested: converting NetCDF to a binary dump, splitting this dump into n files (where n is the number of instances in my SciDB cluster), placing them in the instances’ respective folders, and loading in parallel with AFL command
load(my_flat_array,‘my_binary_file’,-1,‘format_specifier’); (see manual)
The NetCDF to binary conversion is done with a simple/crude C program using libnetcdf (unfortunately specific to the data I’m loading).

If you develop something more user-friendly and generic, I’d be happy to hear about it :smile:


#4

Ok, thanks for the input everyone. I’ll try those suggestions.

John


#5

I have developed a generic netcdf to csv convertor in python if you are interested


#6

Hi, John

I’m new to scidb, and also got problems about how to load HDF data file into scidb. Just like you said, the existing SciDB-HDF5 code is not maintained with the current scidb version. I was wondering if your team got any progress since you guys started the project. Thanks.