I am working on loading HDF data into SciDB (specifically HDF-EOS MODIS satellite data for the EarthDB project). We have this process working with the use of the load() operator: export HDF to CSV, and ingest into SciDB with load() via loadcsv.py. While this works reasonably well, we are trying to avoid the disk usage and processing overhead of creating temporary CSV files (we are dealing with massive amounts of data, so efficiency is key!). Thus we are rewriting our workflow to read directly from HDF files and write the output to SciDB all within a single C++ utility (with no temporary data files involved). While the reading of HDF files is straightforward, we are debating the best way of outputting the data to SciDB. We are exploring a number of options, including:
Option 1: Using the iquery.cpp source code to load data with query strings; e.g. by appending chunks of data to the array with “insert(redimension(build…”. This has the advantage of being easy to write/maintain (i.e. the SciDB query syntax seems to change less than the database internals in new versions), but the disadvantage of requiring more memory/processing overhead (e.g. converting all the data to query strings).
Option 2: Using the source code behind the file load() operator as a starting point (i.e. InputArray.cpp). Essentially we can modify this to read straight from the HDF file instead of the interim CSV file.
Option 3: Use the more low-level write methods in the database (e.g. writeItem() of data chunks). This has the advantage of being most efficient, but the disadvantage of requiring us to handle more of the work (error checking, etc) that the SciDB query tools already do.
Option 4 (similar to Option 3): Merge with some of the existing SciDB-HDF5 code (https://github.com/wangd/SciDB-HDF5). While this code provides a cogent example of HDF loading, it was written for a specific use-case in mind and is not maintained with the current SciDB version. Ultimately, it uses the more low-level (writeItem) methods for SciDB data input (rather than query strings).
We would appreciate any advice on the best way to proceed. Essentially our goals are to load the data efficiently, but allow existing SciDB code to handle as much of the database internals as possible without too much overhead to overall data loading performance. We plan to maintain the code for compatability with new SciDB versions, so we would like to avoid reliance on database internals that are likely to change in the future.
If anyone has some big-picture advice on this or specific code examples, API documentation, etc, we would greatly appreciate it! Thanks!