Parallel Application, large array


#1

I am experimenting with the LAMMPS molecular dynamics code; it’s a parallel application that produces a large array. I want to try to use SciDB to store this array. However, after reading some threads here, it seems that in order to get SciDB to store the array data, I have to first write this array data into a text file, and then store it? Is this correct?

What am I missing here?


#2

Hello,

Not necessarily. SciDB has multiple load options, one of which is binary. I recommend taking a look at scidb.org/HTMLmanual/14.8/scidb_ug/ch08.html
For example, you can load a binary blob that looks like

[code]
……

[code]

Into a one-dimensional (table-like) array <x:int, y:double, z:string> [i]
And then use redimension to produce the final n-dimensional shape.

Let us know how it goes.


#3

Thanks!

It seems though that this expects the data to be in a file, and then loaded from a file into an array? I can’t seem to find a simple example of how to have the C/C++ application load the binary data into SCIDB as an array. I’m sure one exists, I’m just not able to easily find it.

Any help or advice?


#4

Hi,

The binary load example can be found here: scidb.org/HTMLmanual/14.8/scidb_ … 05s03.html

Perhaps you need to have a look at the iquery client too, here: scidb.org/HTMLmanual/14.8/scidb_ug/ch05s01.html


#5

gtchessplayer?

Strictly speaking, the data doesn’t need to be in a file. It simply needs to be read from an I/O Stream … a socket, pipe OR a file-handle.

Perhaps a little more background on what you’re trying to achieve would help.

Is it the case that LAMPS produces a large corpus of binary files that you want to ingest into SciDB?(In which case, some variety of the binary load would seem to be the most efficient option). Or are you receiving input on a stream of some sort? (Where there are several options, binary load among them). Or are you trying to integrate a SciDB client into a running LAMPS to insert data? (In which case, and assuming the data you’re shipping isn’t that big, you can always construct an array using a literal build(…) operator from a text string.)

As “Judge” John Hodgman would explain things, “Specificity is the soul of narrative!” Details would help us.


#6
  1. The application is a parallel MPI application that we’ve run up to 40k cores, where each output epoch (every 30 minutes) can be close to 1TB in size.

  2. The data exists in memory buffers an array of doubles. Each process owns a portion of this global array, identified by some local array dimensions. I can output the data via a file handle or stream if needed. To avoid interacting with the disk multiple times, I’d like to skip having the data output to a file first, and then loaded into SCIDB, but instead have the data go directly to SciDB, if that makes sense.

  3. An analysis component (also MPI parallel) will then query this data using AQL.

Thanks

[quote=“plumber”]gtchessplayer?

Strictly speaking, the data doesn’t need to be in a file. It simply needs to be read from an I/O Stream … a socket, pipe OR a file-handle.

Perhaps a little more background on what you’re trying to achieve would help.

Is it the case that LAMPS produces a large corpus of binary files that you want to ingest into SciDB?(In which case, some variety of the binary load would seem to be the most efficient option). Or are you receiving input on a stream of some sort? (Where there are several options, binary load among them). Or are you trying to integrate a SciDB client into a running LAMPS to insert data? (In which case, and assuming the data you’re shipping isn’t that big, you can always construct an array using a literal build(…) operator from a text string.)

As “Judge” John Hodgman would explain things, “Specificity is the soul of narrative!” Details would help us.[/quote]


#7

Thanks for the details! They certainly help.

So. The way the load(…) (or more strictly the input(…) ) operator work is that they expect a “file name” as a parameter. But Linux / Unix possess a number of file-system features that will allow you to achieve your goal of avoiding interacting with the disk multiple times (makin’ the magnet bounce).

I would:

( a ) Instead of files, use “named pipes”, which are essentially FIFO queues for data. “man mkfifo” for details. Everything in the SciDB manual that talks about reading from “files” or mentions “file-names” applies equally to these named pipes.

( b ) From your LAMMP application, write the result data from each MPI process into one of these named pipes. You will almost certainly need to explicitly write coordinates information into the stream. That is, serialize the state of the model run, including the coordinates information. (This last is necessary because we’re going to have to re-organize the data once it’s inside SciDB … it’s probable that data from one MPI process might need to be written to multiple SciDB instances, especially if you’re using a SciDB feature like chunk overlap.)

( c ) Then, so far as SciDB is concerned, for reading purposes the “named pipes” you’ve created behave just like files. Only instead of reading data from a disk file, you’re actually reading from an in-memory, inter-process FIFO queue managed by the Linux kernel.

( d ) Once you’ve got the data into SciDB, you’re probable going to need to re-organize it to slot it into an array to track the changes to the model state over time. But that part of the work must be done regardless of whether you’re using a file or a fifo.

Is this clear? And hopefully helpful?