Loading Data Without Disk


#1

Gang:

I have an HTTP server, written in Python, that receives data from an experiment as it’s generated and stores it to a NoSQL database. I’d like to explore using SciDB instead, but I don’t see any way to do an insert / update that doesn’t ultimately rely on loading a file from disk. Am I missing something?

Many thanks,
Tim


#2

Hi Tim,

Replace “file from disk” with “filesystem object” and you are almost correct.
You have two options, 1 is more recommended than 2:

  1. You can have a tool that streams the data from your remote server to scidb using named pipes instead of files. Data starts on some remote machine, it then gets sent to the scidb machines, where it gets output to named pipes, which block until they are read from. In fact, take a look at the loadcsv.py tool that comes with scidb. It does that “magic” for you. See the documentation on around p. 55 of the User Guide. At the moment, this supports CSV files only, but you can cook your own binary analog.

  2. You can use the array literal syntax (see p.98). You can use this in conjunction with redimension. This works ok for spot-corrections but not the best way for loading bulky data sets.


#3

Alex:

Makes sense, I’ll check out loadcsv.py. A couple of followup questions:

  • Would this approach work with the SciDB binary load format? If I have to transform my data to write to a pipe anyway, that seems like it would be the most efficient …
  • Any problems loading multiple arrays simultaneously with this approach? I’d be receiving datums at arbitrary times for multiple destination arrays, so would have multiple load() queries going at once.

Many thanks!
Tim


#4

Yes we’ve had folks who got this to work with binary data. Release 13.2 has a specific fix that we got from a friendly customer to allow this. loadcsv.py won’t do it because it looks for the ‘\n’ character, splits the stream into “lines” and then sends different pieces to different scidb instances for optimal parallel load. So you’ll have to create your own way of splitting (i.e. you need to count the bytes and handle variable-sized data if any).

I see no reason why this wouldn’t work for multiple loads. You can’t have two loads into the same array at the same time (one will wait for the other to finish) but multiple loads into different arrays is fine. You will probably need to adjust threading settings accordingly. Consider User Guide section 2.7.2 and surrounding material.

Look forward to see how it works for you!


#5

Alex:

I’ve attached a small, serial + text example for the other Pythonistas in the audience. Overall, I’d say that it’s workable, albeit perverse … once I’ve got a socket connection to the server, I want to use that to transfer data, and not have to worry about issues of naming / managing access to / cleaning-up named pipes.

Note that I had to use the multiprocessing module to run the data-writing thread in a separate process - with in-process threads, the data-writing thread remained blocked, even with the main thread blocked on the “load …” query. I suspect that SciDB’s Python client wrappers need to yield the GIL at some point, but haven’t looked at the code.

Cheers,
Tim

from multiprocessing import Process as thread
#from threading import Thread as thread

import os
import random
import scidbapi as scidb
import sys
import uuid

def execute_query(language, query, ignore_errors=False):
  try:
    sys.stderr.write("%s\n" % query)
    result = database.executeQuery(query, language)
    database.completeQuery(result.queryID)
  except:
    if not ignore_errors:
      raise

def start_writing():
  named_pipe = "/tmp/%s" % uuid.uuid4().hex
  os.mkfifo(named_pipe, 0666)
  def write_data():
    stream = open(named_pipe, "w")
    stream.write("{0}[")
    for row in range(10000000):
      if row:
        stream.write(",")
      stream.write("(%s)" % ",".join([str(random.random()) for i in range(3)]))
    stream.write("]")
  thread(group=None, target=write_data).start()
  return named_pipe

database = scidb.connect("localhost")
named_pipe = start_writing()
execute_query("aql", "drop array test", ignore_errors=True)
execute_query("aql", "create array test<a:double,b:double,c:double>[row=0:*,500000,0]")
execute_query("aql", "load test from '%s'" % named_pipe)
os.unlink(named_pipe)

#6

Not the mention that the writer and SciDB have to share a filesystem!

Naggingly-yours,
Tim


#7

With loadcsv.py I believe you can make an architecture like this:

LOADER_MACHINE                         SCIDB_CLUSTER
* iquery     <-- scidb / 1239  -->     * scidb
* loadcsv.py                               /|\          
input file        --- ssh --->         named fifo

I know it’s not what you want but it’s better than what you mention. loadcsv.py does manage all the fifo open/close stuff for you. And it splits the file into rows as above - so pieces are actually sent to all the machines in the cluster simultaneously.

But yes, we are surely aware that we’ll need a new method for direct ingest.


#8