Returning Data to Clients


#1

Gang:

Given a sufficiently large array containg two attributes, the following Python client code that attempts to retrieve one-attribute-at-a-time will successfully retrieve the first attribute, then fail on the second:

[code]import scidbapi as scidb

database = scidb.connect(“localhost”, 1239)
query = database.executeQuery(“select * from test2”, “aql”)

attributes = query.array.getArrayDesc().getAttributes()
attributes = [attributes[i] for i in range(attributes.size()) if attributes[i].getName() != “EmptyTag”]
attribute_iterators = [query.array.getConstIterator(attribute.getId()) for attribute in attributes]

for attribute_iterator in attribute_iterators:
while not attribute_iterator.end():
value_iterator = attribute_iterator.getChunk().getConstIterator(scidb.swig.ConstChunkIterator.IGNORE_OVERLAPS|scidb.swig.ConstChunkIterator.IGNORE_EMPTY_CELLS)
while not value_iterator.end():
print value_iterator.getItem().getDouble()
value_iterator.increment_to_next()
attribute_iterator.increment_to_next()

database.completeQuery(query.queryID)
database.disconnect()
[/code]

The error message:

Traceback (most recent call last): File "test2.py", line 16, in <module> attribute_iterator.increment_to_next() File "/opt/scidb/13.3/lib/libscidbpython.py", line 996, in increment_to_next def increment_to_next(self): return _libscidbpython.ConstIterator_increment_to_next(self) IndexError: UserException in file: src/array/StreamArray.cpp function: moveNext line: 103 Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_NO_CURRENT_BITMAP_CHUNK Error description: Error during query execution. No current bitmap chunk.

If I use code that retrieves the attributes in “interleaved” order it succeeds:

[code]import scidbapi as scidb

database = scidb.connect(“localhost”, 1239)
query = database.executeQuery(“select * from test2”, “aql”)

attributes = query.array.getArrayDesc().getAttributes()
attributes = [attributes[i] for i in range(attributes.size()) if attributes[i].getName() != “EmptyTag”]
attribute_iterators = [query.array.getConstIterator(attribute.getId()) for attribute in attributes]

while not attribute_iterators[0].end():
for attribute_iterator in attribute_iterators:
value_iterator = attribute_iterator.getChunk().getConstIterator(scidb.swig.ConstChunkIterator.IGNORE_OVERLAPS | scidb.swig.ConstChunkIterator.IGNORE_EMPTY_CELLS)
while not value_iterator.end():
print value_iterator.getItem().getDouble()
value_iterator.increment_to_next()
for attribute_iterator in attribute_iterators:
attribute_iterator.increment_to_next()

database.completeQuery(query.queryID)
database.disconnect()
[/code]

This was a big surprise to me, since SciDB is a column store, and interleaving attributes in this way significantly complicates client code. Could this be a bug? Is there some way around it? It also seems that both approaches work fine on “smaller” arrays (I’m vague on the definition of “smaller”), which led to a lot of confusion. Any thoughts?

Many thanks,
Tim


#2

Hey Tim,

Very interesting observation. When a query is executed, we create something called a “ParallelAccumulatorArray” on each instance, which is a class that supports methods like “give me your next chunk”. Then the coordinator starts polling instances for their chunks and tries to return them back to the client in order. This mechanism has been in the system for quite a while, it’s quite possible it just doesn’t support retrieval of data in a “vertical” pattern. It might work for smaller arrays because it may be prefetching multiple chunks from all the attributes, thus keeping all the data cached when there’s a small number of chunks.

At the core, we are very much a column store and operator project() is definitely your friend. For your case, you could consider writing several project() queries, like
project(test2, attribute1)
project(test2, attribute2)

though this wouldn’t be so efficient if the array isn’t stored.

This is the first time I see a request for supporting different client-side access patterns like this. I’ll make sure to let the other devs know about it…


#3

Alex:

Thanks for the quick response. I’ll likely code around this in my client, as I’m reluctant to have to store temporary results. Also, I was mistaken about array size having an effect on this issue (I had a typo in my array specifications): both iteration patterns work when your results are contained in a single chunk, only the interleaved pattern works when your results require more than one chunk, which makes much more sense.

As a follow-up, it would be really useful to know what guarantees (if any) apply to the ordering of data returned to clients - I’ve mostly been working with 1D arrays up to this point, and assuming (it’s been true so-far) that they’d be returned in order – but now you’ve got me worried :wink:

Cheers,
Tim


#4

Hey Tim,

Data returned will be ordered by chunk first, then the values are ordered inside the chunk. So you will see all of the cells from the first chunk, then all the cells from the second chunk, and so on… Chunks are ordered by the coordinate of their “top-left” corner. Both chunks and individual cells are in ordered by last dimension increasing first, like so:
x0,y0,z0
x0,y0,z1
x0,y0,z2

x0,y0,zn
x0,y1,z0

x1…

The way it’s implemented is that the coordinator will poll for who has the chunk with the lowest position, grab it, return it, get the next lowest chunk.
If you want something else you can use sort().


#5

Alex:

Excellent, I was hoping there’d be an ordering that clients can count on.

Also, a parting thought on the client API … in case you decide not to allow the client to specify data access patterns, the API could be greatly simplified … the current design where the client allocates N iterators for N attributes implies incorrectly that you can retrieve data in whatever order you choose. A less error prone API could look like the following (working example using a layer over the SWIG generated wrappers to make them more Pythonic):

database = scidb.connect()
with database.query("aql", "select * from test") as result:
  for chunk in result.chunks():
    for attribute in chunk.attributes():
      for value in attribute.values():
        print value.getDouble()

Cheers,
Tim