SciDBPy and array chunk size


#1

The speed of converting a scidbpy array wrapper to a numpy array via .toarray() seems to depend on the array’s chunk size, right? It’s not quite clear to me in what way… As far as I understand, .toarray() is the point at which the data gets transferred over the shim connection, right?
I’ve loaded the same data into 5 different SciDB arrays that only differ in chunk size (by a factor of 10), ranging from 1000 to 10000000.
Loading the data from the size 10000000 array takes ~2 minutes on my configuration, while using the size 10000 array does it in under a minute.

I don’t completely understand how the optimal chunk size is calculated, but it seems to me that a chunk size of 10000 should be used for cells with about ~1KB of data (according to the rule of thumb from the manual that a chunk should contain 10 MB). However, my schema has only 4 attributes, each with data type size <= 8 byte, so I doubt that my cells have 1KB of data…
Or is the optimal chunk size different for SciDB and SciDBPy? :wink:

But maybe I’m trying to optimise at the wrong end and something in my configuration is totally screwed up because loading 839,825 cells of 2 floats (which is what I do) takes ~50 seconds? Should it really take that long (over a LAN)? As a comparison: Transferring 6MB of random data via scp from the SciDB machine doesn’t take more than a second.


#2

Recent versions of SciDB-Py upload the array as a single chunk (this is also what SciDB-R does). The chunk size makes a performance difference for certain operations once arrays exist in the database, but it shouldn’t be relevant for uploading data – here I expect the time to be determined by how long it takes to upload data from your machine to the machine where Shim is running. Which brings me to…

Yes, this seems suspicious – I would expect speeds closer to uncompressed scp. On my machine:

In [16]: x = np.zeros(839825, dtype=[('a', 'f'), ('b', 'f')])

In [17]: %timeit sdb.from_array(x)
1 loops, best of 3: 460 ms per loop

vs scp

$ head -c 6718600 < /dev/urandom > /tmp/random
$ time scp /tmp/random scidb@192.168.56.101:/tmp/random
real	0m0.469s
user	0m0.079s
sys	0m0.016s

This is copying between OSX and a virtual machine on the same physical box, but I get similar comparisons when uploading to remote AMI instances.


#3

Hm, I played around a bit with the various ways of generating and transfering data from and to SciDB via scidb-py and shim.

import numpy as np
import time
from scidbpy import connect
sdb = connect('http://scidb:8080')

start_time = time.time()
x = np.zeros((839826, 2), dtype=np.float)
sdb.from_array(x)
print 'Sending zeros to SciDB: {0:.2f} sec'.format(time.time() - start_time)
print x.shape
print x
print ""

start_time = time.time()
x = np.zeros(839826, dtype=[('lon', 'f'), ('lat', 'f')])
sdb.from_array(x)
print 'Sending zeros to SciDB, custom dtype: {0:.2f} sec'.format(time.time() - start_time)
print x.shape
print x
print ""

start_time = time.time()
x = sdb.new_array(shape=None, persistent=False)
sdb.query('store(apply(build(<lat:float>[i=1:839826,1000000,0],0), lon, 0), {A})', A=x)
print 'SciDB query on SciDB: {0:.2f} sec'.format(time.time() - start_time)
start_time = time.time()
loadeddata = x.toarray()
print 'Getting zeros from SciDB, produced by query: {0:.2f} sec'.format(time.time() - start_time)
print loadeddata.shape
print loadeddata
print ""

start_time = time.time()
x = sdb.zeros((839826, 2), dtype="float")
print 'Generating zeros in SciDB: {0:.2f} sec'.format(time.time() - start_time)
start_time = time.time()
loadeddata = x.toarray()
print 'Getting zeros from SciDB, produced by zeros method: {0:.2f} sec'.format(time.time() - start_time)
print loadeddata.shape
print loadeddata
print ""

(Excuse my noob-ish python)

The results are:

Sending zeros to SciDB: 1.35 sec
(839826, 2)
[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 ..., 
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]

Sending zeros to SciDB, custom dtype: 0.65 sec
(839826,)
[(0.0, 0.0) (0.0, 0.0) (0.0, 0.0) ..., (0.0, 0.0) (0.0, 0.0) (0.0, 0.0)]

SciDB query on SciDB: 0.38 sec
Getting zeros from SciDB, produced by query: 1.57 sec
(839826,)
[(0.0, 0) (0.0, 0) (0.0, 0) ..., (0.0, 0) (0.0, 0) (0.0, 0)]

Generating zeros in SciDB: 8.74 sec
Getting zeros from SciDB, produced by zeros method: 2.69 sec
(839826, 2)
[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 ..., 
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]

I don’t quite understand the difference between the first two results, 0.65 sec for sending series with the custom dtype and 1.35 for sending 839826 * 2 floats - but obviously this is marginal in comparison to the next two:
Producing 839826 * 2 floats via a SciDB query and fetching it takes 0.38 + 1.57 sec (so approx 2 sec).
Producing 839826 * 2 floats via sdb.zeros takes 8.74 sec and fetching them takes 2.69 sec (so approx 11 sec).

Can you help me understand what’s going on?

I tried generating zeros via sdb.zeros with the custom dtype, but it looks like this is not supported yet:

x = sdb.zeros((839826, 2), dtype=[('lon', 'f'), ('lat', 'f')])

yields:

[...]
  File "/usr/lib/python2.7/site-packages/scidbpy/interface.py", line 1417, in _shim_urlopen
    raise Error(r.text)
scidbpy.errors.SciDBQueryError: UserQueryException in file: src/query/ops/build/LogicalBuild.cpp function: inferSchema line: 126
Error id: scidb::SCIDB_SE_INFER_SCHEMA::SCIDB_LE_OP_BUILD_ERROR2
Error description: Error during schema inferring. Constructed array should have one attribute.
store(build(<lon:float,lat:float> [i0=0:839825,1000,0,i1=0:1,1000,0],0), py1101152180650_00001)

#4

Hi,

Ok, there are several things to unpack here:

Uploading a 1 attribute, 2D array vs a 2 attribute, 1D array

Your first two arrays are actually different datatypes, which you can see if you print the array returned by sdb.from_array:

<f0:double> [i0=0:839825,839826,0,i1=0:1,2,0]
<lon:float,lat:float> [i0=0:839825,839826,0]

The issue here is that np.float is actually a 64-bit double, resulting in an array which is 2x larger and takes longer to upload. Using x = np.zeros((839826, 2), dtype=np.float32) for a more apples-to-apples comparison, the timings are much closer

Speed of store(build(…))) vs sdb.zeros
Here are the schemas of the two arrays:

store(build...)): <lat:float,lon:int64> [i=1:839826,1000000,0]        (0.23 sec on my machine)
zeros:            <f0:double> [i0=0:839825,1000,0,i1=0:1,1000,0]      (1.23 sec on my machine)

First, your 8.74 sec timing is suspiciously slow – do you get that result consistently, or was it a hiccup?

Ignoring the 8.74sec figure, why is the first array built ~5x faster than the second? Let’s first get consistent attribute datatypes. I’ve changed the build query to

sdb.query(‘store(apply(build(lat:double[i=1:839826,1000000,0],0), lon, 0.0), {A})’, A=x)

This doesn’t change the timings:

store(build...)): <lat:double,lon:double> [i=1:839826,1000000,0]        (0.23 sec on my machine)
zeros:            <f0:double> [i0=0:839825,1000,0,i1=0:1,1000,0]            (1.23 sec on my machine)

What about the chunk size? Maybe there’s less database overhead because the first query stores everything in one chunk?

sdb.query('store(apply(build(<lat:double>[i=0:839825,1000,0],0), lon, 0.0), {A})', A=x)
<lat:double,lon:double> [i=0:839825,1000,0]             (1.76 sec on my machine)

Having a small chunk size does seem to add considerable overhead on my single-node database. Note however that splitting large arrays into several chunks will have a benefit on multi-node databases, since each machine will work in parallel.

Likewise, sending a larger chunk to zeros makes it faster:

sdb.zeros((839826, 2), dtype="float", chunk_size=(839826, 1000))
py1101007558014_00006<f0:double> [i0=0:839825,839826,0,i1=0:1,1000,0]
0.35 sec

Downloading a 1D, 2 attribute array vs a 2D, 1 attribue array

Schemas and download speeds on my machine

<lat:double,lon:double> [i=0:839825,1000000,0]  1.3s
<f0:double> [i0=0:839825,1000000,0,i1=0:1,1000000,0] 2.7s

To download arrays, SciDB-Py needs to transfer the data value and location for every element. This leads to a larger payload for the second array. The first array downloads as 839826 * (8 bytes for lat + 8 bytes for lon + 8 bytes for i0). The second array downloads as 839826 * 2 * (8 bytes for f0 + 8 bytes for i0 + 8 bytes for i1).

As a side note, you can see how much data are downloaded (along with a bunch of other diagnostic info) by running

import logging
logging.getLogger('scidbpy').setLevel(logging.DEBUG)
logging.basicConfig()

#5

Hi Chris,

thanks for taking the time to look at this, it’s very much appreciated :smile:

Ah, thanks! Coming from a C++ world, I expected float to mean float but I see now how naive I was :wink:

It’s consistently ~8 secs. That’s what’s the most surprising to me. I assume that I need to get to the root of this to figure out my problem.
If I run the query generated by zeros directly on the SciDB machine with iquery, I get the 8 seconds as well…

time iquery -naq "store(build(<f0:double> [i0=0:839825,1000,0,i1=0:1,1000,0],0), SpeedMeasure1)"
Query was executed successfully
real	8.132s

I’ve got a 4 node SciDB that runs on two VMs on to physical servers. Still, reducing the chunk size blows up the execution time considerably from
~0.5 sec to ~12 sec
I suppose that’s a hint that my SciDB cluster config is messed up…?

[quote]
As a side note, you can see how much data are downloaded (along with a bunch of other diagnostic info) by running

import logging logging.getLogger('scidbpy').setLevel(logging.DEBUG) logging.basicConfig() [/quote]
Thanks for the tip, that actually helps me quite a lot :smile:


#6

Yes the 8 second figure, and the slow iquery time, seems to isolate SciDB (and not SciDB-Py) as the root of your performance problems. That’s out of my area of expertise, so let me tap some other SciDB gurus to see if they can shed any light on this thread