Scidb-py: data extraction is very slow


#1

Hi everybody,

I am making my first steps with SciDB. I managed to get it up and running on a single Linux machine, including shim. I then ran the following example in an ipython shell:

In [1]: import scidbpy as scidb
In [2]: sdb = scidb.connect('http://localhost:8080')
In [3]: x = sdb.random((800,600))
In [4]: %timeit x[67, 456]
1 loops, best of 3: 1.17 s per loop
In [5]: %timeit x.toarray()
1 loops, best of 3: 2.4 s per loop

I expected this to take some milliseconds. But it takes more than 1 second to extract a single value from a 800x600 array. Is this normal? Or is there something wrong with my installation?

My plan was to use SciDB as a database for (2D) meteorological data. My use cases are, basically, extracting time series for given points, creating 2D plots, and serving these over the Web in “real time”. But if it takes more than 2 seconds to read a 800x600 array from the DB, plotting in real time is pretty hopeless.

Any advice is appreciated!

BTW, why are the R and Python packages communicating with SciDB through HTTP? Wouldn’t Unix Domain Sockets (or even TCP sockets) be much faster?


#2

Hi,

I can try to add some commentary about this timing information:

This line:

Is very inefficient for SciDB-Py, since the overhead of talking to the database, setting up the data transfer, etc. dominates the actual download

This line:

is better, since most of the time will be spent in data transfer. To reconstruct your array locally, SciDB-py needs to download the 2D array index and value of each element – this ends up being about 11MB of data (800 x 600 x (8 bytes for value + 8 bytes for index1 + 8 bytes for index2)). The execution time is mostly occupied by shipping these 11MB over HTTP.

SciDB-Py and SciDB-R both use Shim to communicate with SciDB, which is why we use HTTP (shim is an HTTP library). We are exploring various ways of speeding up the communication – using another protocol besides HTTP is one option, as is compressing the data during transfer. We could also save on bandwidth by not transferring the array indices, but this is currently necessary, since SciDB streams array contents in a non-contiguous, chunked fashion.

For some more context, here’s a small notebook that looks at the speed of toarray() for 1D arrays, as a function of element size. You can see that, on my machine SciDB-Py is most efficient for arrays with ~1M or more elements

nbviewer.ipython.org/gist/ChrisB … 115f4c76c0

cheers,
Chris


#3

Hi Chris,

Thanks a lot for your explanations. I did not know that array indices are transferred. That obviously slows things down. I am happy to hear that you are considering options other than http to send data. I am sure there is a lot of potential. Just for comparison, running the same example on HDF5 gives the following result:

In [1]: f = h5py.File('/tmp/test.h5', 'r')   # some file containing an 800x600 float array
In [2]: f['myarray'].shape
(800, 600)
ln[3]: %timeit f['myarray'][:]     # reading the array from disk into memory
1000 loops, best of 3: 676 µs per loop

I know the comparison is a bit unfair because, with HDF5, there is no interprocess communication involved. But still, 670 microseconds vs. 2.4 seconds makes a huge difference. HDF5 is very much optimized for fast I/O. SciDB, obviously, is not. But it’s clear that SciDB has a lot of advantages over HDF5 (which is just a file format). My hope is that SciDB-py/SciDBR is eventually going to catch up with HDF5 in terms of I/O performance.

Cheers,
Remo


#4

Hey Remo

This line is actually misleading:

Because I bet you that your OS caches the data on the first read, and then accesses it from RAM for the other 999 iterations. As a back of the envelope estimate, a SSD hard drive reads data around 100-300MB/S, so reading 800x600 doubles from disk takes around 10ms.

But your point is fair: there is a lot of things SciDB could do to improve throughput to better than ~10MB/sec. Compression is a big potential gain, and something that we’re working on. However the >5GB/s throughput implied by your profiling is a bit too optimistic :smile:

Out of curiosity, are you running SciDB-Py on the same machine as the database? We are also looking into options for speeding up this particular use case, since in this case we might be able to bypass some of the network overhead in Shim.


#5

Hey Chris,

Thanks a lot for your reply. And sorry for me not responding. I lost track of this thread. You’re totally right about the caching. And yes, I do run SciDB and SciDB-py on the same machine. Looking forward to future versions of SciDB-py… :smile: