Get array result is slow


#1

Hi,
I’m using python connector to connect scidb and getting the result array. The way I use is to iterate every chunk and then every element within the chunk
, and store those elements in a python array. But that’s pretty slow. Is there a good way to do that?

Thanks,
Jian


#2

First, let’s be blunt and admit that our experience with the Python connector has taught us a couple of valuable lessons about the pros and cons of tools like SWIG. SWIG’s great for getting a connector implemented cleanly and efficiently. But it’s run-time performance–SWIG does a magic-circle-hat-dance with every data object that passes between the SciDB client library and Python–is a bit ordinary.

Anyway, to answer your question: it all depends what you want to do with the result. Alternative are:

  1. If you just want to save the result of a query, one alternative is to use the “save()” operator. This is a server-side op that will take the data and save it to a server-side file. Now, if you want to access the contents of that file using another tool (put it into a spread-sheet say) then you’re going to have to use a scripting language (Perl, Python) to re-organize the contents of that file, because “save()” exports it in a format that reflects the rich SciDB data model. You can use another operator like “unpack()” to render the result something close to a 1D csv file, however.

  2. Getting data into and out of any DBMS is usually pretty time-consuming, and with SciDB the problem is made worse by the way our data objects tend to be very big. Think of it this way: your query is “ploughing the ocean” and your client connection can, at best, sip the results through a straw. Consequently it’s usually considered “best practice” when trying to engineer big DBMS applications to minimize the client <-> server data movement.

Can you do inside the server what you were planning to do on the client?
Do you really need all the data? SciDB has a sampling operator “bernoulli()” that can be used to restrict your data to a random sample of the result, or you can use an operator like “regrid()” to reduce the size of the data being pulled out at the expense of some precision.


#3

[quote=“plumber”]First, let’s be blunt and admit that our experience with the Python connector has taught us a couple of valuable lessons about the pros and cons of tools like SWIG. SWIG’s great for getting a connector implemented cleanly and efficiently. But it’s run-time performance–SWIG does a magic-circle-hat-dance with every data object that passes between the SciDB client library and Python–is a bit ordinary.

Anyway, to answer your question: it all depends what you want to do with the result. Alternative are:

  1. If you just want to save the result of a query, one alternative is to use the “save()” operator. This is a server-side op that will take the data and save it to a server-side file. Now, if you want to access the contents of that file using another tool (put it into a spread-sheet say) then you’re going to have to use a scripting language (Perl, Python) to re-organize the contents of that file, because “save()” exports it in a format that reflects the rich SciDB data model. You can use another operator like “unpack()” to render the result something close to a 1D csv file, however.

  2. Getting data into and out of any DBMS is usually pretty time-consuming, and with SciDB the problem is made worse by the way our data objects tend to be very big. Think of it this way: your query is “ploughing the ocean” and your client connection can, at best, sip the results through a straw. Consequently it’s usually considered “best practice” when trying to engineer big DBMS applications to minimize the client <-> server data movement.

Can you do inside the server what you were planning to do on the client?
Do you really need all the data? SciDB has a sampling operator “bernoulli()” that can be used to restrict your data to a random sample of the result, or you can use an operator like “regrid()” to reduce the size of the data being pulled out at the expense of some precision.[/quote]

Thanks for your response.
I’m building a visualization tool based on scidb. Essentially, storing a huge 2d array within scidb,each cell of the array is a pixel value. I’m trying to image that array onto browser. Of course, I only need to query a subarray corresponding to the viewport size region. But the array-size is still unavoidly large.
For that using intermediate file approach, it also incurs some overhead of reading from and writing to disk. Compared with using pyconnector directly, which one you think is better in case of large datasets?
Any idea for handling such case would be appreciated.

Jian