How to improve to scanning performance?


#1

Hello,
I want to read the data from SciDB and make it into an image file.
But elapsed time for “scan” is 16.8s. I think it slows…
Is it normal speed? Or was something wrong? Help me :cry:

Target array: tif_image<val:uint64> [x=0:9402:0:512; y=0:4122:0:256]
OS Version: CentOS 6.9
SciDB Version: 18.1.8
Shim Version: 18.1.2
SciDB-py Version: 18.1.2

AFL% summarize(tif_image);
{inst,attid} att,count,bytes,chunks,min_count,avg_count,max_count,min_bytes,avg_bytes,max_bytes
{0,0} ‘all’,20912926,182169784,386,44,108357,131072,48,471942,1183804

  • test.py

    from scidbpy import connect
    db=connect()
    %time arr = db.arrays.tif_image[:]
    
  • config.ini

    [cluster]
    server-0=localhost,0-3
    server-1=192.168.10.157,0-3
    install_root=/opt/scidb/18.1
    pluginsdir=/opt/scidb/18.1/lib/scidb/plugins
    logconf=/opt/scidb/18.1/share/scidb/log4cxx.properties
    db_user=scidb_pg_user
    db_passwd=scidb_pg_user
    base-port=1239
    base-path=/home/scidb/scidb_data
    redundancy=0
    security=trust
    
    target-cells-per-chunk=100000
    mem-array-threshold=128
    chunk-size-limit-mb=128
    smgr-cache-size=128
    merge-sort-buffer=32
    execution-threads=4
    result-prefetch-threads=3
    result-prefetch-queue-size=1
    operator-threads=1
    sg-send-queue-size=4
    sg-receive-queue-size=4
    max-arena-page-size=8

#2

Hi @derbar,
Are you seeing the time it takes for SciDB’s AFL operator ‘scan’ is 16 seconds? Are you using ‘scan’ to read the data back from the array? Or are you timing the scidbpy script that you have written?
What does your physical system have for RAM, hard disk, and CPU?
Thanks,
Dave


#3

Hey @derbar,

A few thoughts in addition to Dave’s questions. Your data is about 182MB and we’re moving it through a 3-stage process:

  1. export out of scidb
  2. shim stages it to disk
  3. data is read into python

Item (1) is distributed in SciDB (the format conversion) and so that will scale up with more SciDB instances. But when I tried it on a 4-core EC2 machine, I saw issues with steps 2 and 3.

You can speed up (2) by running a shim backed by /dev/shm or /run/shm if you have it on your system. For example, if you have a edit /var/lib/shim/conf and change the tmp= line like so:

tmp=/run/shm/shim

Also, recent SciDB-Py and accelerated_io_tools support the arrow format - and that speeds up Python re-ingest here. SciDB writes data in Arrow which makes it easier to read into Python. So for me this ran over 4x faster with the use_arrow flag:

t1 = datetime.datetime.now()
arr = db.iquery("scan(tif_image_nn)", fetch=True, use_arrow=True)
t2 = datetime.datetime.now()
print(t2-t1)

It’s also important to point out that “scan” will fetch “the whole thing” - 182MB in your case. For a fast visualization, do you really intend to paint all of 9402x4122 pixels? That might take time to plot. So of course if you use filtering or regridding in SciDB, it will work faster because there’s less data to move. For example:

arr = db.iquery("regrid(tif_image_nn, 5, 5, avg(val))", fetch=True, use_arrow=True)
arr = db.iquery("filter(tif_image_nn, x<=800 and y<=600)", fetch=True, use_arrow=True)

#4

Thanks, reply.
I tried change Shim option tmp=/dev/shm/shim and use_arrow
After changing Shim option, about 1 second faster. It’s nice.

But It occurs error using use_arrow option

HTTPError: 406 Client Error: UserQueryException in file: src/query/ops/save/LogicalSave.cpp function: 
inferSchema line: 149
Error id: scidb::SCIDB_SE_INFER_SCHEMA::SCIDB_LE_UNSUPPORTED_FORMAT
Error description: Error during schema inferring. Unsupported format: arrow.
save(project(apply(scan(tif_image), x, x, y, y), x, y, 
val),'/home/scidb/scidb_data/0/0/shim_output_buf_l9g3nR',0,'arrow')
                                                           ^^^^^^^ for url: 
http://localhost:8080/execute_query?query=project%28apply%28scan%28tif_image%29%2C+x%2C+x%2C+y%2C+y%29%2C+x%2C+y%2C+val%29&save=arrow&id=khyldt0ri92f2wfe7ooqwcfbqjoo80js

I used scidb-py: 18.1.2 and installed Apache Arrow.
I don’t know my SciDB version :joy:

[scidb@SM2-PC ~]$ scidb --version
SciDB Version: 18.1.8
Build Type: RelWithDebInfo
Commit: d7dfe80
Copyright (C) 2008-2017 SciDB, Inc.
[scidb@SM2-PC ~]$ shim -version
SciDB Version: 18.1.10
Shim Commit: a19d256
[scidb@SM2-PC ~]$

What version I used?


#5

The Arrow support right now is part of the accelerated_io_tools plugin (add-on). So you need a newer version of the plugin. You can get that here:
https://paradigm4.github.io/extra-scidb-libs/

Or if that gives you trouble, the source is here:


#6

I installed accelerated_io_tools and extra-scidb-libs-18.1-4-1.x86_64 too

load_library('accelerated_io_tools')
and then aio_save works well.

But db.iquery("scan(tif_image)", fetch=True, use_arrow=True) occurs error.


#7

When you re-install a plugin, you need to restart scidb for it to take effect. Did you restart scidb?

What’s the error?


#8

I did restart scidb and shim and even rebooted the server.

ERROR: Unsupported format: arrow.
I wrote down the details in the above answer.


#9

Ah. My apologies. You also need to set shim to use the aio plugin. In your /var/lib/shim/conf set:
aio=1

That should do it. Sorry for the trouble!


#10

Oh, I found /etc/init.d/shimsvc bug…
I ran shim using service shimsvc start. It wasn’t apply aio=1
So I looking shimsvc file.

line 54 in shimsvc file:

test -n "${AIO}" && AIO="-a"

You need to make the following changes:

test -n "${AIO}" && AIO="-a ${AIO}"

Now I can use the arrow option, but another error occurs.:sweat_smile:

HTTPError: 502 Server Error: SystemException in file: src/network/BaseConnection.h function: receive line: 171
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot receive network message: Read failed: End of file (asio.misc:2). An instance may be offline…

Help me…


#11

No the argument to “shim” should be just “-a”. The /var/lib/shim/conf should say aio=1 and that should start shim with the argument -a. We run that configuration all the time.

Taking shim out of the picture, just try this from iquery:

iquery -aq "aio_save(apply(tif_image, x_v, x), '/tmp/tif_image.out', 'format=arrow')"

Does that work?


#12

No, not work.

SystemException in file: src/network/BaseConnection.h function: receive line: 171
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot receive network message: Read failed: End of file (asio.misc:2). An instance may be offline…


#13

Do regular queries still run?

iquery -aq "op_count(tif_image)"

Can you save in TSV format (not Arrow)?

iquery -aq "aio_save(apply(tif_image, x_v, x), '/tmp/tif_image.out', 'format=tsv')"

Possibly an issue with your installation of Arrow. Arrow is still a fairly new package. We found we had to create our own repository of arrow to avoid some conflicts. You mentioned you installed it separately? Perhaps there’s a conflict. Try removing your version and using these instructions?


#14

You’re right. there was perhaps a conflict.
I reinstalled iquery -aq "install_github('paradigm4/accelerated_io_tools')", then arrow well working.
Test result is so nice, too.

Elapse time:

  1. No options: 16.9s
  2. tmp=/dev/shm/shim: 15.6s
  3. arrow=true: 3.86s

Thank you so much!


#15

Glad we got this squared away! There are some very nice things about Arrow.