Join Dense arrays


#1

I’m new to SciDB, but I have been playing around with it for a couple of months. I’d like to use it for computation on dense multidimensional raster datasets, like remote sensing imagery. I’m sure I am doing somethings wrong, so any help is appreciated.

Issue #1: data size, loading.
datasets: glc2000.tif, This is a 33mb raster of the world. Width, Height = (40320, 16353) = 659,352,960 values.
I initially wrote the image out to a scidb, csv format and that file was approximately 11gigs. Yikes! Due to that size I really couldn’t take advantage of the aio loading tools, but perhaps I am wrong. I loaded the data, took over 12 hours, redimensioned it and it took another 12. This doesn’t seem like a very good strategy. However, I noticed on the forum a couple of interesting notes that I am continuing to work on. 1) scidb-py connects via SHIM and I was able to successfully able to write multidimensional arrays of data from python directly to SciDB. This seemed to be much more efficient than other loading strategies and I plan to write a script to utilize this. 2) On the forum I found some work discussing the ability to have SHIM on each instance. I’d definitely be open to helping with that.

Issue #2: Question : Count the # of pixels in Washington.
show(py1102597373689_00001);{i} schema{0} 'py1102597373689_00001<f0:int16> [i0=0:2784,1000,0,i1=0:6474,1000,0]'

To get here I loaded a geographic boundary dataset of the United States and pixelated it and turned it into an array, “py1102597373689_00001”. Now I have two, 2d arrays and I want overlay them (join) and get some statistics (aggregate). This was the final query that worked. 53 is the state of Washington in afl and 840053 in PostgreSQL. What I am noticing so far in my work is this adhoc queries relying on the join is not the speed that I expected. PostgreSQL is able to get the result that I want much faster. One thought I have is to modify my 2d grid to a 3d, and see the performance gain there. Any thoughts or suggestions are appreciated.

iquery -atq "aggregate(filter(join(subarray(glc2000_2d,6187, 4548, 12662, 7333),py1102597373689_00001), f0=53), count(value)) ;"{i} value_count{0} 261318Query execution time: 387779ms

Time: 8448.855 ms

WITH geographic_table as ( SELECT sgl.id as sample_geog_level_id, gi.id as geog_instance_id, gi.label as geog_instance_label,gi.code as geog_instance_code, bound.geog::geometry as geom FROM sample_geog_levels sglinner join geog_instances gi on sgl.id = gi.sample_geog_level_idinner join boundaries bound on bound.geog_instance_id = gi.id WHERE sgl.id = 1051) , clip as ( SELECT g.geog_instance_id, g.geog_instance_label, g.geog_instance_code, ST_Union(ST_Clip(r.rast, g.geom)) as rast FROM geographic_table g inner join landcover.glc2000_tp r on ST_Intersects(r.rast, g.geom) GROUP BY g.geog_instance_id, g.geog_instance_label, g.geog_instance_code ), selection_value_count as ( SELECT geog_instance_id, geog_instance_label, geog_instance_code, (ST_ValueCount(rast)).* FROM clip WHERE geog_instance_code = 84005 3) SELECT sum(count)FROM selection_value_count


#2

Hey thanks for the example and the use case.

Don’t see a reason why you can’t load 11GB with aio. Did something not work?
The best bet is probably binary loads https://github.com/Paradigm4/binary_load_example

What’s the schema for glc2000?
It might help to look at your config.ini file too.


#3

Here is the schema for the glc2000 in scidb form

show(glc2000_2d) ;
{i} schema
{0} 'glc2000_2d<value:int16> [x=0:16353,250,3,y=0:40320,250,3]'

Here is the config.ini file

[single_server]
server-0=localhost,3
install_root=/opt/scidb/VERSION
pluginsdir=/opt/scidb/VERSION/lib/scidb/plugins
logconf=/opt/scidb/VERSION/share/scidb/log4cxx.properties
db_user=scidb_pg_user
db_passwd=scidb_pg_user_pasw
base-port=1239
base-path=/home/scidb/scidb_data
redundancy=0

### Threading: max_concurrent_queries=2, threads_per_query=1
# max_concurrent_queries + 2:
execution-threads=4
# max_concurrent_queries * threads_per_query:
result-prefetch-threads=2
# threads_per_query:
result-prefetch-queue-size=1
operator-threads=1

### Memory: 1000MB per instance, 750MB reserved
# network: 300MB per instance assuming 5MB average chunks
# in units of chunks per query:
sg-send-queue-size=15
sg-receive-queue-size=15
# caches: 300MB per instance
smgr-cache-size=150
mem-array-threshold=150
# sort: 150MB per instance (specified per thread)
merge-sort-buffer=75

#4

Hey, there are a few things here:

The py… array has chunk sizes that are 1000x1000 - which is reasonable for a dense array. The glc2000_2d has chunk sizes that are 250x250, also reasonable but on the lower side. Joining a thing that is 1000x1000 against 250x250 is a big part of what makes us unhappy. As you can see, it works but, under the hood, it requires a so-called repartition of one of the arrays to make the chunk sizes match.

If you re-load the arrays to the point of matching chunk sizes you should see much better join performance.

Not sure how much RAM you have but you can try increasing mem-array-threshold while decreasing smgr-cache-size. That may give you more scratch space for the loads and sorts.


#5

Sorry, I’ll be more specific re: the memory stuff:

try

mem-array-threshold=512

Try not changing smgr-cache-size for now.
The configurator can be a bit conservative and if you really have 4GB ram this should be OK.


#6

Ok, I’m trying this part first. Could you elaborate or point to some documentation about how chunk size affects performance. Also I wanted to read up a bit more on your indexing strategy, I talked to Paul Brown a while back and he mentioned the name, but I can’t seem to recall it.


#7

Yes there exists an architecture paper that is currently being re-uploaded to our website. It should be out in a couple of days. The next best explanation is the Mac Storage writeup.

But your question is more specifically about joins and the join implementation. Some prototyping work is ongoing in this area.

For the join you are using the first major consideration is: do the chunk sizes match? In that regard, a quick experiment can illustrate the difference:

$ iquery -anq "store(build(<val:double> [x=1:4000,1000,0, y=1:4000,1000,0], random()), A)"
Query was executed successfully
$ iquery -anq "store(build(<val:double> [x=1:4000,1000,0, y=1:4000,1000,0], random()), B)"
Query was executed successfully
$ iquery -anq "store(build(<val:double> [x=1:4000,250,0, y=1:4000,250,0], random()), C)"
Query was executed successfully

$ time iquery -aq "op_count(join(A,B))"
{i} count
{0} 16000000

real	0m0.651s
user	0m0.004s
sys	0m0.004s

$ time iquery -aq "op_count(join(A,C))"
{i} count
{0} 16000000

real	0m19.371s
user	0m0.008s
sys	0m0.000s

In other words, it should be substantial.


#8

thanks for your help, but I am running into another problem.

When I try running the following query I am timing out. I adjusted the mem-array-threshold and now I seem to be encountering more problems. Suggestions.
What should I reduce smgr-cache-size too?

I have about 6 gigs of memory on an amazon machine. 100 gig hd, with about 10 free.

store(redimension(glc2000_1d,glc2000_bigchunk),glc2000_bigchunk);


#9

Hey, this is 15.12 right?

Try these configs:

execution-threads=5
result-prefetch-threads=4
result-prefetch-queue-size=1
operator-threads=1

mem-array-threshold=512
merge-sort-buffer=512
smgr-cache-size=128

sg-send-queue-size=4
sg-receive-queue-size=4

You just need to do scidb.py stopall / startall for them to take effect.

I’m assuming glc2000_bigchunk has chunks that are 1000x1000, right? And what’s the schema for glc2000_1d?

To build more confidence and get a better completion estimate you can try running the redimension in pieces. For example:

#assuming you have glc2000_1d <....> [i=0:*]

#1 insert the first 10 million values
time iquery -aq "insert(redimension(between(glc2000_1d, 0, 9999999), glc2000_bigchunk), glc2000_bigchunk)"

#2 insert the next 10 million values
time iquery -aq "insert(redimension(between(glc2000_1d, 10000000, 19999999), glc2000_bigchunk), glc2000_bigchunk)"

#... and so on

#10

I am running 15.7. Which we compiled ourselves. So there are likely to be some issues there.
However I am considering utilizing one of your pre-existing AMIs for some benchmark testing. In particular the array joing operation seems very interesting and I encountered something similar in PostgreSQL.

What specs on the machine would you recommend for the dense datasets.


#11

I see. Sure. I would check to see that the build is optimized (RelWithDebInfo).
Given that you’re clearly an academic researcher, I’ll privately send you our AMI quick-start guide with some instructions. That might come in handy.


#12

Sounds great seems like limit is about 50 million points for a redimension otherwise I receive the connection error. Does SciDB operate similar to a RDBMS with chunk/row “locking”

time iquery -anq "insert(redimension(between(glc2000_1d, 0, 49999999), glc2000_bigchunk), glc2000_bigchunk)"
Query was executed successfully

real 29m41.578s
user 0m0.008s
sys 0m0.007s

SystemException in file: src/network/BaseConnection.h function: receive line: 414
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot send or receive network messages.

#13

I moved the data into an new array with chunks set to 1000 on both dimensions.

query performance improved a bit, but not as much as I wanted. After reading that document I can try to improve the performance again my increasing the chunk size.

 time iquery -aq "aggregate(filter(join(subarray(glc2000_bigchunk,187, 4548, 12662, 7333),py1102597373689_00001), f0=53), count(value));"
{i} value_count
{0} 261318

real 6m48.965s
user 0m0.016s
sys 0m0.000s