Network error while running aggregate


#1

I’m receiving a network error when running an aggregate query using the Amazon VM. I’m running version 13.12 on one m3.2xlarge instance, and I’ve managed to produce the error with the following script.

What I’m ultimately trying to do is load 250 m MODIS data that comes in 4800x4800 rasters. Each file is a 16 day composite, so there are two images per month. I want to load a year worth of rasters (eventually multiple years, but one to start) and compute the average EVI value for each pixel in each month. I’ve set up my array with four dimensions: year, month, day, pixel index, and one attribute: Enhanced Vegetation Index (EVI). I want to run an aggregation by year and day to get the average EVI for each month at each pixel.

I’m running the following commands to set up an array with some dummy data and run the aggregation.

#!/bin/bash

set -x

# Cleanup
iquery -anq "remove(mod13q1_test)" &> /dev/null
iquery -anq "remove(mod13q1_test_climate)" &> /dev/null

# Generate an array of test data
time iquery -anq "
store(
    build(<evi:int16>
                 [year=2001:2001,1,0, month=1:12,12,0, doy=1:2,2,0, i=0:23039999,170000,0],
          month + i),
    mod13q1_test)"

# Compute mean EVI for each pixel in each month.
time iquery -anq "
store(
   aggregate(
      mod13q1_test, avg(evi), stdev(evi), month, i),
   mod13q1_test_climate)"

This is the output that I receive:

+ iquery -anq '
store(
    build(<evi:int16>
                 [year=2001:2001,1,0, month=1:12,12,0, doy=1:2,2,0, i=0:23039999,170000,0],
          month + i),
    mod13q1_test)'
Query was executed successfully

+ iquery -anq '
store(
   aggregate(
      mod13q1_test, avg(evi), stdev(evi), month, i),
   mod13q1_test_climate)'
SystemException in file: src/network/BaseConnection.h function: receive line: 294
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CANT_SEND_RECEIVE
Error description: Network error. Cannot send or receive network messages.

I don’t get the error every time that I run the script, but sometimes the operation takes over an hour to complete. Anyone know what’s going on here, or what I can try to get more information?


#2

Hello,

Yes this might be a known issue. You are aggregating from a sparse form to a dense form, and so your chunks are becoming dense 12 x 170000 which might be too large and that might be running you out of memory. That could be the problem.

Try this workaround and see if it works:

redimension ( mod13q1_test, <evi_avg: double null, evi_stdev: double null> [month = 1:12, 12, 0, i = 0:23039999, 80000,0 ], avg(evi) as evi_avg, stdev(evi) as evi_stdev )

The rationale is that 80000 is roughly 1M / 12 so you maintain 1M elements per chunk in the result. Let me know if this does a better job…


#3

Well, the redimension query hasn’t crashed, but it’s been running for over an hour. I can run an aggregation in SciDB over all dimensions in a seconds:

% time iquery -anq "consume(aggregate(mod13q1_test, avg(evi)))
Query was executed successfully

real    0m2.256s
user    0m0.004s
sys     0m0.012s

So it does seem like a memory issue. Are there and settings that I should try tweaking? Or should I try upgrading to 14.3?


#4

Ah yes. I missed the 13.12. You are most likely suffering from this bug. Fixed in 14.3. viewtopic.php?f=11&t=1293&hilit=regrid&start=10

Also: aggregating over a large number of groups is always going to be slower than grand aggregate / few groups. All the work goes into splitting the data into groups - the more groups the harder the job.


#5

Ok, I’ve upgraded to 14.3. The good news is that I haven’t seen the network error. But my aggregation query is still taking over an hour to run. I’ve tried both my original aggregation call and the redimension version that you suggested. I understand that this query will take longer than running an grand aggregation, but an hour per query is too long for my use case. Does anyone have any advice on how I might restructure the data or the query to get this to perform better? Or what I might try to understand where the time is being spent?

As a test case I’m using an array with this schema, and running an aggregation by day:

<evi:double> [month=1:12,12,0, day=1:2,2,0, i=0:23039999,40000,0]

aggregate(mod13q1_test, avg(evi), month, i))

I think I’ve set the chunk sizes correctly: 12 * 2 * 40000 = 960000. Any advice is appreciated. Thanks!


#6

Well it is definitely hardware-dependent. Here’s what I’m getting on my 4-machine cluster. This cluster cost a total of $25K 3 years ago, I can give you the specs.

$ iquery -aq "create array foo <evi:double> [month=1:12,12,0, day=1:2,2,0, i=0:23039999,40000,0]"
Query was executed successfully

$ iquery -anq "store(build(foo,random()), foo)"
Query was executed successfully

$ time iquery -anq "store(aggregate(foo, avg(evi), month, i), bar)"
Query was executed successfully

real	0m28.410s
user	0m0.020s
sys	0m0.000s

So you’re saying on your hardware this takes over an hour?

The aggregate is a two-phase op. Each instance first computes a local aggregation of the chunks it has, then the local aggregation is redistributed and merged with what the other instances have computed - in parallel. In this case, the local aggregation is fairly large, so my guess would be you don’t have enough memory to contain all of it at once - so as it’s being populated, pieces of it are getting swapped out to disk.

You could also try a piecewise approach. For example, I got 64 instances ( 4 machines x 16 instances each). I can do something like this:

#Aggregate the first 64 chunks -- 1 chunk per instance
$ time iquery -anq "store(aggregate(between(foo, null, null, 0, null, null, 2559999), avg(evi), month, i), bar)"
Query was executed successfully

real	0m4.870s
user	0m0.004s
sys	0m0.008s

#Aggregate the second 64 chunks
$ time iquery -anq "insert(aggregate(between(foo, null, null, 2560000, null, null, 5119999), avg(evi), month, i), bar)"
Query was executed successfully

real	0m4.822s
user	0m0.008s
sys	0m0.004s
...

This will do the same computation eventually but will keep memory usage down… Curious to see how we can help…


#7

Interesting… I’m running one m3.2xlarge VM on Amazon, with four instances. The aggregation query clocked in at 61 minutes last time I tried. Maybe I’ll try adding a few more machines to the cluster and see if I can reproduce your results. I’ll also try playing with the SciDB memory settings. Thanks for your help.


#8

Before you kill yourself - can you post your config.ini file?


#9

Here’s my config.ini. The database is running on a VM with 8 “virtual CPUs” and 30 GB RAM.

[mydb]
server-0=localhost,3
db_user=mydb
db_passwd=mydb
install_root=/opt/scidb/14.3
pluginsdir=/opt/scidb/14.3/lib/scidb/plugins
logconf=/opt/scidb/14.3/share/scidb/log1.properties
base-path=/home/scidb/DB-mydb
base-port=1239
interface=eth0
redundancy=1

#10

I added a second machine to my cluster and the aggregation time dropped from an hour to fifteen minutes. So it looks like I can move forward by scaling up my cluster. But I would be interested to know if there’s anything that could be changed to get the operation to perform better on a single machine. It’s only about 4.5 GB of data. I would have thought a machine with 30 GB of RAM could handle it comfortably…