Error: Not enough memory


#1

Hello All

I am getting this error during regridding or aggregating:

After the error message SciDB processes are still maxing out system memory. I have to stopall/startall to free.

Any ideas?

Thanks in advance,

Nick


#2

Hi Nick,

There’s a known issue where regridding and grouping high-dimensional sparse spaces to a point where density goes up sharply may lead to trouble. The workaround for now would be to use a redimension aggregate.

We can help you further if you provide:

Thanks. Sorry for the inconvenience.


#3

OK, here is the Config.ini file:

[carve]
server-0=localhost,7
install_root=/opt/scidb/13.12
metadata=/opt/scidb/13.12/share/scidb/meta.sql
pluginsdir=/opt/scidb/13.12/lib/scidb/plugins
logconf=/opt/scidb/13.12/share/scidb/log4cxx.properties
db_user=carve
db_passwd=carve
base-port=1239
base-path=/media/scidb/scidb1312
redundancy=0
mem-array-threshold=845
smgr-cache-size=845
merge-sort-buffer=423
network-buffer=423
max-memory-limit=5120
execution-threads=1
result-prefetch-threads=1
result-prefetch-queue-size=1
operator-threads=1

I am running this on 8 cores and 40 GB memory

Here is the array schema:

'modis_alaskaalbers_1km_nrt<lst_day:double NULL DEFAULT null,lst_night:double NULL DEFAULT null> [ROW=1:26100,1000,0,COL=1:34660,1000,0,DATE=0:10,1,0]

And here is the output from the list-chunk method:

iid,aid,name,nchunks,min_ccnt,avg_ccnt,max_ccnt,total_cnt
7,358,'modis_alaskaalbers_1km_nrt@1',1298,66000,957524,1000000,1242866000

Thanks,

Nick


#4

Nick,

All the chunking looks nice and the settings look decent. You’re going from 1M dense to 1M dense chunks so it’s not the issue I thought it was.
I am guessing this is just some general memory slopiness. We’ve noticed sometimes SciDB will act as if it’s leaking some memory. We saw it happen in sort (and fixed some of it in the upcoming 14.3), perhaps there are issues in regrid too. It could also be that some other array was redimensioned improperly and now its chunks are taking a disproportionate amount of memory in the chunk map.

Some options:

  1. Check the full output of the list-chunk and make sure no one has an “nchunks” that is too high (100K+). That is a common memory consuming thing.
  2. Reduce mem-array-threshold and smgr-cache-size settings further. Try reducing each 50%.
  3. Send us your query and the count() of the elements in your array. Let us know when it happens (all the time? some of the time? anything that leads up to it or make it more likely?).

Let me know. Thanks for helping us improve.


#5

Thanks for your help on this. I initialized a scidb instance with your suggested configuration and am still getting memory errors. Presently, I only have the one array loaded.

Here is a count of the array elements:

AFL% aggregate(modis_alaskaalbers_1km_nrt, count(lst_day)); {i} lst_day_count {0} 60840959

These queries using regrid and aggregate are both returning a memory error:

store(regrid(modis_alaskaalbers_1km_nrt, 1, 1, 11, avg(lst_day)), regrid_test); store(aggregate(modis_alaskaalbers_1km_nrt, avg(lst_day), ROW, COL), agg_test);
Regridding while reducing the size of the output grid works fine:

store(regrid(modis_alaskaalbers_1km_nrt, 100, 100, 11, avg(lst_day)), regrid_test);
The system memory stays constant when I run the reduced case.


#6

Ok. Interesting. We shall file a bug and investigate the regrid. Meanwhile, how about we try a workaround query:

redimension( apply( modis_alaskaalbers_1km_nrt, NEW_DATE, 0), <avg_lst_night:double null> [ROW=1:26100,1000,0,COL=1:34660,1000,0,NEW_DATE=0:0,1,0], avg(lst_night) as avg_lst_night)

This returns the equivalent array, but it uses a different pathway. It might work better in this case. Tell me if that helps. Also, if you can upload a part of your scidb.log file where you get the error, it would help investigate.


#7

Just FYI …

We’ve opened an internal ticket #3842 about this. We’re able to reproduce the problem.


#8

I am uploading the scidb.log for the regrid query.

The redimension query you suggested as a workaround also returned a memory error after running for around 5 hrs . I have uploaded that part of the logfile as well.

Would you suggest any other memory restrictions in the config.ini? max-memory-limit?

I am currently using a merge over the time-dimension to avoid aggregation, I think that will suite my purposes if successful.

Thanks,

Nick
regrid_scidb.log (21.2 KB)
redimension_scidb.log (549 KB)


#9

Nick,

Thanks for helping us identify the problem. There is a memory fragmentation type of situation which looks and acts a lot like a memory leak. That means the problem gets worse with increasing data sizes, lowering the memory settings will help, but won’t get rid of the problem completely. Remember that max-memory-limit is a hard os limit. It is what makes you error out when there isn’t enough (i.e. if you remove it, you will allow scidb to swap out, and eventually it will be killed when the swap is exhausted). So you probably want to set max-memory-limit at (TOTAL_MEMORY - OS_OVERHEAD - OTHER_PROGRAMS_OVERHEAD) / NUM_INSTANCES. You can lower the other settings like mem-array-threshold, smgr-cache-size, network-buffer, merge-sort-buffer.

There are a number of tricks we can use to get things to work. The sure-fire way is to break the problem into pieces and do it one piece at a time:

store ( regrid( between(modis_alaskaalbers_1km_nrt,      1, null , null, 1000, null, null), 1,1,11, avg(lst_day)), regrid_test)
insert( regrid( between(modis_alaskaalbers_1km_nrt, 1001, null , null, 2000, null, null), 1,1,11, avg(lst_day)), regrid_test)
insert( regrid( between(modis_alaskaalbers_1km_nrt, 2001, null , null, 3000, null, null), 1,1,11, avg(lst_day)), regrid_test)
...

You can iterate like that, and when the leak builds up, restart the scidb and continue. Not ideal but gets the job done.

Now, since avg aggregate has 16 bytes of state (sum and count) it is allocated on the heap, and that memory allocation is what’s getting fragmented. The memory leak will be greatly reduced if, instead you compute sum and count separately:

store ( regrid( between(modis_alaskaalbers_1km_nrt,      1, null , null, 1000, null, null), 1,1,11, sum(lst_day), count(lst_day)), regrid_test)

You can then apply a division manually.

Further, this problem has to do with reopening chunks multiple times. So if you sort the data a certain way first, you can reduce it further:

 redimension( sort(unpack(project(modis_alaskaalbers_1km_nrt, lst_night), i), ROW, COL), <avg_lst_night:double null> [ROW=1:26100,1000,0,COL=1:34660,1000,0], avg(lst_night) as avg_lst_night))
-- could use sum and count here too

There we are - I am not 100% certain what will work best for you so giving you different options. The first query is a sure-fire way to finish the job. The second two approaches are other “one-step” alternatives to try out.
Appreciate you helping us make our product better. As soon as we have a fix, or patch of some kind, we’ll put it up here.


#10

Any idea when this may be resolved? Using this type of iteration slows these operations considerably correct?

Thanks,

Nick


#11

Hi Nick,

Happy to report we have a fix. If you’re building from 13.12 source, then apply the following:

Index: src/array/MemArray.cpp
===================================================================
--- src/array/MemArray.cpp	(revision 6872)
+++ src/array/MemArray.cpp	(working copy)
@@ -2344,6 +2344,7 @@
                 dataChunk->unPin();
             }
         }
+        values.~ValueMap();
     }
 
     bool RLEChunkIterator::setPosition(Coordinates const& pos)

Basically, it is a leak, and we were a little bit too smart with our custom allocator. With that patch applied hopefully you’ll find a predictable memory footprint. You should then be able to raise some of the memory limits for better performance.

There’s a separate issue where something like this may use up too much disk space. If that happens, try the redimension() query above. The fix for that is on the way.
Thanks for being patient!


#12

After rebuilding with the suggested changes I am no longer getting the memory error!

Thanks for your help on this.

Nick