Redimension_store / memory consumption


#1

I’m trying to understand SciDB memory usage, so running the following with two instances on a VM with 4GB for testing:

create array trips<pickup_lat:double,pickup_lon:double,dropoff_lat:double,dropoff_lon:double>[row=0:*,1000000,0] ... load 6000000 cells into trips ... create array trips2<dropoff_lat:double,dropoff_lon:double>[pickup_lat(double)=*,1000,0,pickup_lon(double)=*,1000,0] redimension_store(trips, trips2)

When I run redimension_store, my memory consumption goes through the roof and the query fails with

Error id: scidb::SCIDB_SE_NO_MEMORY::SCIDB_LE_MEMORY_ALLOCATION_ERROR Error description: Not enough memory. Error 'std::bad_alloc' during memory allocation.

I’m able to use max-memory-limit to control how far the query goes before dying, but none of the other configuration parameters seems to affect my memory usage, no matter how low I set them. In particular, I was expecting mem-array-threshold to keep usage down, but it seems to have no effect. Any suggestions?

Cheers,
Tim


#2

Hi Tim,

Yes. This is the unfortunate part of what we mean in the release notes saying “Support for non-integer dimensions is incomplete…” The current non-integer dimension code does not abide by any memory controls. What happens is we build a map, on each instance, of non-integer values to integers. The size of this map depends on the number of unique elements in the dimension. Here we have up to 6 million doubles, times two dimensions, times the per-element overhead in the map, times two instances. Looks like 4 GB (minus the OS overhead) may be cutting it close.

scidb.log will have a good description of what redimension_store is doing at the moment and at what point you run out of memory. You might consider posting that section of the log here.

You can try it with just one instance on the 4GB VM and see what the memory usage peaks at.

In general, I can’t say for sure but this looks very geospatial. I’ve seen a couple of geospatial use cases use rounding (for example to 4 decimal places, or int64 ( lat * 10000.0 + 0.5)) which then become large integer dimensions, i.e. [lat = -900000 : 900000, … ]. Each cell in the array then corresponds to a small shape on the earth’s surface that’s roughly 10m x 10m and sometimes the original coordinates are retained as attributes. This may work as an alternative - depends on what you are trying to do.


#3

Alex:

Thanks for the quick response - makes sense. You’re correct that this is a geospatial use-case - I’ll try the rounding approach and see how that works.

Cheers,
Tim


#4

Following-up, I’m still running out of memory, after switching to integer dimensions and a single instance. Here’s my scidb.log:

2013-04-05 15:16:30,878 [0x7f6d8356b700] [DEBUG]: (Pre)Single executing queryID: 1100876874368
2013-04-05 15:16:30,882 [0x7f6d8356b700] [DEBUG]: Create array trips_by_pickup@1(66) in query 1100876874368
2013-04-05 15:16:30,883 [0x7f6d8356b700] [DEBUG]: Query is serialized: [pPlan]:
>[pNode] PhysicalRedimensionStore agg 0 ddl 0 tile 0 children 1
  schema trips_by_pickup@1<pickup_lat:double,pickup_lon:double,dropoff_lat:double,dropoff_lon:double> [pickup_lati=-900000:900000,1000,0,pickup_loni=-1800000:1800000,1000,0,i=0:*,100,0]
  props sgm 1 sgo 1
  distr roro
  bound start {-900000, -1800000, 0} end {900000, 1800000, 4611686018427387903} density 1 cells 4611686018427387903 chunks 4611686018427387903 est_bytes 9.82289e+20
>>[pNode] impl_materialize agg 0 ddl 0 tile 0 children 1
   schema trips@1<pickup_lat:double,pickup_lon:double,dropoff_lat:double,dropoff_lon:double,pickup_lati:int64,pickup_loni:int64> [rowtrips=0:59999999,1000000,0]
   props sgm 1 sgo 1
   distr roro
   bound start {0} end {59999999} density 1 cells 60000000 chunks 60 est_bytes 7.98e+09
>>>[pNode] physicalApply agg 0 ddl 0 tile 1 children 1
    schema trips@1<pickup_lat:double,pickup_lon:double,dropoff_lat:double,dropoff_lon:double,pickup_lati:int64,pickup_loni:int64> [rowtrips=0:59999999,1000000,0]
    props sgm 1 sgo 1
    distr roro
    bound start {0} end {59999999} density 1 cells 60000000 chunks 60 est_bytes 7.98e+09
>>>>[pNode] physicalScan agg 0 ddl 0 tile 1 children 0
     schema trips@1<pickup_lat:double,pickup_lon:double,dropoff_lat:double,dropoff_lon:double> [rowtrips=0:59999999,1000000,0]
     props sgm 1 sgo 1
     distr roro
     bound start {0} end {59999999} density 1 cells 60000000 chunks 60 est_bytes 5.58e+09

2013-04-05 15:16:30,883 [0x7f6d8356b700] [DEBUG]: Prepare physical plan was sent out
2013-04-05 15:16:30,883 [0x7f6d8356b700] [DEBUG]: Waiting confirmation about preparing physical plan in queryID from 0 instances
2013-04-05 15:16:30,883 [0x7f6d8356b700] [DEBUG]: Execute physical plan was sent out
2013-04-05 15:16:30,883 [0x7f6d8356b700] [INFO ]: Executing query(1100876874368): redimension_store(apply(trips,pickup_lati,int64(pickup_lat * 10000 + 0.5),pickup_loni,int64(pickup_lon * 10000 + 0.5)), trips_by_pickup); from program: 127.0.0.1:48588/home/slycat/install/python/bin/python2.7 slycat-scidb-geo-test.py --trip-count=60000000 --no-load ;
2013-04-05 15:16:30,884 [0x7f6d8356b700] [DEBUG]: Request shared lock of array trips@1 for query 1100876874368
2013-04-05 15:16:30,884 [0x7f6d8356b700] [DEBUG]: Granted shared lock of array trips@1 for query 1100876874368
2013-04-05 15:16:30,884 [0x7f6d8356b700] [DEBUG]: Request exclusive lock of array trips_by_pickup for query 1100876874368
2013-04-05 15:16:30,884 [0x7f6d8356b700] [DEBUG]: Granted exclusive lock of array trips_by_pickup for query 1100876874368
2013-04-05 15:16:30,888 [0x7f6d8356b700] [DEBUG]: Request shared lock of array trips_by_pickup@1 for query 1100876874368
2013-04-05 15:16:30,888 [0x7f6d8356b700] [DEBUG]: Granted shared lock of array trips_by_pickup@1 for query 1100876874368
2013-04-05 15:16:30,891 [0x7f6d8356b700] [DEBUG]: [RedimStore] Begins.
2013-04-05 15:16:30,891 [0x7f6d8356b700] [DEBUG]: [RedimStore] build mapping index took 0 ms, or 0 millisecond
2013-04-05 15:20:24,115 [0x7f6d8356b700] [DEBUG]: [RedimStore] inputArray --> RowCollection took 233221 ms, or 3 minutes 53 seconds 221 milliseconds
2013-04-05 15:25:59,589 [0x7f6d8356b700] [DEBUG]: [RedimStore] RowCollection --> beforeRedistribution took 335365 ms, or 5 minutes 35 seconds 365 milliseconds
2013-04-05 15:25:59,630 [0x7f6d8356b700] [DEBUG]: SG_AGGREGATE started
2013-04-05 15:25:59,630 [0x7f6d8356b700] [DEBUG]: [RedimStore] redistributeAggregate took 21 ms, or 21 milliseconds
2013-04-05 15:26:22,399 [0x7f6d80a7a700] [ERROR]: Job::execute: unhandled exception in job: std::bad_alloc
2013-04-05 15:26:22,515 [0x7f6d80c7c700] [ERROR]: Job::execute: unhandled exception in job: SystemException in file: src/smgr/io/Storage.cpp function: allocate line: 2856
Error id: scidb::SCIDB_SE_STORAGE::SCIDB_LE_CANT_ALLOCATE_MEMORY
Error description: Storage error. Failed to allocate memory.
Failed query id: 1100876874368
2013-04-05 15:26:22,515 [0x7f6d80979700] [ERROR]: Job::execute: unhandled exception in job: std::bad_alloc
2013-04-05 15:26:22,569 [0x7f6d80b7b700] [ERROR]: Job::execute: unhandled exception in job: std::bad_alloc
2013-04-05 15:26:22,619 [0x7f6d8356b700] [DEBUG]: Warning: accessCount is 1 due clean up of mem array 'trips_by_pickup@1
2013-04-05 15:26:22,795 [0x7f6d8356b700] [DEBUG]: Broadcast ABORT message to all instances for query 1100876874368
2013-04-05 15:26:22,796 [0x7f6d8356b700] [DEBUG]: Query::done: queryID=1100876874368, _commitState=0, erorCode=-1
2013-04-05 15:26:22,805 [0x7f6d8356b700] [ERROR]: executeClientQuery failed to complete: SystemException in file: src/util/Job.cpp function: execute line: 55
Error id: scidb::SCIDB_SE_EXECUTION::SCIDB_LE_UNKNOWN_ERROR
Error description: Error during query execution. Unknown error: std::bad_alloc.
Failed query id: 1100876874368
2013-04-05 15:26:22,808 [0x7f6d8356b700] [DEBUG]: Query (1100876874368) is being aborted
2013-04-05 15:26:22,808 [0x7f6d8356b700] [ERROR]: Query (1100876874368) error handlers (1) are being executed
2013-04-05 15:26:22,808 [0x7f6d8356b700] [DEBUG]: Update error handler is invoked for query (1100876874368)
2013-04-05 15:26:23,250 [0x7f6d8356b700] [DEBUG]: UpdateErrorHandler::handleErrorOnCoordinator: the new version 1 of array trips_by_pickup is being rolled back for query (1100876874368)
2013-04-05 15:26:23,272 [0x7f6d8356b700] [DEBUG]: End of log at position 2288 rc=104
2013-04-05 15:26:23,272 [0x7f6d8356b700] [DEBUG]: End of log at position 0 rc=0
2013-04-05 15:26:23,500 [0x7f6d8356b700] [DEBUG]: Deallocating query (1100876874368)
2013-04-05 15:26:23,500 [0x7f6d8356b700] [DEBUG]: Releasing locks for query 1100876874368
2013-04-05 15:26:23,500 [0x7f6d8356b700] [DEBUG]: SystemCatalog::deleteArrayLocks instanceId = 0 queryId = 1100876874368
2013-04-05 15:26:23,502 [0x7f6d8356b700] [DEBUG]: Release lock of array trips@1 for query 1100876874368
2013-04-05 15:26:23,502 [0x7f6d8356b700] [DEBUG]: Release lock of array trips_by_pickup for query 1100876874368
2013-04-05 15:26:23,502 [0x7f6d8356b700] [DEBUG]: Release lock of array trips_by_pickup@1 for query 1100876874368
2013-04-05 15:26:24,663 [0x7f6d89a3e840] [DEBUG]: Disconnected

Suggestions on where I should be changing my config.ini welcome.

Cheers,
Tim


#5

Hi Tim,

Sorry to hear that. Did you try it on a single instance at all?

As fate would have it, I’m investigating an issue with large loads eating a lot of memory for a different user. One of the findings is that the default linux malloc behavior causes memory fragmentation that’s unacceptable to us. We are discovering we need to set the default mallopt() settings to avoid that. You could try adding these items to the config file:

small-memalloc-size=4096
large-memalloc-limit=50000000

It might help.

The other thing going on here is, now that we’re using rounded integers, we may want to consider increasing chunk sizes for the lat / lon dimensions because we will expect sparsity. This is hard for me to tell - I don’t know if your dataset describes the whole world or just some small region in a city. But you want about 1 million elements per chunk. Just knowing the min and max of the lat and lon will help determine a decent estimate.

Also - is there a way you could share this dataset with me? It seems fairly small. You could even send it to me via email. I definitely want to help get to the bottom of this.


#6

Alex:

Yep, I switched to a single SciDB instance to make the numbers easier. My “dataset” is just a collection of randomly-generated coordinates, uniformly sampled from a rectangle that covers New York.

I’ll try the config items you mention.

Cheers,
Tim


#7

tshead have you been able to find a solution for your problem? … I’m facing a similar issue with memory allocation during query execution, and I’ve yet to find a way to fix this.

If you’ve managed to find a fix or a work-around, would you please post the answer here?
thank you