Hello again. Glad the change worked. Let me respond to this point-by-point:
1. Why does the footprint stay at 1GB?
This is a known issue I mentioned before. It has to do with the memory allocator and the threading model that SciDB uses. A memory that is allocated to a thread is released when the thread is reused for a new query. Thus, in practice SciDB memory footprint will grow to an upper bound and then stay at that level. As you found, the configuration settings can be used to adjust that upper bound. Internally, we’re looking into this.
2. SciDB instances versus EC2 instances
Let’s make sure we’re talking about the same thing. A “SciDB instance” is a standalone Linux process, responsible for a portion of the data and query execution. An “EC2 instance” is a virtual machine that runs an OS. This creates some confusion and I wonder if that’s part of the problem in our discussion :/. Typically we recommend users run a SciDB instance for each 1-2 CPU cores (depending on multiuser workload) and folks rarely run a single instance. The whole point of SciDB is to be distributed. Default configs often use 4 instances. A default config with some of the AMIs use 16 instances. Some log files you posted to another thread indicate you actually had 16. So - it is possible but unlikely that you’re running a single SciDB instance.
You can check your config with
iquery -aq "list('instances')"
And the number of instances to run is specified in config.ini server. In the simplest case:
server-0=127.0.0.1,3 #this means 4 SciDB instances (1 is implied)
This number is important as the memory configs are specified per instance. When we say
mem-array-threshold=128 it means every SciDB instance shall use up to 128MB. To get the total amount of memory allowed for mem-array-threshold, multiply the 128 by your number of instances.
3. More on how these configs work
smgr-cache-size are two caches (one for mid-query and temp arrays and the second one for persistent arrays). Setting these numbers to low values means the system will write/read extra data to disk more often. Setting them to larger values means the system will try to keep more data in memory. All the spilling is caching happens automatically so your actual data volumes processed by a query can be much larger than these configs. A good practice is to start with lower values and then gradually increase them as you tune for your workload.
merge-sort-buffer is used only by some operators (on a per-thread basis, not per-instance) to accumulate buffers of data before they are sorted. It is also used for hash tables and other intermediary structures that operators may build. Most common users are
sort and some Labs plugins like
Does this help?