Hey @dahaynes
It depends on the schema of the target array and a few other factors. Optimizing redimension can be a bit tricky before you get a feel for it.
What redimension does is sort data in parallel, then split it up into chunks - according to the target array schema. Those chunks are then passed to insert. If the data is too big to fit in memory, redimension also tries to spill pieces to disk while sorting.
So the run time will depend on the chunking of the target array (square or skinny), how the redimensioned data lands in the target chunks (1 chunk, 3 chunks or a bunch of small fragments all over), and how big of a job it is - if it’s too small, you may not use all the nodes, if it’s too big, it will run out of memory, start spilling to disk and go slower. Then the data is fed into insert and insert can slow down if your new data overlaps the chunks youve loaded before. This all depends on the chunk sizes too.
The config that affects redimenion most is actually mem-array-threshold
- I wouldn’t touch those memalloc
configs you pointed out.
On 16.9 (this changed in 18.1) you should initially set mem-array-threshold
such that (mem-array-threshold + smgr-cache-size) * num_instances <= TOTAL_RAM*0.5
- in megabytes. Basically that tells SciDB to use half your total cluster RAM for caches and temp arrays. The other half would go to mid-query scratch space and sort buffers (depending on your concurrency levels). Then you can reduce smgr-cache-size
to as little as 8 and increase mem-array-threshold
if needed. Then, if no one else is doing anything on the machine, try increasing that 0.5 number.
The best strategy for load is to ingest, say, 120 chunks at a time (10 per instance, maybe) and load it in such a way that new loads never touch chunks you’ve already loaded - if possible.
There are more notes here SciDB's MAC(tm) Storage Explained
If you show me your array schema and your config.ini I can help more.