You’ve got a couple of things going on here.
There’s no need to store(…) the result of your sampling. SciDB supports a query language that allows you to compose operators. So you can pipe the results of the sample(…) (or the bernoulli(…), which is a better operator for this job) directly into the aggregate (… ).
Looking at your sample rate a minute, you’re sampling 20% of the input data. Chances are, that’s going to touch every chunk in the input array, which won’t really reduce the I/O by all that much. I’d also note that the ‘sample(…)’ routine samples “chunks”, not cells. For a bernoulli random sample, use bernoulli(…). It’s a better sampler.
Anyway - try this:
aggregate ( input, count(*) as cnt )
The query above will give you the total size of the data, and usually very quickly, because under the covers we don’t consult the data itself but instead only the bitmasks we maintain. Now - suppose you want a sample size of 4,000 (a good rule of thumb). So you want a sample rate of 4,000 / P (where P is the population, the result of the query above). Call this value, ‘p’, say, 0.0001.
Plug your sample rate into the following query:
aggregate ( bernoulli ( input, 0.0001 ), avg ( val ) AS Avg_Val, count(*) as Sample_Size )
Depending on your chunk size (if you have a really, really big data set, then pull p even lower to reduce the I/O by stepping over entire chunks, although this effect is most useful when chunk count > 1,000 and your data sizes are in the trillions of cells) this should be faster than what you’re doing.