Calculate the average after sampling


#1

I’m trying to calculate some statistics after sampling a subset of dataset. However, I observed my query processing is very slow.

For example, if I need to calculate the average of a data subset which is sampled from a dataset ‘input’ with the probability of 0.2, I use the following AFl statements:

avg(store(sample(input, 0.2), temp));

or

aggregate(store(sample(input, 0.2), temp), avg(val));

I know that I used the store function to store the sampled data in a temporary array, and I guess that one reason of poor performance may be such a temporary store. Is there any statement that can avoid such a ‘store’ process?


#2

You’ve got a couple of things going on here.

  1. There’s no need to store(…) the result of your sampling. SciDB supports a query language that allows you to compose operators. So you can pipe the results of the sample(…) (or the bernoulli(…), which is a better operator for this job) directly into the aggregate (… ).

  2. Looking at your sample rate a minute, you’re sampling 20% of the input data. Chances are, that’s going to touch every chunk in the input array, which won’t really reduce the I/O by all that much. I’d also note that the ‘sample(…)’ routine samples “chunks”, not cells. For a bernoulli random sample, use bernoulli(…). It’s a better sampler.

Anyway - try this:

aggregate ( input, count(*) as cnt )

The query above will give you the total size of the data, and usually very quickly, because under the covers we don’t consult the data itself but instead only the bitmasks we maintain. Now - suppose you want a sample size of 4,000 (a good rule of thumb). So you want a sample rate of 4,000 / P (where P is the population, the result of the query above). Call this value, ‘p’, say, 0.0001.

Plug your sample rate into the following query:

aggregate ( bernoulli ( input, 0.0001 ), avg ( val ) AS Avg_Val, count(*) as Sample_Size )

Depending on your chunk size (if you have a really, really big data set, then pull p even lower to reduce the I/O by stepping over entire chunks, although this effect is most useful when chunk count > 1,000 and your data sizes are in the trillions of cells) this should be faster than what you’re doing.


#3

Thanks for your reply. I tried your suggested codes, and the performance looks good to me. Now I know how to control the sample rate. :smiley:

BTW, actually the first piece of code can be replaced by an ‘analyze’ function in AFL, which can provide some basic statistics of the queried dataset.


#4

Regarding analyze(…) …

Need to be a tiny bit careful w/ it. Analyze is intended (as you say) as a tool that provides a basic set of statistical information about the contents of an array, and you certainly can use it to compute the number of cells in the array. BUT … analyze(…) will be much, much slower than the quick aggregate ( …, count(*) …) query, because analyze(…) looks at the contents of all of the attributes in the input array. And analyze(…) is doing a lot more stuff, the most compute intensive of which is an estimate of the number of distinct values in the attribute (very handy to know when you’re planning the chunking for your n-Dimensionsal array).

Oh. And I wouldn’t try analyze ( bernoulli ( input, p ) ). The problem is that the sampling will render the distinct count and range estimates potentially way, way off. . .

Glad this helped! Let us know how you go.