Histogram


#1

Hi,

I’d really like to see an operator for building histograms. It seems a little awkward using the current AFL http://trac.scidb.org/wiki/Docs/Release_11.06/ArrayFunctionalLanguage_11.06. From what I can see, I need to do a filter, count for each bin manually, joining each count together. I guess I’m looking for something with a syntax of:

histogram(input,bins) => array with length = len(bins)+1, given that input is a single-attribute array of any dimensionality, and bins is an array of boundaries between bins, of the same type as the input array. For example, if bins(uint) = [2,3,10], then the histogram would have the following bins:
uint_min <= x < 2
2 <= x < 3
3 <= x < 10
10 <= x < = uint_max

It should be possible to extend this to parameterize the dimensions that are being summarized, e.g., for a video clip of x,y,t, I want to have a histogram that’s count(bin, t) instead of just count(bin).

I’m not sure how this can be elegantly extended to a multi-attribute syntax, but it’s not that important to me since most of the use-cases I can think of involve single-attribute arrays.

Please feel free to point out how this behavior can be synthesized out of trivial combinations of AFL that I have somehow overlooked. I won’t be embarrassed (much).

Just thought this would be a useful feature,
-Daniel


#2

Hi Daniel,

Here’s some black magic courtesy of Mr. Paul Brown:

#we have an array xyz with the following data:
iquery -aq "show(xyz)"
[("xyz<a:double NOT NULL> [x=0:1999,100,0,y=0:1999,100,0]")]
iquery -aq "min(xyz)"
[(0)]
iquery -aq "max(xyz)"
[(0.997352)]

#We compute a 10-bucket equi-width histogram which means width is (max-min)/10 = 0.0997352

iquery -aq "store ( build  ( <val:double> [Bucket=0:9,10,0], Bucket ), deciles)"
[(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)]

iquery -aq "count (
    filter (      
        cross ( 
            xyz as X, 
            project ( 
                join (
                    apply ( deciles, Min, Bucket * 0.0997352), 
                    apply ( deciles, Max, (Bucket+1) * 0.0997352)
                ), 
                Min, Max
            ) as B
        ), 
        X.a >= B.Min AND X.a < B.Max
    ), 
    B.Bucket
)"
[(3369297),(88),(94),(82),(69),(91),(73),(72),(69),(64)]

#recheck:
iquery -aq "count(filter(xyz, a<0.0997352))"
[(3369297)]
iquery -aq "count(filter(xyz, a>=0.0997352 and a<0.0997352*2))"
[(88)]

It looks like it’s almost what you want - except the open-ended ranges which you could probably add.


#3

Hi Alex,

This looks great. I’m sure I can make use of this. Thanks a bunch!

But seeing as we both agree that it looks like black magic and not the first thing to occur to a casual SciDB user, can we bundle this sort of logic into a macro, if not a compiled function?

What I would like is some facility in AFL or AQL to define a name and parameterization for that sort of black magic (and future such examples), so I can reuse my own (and others’) black magic. Perhaps this is already on the TODO list. If not, can this be done? Where might it be on the priority list?

I collected a couple examples of the sort of facility I’m looking for:

MATLAB: mathworks.com/help/techdoc/ref/function.html
Maple: kb.mit.edu/confluence/pages/view … Id=3907032
Mathematica: reference.wolfram.com/mathematic … ction.html
R: cran.r-project.org/doc/manuals/R … e-examples

I know SciDB already has a UDF facility. What I am suggesting is something more lightweight. The key difference is that it doesn’t require a compiler or write-access to SciDB’s plugin directory. This means that if I write a bad function in this form, it’s more difficult for me to accidentally bring down the server.

Thoughts?
-Daniel


#4

Hey Daniel,

From the SQL side of things, various folks have been trying to attack this kind of problem for some time. One simple solution is a “view” which is essentially a query that is stored under a specific name. So you could say something to the effect of:

create view histogram_xyz as ...[black_magic_query_here];
--- now the name histogram_xyz can be treated as the name of an array in other queries ---
select * from histogram_xyz;

Here’s a Postgres example:
postgresql.org/docs/8.1/stat … eview.html

And when views aren’t enough, various flavors of SQL have a notion of “stored procedures”. One key difference between stored procedures and UDFs is that UDFs tend to be issued from within a SELECT statement and tend to see only one tuple at a time, whereas stored procedures tend to use SELECT statements as building blocks and are usually written in a high-level language.

Wikipedia: en.wikipedia.org/wiki/Stored_procedure
Postgres example: postgresqldbnews.blogspot.com/20 … using.html
Oracle example: download.oracle.com/docs/cd/B193 … s_6009.htm

This is a bit of a harder task because it involves defining a language for these things.

I believe both views and stored procedures (in some form) are on the roadmap, but not sure where they are on the priority list. Will keep you posted when I have more info.


#5

Maybe I’ll regret this… but I started to write an operator for creating histograms.
Thank’s to Alex’s sample UDO and my getting the courage to read examples and sources for operators, I met with some success with creating another UDO that I could load_library into SciDB and not crash everything :smile:

Here is the plan

AFL% mkhisto(array, nbins, minV, maxV);

would return an array of int64s whose len() is nbins. To make my life easier, the input array is limited to one double attribute and one dimension. The result array is assumed to be small, most histograms for my needs have fewer than a thousand bins, so I am probably making some simplifying assumptions about how to chunk it.

Please, if anyone else out there is working on similar stuff, please let me look over your shoulder.

Thanks and cheers,

George


#6

Two thoughts:

  1. Have a look at 13.6, where Alex P. has added a couple of useful example operators that should demonstrate a little better how to make parallelism work for this kind of operator.

  2. Might I suggest that that you slightly modify your signature?

mkhisto(array, nbins, minV, maxV [, dimension] );

Rather than only returning one histogram per input array, return one histogram for each of the groups identified by the combination of the dimensions. For an example of what I have in mind, have a look at the aggregate(…) operator.


#7

Great advice, Thanks!
I have to wait until IGARSS is over, colleague needs some stuff done with the current version, then we switch to 13.6
I understand that we have to reload everything…

Cheers, George


#8

Hi Everyone,

Here is an example of computing histograms in SciDB, just to add to the record:

Setup: Approximate a random variable sampled from a standard Gaussian distribution using Box-Muller, saving it in array z:

store( project( apply( apply( build(<x:double>[i=1:1000,1000,0],(random()%1000000)/1000000.0), y,(random()%1000000)/1000000.0), z,sqrt(-2*log(x))*cos(2*3.141592653589793*y)), z) ,z)

OK now compute a histogram of z with 10 bins:

project( apply( redimension( substitute( apply(cross_join(z,aggregate(z,min(z) as min, max(z) as max)), t,int64(9*(z-min)/(max-min))),build(<v:int64>[i=0:0,1,0],0),t), <count:uint64 null, min:double null, max:double null>[t=0:*,1000000,0], count(t) as count), x, t*(max-min)/9.0 + min), x,count)
The output is a table with two attributes, count and x. The x attribute is the x-value scale and the count attribute holds the counts.

Note the use of the number ‘9’ above, meaning divide into 10 bins. Change that number to alter the number of bins, nothing else has to change.

The approach is to transform the variable z into [0,1], integer multiply by the number of bins - 1, then apply a redimension aggregate to get the counts. At the end we apply the reverse transform to show the original scale along with the counts. The histogram produced above matches exactly that produced by R with the same data and breaks (R has many options on choosing breaks).


#9

Here is a shell script that improves on the basic histogram idea in the last post. It handles boundary cases correctly and gives you control over which bin values on the interval breaks fall into.

One problem with this script is that bins with zero counts appear as sparse (empty) entries in the array. You can use merge to fill in those zeros if you like.

If you use SciDB from R, this is more or less the ‘hist’ function for SciDB arrays in the R package.

#!/bin/bash
# Basic histogram script
# Usage:
# hist.sh <scidb array> <bins> [right]
#
# where, the scidb array has a single numeric attribute, bins is a number of
# bins to divide the range of the attribute into. Bins must be a whole number.
#
# The bin division is uniform and the resulting bin intervals are closed on the
# left and open on the right.  If you supply the optional right argument, then
# the bins are open on the left and closed on the right.

if test $# -lt 2; then
  echo "Usage: hist.sh <array> <bins> [right]"
  exit 1
fi

# Get the attribute name
a=$(iquery -aq "show(${1})" | sed 1d | sed -e "s/.*<//" | sed -e "s/:.*//")
echo $a

if test $# -gt 2; then
  query="project( apply( redimension( substitute( apply(cross_join(${1},aggregate(${1}, min(${a}) as min, max(${a}) as max)), bin,iif(${a}=min,1,ceil(${2}.0*(${a}-min)/(1+max-min)))  ),build(<v:int64>[i=0:0,1,0],0),bin), <count:uint64 null, min:double null, max:double null>[bin=0:${2},1000000,0], count(bin) as count), x, bin*(1+max-min)/${2}.0 + min), x,count)"
else
  query="project( apply( redimension( substitute( apply(cross_join(${1},aggregate(${1}, min(${a}) as min, max(${a}) as max)), bin,floor(${2}.0 * (${a}-min)/(1+max-min))),build(<v:int64>[i=0:0,1,0],0),bin), <count:uint64 null, min:double null, max:double null>[bin=0:${2},1000000,0], count(bin) as count), x, bin*(1+max-min)/${2}.0 + min), x,count)"
fi

# Run the query
echo "${query}"
iquery -aq "${query}"