distinct_count strange value


#1

Hello!

I got a strange result for the attribute GPI (int64) from the analyze funtion for my “flat” array after loading it. As i understand distinct_count this should not be possible. For me it looks like a bug.

analyze(SpatialGridNode_18012)
{attribute_number} attribute_name,min,max,distinct_count,non_null_count
{0} ‘cellid’,‘0’,‘2591’,2592,3264391
{1} ‘gpi’,‘0’,‘3264390’,3963546,3264391
{2} ‘land_flag’,‘false’,‘true’,2,3264391
{3} ‘lat’,’-89.9829’,‘89.9829’,1601,3264391
{4} ‘lon’,’-180’,‘180’,1624105,3264391

Best regads
Karl


#2

The “distinct count” in the analyze(…) operator is an estimate. It’s not a precise value.

If you want the precise distinct count, then you need to sort the data, discard the duplicates, and count the result. Something like the query below:

aggregate ( 
   uniq ( 
     sort ( 
       project ( SpatialGridNode_18012, gpi ),
       gpi
     )
   ),
   count(*) AS exact_distinct_count
)

This query is very, very computationally expensive relative to an approximation that makes a single pass over the data and comes up with an estimate. What I am guessing, looking at your data here, is that you actually have 3,264,391 distinct values. When an estimate of 3,963,546 (which is off by quite a bit more than I would like … closer to 20% … but this is a fairly small data set) exceeds the count, it’s a big clue that the attribute contains unique values. The purpose of this estimate is for use in things like chunk sizing, and to estimate query result sizes in things like redimension(…).

We should change the name of the attribute produced by the analyze(…) operator to make the semantics more clear.


#3

Thanks plumber, very illuminating. Maybe it would be great to calculate the statistics on insert/updated and persist the values. Then you would not need to guess and it the speed of the operation would not suffer.
Yes, you could change the header of the field to make clear that it is an estimate.