Analyze output question



I was wondering if someone could help explain the output of an analyze command.

First, here’s the background.

The goal is to load a 13G csv edge file (format ‘src,dst,1’) and turn it into an adjacency matrix. The file has 602346147 lines (no header).

Here’s the steps:

create array edataFlat <src:int64, dst:int64, val:int32> [i=0:*, 100000, 0]

This was successful, so I ran analyze(edataFlat) in order to get dimensions for the adjacency matrix. Here’s the output:

attribute_number attribute_name min max distinct_count non_null_count
1 0 dst 0 37225696 34445147 602346147
2 1 src 0 37225696 39730005 602346147
3 2 val 1 1 1 602346147

Here’s the question.

How can the distinct_count of the src vertexes be greater than the max vertex number?

Thanks in advance.


The value you get from analyze(…) is an approximation. If the number is less than a few thousand, then it’s completely precise. But more than that we fall back on an approximation method to avoid having to sort the data, which would increase the time taken to run analyze(…) by several orders of magnitude.

If you want a precise count, have a look in the examples directory for the uniq(…) operator. This takes as input the result of a sort(…), and runs over it, discarding duplicate values. Then you can use aggregate ( uniq ( sort ( inputArray, … ) ), count (*) as precise_distinct_cnt) to get to a precise number of distinct values. Of course, this takes a lot longer to compute.