Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select
(fnv_hash(mycolumn) & 1023) bucket
from
mytableHere's some sample output:
+--------+
| bucket |
+--------+
| 837 |
| 566 |
| 642 |
| 674 |
| 486 |
+--------+
HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by 2^32-1. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value z:
select
(fnv_hash(mycolumn) & 1023) bucket,
(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable+--------+---+
| bucket | z |
+--------+---+
| 599 | 1 |
| 574 | 2 |
| 43 | 1 |
| 360 | 3 |
| 644 | 3 |
+--------+---+
Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:
select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
group by
bucket+--------+----+
| bucket | z |
+--------+----+
| 283 | 22 |
| 977 | 17 |
| 630 | 16 |
| 208 | 15 |
| 315 | 20 |
+--------+----+
The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).
select
floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
from
(select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from mytable
group by bucket) bucket_valuesAnd... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.
+----------+
| estimate |
+----------+
| 94485825 |
+----------+
~11.5 billion records in Impala nested array (.amft), parquet table:
select ndv(concat_ws('|', part_code,cast(activity_dollars as string)))
from fact.amft
got: 9,460,217
in: 5.4 minutes.
select COUNT(DISTINCT part_code,activity_dollars)
from fact.amft
got: 9,298,947
in: 5.1 minutes.
So for Impala above case ran faster with exact count distinct.
ps. Reran Test 1 again (after Test 2 was done), to eliminate disk caching effects etc; got 5.6 minutes - still slower.
I think it might have something with the fact that it is a parquet table, so it doesn't have to read whole data set.
It just needs to read metadata from rowchunks / column groups.
To use ndv() I had to pass it one argument, convert to string, and then Impala actually had to read whole dataset.