We’ve been pondering how much of a priority we should make transitioning to BigQuery’s new Standard SQL dialect. One of its many selling points is that it promises scalable COUNT DISTINCTs. Looker tends to use COUNT DISTINCT more often than you would if you were hand-writing SQL, so I was curious how much faster they are. The answer is a lot!
-- Standard SQL:
SELECT COUNT(DISTINCT ngram) FROM publicdata.samples.trigrams
-- Exact: 68,051,509 ~6s
-- Legacy SQL - exact:
SELECT EXACT_COUNT_DISTINCT(ngram) FROM [publicdata:samples.trigrams]
-- Resources exceeded
-- Legacy SQL - accurate up to 1,000:
SELECT COUNT(DISTINCT ngram, 1000) FROM [publicdata:samples.trigrams]
-- 69,734,045 (off by 2.5%) -> 1.8s
-- Legacy SQL - accurate up to 100,000:
SELECT COUNT(DISTINCT ngram, 100000) FROM [publicdata:samples.trigrams]
-- 67,816,890 (off by 0.3%) -> 34s
-- Legacy SQL - accurate up to 1,000,000:
SELECT COUNT(DISTINCT ngram, 1000000) FROM [publicdata:samples.trigrams]
-- 68,151,294 (off by 0.15%) -> 57s
It turns out that you can’t compute an exact count distinct on this dataset in legacy SQL (so the answer is answer is technically ∞). You can get pretty close in 10x as long, though. Worth the effort to migrate!
If you’re interested in working with BigQuery and Looker, you should come and join us at RealSelf in Seattle (we offer relocation). Feel free to drop me a line (alan@).