Question

Symmetric Aggregate Performance in BigQuery


We’ve recently started trying out BigQuery with Looker and have run into performance issues with the symmetric aggregate functionality.


The query Looker is generating calculates the non-symmetric aggregates by generating SQL like


COALESCE(looker_array_sum(ARRAY_AGG(DISTINCT CONCAT( …


to get all the distinct values (relative to the grain of the measure) and then strips out the measure and sums it up. However in our case we want to store the data at the most granular level possible (so tens of millions of rows per day) which means that BigQuery can’t handle creating an array of so many elements.


I wonder if there’s a better approach for doing this? Since in this case all our fan-outs are from un-nesting columns, in theory I think we could re-write the SQL, regardless of the fields selected, so that we calculate the non-symmetric aggregates in a distributed way in a subquery,and then sum up the results. E.g.


SELECT

app,

SUM(app_starts),

SUM(ads_requests)

FROM

(SELECT

date, device_id, app,

COALESCE(looker_array_sum(ARRAY_AGG(DISTINCT CONCAT(device_id, app, date), ‘||’, CAST(app_starts AS STRING)))), 0),

SUM(ad_source.ad_requests)

FROM

table_with_one_row_per_device_and_app_and_date

LEFT JOIN UNNEST(table_with_one_row_per_device_and_app_and_date.nested_field_with_one_row_per_ad_source) AS ad_source

WHERE

date = ‘2018-04-01’

GROUP BY

date, device_id, app

) subq

GROUP BY

app


The disadvantage with this being that it would involve a ton of dynamic SQL to bypass the default Looker behaviour and handle all possible query combinations (if it’s even possible).


I’m wondering if there’s another approach to this, or a way of avoiding BigQuery running out of memory. The other option we have is to remodel the data to avoid any fan-outs but this has its own trade offs so is more of a last resort.


6 replies

Userlevel 6
Badge

I haven’t tried this out with an UNNEST join, but assuming you can constrain that join with an additional ON constraint, you could use this approach: A cure for the one_to_many blues

Thanks! That’s actually very similar to the approach we’re taking. The only real difference is that we’re pre-building the result of all the table joins and storing them in the nested field (so basically no joins and using lots rows / storage used instead, which makes more sense for BigQuery).


It’d be nice if we could use nesting to represent the relationships in the data without having to store loads of nulls, but we are able to partially use it so it’s an ok compromise for now.

Is there a particular reason that using an array for symmetric aggregates was chosen for BQ rather than the MD5 hash that it uses for our Exasol connection (where the sql looks more like this):


SUM(DISTINCT HEXSTRING2DEC(SUBSTR(UPPER(HASH_MD5(concat(…

Userlevel 6
Badge

We’ve only seen a couple of people hit the limit of Symmetric aggregates on BigQuery. BigQuery doesn’t have very large integers (128 bit integers) which are required for the method you are talking about. BigQuery does have hyperloglog functions, so you can do approximate counts for really large data sets and maybe that will work for you. These functions are symmetric and have no limits (and were invented for the use case you are describing)


https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#hll_countinit

Userlevel 6
Badge

Does this change anything?


https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#numeric-type


Only beta at the moment but surly this is going to scale better than the array?

Userlevel 6
Badge

@IanT we’ve got the more standard implemenetation of Symmetric Aggregates working (internally) with BigQuery now. We’re testing and yes, it does scale better than the array. You should see something in the next couple of releases.

Reply