Symmetric Aggregate Performance in BigQuery

We’ve recently started trying out BigQuery with Looker and have run into performance issues with the symmetric aggregate functionality.

The query Looker is generating calculates the non-symmetric aggregates by generating SQL like


to get all the distinct values (relative to the grain of the measure) and then strips out the measure and sums it up. However in our case we want to store the data at the most granular level possible (so tens of millions of rows per day) which means that BigQuery can’t handle creating an array of so many elements.

I wonder if there’s a better approach for doing this? Since in this case all our fan-outs are from un-nesting columns, in theory I think we could re-write the SQL, regardless of the fields selected, so that we calculate the non-symmetric aggregates in a distributed way in a subquery,and then sum up the results. E.g.







date, device_id, app,

COALESCE(looker_array_sum(ARRAY_AGG(DISTINCT CONCAT(device_id, app, date), ‘||’, CAST(app_starts AS STRING)))), 0),




LEFT JOIN UNNEST(table_with_one_row_per_device_and_app_and_date.nested_field_with_one_row_per_ad_source) AS ad_source


date = ‘2018-04-01’


date, device_id, app

) subq



The disadvantage with this being that it would involve a ton of dynamic SQL to bypass the default Looker behaviour and handle all possible query combinations (if it’s even possible).

I’m wondering if there’s another approach to this, or a way of avoiding BigQuery running out of memory. The other option we have is to remodel the data to avoid any fan-outs but this has its own trade offs so is more of a last resort.

6 replies

Userlevel 6

I haven’t tried this out with an UNNEST join, but assuming you can constrain that join with an additional ON constraint, you could use this approach: A cure for the one_to_many blues

Thanks! That’s actually very similar to the approach we’re taking. The only real difference is that we’re pre-building the result of all the table joins and storing them in the nested field (so basically no joins and using lots rows / storage used instead, which makes more sense for BigQuery).

It’d be nice if we could use nesting to represent the relationships in the data without having to store loads of nulls, but we are able to partially use it so it’s an ok compromise for now.

Is there a particular reason that using an array for symmetric aggregates was chosen for BQ rather than the MD5 hash that it uses for our Exasol connection (where the sql looks more like this):


Userlevel 6

We’ve only seen a couple of people hit the limit of Symmetric aggregates on BigQuery. BigQuery doesn’t have very large integers (128 bit integers) which are required for the method you are talking about. BigQuery does have hyperloglog functions, so you can do approximate counts for really large data sets and maybe that will work for you. These functions are symmetric and have no limits (and were invented for the use case you are describing)

Userlevel 6

Does this change anything?

Only beta at the moment but surly this is going to scale better than the array?

Userlevel 6

@IanT we’ve got the more standard implemenetation of Symmetric Aggregates working (internally) with BigQuery now. We’re testing and yes, it does scale better than the array. You should see something in the next couple of releases.