We’ve recently started trying out BigQuery with Looker and have run into performance issues with the symmetric aggregate functionality.
The query Looker is generating calculates the non-symmetric aggregates by generating SQL like
COALESCE(looker_array_sum(ARRAY_AGG(DISTINCT CONCAT( …
to get all the distinct values (relative to the grain of the measure) and then strips out the measure and sums it up. However in our case we want to store the data at the most granular level possible (so tens of millions of rows per day) which means that BigQuery can’t handle creating an array of so many elements.
I wonder if there’s a better approach for doing this? Since in this case all our fan-outs are from un-nesting columns, in theory I think we could re-write the SQL, regardless of the fields selected, so that we calculate the non-symmetric aggregates in a distributed way in a subquery,and then sum up the results. E.g.
date, device_id, app,
COALESCE(looker_array_sum(ARRAY_AGG(DISTINCT CONCAT(device_id, app, date), ‘||’, CAST(app_starts AS STRING)))), 0),
LEFT JOIN UNNEST(table_with_one_row_per_device_and_app_and_date.nested_field_with_one_row_per_ad_source) AS ad_source
date = ‘2018-04-01’
date, device_id, app
The disadvantage with this being that it would involve a ton of dynamic SQL to bypass the default Looker behaviour and handle all possible query combinations (if it’s even possible).
I’m wondering if there’s another approach to this, or a way of avoiding BigQuery running out of memory. The other option we have is to remodel the data to avoid any fan-outs but this has its own trade offs so is more of a last resort.