BIGQUERY (HowTo): BigQuery Cluster and Partition Best Practices

Knowledge Drop

Last tested: Sep 21, 2018

  • Default to INNER join
  • If LEFT is required, denormalize, nest, or figure out a way to push filter to fact table
  • Make sure that Looker isn't overriding your cluster ON clause
  • Clustering pushes predicates, partitioning does not

See the below example for inner join vs left. In this case the table is clustered by a.cust_id. We don't filter on a.cust_id, as the user is interacting with the b.cust_name. A left join will still do a full table scan.

NO FILTER - Query complete (21.453 sec elapsed, 42.04 GB processed)
INNER - Query complete (3.783 sec elapsed, 1.78 GB processed)
LEFT - Query complete (11.192 sec elapsed, 42.04 GB processed)
 

SELECT  a.prod_skid,  COUNT(*)FROM  ls.POSDS_WKLY_FCT aLEFT JOIN  ls.POSDS_CUST_DIM_enhanced bON  a.cust_id = b.cust_idLEFT JOIN  ls.POSDS_PROD_DIM AS posds_prod_dimON  a.CUST_ID = posds_prod_dim.CUST_ID  AND a.PROD_SKID = posds_prod_dim.PROD_SKIDWHERE  b.cust_name = 'Demo Customer US - P&G Only'GROUP BY  1


Another example. In this case Looker is applying a filter on  a.CUST_ID is not null. A direct filter on the clustered column overrides whatever optimization you get from joining on the clustered column to a filtered table. In this case it was because of an access filter we were applying where most people default to all. Again, denormalization here could be helpful - as even though dim table isn't large, it makes for easier querying.
 

  access_filter: {    field: cust_id    user_attribute: customer_id  }


BAD FILTER - Query complete (20.744 sec elapsed, 63.03 GB processed)
NO BAD FILTER - Query complete (4.927 sec elapsed, 2.65 GB processed)
 

SELECT  1 AS posds_wkly_fct_is_ya,  COALESCE(SUM(posds_wkly_fct.POS_UNIT_SALES_AMT),    0)FROM  ls.POSDS_WKLY_FCT AS posds_wkly_fctINNER JOIN  ls.POSDS_CUST_DIM_enhanced AS posds_cust_dim_enhancedON  posds_wkly_fct.CUST_ID = posds_cust_dim_enhanced.CUST_IDINNER JOIN  ls.POSDS_PROD_DIM AS posds_prod_dimON  posds_wkly_fct.CUST_ID = posds_prod_dim.CUST_ID  AND posds_wkly_fct.PROD_SKID = posds_prod_dim.PROD_SKIDWHERE  (posds_cust_dim_enhanced.CUST_NAME = 'Demo Customer US - P&G Only')  AND (NOT (posds_wkly_fct.CUST_ID IS NULL))GROUP BY  1LIMIT  500


The previous above examples show that predicate pushdown works when joining on clustered fields. However that is not the case with partitions. The fact table is partitioned by sales_date. See results for when we hit the partition directly in the where clause vs an equivalent filter on the right side of a join. We're swapping out a.sales_date with b.day_date. So either you need to make sure you're filtering appropriately in the fact table OR you could do this workaround instead. 
CREATE TABLE better_cluster PARTITION BY ignore_this CLUSTER BY sales_date ASSELECT *, DATE('2018-01-01') AS ignore_this FROM old_table

FILTER ON DIM - Query complete (2.923 sec elapsed, 62.98 GB processed)
FILTER ON FACT - Query complete (1.761 sec elapsed, 80.06 MB processed)
 

SELECT  a.sales_date,  SUM(pos_cost_amt)FROM  `swarm-central-pg-looker.ls.POSDS_WKLY_FCT` aJOIN  `swarm-central-pg-looker.ls.POSDS_ROLL_TIME_DIM_DAY` bON  a.sales_date = b.day_date  AND a.cust_id = b.cust_idWHERE  a.sales_date = '2016-10-01'  AND fact_table = 'POSDS_WKLY_FCT'  AND calendar_name = 'Customer'  AND b.cust_id = 4000  AND perd_name = 'L12M'GROUP BY  1


 

This content is subject to limited support.                

Version history
Last update:
‎06-14-2021 05:50 PM
Updated by: