BigQuery Table Sampling: LookML Syntax Explanation

  • 4 August 2021
  • 2 replies
  • 188 views

Userlevel 4
  • Community Admin
  • 30 replies

Table sampling is a feature that enables users to query random subsets of data from a large BigQuery tables, which may be useful in a data science / machine learning workflow (you can read about BigQuery Machine Learning (BQML) in Looker here). 

 

As of August 2021, it seems that TABLESAMPLE SYSTEM does not support aliasing table. The following query:
 

SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT) AS mytable

will result in this error: 

​​​​​​​notFound: Not found: Table dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT) was not found in location US


With that being said, we would need to make a derived table to get around with the alias issue. Inside the view for the derived table, I have added a parameter to help with adjusting the value for selection condition: For example, if we put in 10 in the percent filter in the explore front end, the generated SQL would be `TABLESAMPLE SYSTEM (10 PERCENT)`, and BigQuery will return 10% of the data.

 

explore: tablesamplet_system {}

view: tablesamplet_system {

derived_table: {
sql: SELECT * FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801 TABLESAMPLE SYSTEM ({%parameter percent %} PERCENT) ;;
}

parameter: percent {
type: number
}

dimension: visitId {
type: number
primary_key: yes
}

}

 

Final result: 

 


Please share if you have cool use cases on table sampling!​​​​​​​


2 replies

As tablesample use cases need as much query efficiency as possible, it would be wonderful to use this feature in a native non-derived view.  

The TABLESAMPLE does support aliases, but the TABLESAMPLE clause has to come after the alias - which is currently not supported.  This would be a great, and easy to implement feature, to add an attribute to views for Bigquery tablesampling.  The clause would just have to be implemented to come after the table alias.

It could get tricky with CTEs, I don’t think you can TABLESAMPLE a table twice in one statement.

My use case is with terabyte tables, wanting to reduce query costs and still gather meaningful information from huge tables.

Reply