[Analytic Block] Partitioned Date Filters in BigQuery

  • 23 February 2017
  • 14 replies
  • 793 views

Userlevel 6
Badge

BigQuery is a scanning database, which means it scans the entire table for the columns referenced in the query.


Because BigQuery elastically scales up compute power as needed, queries never really get slow, but they can get expensive if you scan really big tables.


To cut down on cost (both computational and financial), BigQuery lets you partition large tables by date so that instead of scanning the entire table, a query can scan just a day or multiple days worth of data. To do this, BigQuery introduces a pseudo-column _PARTITIONTIME and when this column is used in a WHERE clause, BigQuery limits the amount of data scanned.


_PARTITIONTIME can be used in any part of the query and will just return the DATE of the partition as a type TIMESTAMP. (It will not return the timestamp of the individual row, since _PARTITIONTIME is always a date).


Making use of _PARTITIONTIME


So, _PARTITIONTIME is a powerful tool for preventing unnecessarily expensive queries. The problem is, you’d normally have to teach everyone querying BigQuery to remember to use partitions in every query where it’s applicable.


LookML solves that problem. Instead of teaching each person, you can teach the pattern to Looker once and never worry about it again.


Someone querying a partitioned table really doesn’t care about partitioning, they want it all to work seamlessly, setting date filters in the background for them. Here’s a Looker Block to make that happen.


Modeling in LookML


Let’s assume we are querying a partitioned table called logs that has events. The code below will present a single field group called Event Date and make sure there is always a filter on _PARTITIONTIME to prevent accidentally over-expensive queries.


explore: logs {
# Make sure there is always a filter on event_date, event_week, event_month or event_year
# Default to the last complete day of data
conditionally_filter: {
filters: {
field: logs.event_date
value: "1 days ago for 1 day"
}
}

view: logs {
# Combine the partition date filters and the time filters into a single field group.
dimension_group: event {
type: time
timeframes: [date,week,month,year]
sql: _PARTITIONTIME ;;
# NOTE: for manually partitioned files use code below
# sql: TIMESTAMP(PARSE_DATE('%Y%m%d', REGEXP_EXTRACT(_TABLE_SUFFIX,r'\d\d\d\d\d\d\d\d'))) ;;
}

# define a different set of dimensions, but have them shown together in the Explorer
dimension_group: _event {
label: "Event"
type: time
timeframes: [time,minute10, minute5, minute30, hour,hour_of_day,raw]
sql: ${TABLE}.event_time;;
datatype: epoch
}

...
}

NOTE: BigQuery Manually Partitioned Files (Table Wildcards).


BigQuery also supports manually partitioned tables. Tables are selected using a wildcard (*) in the table name of the from statement and _TABLE_SUFFIXin much the same way _PARTITION_TIME is used. _TABLE_SUFFIX returns the string that matched the table wild card. Change the sql: for the event dimension_group to make this pattern work with manually partitioned files…


Read more about BigQuery table Wildcards

https://cloud.google.com/bigquery/docs/querying-wildcard-tables


14 replies

I came up with the following to allow easy filtering on specific times (and not just dates) in a consistent way, while still pushing down the _PARTITIONTIME condition:



view: logs {

derived_table: {

sql:

SELECT * FROM my_dataset.my_logs_table

WHERE _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC({% date_start time_filter %}, DAY)

AND TIMESTAMP_TRUNC({% date_end time_filter %}, DAY)

AND TIMESTAMP_MILLIS(CAST(requestTime*1000 AS int64)) BETWEEN {% date_start time_filter %}

AND {% date_end time_filter %}

;;

}



filter: time_filter {

type: date_time

}

}

We’ve been using this technique a lot recently, and there’s still one thing that I’m struggling with.



We often need to join two date-partitioned tables together. If I add a date filter to each of the tables separately, then in the explore the user has to enter the date criteria separately for each of the tables. This is time consuming and error prone:




Instead, I’d like to be able to choose the date range once, on the primary table in the explore, and have it propagate to the joined table. If this was a dimension, then I could just add it to the sql_on expression, along with the primary key, i.e.:



exlore: log1 {

join: log2 {

sql_on: ${log2.request_id} = ${log1.request_id} AND ${log2.time_filter} = ${log2.time_filter} ;;

}

}



But this doesn’t work if event_date is a filter. I couldn’t find any way in LookML to propagate a filter in a similar way. Am I missing anything?



The workaround that I came up with is to refer to the filter from the log1 table from the log2 derived table definition using a fully-qualified name, e.g.:



view log2 {

derived_table: {

sql:

SELECT * FROM request_data.request_end_logs_avro

WHERE _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC({% date_start log1.time_filter %}, DAY)

... ;;

}

}



This works, but if log2 is involved in multiple explores then I need to extend it and duplicate the derived_table block with a different filter reference every time.



Is there any cleaner way to propagate filters?



Thanks,


Sagi

Userlevel 6
Badge

Hi Sagi,



A couple of things. You can more easily write your derived tables as:



view: logs {

derived_table: {

sql:

SELECT * FROM my_dataset.my_logs_table

WHERE {% condition logs.time_filter %} _PARTITIONTIME {% endcondition %}

AND {% condition logs.time_filter %} TIMESTAMP_MILLIS(CAST(requestTime*1000 AS int64))

{% endcondition %} ;;

}

}



You can join two partitioned derived tables using a sql_where: clause in the join. sql_where: puts the code in the where clause instead of the join. The benefit here is that the code is only referenced when the join is made.



explore: logs {

join: logs2 {

sql_on: ${logs.something_id} = ${logs2.something_id} ;;

sql_where: {% condition logs.time_filter %} logs2._PARTITION_TIME {%end_condition%} ;;

}

}

Nice! Building on top of this, I managed to eliminate the derived tables entirely by moving the conditions to a sql_always_where in the explore. Much cleaner and far more reusable now:



explore: logs {

conditionally_filter: {

filters: {

field: logs.time_filter

value: "last 24 hours"

}

}



fields: [ALL_FIELDS*, -logs2.time_filter]



sql_always_where: logs._PARTITIONTIME BETWEEN TIMESTAMP_TRUNC({% date_start logs.time_filter %}, DAY)

AND TIMESTAMP_TRUNC({% date_end logs.time_filter %}, DAY)

AND {% condition logs.time_filter %} ${request_raw} {% endcondition %} ;;



join: logs2 {

sql_on: ${logs.something_id} = ${logs2.something_id} ;;

sql_where: logs2._PARTITIONTIME BETWEEN TIMESTAMP_TRUNC({% date_start logs.time_filter %}, DAY)

AND TIMESTAMP_TRUNC({% date_end logs.time_filter %}, DAY)

}

}
Userlevel 4

Nice @Sagi! I’m going to borrow this pattern for some of our sample models 🙂

Thanks! Just one more thing: I found out that the following attempt of simplification:



{% condition logs.time_filter %} logs._PARTITIONTIME {% endcondition %}



Actually breaks the query whenever setting the filters to something that isn’t a whole day. For example, if the filter is set to the past 2 hours, the expression above would generate:



WHERE ((( _PARTITIONTIME ) >= ((TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL -1 HOUR))) 

AND ( _PARTITIONTIME ) < ((TIMESTAMP_ADD(TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL -1 HOUR), INTERVAL 2 HOUR)))))



Which evaluates to something like:



_PARTITIONTIME >= '2017-06-06 20:00:00 UTC' AND _PARTITIONTIME < '2017-06-06 22:00:00 UTC'



Which is always false, because _PARTITIONTIME is set to the timestamp of the day at midnight UTC, e.g. ‘2017-06-06 00:00:00 UTC’.



Therefore, you really must add the TIMESTAMP__TRUNC in there, like I did in my original example:



logs._PARTITIONTIME BETWEEN TIMESTAMP_TRUNC({% date_start logs.time_filter %}, DAY)

AND TIMESTAMP_TRUNC({% date_end logs.time_filter %}, DAY)



I will update the code block above with this correction to ensure it won’t trip anyone.

How can one make a partitioned table from non partitioned table in big query ?

Userlevel 6
Badge

Unfortunately, currently, the only way to create a partitioned table is to do it manually from the command line and create partitions by hand.

Are derived looker tables the only way to leverage wildcard queries? I tired overriding sql_table_name in a view and it didn’t work.



I was reading this post and we have an interesting use case where being able to query across partitioned tables w/o custom SQL would come in handy. When I spoke w/ the help desk about this they were under the impression that we could not do this. Is there anyone specific I should work with on this?

Userlevel 3

Hi Dan,



You can put wildcard queries into the sql_table_name with this syntax:



sql_table_name: `wildcard_example.events_201*` ;;

Wow thanks, I will try it ASAP. I was under the impression I could not, so this makes my day!

For anyone out there using Google BigQuery Legacy SQL, you can use a combination of templated filters and GBQ to accomplish querying on date partitioned tables





view: logs {


sql_table_name: TABLE_DATE_RANGE([ PREFIX OF TABLE BEFORE DATE ],


TIMESTAMP({% date_start table_filter %}),


TIMESTAMP({% date_end table_filter %})) ;;







filter: table_filter {


type: date


}





Here is a link to GBQ Legacy SQL Function: https://cloud.google.com/bigquery/query-reference#tablewildcardfunctions

Userlevel 4

Just a related feature when it comes to PDTs for anyone reading through this: https://docs.looker.com/reference/view-params/partition_keys

Userlevel 3

Might find this useful: How To: Smart Partition Key Filtering with Liquid

Reply