Question

Incremental PDTs on BigQuery - or how to produce Daily Snapshots on SCD Type 1 Schemas

  • 17 May 2016
  • 2 replies
  • 183 views

Userlevel 3

Very frequently we ask how some total on one day compares to a total metric on another day (month, year). As long as we store data for those previous dates, it becomes a simple exploration task. In fact, BigQuery even natively supports Daily Partitioning of data for these tables (see this discussion) to improve performance. One way to visualize this is to imagine these examples:



  • event tables, in which every event has an associated date

  • transactional tables, in which every transaction corresponds to a date


However, how to infer the day over day (month over month, …) logic for tables that do not have historical information with corresponding dates? In the case of tables in which rows are overwritten–not appended–this requires some additional logic.


Ideal Data Types


This block is most useful on SCD Type 1 schema. Typical tables that are stored in this way are: Users, Authors, Doctors, Devices, ... - basically, entity tables.


Expected Output




Try It Yourself


To visualize this, let’s imagine that a wearables company has users carrying devices. Here’s a simple schema illustrating what the raw data might look like:


Users Table


| id | is_active  |     updated_at      |
| -- | ------------- | ------------------- |
| 1 | True | NULL |
| 2 | True | 01/01/2016 10:01:04 |

It does not matter how someone is defined as active or not active, but the important point is that it is a switch that can be turned on or off for any one user at any point in time.


In the above example, the total number of active users = 2.


Now on some day, say February 2nd, one user turns inactive. The table becomes:


| id | is_active  |     updated_at      |
| -- | ------------- | ------------------- |
| 1 | False | 02/02/2016 00:00:01 |
| 2 | True | 01/01/2016 10:01:04 |

In this second example, the total number of active users = 1.


Normally, the information that on Jan 1st - Feb 1st, the total number of active users = 2, would be lost.


But in BigQuery we can make use of Looker’s Persistent Derived Tables to store previous states of data. We can query tables that were dynamically generated by Looker into its looker_scratch schema. For example, to get the previous version of a table “active_totals”:


SQL
SELECT * FROM TABLE_QUERY(lookerdata:looker_scratch, "table_id CONTAINS 'active_totals'")

To append new data, we just pull it from the original table:


SQL
SELECT ... FROM [bigquery:active_totals] AS x_table ...

There is one tiny drawback of using TABLE_QUERY however: it assumes that there is at least one table matching that pattern - otherwise it fails. This should not matter, except for the first time we build a snapshot table. So we need to make sure that we have some empty referenceable table on day 1.



Old LookML
- view: active_totals_preliminary
derived_table:
sql: |
SELECT
CURRENT_DATE() AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] users
GROUP EACH BY 1
LIMIT 0 # Table will be empty
persist_for: 9999999 hours



New LookML
view: active_totals_preliminary {
derived_table: {
sql: SELECT
CURRENT_DATE() AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] users
GROUP EACH BY 1
LIMIT 0 # Table will be empty
;;
persist_for: "9999999 hours"
}
}


Notice that active_totals_preliminary is matched in the above TABLE_QUERY SELECT.


Everything together:



Old LookML
- explore: active_totals

- view: active_totals
derived_table:
sql: |
SELECT *
FROM
(
#grab data for all days prior to (TODAY-1)
SELECT
date
, number_of_active_users
FROM
TABLE_QUERY(
lookerdata:looker_scratch
, "table_id CONTAINS 'active_totals'"
)
),
(
#grab data for yesterday (i.e. TODAY-1)
SELECT
DATE(DATE_ADD(TIMESTAMP(CONCAT(CURRENT_DATE(), ' 00:00:00')), -1, 'DAY')) AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] AS users
WHERE
# state of a user at any point in time
users.is_active
GROUP EACH BY 1
),
(
#make sure that an empty preliminary table has been built
SELECT
date
, number_of_active_users
FROM
${active_totals_preliminary.SQL_TABLE_NAME}
LIMIT 0
)
sql_trigger_value: SELECT CURRENT_DATE()

fields:
- dimension: date
sql: date

- dimension: total_number

- view: active_totals_preliminary
derived_table:
sql: |
SELECT
CURRENT_DATE() AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] users
GROUP EACH BY 1
LIMIT 0 # Table will be empty
persist_for: 9999999 hours

fields:



New LookML
explore: active_totals {}

view: active_totals {
derived_table: {
sql: SELECT *
FROM
(
#grab data for all days prior to (TODAY-1)
SELECT
date
, number_of_active_users
FROM
TABLE_QUERY(
lookerdata:looker_scratch
, "table_id CONTAINS 'active_totals'"
)
),
(
#grab data for yesterday (i.e. TODAY-1)
SELECT
DATE(DATE_ADD(TIMESTAMP(CONCAT(CURRENT_DATE(), '00:00:00')), -1, 'DAY')) AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] AS users
WHERE
# state of a user at any point in time
users.is_active
GROUP EACH BY 1
),
(
#make sure that an empty preliminary table has been built
SELECT
date
, number_of_active_users
FROM
${active_totals_preliminary.SQL_TABLE_NAME}
LIMIT 0
)
;;
sql_trigger_value: SELECT CURRENT_DATE() ;;
}

dimension: date {
sql: date ;;
}

dimension: total_number {}
}

view: active_totals_preliminary {
derived_table: {
sql: SELECT
CURRENT_DATE() AS date
, IFNULL(INTEGER(COUNT(1)), 0) AS number_of_active_users
FROM [bigquery:public.users] users
GROUP EACH BY 1
LIMIT 0 # Table will be empty
;;
persist_for: "9999999 hours"
}


}


2 replies

Userlevel 4

For each old LookML code block in this article and its comments, we just added the New LookML equivalent code.

Userlevel 2

Just wanted to clarify that this is BigQuery Legacy SQL; for BigQuery Standard SQL, you can use the wildcard syntax. Instead of


FROM
TABLE_QUERY(
lookerdata:looker_scratch
, "table_id CONTAINS 'active_totals'"
)

you can just write


FROM
`lookerdata.looker_scratch.*active_totals*`

Reply