Question

Modeling nested data in Big Query Standard SQL

  • 29 November 2016
  • 1 reply
  • 1595 views

Userlevel 3

With the release of Big Query’s Standard SQL dialect, it’s now much easier to model nested data sets in LookML.

 

For a quick primer on how nested and repeated files work in BigQuery, and why they’re valuable, take a look at Daniel’s post.

 

 

You can also head here for a complete Looker Block for Google Analytics Premium and GA 360 data, which puts the concepts described below into a plug-and-play LookML model.

 

 

These examples are based on the Google Analytics Premium (GAP) Big Query export data set, which makes heavy use of nested repeated fields.

 

 

Join-based querying of nested-repeated (i.e. arrays of STRUCT) fields, replacing the need for FLATTENS

 

The Google Analytics Premium schema has one table for session-level data, with individual hit (i.e. event) records nested within each session. To query hit-level and session level data at the same time (e.g. counting total number of sessions and total number of hits over time), there is a new join-based syntax for referencing those two different levels of nesting ([documented here] (https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#removing_repetition_with_flatten)). Mirroring this syntax in Looker, you’d define separate view files for session and hit fields in your model, and join them together in an explore:

 

 

view: session {

sql_table_name: `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_*` ;;

# Define all session-related dimensions and measures in this view file

}



view: hits {

#note that there is no sql_table_name here because we've defined it in the sql in the explore below

# Define all hit-related dimensions and measures in this view file

}

explore: session {

join: hits {

sql: ,UNNEST(visits.hits) as hits ;;

relationship: one_to_many

}

}

 

You can also select individual elements, or aggregate across multiple elements, from a nested-repeated columns, and use those unnested columns as dimensions in GROUP by statements.

 

This is particularly valuable for handling custom hit attributes in the GAP export schema:

 

 

  dimension: custom_attribute_1_value {

type: string

sql: (SELECT value FROM UNNEST(${TABLE}.customdimensions) where index = 1) ;;

}

 

You can even aggregate across several nested repeated records for a given event. This too can be used in a GROUP BY, or even further aggregated as a measure:

 

 

  dimension: number_of_custom_attributes {

type: number

sql: (SELECT count(*) FROM UNNEST(${TABLE}.customdimensions)) ;;

}



measure: average_number_of_custom_attributes {

type: average

sql: ${number_of_custom_attributes} ;;

}

 

Also of note, replacing table_date_range() and table_query()

 

This discourse post describes how to model date-separated tables, like the ga_sessions tables in the GAP schema, using bigquery’s new partitioning tools.

 


1 reply

Userlevel 3

Here are a few even more complex examples of dimensions you can build using subqueries:


dimension: number_of_long_pauses {
sql:
/* Find the number of times that X time elapses between events */
( select count(*) -- 2. Count the hits
from (
/* -- 1: For every hit (i.e. event), get it's timestamp, and the timestamp of the prior hit */
select time,
lag(time) over(order by time) as last_event_time
from unnest(hits)
)
where time - last_event_time > 1000 /* 3. Where the hit was more than 1000 miliseconds after the prior hit */
) ;;
}

dimension: event_type_with_longest_pause {

sql:
(select last_value(event_category) over(order by time_spent asc)
from
(select
eventInfo.eventCategory as event_category,
lead(time) over(order by time) as time_spent
from unnest(hits))
limit 1
)
;;
}

Reply