Modeling nested data in Big Query Standard SQL

zlebowitz · 11-29-2016 02:56 PM

With the release of Big Query’s Standard SQL dialect, it’s now much easier to model nested data sets in LookML.

For a quick primer on how nested and repeated files work in BigQuery, and why they’re valuable, take a look at Daniel’s post.

You can also head here for a complete Looker Block for Google Analytics Premium and GA 360 data, which puts the concepts described below into a plug-and-play LookML model.

These examples are based on the Google Analytics Premium (GAP) Big Query export data set, which makes heavy use of nested repeated fields.

Join-based querying of nested-repeated (i.e. arrays of STRUCT) fields, replacing the need for FLATTENS

The Google Analytics Premium schema has one table for session-level data, with individual hit (i.e. event) records nested within each session. To query hit-level and session level data at the same time (e.g. counting total number of sessions and total number of hits over time), there is a new join-based syntax for referencing those two different levels of nesting ([documented here] (https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#removing_rep...)). Mirroring this syntax in Looker, you’d define separate view files for session and hit fields in your model, and join them together in an explore:

view: session {

  sql_table_name: `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_*` ;;

# Define all session-related dimensions and measures in this view file

  }



view: hits {

#note that there is no sql_table_name here because we've defined it in the sql in the explore below

# Define all hit-related dimensions and measures in this view file

  }

explore: session {

  join: hits {

    sql: ,UNNEST(visits.hits) as hits ;;

    relationship: one_to_many

  }

}

You can also select individual elements, or aggregate across multiple elements, from a nested-repeated columns, and use those unnested columns as dimensions in GROUP by statements.

This is particularly valuable for handling custom hit attributes in the GAP export schema:

  dimension: custom_attribute_1_value {

    type: string

    sql: (SELECT value FROM UNNEST(${TABLE}.customdimensions) where index = 1) ;;

  }

You can even aggregate across several nested repeated records for a given event. This too can be used in a GROUP BY, or even further aggregated as a measure:

  dimension: number_of_custom_attributes {

    type: number

    sql: (SELECT count(*) FROM UNNEST(${TABLE}.customdimensions)) ;;

  }



  measure: average_number_of_custom_attributes {

    type: average

    sql: ${number_of_custom_attributes} ;;

  }

Also of note, replacing table_date_range() and table_query()

This discourse post describes how to model date-separated tables, like the ga_sessions tables in the GAP schema, using bigquery’s new partitioning tools.

zlebowitz

Here are a few even more complex examples of dimensions you can build using subqueries:

dimension: number_of_long_pauses {
sql:
/* Find the number of times that X time elapses between events */
( select count(*)  -- 2. Count the hits 
  from (
     /* -- 1: For every hit (i.e. event), get it's timestamp, and the timestamp of the prior hit */
      select time, 
             lag(time) over(order by time) as last_event_time
      from unnest(hits)
      )
  where time - last_event_time > 1000  /* 3. Where the hit was more than 1000 miliseconds after the prior hit */
      ) ;;
}

dimension: event_type_with_longest_pause {

sql: 
  (select last_value(event_category) over(order by time_spent asc)
  from 
  (select 
    eventInfo.eventCategory as event_category, 
    lead(time) over(order by time) as time_spent
  from unnest(hits))
  limit 1
  ) 
;;
}