Aggregate Awareness using _in_query

russ3 · ‎05-16-2022

Looker will not be updating this content, nor guarantees that everything is up-to-date.

You can define aggregate tables in your LookML projects, and Looker will use aggregate awareness logic to find the smallest, most efficient table available to run a query while still maintaining correctness. See the Aggregate Awareness documentation page for information.

With the Liquid attribute _in_query, we can check which fields a user has selected in an Explore and make on-the-fly modifications to SQL. This opens a large number of potential use cases; this article focuses on a concept called "aggregate awareness," which can be used to streamline performance by reducing database resource consumption.

Many Looker customers report on very large volumes of granular data at an event or transaction level. This is helpful for getting to the bottom of specific scenarios like examining individual records, but querying the granular data incurs a large expense when all you need is basic statistics. The natural solution is to create a more summarized table for these use cases, which is less expensive to query but may need to be its own Explore. This can potentially create a situation where end-users may not know the best Explore to use in certain cases.

_in_query solves this problem elegantly by allowing us to route the query to just the right level of summarization, using one Explore. This can reduce query times for basic statistics.

Example

This is a basic usage example similar to the example provided in Looker's Liquid variable documentation:

view: orders {
sql_table_name:
{% if orders.created_date._in_query %}
orders
{% elsif orders.created_week._in_query %}
orders_smry_week
{% elsif orders.created_month._in_query %}
orders_smry_month
{% else %}
orders_smry_year
{% endif %} ;;

dimension_group: created {
type: time
timeframes: [date, week, month, year]
sql: ${TABLE}.created_at ;;
}

In this example, there are four tables with different levels of summarization based on time: orders, orders_smry_week, orders_smry_month, and orders_smry_year. In other respects each has an identical schema. The if view_name.field_name._in_query logic routes the user to the full transactional table if they select or filter on created_date, a weekly summary if they select or filter on created_week, etc. The result is far less traffic against the raw transactional orders table.

This approach can have a tremendous impact on overall database contention, especially if there are many queries run against transactional or event-level data. This is partly due to the fact that as slow queries run, faster running queries will stack up behind them. The more requests that can be satisfied from highly summarized data, the more we can start moving data requests in and out more effectively. This enhanced movement of requests reduces the chance of a contention-based slowdown. The individual query itself will return faster as well.

Can't Summarize via ETL? (Use This in Conjunction with PDTs)

The above example assumes that you are creating these progressive summaries independently of Looker. How can we achieve the same concept by using the LookML modeling layer? Persistent derived tables (PDTs) to the rescue:

view: event_facts_yearly {
derived_table: {
sql_trigger_value: select DATE_TRUNC('year',current_date) ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('year',created_at) AS TIME
,event_type
,traffic_source
,browser
,os
,country
,uri
,COUNT(*) AS COUNT
,COUNT(DISTINCT user_id) AS DISTINCT_USERS
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1,2,3,4,5,6,7
;;
}
}

view: event_facts_monthly {
derived_table: {
sql_trigger_value: select DATE_TRUNC('month',current_date) ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('month',created_at) AS TIME
,event_type
,traffic_source
,browser
,os
,country
,uri
,COUNT(*) AS COUNT
,COUNT(DISTINCT user_id) AS DISTINCT_USERS
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1,2,3,4,5,6,7
;;
}
}


view: event_facts_daily {
derived_table: {
sql_trigger_value: select current_date ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('day',created_at) AS TIME
,event_type
,traffic_source
,browser
,os
,country
,uri
,COUNT(*) AS COUNT
,COUNT(DISTINCT user_id) AS DISTINCT_USERS
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1,2,3,4,5,6,7
;;
}
}


view: events {
derived_table: {
sql:
SELECT * FROM
{% if time_date._in_query %}
${event_facts_daily.SQL_TABLE_NAME}
{% elsif time_month._in_query %}
${event_facts_monthly.SQL_TABLE_NAME}
{% elseif time_year._in_query %}
${event_facts_yearly.SQL_TABLE_NAME}
{% else %}
PUBLIC.EVENTS
{% endif %}
;;
}

dimension_group: time {
timeframes: [raw,date,month,year]
type: time
sql: ${TABLE}.TIME ;;
}

dimension: event_type {
type: string
sql: ${TABLE}.event_type ;;
}

dimension: traffic_source {
type: string
sql: ${TABLE}.traffic_source ;;
}

dimension: browser {
type: string
sql: ${TABLE}.browser ;;
}

dimension: os {
type: string
sql: ${TABLE}.os ;;
}

dimension: country {
type: string
sql: ${TABLE}.country ;;
}

dimension: uri {
type: string
sql: ${TABLE}.uri ;;
}

measure: count {
type: sum
sql: ${TABLE}.count ;;
drill_fields: [time_date,count]
}

measure: distinct_users {
type: sum
sql: ${TABLE}.distinct_users ;;
}
}

explore: test_events {}

In this example, we are creating each level of summarization as its own PDT: yearly, monthly and daily. Anything else will be satisfied from the raw events table. Note that we don't even need to provide fields for the views. But now, the events table will write a WITH clause that will select either from the raw, daily, monthly or yearly tables, based on what the user chooses. Looker will keep those PDTs refreshed according to their specified sql_trigger_value.

Note: We are using a SQL derived table rather than sql_table_name because we cannot use ${pdt_view.SQL_TABLE_NAME} inside of sql_table_name.

Summarize by Dimensions Other Than Time

The examples above demonstrate summarization by time, but this can be done with any type of dimension.

Let's look at another example drawing from two tables: one basic summary table, and one table with a high cardinality variable like user_id:

view: usage_facts {
sql_table_name:
{% if usage_facts.user_id._in_query %}
summarized_by_user_id
{% else %}
basic_summary
{% endif %} ;;

dimension: user_id {
type: number
sql: ${TABLE}.user_id ;;
}

dimension: usage {
type: number
sql: ${TABLE}.usage ;;
}

measure: {
type: sum
sql: ${usage} ;;
}

Now, when someone needs usage fact information, they will only go to the more expensive, granular summarized_by_user_id table if they have filtered or selected it.

Going Further

We can get as sophisticated as our use case demands. We have the full power of Liquid inside of the SQL and sql_table_name. We can base these on user input to a parameter field, and we can nest if statements for complex routing between tables.

Below is a complex example using all of the concepts at once. Let's imagine we have user-level information and our measure is sensitive to whether or not it is grouped by user_id. We already have an existing daily summary of this information, but we want Looker to summarize by month. We also need a daily and monthly summarization at the user level. The end user should choose whether user_id is respected in the aggregation totals.

For this purpose, we could write code like the following:

view: event_facts_monthly {
derived_table: {
sql_trigger_value: select DATE_TRUNC('month',current_date) ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('month',created_at) AS TIME
,COUNT(*) AS COUNT
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1
;;
}
}

view: event_facts_user {
derived_table: {
sql_trigger_value: select DATE_TRUNC('day',current_date) ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('day',created_at) AS TIME
,user_id
,COUNT(*) AS COUNT
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1,2
;;
}
}

view: event_facts_user_monthly {
derived_table: {
sql_trigger_value: select DATE_TRUNC('month',current_date) ;;
distribution_style: all
sortkeys: ["TIME"]
sql: SELECT
DATE_TRUNC('month',created_at) AS TIME
,user_id
,COUNT(*) AS COUNT
FROM
PUBLIC.EVENTS
WHERE
1=1
GROUP BY
1,2
;;
}
}


view: events {
derived_table: {
sql:
-- This notation is just to make create a more readable user_input variable for casing off the value the user put in
{% assign user_input = events.calc_method_value_pass_through._sql %}

SELECT * FROM
-- Select the appropriate table for user_level calculation
{% if user_input contains 'User Level' %}
-- Select the appropriate table based on time grain chosen
{% if time_date._in_query or time_week._in_query %}
${event_facts_user.SQL_TABLE_NAME}
{% else %}
${event_facts_user_monthly.SQL_TABLE_NAME}
{% endif %}
{% else %}
-- Select the appropriate table based on time grain chosen
{% if time_date._in_query or time_week._in_query %}
schema.event_facts -- For example if this table already existed in another schema and was not a PDT assigned in Looker
{% else %}
${event_facts_monthly.SQL_TABLE_NAME}
{% endif %}
{% endif %}
;;
}

dimension_group: time {
type: time
timeframes: [date, week, month, year]
sql: ${TABLE}.TIME ;;
}

filter: calc_method {
type: string
suggestions: [
"User Level",
"Global"
]
}

dimension: calc_method_value_pass_through {
hidden: yes
type: string
sql: {% parameter calc_method %} ;;
}

measure: count {
type: sum
sql: ${TABLE}.count ;;
drill_fields: [time_date,count]
}

}

We recommend making use of aggregate awareness with the _in_query variable as a LookML best practice to speed up queries and make the best use of database resources.