Question

Lookml measures in derived tables ... or "approaches to aggregate(aggregates)"

  • 5 April 2017
  • 8 replies
  • 1066 views

Userlevel 2

Given that some SQL dialects cannot perform aggregations on aggregations [e.g. avg(sum(xyzvalue))] I was wondering if anyone has any ideas on how to tackle this while still allowing a “user” to select dynamic dimensions over which these pre-aggregations and then post-aggregations take place.


EXAMPLE

Assume the data I’m working with is defined by the following:



select 1 uid, 100 value, ‘x’ clientid, 101 itemid, ‘20161230’ asofdate UNION ALL

select 2 uid, 200 value, ‘x’ clientid, 202 itemid, ‘20161230’ asofdate UNION ALL

select 3 uid, 500 value, ‘x’ clientid, 303 itemid, ‘20161230’ asofdate UNION ALL

select 4 uid, 800 value, ‘y’ clientid, 101 itemid, ‘20161230’ asofdate UNION ALL

select 5 uid, 250 value, ‘y’ clientid, 303 itemid, ‘20161230’ asofdate UNION ALL

select 6 uid, 100 value, ‘x’ clientid, 101 itemid, ‘20161231’ asofdate UNION ALL

select 7 uid, 400 value, ‘x’ clientid, 202 itemid, ‘20161231’ asofdate UNION ALL

select 8 uid, 600 value, ‘x’ clientid, 303 itemid, ‘20161231’ asofdate UNION ALL

select 9 uid, 100 value, ‘y’ clientid, 303 itemid, ‘20161231’ asofdate



a few illustrative analyses I may want to allow a user to perform are:



  • Calculate the average total value (across all items) by client by asofdate…or “average total value from client by date”

  • Calculate the average total value (across all items and all clients) by asofdate …or “average total value by date”

  • Calculate the average total value (across all clients) by item by asofdate…or “average total value from item by date”


There are several approaches I’ve tested but with little success… i may be missing something and/or may not fully understand the posts i found related but would love to hear if anyone has been able to successfully implement something similar and could shed some light for me:




  1. PDTs with window-analytic functions: this allows for pre-aggregation --> but doesnt fit this use case in that it restricts the dimensions over which aggs take place …




  2. Waterfall sequencing of pre- and post-aggregation PDTs: “PDTstep1” with my pre-aggregations and “PDTstep2” as the aggregation of the aggregations in “PDTstep1” (example code below) --> but the lookml aggregations defined in PDTstep1 don’t come through as actual “columns” in “PDTstep2”





explore: PDTstep1and2 {
from: PDTstep1
join: PDTstep2 {
sql_on: ${PDTstep1and2.uid} = ${PDTstep2.uid} ;;
type: inner
relationship: one_to_one
}
}


view: PDTstep1 {
derived_table: {
sql:

select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL
select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL
select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL
select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL
select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate
;;
}


dimension: uid {

type: number
value_format_name: id
sql: ${TABLE}.uid ;;
primary_key: yes
}


dimension: date_key {

type: string
sql: ${TABLE}.asofdate ;;
}

dimension: asofdate {
type: date
sql: ${TABLE}.asofdate ;;
}

dimension: client_id {

type: string
sql: ${TABLE}.clientid ;;
}

dimension: item_id {

type: number
value_format_name: id
sql: ${TABLE}.itemid ;;
}


measure: value_pre_sum {
description: ""
type: sum
sql: ${TABLE}.value ;;
value_format_name: decimal_2
}

}


view: PDTstep2 {
derived_table: {
sql:
select
*
from ${PDTstep1.SQL_TABLE_NAME} agg;;
}


dimension: uid {
hidden: yes
type: number
value_format_name: id
sql: ${TABLE}.uid ;;
# primary_key: yes
}


dimension: date_key {
hidden: yes
type: string
sql: ${TABLE}.date_key ;;
}

dimension: asofdate {
type: date
sql: ${TABLE}.asofdate ;;
}

dimension: client_id {
hidden: no
type: string
sql: ${TABLE}.clientid ;;
}

dimension: item_id {
hidden: no
type: number
value_format_name: id
sql: ${TABLE}.itemid ;;
}


measure: value_sum {
description: ""
type: sum
sql: ${TABLE}.value_pre_sum ;;
value_format_name: decimal_2
}

measure: value_avg {
description: ""
type: average
sql: ${TABLE}.value_pre_sum ;;
value_format_name: decimal_2
}
}

8 replies

Userlevel 5
Badge

You’re correct that LookML aggregations defined in step1 won’t come through as columns in step2. However, if you actually define those aggregations in the SQL, then they will come through as columns, allowing you to perform aggregations. You could do something like this:


view: PDTstep2 {
derived_table: {
sql:
SELECT asofdate, SUM(value) as sum_value
FROM ${PDTstep1.SQL_TABLE_NAME}
GROUP BY 1 ;;
}

dimension: value {
description: ""
type: sum
sql: ${TABLE}.sum_value ;;
value_format_name: decimal_2
}

measure: value_avg {
description: ""
type: average
sql: ${TABLE}.value ;;
value_format_name: decimal_2
}
}

If you want to get fancy with templated filters, then you can let users choose which field to group on:


view: PDTstep2_fancy {
derived_table: {
sql:
SELECT
CASE WHEN {% condition PDTstep2_fancy.group_selector %} 'client' {% endcondition %}
THEN clientid
WHEN {% condition PDTstep2_fancy.group_selector %} 'item' {% endcondition %}
THEN itemid
END,
asofdate,
SUM(value) as sum_value
FROM ${PDTstep1.SQL_TABLE_NAME}
GROUP BY 1,2 ;;
}

filter: group_selector {
type: string
suggestions: ["client", "item"]
}

dimension: value {
description: ""
type: sum
sql: ${TABLE}.sum_value ;;
value_format_name: decimal_2
}

measure: value_avg {
description: ""
type: average
sql: ${TABLE}.value ;;
value_format_name: decimal_2
}
}

Let me know if this is helpful!

Userlevel 2

thanks for the quick response, @sam!


i’m trying to stay clear of templated filters since that would limit the dimensions for analysis to only single selections and only those i define ahead of time specifically.


for your first example, would the only aggregation be over asofdate? in other words, if a user added the client dimension in the UI, the “PDTstep2” definition wouldnt dynamically change to the following would it?



view: PDTstep2 {
derived_table: {
sql:
SELECT asofdate, CLIENTID, SUM(value) as sum_value
FROM ${PDTstep1.SQL_TABLE_NAME}
GROUP BY 1, 2 ;;
}

Userlevel 5
Badge

You’re right @mplooker - going the pre-aggregated route means that the SUM won’t change its grouping if a user adds in another dimension.


To simplify the problem here, we want an average of a SUM. But we also want the grouping of the SUM to be dynamically chosen by the user in the UI. Is that right?


I think the only way to get at this is with templated filters as I suggested. I’ll keep thinking on this and we’ll see if any other Discoursers have bright ideas here.

Userlevel 5
Badge

You could get at this with table calcs, if we wanted to not use PDTs at all. Say you used the mean table calc to always point at your SUM measure. Then, no matter how the SUM measure is grouped, the table calc will always display the average of that measure.

Userlevel 2

Thanks, @sam.


Just because I’ve been struggling to find a workaround for this, here are some thoughts I had:


Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc). Then the last step would be to apply the same (applicable) joins to the resulting table that were used in the intermediate step outlined above.


So using the example above, say the user wants to



  1. Compute the total client sum(value) for each day THEN

  2. Compute the average client’s total sum(value) for each asofdate; or

  3. Compute average daily total sum(value) for each clientID…


the basic setup can be seen below:



As mentioned above, trying to solve this in LookML is complicated by the fact that the first PDT doesn’t actually include any of the measure definitions and throws an error when trying to use the measure name from PDTstep1 in PDTstep2


As such, LookML renders this as


WITH PDTstep1 AS (select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL
select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL
select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL
select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL
select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate
)
, PDTstep2 AS (select
*
from PDTstep1 agg)
SELECT
TOP 500
CONVERT(VARCHAR(10),PDTstep2.asofdate ,120) AS "pdtstep2.asofdate_1",
AVG(PDTstep2.value_presum ) AS "pdtstep2.value_avg_1"
FROM PDTstep2

GROUP BY CONVERT(VARCHAR(10),PDTstep2.asofdate ,120)
ORDER BY 1 DESC

And throws an error because there is no column name ‘value_pre_sum’ in the PDTstep1 as defined.


But in theory, the explore could be set-up so that both PDTstep1 and PDTstep2 are available to the User, they select ClientID and AsofDate in PDTstep1 and then value_avg and Asofdate from PDTstep2

(though the dimensions in PDTstep2 might have to be dynamically included/excluded based on the dimensions selected in PDTstep1) and the resulting LookML would render something like this:


WITH PDTstep1 AS (select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL
select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL
select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL
select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL
select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL
select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL
select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate
)

, PDTstep1_INTERMEDIATE AS (
select
CONVERT(VARCHAR(10),PDTstep1.asofdate ,120) AS asofdate,
PDTstep1.clientid AS clientid,
SUM(value) as value_presum
from PDTstep1
GROUP BY CONVERT(VARCHAR(10),PDTstep1.asofdate ,120),PDTstep1.clientid
)


SELECT
TOP 500
CONVERT(VARCHAR(10),PDTstep1_INTERMEDIATE.asofdate ,120) AS asofdate,
COALESCE(AVG(PDTstep1_INTERMEDIATE.value_presum ), 0) value_avg
FROM PDTstep1_INTERMEDIATE

GROUP BY CONVERT(VARCHAR(10),PDTstep1_INTERMEDIATE.asofdate ,120)
ORDER BY 1 DESC
Userlevel 3
Badge

Hey there, Maxie from Looker Support chiming in.


I saw an internal thread talking about something similar, the intention of this pattern (which has not been implemented and is still in the “interesting idea” phase) is to:



either require or allow aggregating of records from some tables before performing joins.







This is a very common use case, when selecting measures from multiple views, and is common when grouping by categories, dates/calendars, statuses, accounts, etc.



Basically the idea was to join in pre-aggregated tables, as opposed to the raw tables, to improve performance, but it also seems to be applicable to your situation here. So if you start with something like this (minus the logic for symmetric aggregates):


SELECT product_category, symmetric sum(orders.amount), count(distinct page_view.id)
FROM product_categories
LEFT JOIN orders on orders.product_category=product_categories.id
LEFT JOIN page_views on page_views.product_category=product_categories.id
GROUP BY 1

The pattern would instead produce something like this:


SELECT product_category, SUM(orders._agg_amount), SUM(page_view._agg_count)
FROM product_categories
LEFT JOIN
(SELECT product_category, SUM(amount) as _agg_amount FROM page_views GROUP BY 1)
as orders on orders.product_category=product_categories.id
LEFT JOIN
(SELECT product_category, COUNT(*) as _agg_count FROM page_views GROUP BY 1)
as page_views on page_views.product_category=product_categories.id
GROUP BY 1

I was specifically thinking of you comment



Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc).



when I ran across this and it seemed like it could be related.


Does this seem like it’d be applicable to your ask here? If so, I’ll make sure the detailed feedback you’ve left here (and any additional feedback you have) makes it into the appropriate discussion.

Userlevel 2

@maxcorbin this is exactly what I’m talking about.


happy to talk through the use cases and any considerations to get this into the product as quickly as possible as this is probably the #1 issue we run into with respect to modeling. as such, this would be a game changer.

Userlevel 2

@maxcorbin And specifically to my bolded suggestion below around building persistent temp tables only for the execution of the query that are then immediately dropped, is that this approach could potentially optimize performance tremendously. we’ve tried to deliver this by creating PDTs with persist_for variables of a second or two but it doesn’t work as seamlessly/intuitively in practice as we would like.



Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc).


Reply