Question

Lookml measures in derived tables ... or "approaches to aggregate(aggregates)"

  • 5 April 2017
  • 8 replies
  • 1162 views

Userlevel 2

Given that some SQL dialects cannot perform aggregations on aggregations [e.g. avg(sum(xyzvalue))] I was wondering if anyone has any ideas on how to tackle this while still allowing a “user” to select dynamic dimensions over which these pre-aggregations and then post-aggregations take place.



EXAMPLE


Assume the data I’m working with is defined by the following:





select 1 uid, 100 value, ‘x’ clientid, 101 itemid, ‘20161230’ asofdate UNION ALL


select 2 uid, 200 value, ‘x’ clientid, 202 itemid, ‘20161230’ asofdate UNION ALL


select 3 uid, 500 value, ‘x’ clientid, 303 itemid, ‘20161230’ asofdate UNION ALL


select 4 uid, 800 value, ‘y’ clientid, 101 itemid, ‘20161230’ asofdate UNION ALL


select 5 uid, 250 value, ‘y’ clientid, 303 itemid, ‘20161230’ asofdate UNION ALL


select 6 uid, 100 value, ‘x’ clientid, 101 itemid, ‘20161231’ asofdate UNION ALL


select 7 uid, 400 value, ‘x’ clientid, 202 itemid, ‘20161231’ asofdate UNION ALL


select 8 uid, 600 value, ‘x’ clientid, 303 itemid, ‘20161231’ asofdate UNION ALL


select 9 uid, 100 value, ‘y’ clientid, 303 itemid, ‘20161231’ asofdate





a few illustrative analyses I may want to allow a user to perform are:





  • Calculate the average total value (across all items) by client by asofdate…or “average total value from client by date”


  • Calculate the average total value (across all items and all clients) by asofdate …or “average total value by date”


  • Calculate the average total value (across all clients) by item by asofdate…or “average total value from item by date”




There are several approaches I’ve tested but with little success… i may be missing something and/or may not fully understand the posts i found related but would love to hear if anyone has been able to successfully implement something similar and could shed some light for me:







  1. PDTs with window-analytic functions: this allows for pre-aggregation --> but doesnt fit this use case in that it restricts the dimensions over which aggs take place …







  2. Waterfall sequencing of pre- and post-aggregation PDTs: “PDTstep1” with my pre-aggregations and “PDTstep2” as the aggregation of the aggregations in “PDTstep1” (example code below) --> but the lookml aggregations defined in PDTstep1 don’t come through as actual “columns” in “PDTstep2”









explore: PDTstep1and2 {

from: PDTstep1

join: PDTstep2 {

sql_on: ${PDTstep1and2.uid} = ${PDTstep2.uid} ;;

type: inner

relationship: one_to_one

}

}





view: PDTstep1 {

derived_table: {

sql:



select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL

select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL

select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL

select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL

select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate

;;

}





dimension: uid {



type: number

value_format_name: id

sql: ${TABLE}.uid ;;

primary_key: yes

}





dimension: date_key {



type: string

sql: ${TABLE}.asofdate ;;

}



dimension: asofdate {

type: date

sql: ${TABLE}.asofdate ;;

}



dimension: client_id {



type: string

sql: ${TABLE}.clientid ;;

}



dimension: item_id {



type: number

value_format_name: id

sql: ${TABLE}.itemid ;;

}





measure: value_pre_sum {

description: ""

type: sum

sql: ${TABLE}.value ;;

value_format_name: decimal_2

}



}





view: PDTstep2 {

derived_table: {

sql:

select

*

from ${PDTstep1.SQL_TABLE_NAME} agg;;

}





dimension: uid {

hidden: yes

type: number

value_format_name: id

sql: ${TABLE}.uid ;;

# primary_key: yes

}





dimension: date_key {

hidden: yes

type: string

sql: ${TABLE}.date_key ;;

}



dimension: asofdate {

type: date

sql: ${TABLE}.asofdate ;;

}



dimension: client_id {

hidden: no

type: string

sql: ${TABLE}.clientid ;;

}



dimension: item_id {

hidden: no

type: number

value_format_name: id

sql: ${TABLE}.itemid ;;

}





measure: value_sum {

description: ""

type: sum

sql: ${TABLE}.value_pre_sum ;;

value_format_name: decimal_2

}



measure: value_avg {

description: ""

type: average

sql: ${TABLE}.value_pre_sum ;;

value_format_name: decimal_2

}

}

8 replies

Userlevel 3
Badge

Hey there, Maxie from Looker Support chiming in.



I saw an internal thread talking about something similar, the intention of this pattern (which has not been implemented and is still in the “interesting idea” phase) is to:





either require or allow aggregating of records from some tables before performing joins.













This is a very common use case, when selecting measures from multiple views, and is common when grouping by categories, dates/calendars, statuses, accounts, etc.





Basically the idea was to join in pre-aggregated tables, as opposed to the raw tables, to improve performance, but it also seems to be applicable to your situation here. So if you start with something like this (minus the logic for symmetric aggregates):



SELECT product_category, symmetric sum(orders.amount), count(distinct page_view.id)

FROM product_categories

LEFT JOIN orders on orders.product_category=product_categories.id

LEFT JOIN page_views on page_views.product_category=product_categories.id

GROUP BY 1



The pattern would instead produce something like this:



SELECT product_category, SUM(orders._agg_amount), SUM(page_view._agg_count)

FROM product_categories

LEFT JOIN

(SELECT product_category, SUM(amount) as _agg_amount FROM page_views GROUP BY 1)

as orders on orders.product_category=product_categories.id

LEFT JOIN

(SELECT product_category, COUNT(*) as _agg_count FROM page_views GROUP BY 1)

as page_views on page_views.product_category=product_categories.id

GROUP BY 1



I was specifically thinking of you comment





Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc).





when I ran across this and it seemed like it could be related.



Does this seem like it’d be applicable to your ask here? If so, I’ll make sure the detailed feedback you’ve left here (and any additional feedback you have) makes it into the appropriate discussion.

Userlevel 2

@maxcorbin this is exactly what I’m talking about.



happy to talk through the use cases and any considerations to get this into the product as quickly as possible as this is probably the #1 issue we run into with respect to modeling. as such, this would be a game changer.

Userlevel 5
Badge

You’re correct that LookML aggregations defined in step1 won’t come through as columns in step2. However, if you actually define those aggregations in the SQL, then they will come through as columns, allowing you to perform aggregations. You could do something like this:



view: PDTstep2 {

derived_table: {

sql:

SELECT asofdate, SUM(value) as sum_value

FROM ${PDTstep1.SQL_TABLE_NAME}

GROUP BY 1 ;;

}



dimension: value {

description: ""

type: sum

sql: ${TABLE}.sum_value ;;

value_format_name: decimal_2

}



measure: value_avg {

description: ""

type: average

sql: ${TABLE}.value ;;

value_format_name: decimal_2

}

}



If you want to get fancy with templated filters, then you can let users choose which field to group on:



view: PDTstep2_fancy {

derived_table: {

sql:

SELECT

CASE WHEN {% condition PDTstep2_fancy.group_selector %} 'client' {% endcondition %}

THEN clientid

WHEN {% condition PDTstep2_fancy.group_selector %} 'item' {% endcondition %}

THEN itemid

END,

asofdate,

SUM(value) as sum_value

FROM ${PDTstep1.SQL_TABLE_NAME}

GROUP BY 1,2 ;;

}



filter: group_selector {

type: string

suggestions: ["client", "item"]

}



dimension: value {

description: ""

type: sum

sql: ${TABLE}.sum_value ;;

value_format_name: decimal_2

}



measure: value_avg {

description: ""

type: average

sql: ${TABLE}.value ;;

value_format_name: decimal_2

}

}



Let me know if this is helpful!

Userlevel 5
Badge

You’re right @mplooker - going the pre-aggregated route means that the SUM won’t change its grouping if a user adds in another dimension.



To simplify the problem here, we want an average of a SUM. But we also want the grouping of the SUM to be dynamically chosen by the user in the UI. Is that right?



I think the only way to get at this is with templated filters as I suggested. I’ll keep thinking on this and we’ll see if any other Discoursers have bright ideas here.

Userlevel 5
Badge

You could get at this with table calcs, if we wanted to not use PDTs at all. Say you used the mean table calc to always point at your SUM measure. Then, no matter how the SUM measure is grouped, the table calc will always display the average of that measure.

Userlevel 2

thanks for the quick response, @sam!



i’m trying to stay clear of templated filters since that would limit the dimensions for analysis to only single selections and only those i define ahead of time specifically.



for your first example, would the only aggregation be over asofdate? in other words, if a user added the client dimension in the UI, the “PDTstep2” definition wouldnt dynamically change to the following would it?





view: PDTstep2 {

derived_table: {

sql:

SELECT asofdate, CLIENTID, SUM(value) as sum_value

FROM ${PDTstep1.SQL_TABLE_NAME}

GROUP BY 1, 2 ;;

}


Userlevel 2

Thanks, @sam.



Just because I’ve been struggling to find a workaround for this, here are some thoughts I had:



Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc). Then the last step would be to apply the same (applicable) joins to the resulting table that were used in the intermediate step outlined above.



So using the example above, say the user wants to





  1. Compute the total client sum(value) for each day THEN



  2. Compute the average client’s total sum(value) for each asofdate; or


  3. Compute average daily total sum(value) for each clientID…




the basic setup can be seen below:





As mentioned above, trying to solve this in LookML is complicated by the fact that the first PDT doesn’t actually include any of the measure definitions and throws an error when trying to use the measure name from PDTstep1 in PDTstep2



As such, LookML renders this as



WITH PDTstep1 AS (select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL

select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL

select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL

select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL

select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate

)

, PDTstep2 AS (select

*

from PDTstep1 agg)

SELECT

TOP 500

CONVERT(VARCHAR(10),PDTstep2.asofdate ,120) AS "pdtstep2.asofdate_1",

AVG(PDTstep2.value_presum ) AS "pdtstep2.value_avg_1"

FROM PDTstep2



GROUP BY CONVERT(VARCHAR(10),PDTstep2.asofdate ,120)

ORDER BY 1 DESC



And throws an error because there is no column name ‘value_pre_sum’ in the PDTstep1 as defined.



But in theory, the explore could be set-up so that both PDTstep1 and PDTstep2 are available to the User, they select ClientID and AsofDate in PDTstep1 and then value_avg and Asofdate from PDTstep2


(though the dimensions in PDTstep2 might have to be dynamically included/excluded based on the dimensions selected in PDTstep1) and the resulting LookML would render something like this:



WITH PDTstep1 AS (select 1 uid, 100 value, 'x' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 2 uid, 200 value, 'x' clientid, 202 itemid, '20161230' asofdate UNION ALL

select 3 uid, 500 value, 'x' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 4 uid, 800 value, 'y' clientid, 101 itemid, '20161230' asofdate UNION ALL

select 5 uid, 250 value, 'y' clientid, 303 itemid, '20161230' asofdate UNION ALL

select 6 uid, 100 value, 'x' clientid, 101 itemid, '20161231' asofdate UNION ALL

select 7 uid, 400 value, 'x' clientid, 202 itemid, '20161231' asofdate UNION ALL

select 8 uid, 600 value, 'x' clientid, 303 itemid, '20161231' asofdate UNION ALL

select 9 uid, 100 value, 'y' clientid, 303 itemid, '20161231' asofdate

)



, PDTstep1_INTERMEDIATE AS (

select

CONVERT(VARCHAR(10),PDTstep1.asofdate ,120) AS asofdate,

PDTstep1.clientid AS clientid,

SUM(value) as value_presum

from PDTstep1

GROUP BY CONVERT(VARCHAR(10),PDTstep1.asofdate ,120),PDTstep1.clientid

)





SELECT

TOP 500

CONVERT(VARCHAR(10),PDTstep1_INTERMEDIATE.asofdate ,120) AS asofdate,

COALESCE(AVG(PDTstep1_INTERMEDIATE.value_presum ), 0) value_avg

FROM PDTstep1_INTERMEDIATE



GROUP BY CONVERT(VARCHAR(10),PDTstep1_INTERMEDIATE.asofdate ,120)

ORDER BY 1 DESC
Userlevel 2

@maxcorbin And specifically to my bolded suggestion below around building persistent temp tables only for the execution of the query that are then immediately dropped, is that this approach could potentially optimize performance tremendously. we’ve tried to deliver this by creating PDTs with persist_for variables of a second or two but it doesn’t work as seamlessly/intuitively in practice as we would like.





Another solution could be for the application to build intermediate tables that exist only for the query execution (so either CTEs or actual persistent temp tables that are immediately dropped) BUT that include the lookml-defined aggregation measures with all user manipulations (like filters, dimensions, etc).



Reply