Question

Only measuring the "latest" values

  • 19 December 2017
  • 8 replies
  • 4668 views

Userlevel 7
Badge

What’s in a measure?


All too often, we limit ourselves needlessly when we equate measures, metrics, or aggregates with the few basic aggregates that SQL provides out of the box: SUM, COUNT, AVERAGE, maybe MEDIAN if we’re feeling creative and our dialect supports it.


For a moment, though, let’s step back and think about what measures are essentially for - taking a potentially large number of underlying data points, usually within a group/dimension, and condensing them down into a digestible datum. Thousands of orders? Total revenue. Millions of sessions? Average time on site. Hundreds of users per segment? A histogram of visit frequency. A dozen campaigns with hundreds of dates and thousands of vists? A sparkline of visits over the last 30-days.


Ok, some of the above are almost over-the-top, but the point is - your imagination is the limit. Here’s a simple way to use measures when summarizing a set of data points in which only the latest datapoint should be used. Examples are inventory snapshotting, balance sheets, historical state of attributes given a timestamped list of changes, etc.


The “latest” measure


Let’s say you have inventory snapshots that you take for each piece of inventory at the end of each day:


INVENTORY_SNAPSHOT
inventory_id | taken_at | amount
------------------------------------
1 | 2018-01-01 | 17
2 | 2018-01-01 | 9
3 | 2018-01-01 | 29
1 | 2018-01-02 | 13
2 | 2018-01-02 | 37
3 | 2018-01-02 | 19
...

For any given window(s) of time, we want to aggregate all the rows for a given inventory ID by selecting the latest one. There are different ways to do this, but I like this one because it’s short, reusable, and has O(n) time complexity and O(1) memory complexity, as opposed to approaches based on sorting/ordering which may have worse complexity:


SELECT
inventory_id,
SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int
...
GROUP BY 1
-- Written for Redshift. The :: operator is type casting, || is string concatenation

Let’s explain a bit - for all the rows matching the filter and in a given inventory ID group, we first construct a concatenated field from a fixed-length version of the sorting field (taken_at) and the field we care about (amount). Then, with this calculated field, using the MAX aggregate function gets us the latest row for each inventory id, and then SUBSTRING discards the date, leaving just the value.


Using it in LookML


All well and good, how do we translate this to LookML? You’ll notice that the “latest” aggregate is tightly bound to one grouping (inventory id in our example), so it’s not a great candidate to directly be a measure in our view, where users could pair it with arbitrary dimensions. Instead, it makes sense to put that aggregation and grouping inside of a derived table where the pairing will be hard-coded.


On the other hand, the aggregate is loosely bound to time grouping, since we want users to be able to swap in different levels of granularity - or even no time grouping at all for just the absolute latest data. In addition, this inventory-x-time aggregate often needs to be further aggregated to other uncoupled levels of grouping. With all that in mind, here is one example way to work all these things together using our _in_query feature to write some “dynamic” SQL:


explore: inventory {
join: inventory_snapshot_dynamic {
type: left_join
relationship: one_to_many
sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
}
}
view: inventory_snapshot_dynamic{
derived_table:{
sql:
SELECT inventory_id,
{% if inventory_snapshot_dynamic.dynamic_granularity_date._in_query %}
DATE_TRUNC('day',taken_at)
{% elsif inventory_snapshot_dynamic.dynamic_granularity_week._in_query %}
DATE_TRUNC('week',taken_at)
{% elsif inventory_snapshot_dynamic.dynamic_granularity_month._in_query %}
DATE_TRUNC('month',taken_at)
{% elsif inventory_snapshot_dynamic.dynamic_granularity_quarter._in_query %}
DATE_TRUNC('quarter',taken_at)
{% elsif inventory_snapshot_dynamic.dynamic_granularity_year._in_query %}
DATE_TRUNC('year',taken_at)
{% else %} 0
{% endif %} as dynamic_granularity
--------
{% if inventory_snapshot_dynamic.sum_ending_amount._in_query %}
, SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int as ending_amount
{% endif %}
FROM inventory_snapshot
WHERE {% condition inventory_snapshot_dynamic.dynamic_granularity_date %} taken_at {% endcondition %}
AND {% condition inventory_snapshot_dynamic.dynamic_granularity_week %} taken_at {% endcondition %}
AND {% condition inventory_snapshot_dynamic.dynamic_granularity_month %} taken_at {% endcondition %}
AND {% condition inventory_snapshot_dynamic.dynamic_granularity_quarter %} taken_at {% endcondition %}
AND {% condition inventory_snapshot_dynamic.dynamic_granularity_year %} taken_at {% endcondition %}
GROUP BY 1,2
;;
}
dimension: inventory_id{}
dimension_group: dynamic_granularity {
group_label: "History Granularity"
type: time
datatype: date
sql: {$TABLE}.dynamic_granularity ;;
timeframes: [year,quarter,month,week,date]
}
measure: sum_ending_amount{
type:sum
sql: ${TABLE}.ending_amount
}
}

Sharing is caring


The above works well enough in isolation, but when you want this and other similar dynamic views to be composable into one explore, it helps to isolate and externalize the date fields so multiple views can reference them. Here’s an illustrative partial example:


explore: inventory {
join: dynamic_granularity {sql: ;; relationship: one_to_one} #Field-only view
join: inventory_snapshot_dynamic {
type: left_join
sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
}
join: sales_dynamic {
type: left_join
sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id}
{% if inventory_snapshot_dynamic._in_query %}
AND sales_dynamic.dynamic_granularity = inventory_snapshot_dynamic.dynamic_granularity
{% endif %} ;;
}
}
view: dynamic_granularity {
label: "[History]"
dimension_group: dynamic_granularity {
group_label: "Granularity"
type: time
datatype: date
sql: COALESCE(
{% if inventory_snapshot_dynamic._in_query %}
inventory_snapshot_dynamic.taken_at,
{%endif%}
{% if sales_dynamic._in_query %}
sales_dynamic.sale_at,
{%endif%}
NULL ) ;;
timeframes: [year,quarter,month,date]
}
}
view: inventory_snapshot_dynamic{
derived_table:{
sql:
SELECT inventory_id,
{% if dynamic_granularity.dynamic_granularity_date._in_query %}
DATE_TRUNC('day',taken_at)
{% elsif dynamic_granularity.dynamic_granularity_week._in_query %}
DATE_TRUNC('week',taken_at)
...

Footnotes



  1. The term “semi-additive measure” is often invoked in these kinds of discussions


8 replies

Fabio,


Thanks for taking the time to post this. Very helpful.


I tried a somewhat similar approach to this same problem and came across an edge case (described below) that didn’t work. I’d like to know if this case would work using your approach but it’s hard for me to tell just looking at the code. The case goes like this.


Test Case

user selects date = 'last 7 day’s in the filter, but not in the report. In this case the report incorrectly adds the seven days.

If user adds date to the report then the report works as it should.


Have you encountered this case? Are you able to verify if it works or not under this approach? Would very much appreciate any help you can provide.

Userlevel 7
Badge

Hi, good catch! I believe this should be addressable.



  1. Use field._is_selected rather than field._in_query in the subquery’s SELECT clause

  2. Create an additional filter field that the user should use for date filtering

    2a. Make it clear to use this field for filtering vs the others by adding it to the explore’s always_filter

    2b. Unlike dimensions, when you specify the sql of a filter, this can be used to override the SQL that gets put into the outer WHERE clause. We’ll use sql: TRUE ;; since we don’t want to filter in the outer query where the dates may have already been transformed

    2c. Inside the subquery(s), add a WHERE {% condition new_filter_field %} date_column_to_filter {% endcondition %}


I don’t have a test dataset available just now to test the above suggestion, but believe it should work as intended. Let me know if you run into any issues!

Fabio,

I very much appreciated your quick response. Thank you.

I’ll try to take this approach and adapt it to my situation. Thanks for the tip again.

Juan

Userlevel 2

@fabio

Is this still the best design pattern for handling semi-additive facts? I’m having trouble getting this to work as expected.

I’m trying to generalize this against a date dimension instead of for each individual fact.

Userlevel 7
Badge

@la5rocks

Hi David, long time no talk!

As far as the measure itself, yes, I would say that using this CONCAT+MAX approach is still what I would recommend. I have found it convenient to write some UDFs to make the expression more usable, but that depends on what SQL engine/dialect you’re working with.

As far as making date logic generalized, one approach that has worked really well for me is specifying the date fields as such:

view: date {
  view_label: "[Date]"

  filter: view_description {
    label: "(info) -➤"
    description: "This is a co-dimension view. These fields combine the primary date dimensions from one or more tables (e.g., views, orders, registrations, etc.)"
    sql: TRUE /*This field is just a description, not useful in a query */;;
  }

  filter: date_filter {
# Useful if your tables tend to have date-based sorting/clustering/partitioning, for use with always_filter
    type: date_time
    datatype: timestamp
    label: "Date Limit"
    sql: COALESCE({% condition %} ${TABLE} {% endcondition %},TRUE);; # True if null, i.e. applied to a row with no date column
  }

  dimension_group: _ {
    type: time
    datatype: timestamp
    sql: ${TABLE} ;; #### Notice we just reference the table alias directly ####
    timeframes: [week,date,raw]
  }

  # Any other date-based fields

}




And then specifying the relevant tables to apply the date fields to at the explore level like so, after having joined in the relevant tables on separate rows (e.g. using an outer join on false, or using “join paths”)
 

explore: foo {
...
join: date {
type: cross
relationship: one_to_one
sql_table_name: UNNEST([COALESCE(
{% if views._in_query %} views.created_at, {% endif %}
{% if orders._in_query %} orders.ordered_at, {% endif %}
{% if registrations._in_query %} registrations.created_at, {% endif %}
CAST(NULL AS TIMESTAMP)
)]) ;;
}
}

The above is BigQuery syntax, but any engine with a lateral join or cross apply syntax should work as well.

By specifying the tables in the join, at the explore level, you can avoid creating a large number of views that need to be maintained just for the one explore. And by having a convention across your project to always use the “date” alias, you can re-use date logic pretty conveniently
 

Userlevel 2

How does this deal with the additive part of semi-additive?

I’d like my measures to aggregate on all dimensions except for time. is that where you’ve used UDFs to work that out? This gets problematic with nested aggregation otherwise.

Userlevel 7
Badge

The additive part would still be in a derived table as suggested by the article. Really the only new thing I’ve suggested in my latest comment is a slightly more re-usable approach to declaring and joining in the field-only view, so that you can depend on it always having the same name.

 

But perhaps I’m not understanding the hurdle you’re trying to overcome, do you have a example you could share with me?

Hi,

 

Now I understand, that I miss interpret the value.

 

Thanks

Reply