Creating Dynamic Tiers

rachel_johnson · 10-16-2017 05:05 PM

The content of this article has been updated and migrated to a Looker Help Center article.

But, the information in the comments is useful, so the article will remain here for discussion. For the actual contents, please visit the Help Center.

jesse_carah

I created a similar pattern which uses the modulo operation to build dynamic tiers. The big difference is that you get concatenated tiers like 10-20, 20-30, etc. This is what the tiers look like:

Mathematical Approach:

Like I said before, I use the modulo operation (denoted by %) to calculate the tiers. The approach can be explained with some simple math:

Let r represent the user-defined bucket size
Let n represent the the value for which we want to assign a bucket

We’ll first calculate the lower bound of the bucket with the following formula:

n - n % r

We then calculate the upper bound with this formula:

n - n % r + r

Example:

Let r = 5
Let n = 33
n % r = 33 % 5 = 3

This implies:

Lower bound = 33 - 3 = 30
Upper bound = 33 - 3 + 5 = 35
When then concatenate these two values and get 30 - 35 as our bucket.

The LookML

First, I create a parameter that enables the end user to select his or her desired bucket size:

  parameter: bucket_size {
    default_value: "10"
    type: number
  }

I then build a dimension that takes the user input from the bucket_size parameter and calculates even-sized buckets.

 dimension: dynamic_bucket  {
    sql:
        concat(${sale_price} - mod(${sale_price},{% parameter bucket_size %}),
          '-', ${sale_price} - mod(${sale_price},{% parameter bucket_size %}) + {% parameter bucket_size %})
      ;;
  order_by_field: dynamic_sort_field   
  }

One annoyance is that this dimension is a string, which can cause issues with sorting. For example, the value 100-110 would come before 90-100 if we applied an ascending sort on our bucket. For this reason, we’ll need to build a sort field which determines – you guessed it – how our dimension will sort.

  dimension: dynamic_sort_field {
    sql: 
      ${sale_price} - mod(${sale_price},{% parameter bucket_size %});;
      type: number
    hidden: yes
  }

Because our sort field is now a number instead of a string, the dynamic bucket column will sort as intended. It’s also worth noting that this dimension has the same cardinality as our bucket, which ensures that our result set does not fan out. Check out this post to learn more about fanouts.

–

That’s it! Go ahead and bucket away to your heart’s desire.

mplooker

or if anyone is using sql server which doesn’t have the exact same MOD and TRUNCATE functions, here is a pattern that provides a concatenated label as well as the numerical sort:

dimension:  MEASURE_raw  {
    type: number
    group_label: "MEASURE"
    hidden: no
    sql: ${TABLE}.MEASURE
  }

parameter: MEASURE_bucket_size {
    default_value: "2"
    type: number
  }

  dimension: MEASURE_dynamic_bucket  {
    description: "tiers of MEASURE"
    group_label: "MEASURE"
    sql:
    cast(
    ROUND(${MEASURE_raw}/ {% parameter MEASURE_bucket_size %}, 0) * {% parameter MEASURE_bucket_size %}
    as varchar(256)
    )
    +'-'+
    cast(
    ROUND(${MEASURE_raw}/ {% parameter MEASURE_bucket_size %}, 0) * {% parameter MEASURE_bucket_size %}
    + {% parameter MEASURE_bucket_size %}
    as varchar(256)
    )
    +'%'

    ;;
    order_by_field: MEASURE_bucket_sort
  }

  dimension: MEASURE_bucket_sort {
    group_label: "MEASURE"
    sql:
      ROUND(${MEASURE_raw}/ {% parameter MEASURE_bucket_size %}, 0) * {% parameter MEASURE_bucket_size %};;
    type: number
    hidden: no
  }

davemasom

Hi all, this was really helpful. Is there a way to allow a user to specify a number of tiers they want and have Looker calculate the correct size of the tiers? This was one of the first requests we got from business users when we put them through training.

For example - “split order items into 10 equally-sized buckets”

bens1

Hey Dave, I’d be curious to hear the answer to this as well. I agree, it’ll be pretty amazing to have a user-facing way of bucket creation. I’m picturing something like this:

User chooses dimension “Subdivide”
User chooses “by cumulative values (i.e. equal total value in each bin)” or “by bin size” with a parameter, and the bin assignment per record is done in one of the following ways depending on the selection:
3a. for cumulative values, the system would prompt for a number of bins by pinning a new parameter to the filters, and essentially produce a cumulative histogram
3b. the dynamic tiers discussion above should work for “by bin size”

PS thanks to the discussion participants above for the head-start on this!

izzymiller

I bet there’s a way to do this using a parameter to take the input of # of tiers and insert it into some custom written SQL, similar to the patterns above but altered to take tier # of tiers instead of bucket size. If that gets anyones creative juices flowing, I’d love to see it too!

If not, maybe I’ll give it a try

shiggins

@davemasom @bens1 Not sure if you ended up with a solution you were happy with, but I played around with this using NTILE for Redshift and it’s relatively straight forward to have the tiers be dynamically calculated so there’s histogram equalisation happening.

Here’s the important parts of code:

  derived_table: {
    sql: SELECT ntile,
                  (min(age)|| '-' ||max(age)) as cohort,
                 Count(*) as total_orders,
                 sum(revenue) as revenue
          FROM   (SELECT age,
                         Ntile({{ num_buckets._parameter_value}}) OVER (ORDER BY age),
                         sale_price as revenue
                  FROM   order_items o
                  JOIN users u
                  ON u.id = o.user_id
                  ) x
          GROUP  BY 1
        ;;
  }
  parameter: num_buckets {
    description: "Select the number of buckets you want the orders to be broken down by i.e. how many cohorts do you want to see? More = granular, less = coarse."
    type: unquoted
    allowed_value: {value:"3"}
    allowed_value: {value:"5"}
    allowed_value: {value:"8"}
    allowed_value: {value:"10"}
    allowed_value: {value:"15"}
  }```

By default this is going to get the `NTILE` based on order count because of the tables, but additional parameters and other logic could make it flexible for a user to change what is being used for the '10 equally-sized buckets'.

davemasom

Very cool! I never found a solution - I will check this one out. Thanks!

shiggins

I should add, Redshift doesn’t support width_bucket() which allows you to adjust the buckets based on a cumulative value (e.g. revenue per user) as opposed to just an even distribution of rows per partition. To my knowledge, Snowflake, Postgres and Oracle support this.

davisclark

Hi Sean,

Why do the intervals in your histogram appear to fluctuate in size? Furthermore, Why are the values in $s and not frequency counts?

Thanks in advance.

shiggins

@davisclark the intervals are being set dynamically to answer Ben’s use case:

3b. the dynamic tiers discussion above should work for “by bin size”

The value format is just dictated in the Measure so that could be in $s or integer format, whatever is preferred.

Actually an update to this, there’s a new feature landing in 6.20 called bind_all_filters that’d make this whole calculation of tiers and buckets far more flexible. Mega genius @Bryan_Weber wrote this article that everyone should implement right now - [Analytic Block] - Pivot by Top X - Introducing bind_all_filters: yes

davisclark

Thanks, @shiggins. I appreciate the reply.

Dynamic Tiers and Histograms are fundamentally different analyses. A user should be able to enter a list of dynamic tiers when variable intervals are desired.

Histograms are used to visualize a distribution & conduct frequency analysis. While Histograms can technically feature bins of variable width, they rarely do and, when they do, the bar’s width must vary in proportion to the size of the interval that it covers. In this situation, the frequency is encoded as the area of the bar, not the height.

In your visualization, with 5 bins, the final interval is 55-101, whereas the first is 12-16 and the second 16-36. Not only does this chart fail to convey the shape of the distribution, it does so in the most misleading way possible, allowing the user to input number of bins into a parameter labeled Histogram Equalisation, while using equal columns widths in the chart itself.

Again I appreciate the reply, but I feel this post, absent a full disclosure, is harmful to the broader Looker community.

shiggins

@davisclark granted, this wasn’t intended to be an example of a histogram, so I can see where I was confusing and possibly conflating the two asks, but an example of generating dynamic # of buckets and the functions that can be used to calculate those buckets upstream based on the original question.
Looker doesn’t have a histogram vis so the only way I could show the fact those buckets have equal amount of records (by way of NTILE or width_bucket) is showing bars (green) that are equal across those buckets. Probably better to just remove that gif, to avoid any further confusion.

white1

@davemasom @izzymiller I’ve posted on how to dynamically create tiers based on end-user input over here.

Hope this helps!