Name Game: step 4 - normalizing populations with cohorts

  • 28 March 2022
  • 0 replies
  • 4 views

Userlevel 5

This content, written by Lloyd Tabb, was initially posted in Looker Blog on Jan 15, 2016. The content is subject to limited support.

A cohort is a group of people banded together and treated as a group. In this model, we are going to treat all people of a given gender born in a particular year in a particular state as a cohort.

Cohort table: the code

We're going to use to create a cohort table that takes the raw data and computes the number of people in each cohort. We'll declare dimensions for the cohorts so we can join this table to our usual names table. We'll make the year, state, and gender dimensions hidden so they do not appear in the field picker.

We'll also declare a measure cohort_size, so we can combine smaller cohorts into larger ones. This computation is very difficult to do in other systems but correctly handles this computation.

view: cohorts {
  derived_table: {
    sql: SELECT
        year, state, gender,
        SUM(number) as number,
        row_number() OVER () as id
      FROM names
      GROUP BY 1,2,3
       ;;
  }

  dimension: id {
    primary_key: yes
  }

  dimension: year {
    hidden: yes
  }

  dimension: state {
    hidden: yes
  }

  dimension: gender {
    hidden: yes
  }

  dimension: number {
    #X# Invalid LookML inside "dimension": {"type":null}
  }

  measure: cohort_size {
    type: sum
    sql: ${number} ;;
  }
}

Code: the join

We're going to join each name to the cohort that has the same gender, year, and state:

# build a percent of cohort to normalize data
explore: names_step_4 {
  join: cohorts {
    sql_on: ${state} = ${cohorts.state}
      AND ${year} = ${cohorts.year}
      AND ${gender} = ${cohorts.gender}
       ;;
    relationship: many_to_one
  }
}

The view

And finally, we simply add a single measure, percentage_of_cohort. This measure will tell us what percentage this particular name or names were of the total cohort. Looker correctly computes this measure no matter what filters we add or which dimensions we group by.

# build out cohort analysis to normalize data by cohort
#  so you can compare names popularity between states or years
#
view: names_step_4 {
  extends: [names_step_3]

  measure: percentage_of_cohort {
    type: number
    sql: ${total_population} / ${cohorts.cohort_size} ;;
    value_format_name: percent_2
  }
}

Explore!

All of the queries below can be created by . Select dimensions and measures, add filters and bonk the 'Run Button'.

So far in this analysis, we've been using population to tell us the popularity of a given name. However, looking at this cohort graph we can see that we've been wrong. One line shows that the population of people with the name James has risen and fallen. The other line shows that from 1920 to 1950 fully 3% of all people were named 'James' but since then the popularity of the name has been on decline. Using cohorts, we can normalize the data to the population size and see actual trends in popularity.

The cohorts work regardless of how we group the data. Here is a graph of popularity of the name 'James' by state. Percentages are against all the male people born in that state for all time.

We can see the changes in popularity of the name James over time by region. The popularity calculation works regardless of how the data is grouped or filtered.

Here's a list of all of the steps:


0 replies

Be the first to reply!

Reply