[Analytic Block] Weighted Medians in BigQuery with UDFs

Weighted Medians can be pretty helpful in data analysis. The example we are going to use here is the Names, populations and years. In any given year, some number of people are born with a given name. A weighted median answers the question:

What year where there as many people born with the name [James] as there were in following years?

We’re going to build a generic measure that can answer this question, regardless of how the data is aggregated.

Eyeballing the Data.

The graphs below show population and years for the names James and Isabel. We can see that the median year for James is 1953 (about 1/2 the area) and the median year for Isabel is 1998.

Explore Data in Full Screen

Explore Data in Full Screen

Let’s first Play with it First

Median Year lets us find names that were popular in given eras and areas. Below is a table that shows some names and states and the median year in for a given name and state. If you sort the columns, you can see where and when names originated or at least where they peaked.

Explore Data in Full Screen

The Complete LookML Model

connection: "bigquery_publicdata_standard_sql"

persist_for: "24 hours"

explore: custom_functions {
  extension: required
  sql_preamble:
      CREATE TEMP FUNCTION _pairs_sum_float(a ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
      RETURNS ARRAY<STRUCT<num FLOAT64, weight FLOAT64>> AS ((
        SELECT
           ARRAY_AGG(STRUCT(num,weight))
        FROM (
          SELECT
            num
            , SUM(weight) as weight
          FROM UNNEST(a)
          GROUP BY 1
          ORDER BY 2 DESC
        )
      ));


      CREATE TEMP FUNCTION MEDIAN_WEIGHTED(a_nums ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
      RETURNS FLOAT64 AS ((
        SELECT
          num
        FROM (
         SELECT
            MAX(cumulative_weight) OVER() max_weight
            , cumulative_weight
            , num
          FROM (
            SELECT
              SUM(weight) OVER (ORDER BY num) as cumulative_weight
              , weight
              , num
            FROM UNNEST(_pairs_sum_float(a_nums)) a
            ORDER BY num
          )
        )
        WHERE cumulative_weight > max_weight/2
        ORDER BY num
        LIMIT 1
    ));
  ;;
}

explore: names {
  extends: [custom_functions]
}

view: names  {
  sql_table_name: `fh-bigquery.popular_names.usa_1910_2013` ;;
  dimension: name {}
  dimension: state {}
  dimension: year {type:number}
  dimension: population {
    type: number
    sql: ${TABLE}.number ;;
  }

  measure: total_population {
    type: sum
    sql: ${population} ;;
  }

  measure: median_year {
    type: number
    sql: MEDIAN_WEIGHTED(ARRAY_AGG(STRUCT(CAST(${year} as FLOAT64) as num, CAST(${population} as FLOAT64) as weight)));;
  }
}

Things to Notice

For the most part, this is a standard LookML model. Some things you might notice.

The Function Declarations

We’ve put the custom functions into a single explore

We’ve put the custom functions into an abstract explore (using extensions: required), this will let use these functions in multiple different explores without having to copy code around.

explore: custom_functions {
  extension: required
  sql_preamble:
      CREATE TEMP FUNCTION _pairs_sum_float(a ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
      RETURNS ARRAY<STRUCT<num FLOAT64, weight FLOAT64>> AS ((
        SELECT
           ARRAY_AGG(STRUCT(num,weight))
        FROM (
          SELECT
            num
            , SUM(weight) as weight
          FROM UNNEST(a)
          GROUP BY 1
          ORDER BY 2 DESC
        )
      ));


      CREATE TEMP FUNCTION MEDIAN_WEIGHTED(a_nums ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
      RETURNS FLOAT64 AS ((
        SELECT
          num
        FROM (
         SELECT
            MAX(cumulative_weight) OVER() max_weight
            , cumulative_weight
            , num
          FROM (
            SELECT
              SUM(weight) OVER (ORDER BY num) as cumulative_weight
              , weight
              , num
            FROM UNNEST(_pairs_sum_float(a_nums)) a
            ORDER BY num
          )
        )
        WHERE cumulative_weight > max_weight/2
        ORDER BY num
        LIMIT 1
    ));
  ;;
}

When we want to use a custom function in an explore, we extend from it.

explore: names {
  extends: [custom_functions]
}

The MEDIAN_WEIGHTED function takes an array of structs as a parameter.

To create aggregate UDFs in BigQuery, you aggregate data into an array and pass it to a function. In this case for each data point, we need the number and the weight we want to give that number. We first combine the data point into a STRUCT and then aggregate them all into a group with the ARRAY_AGG function. Then calls MEDIAN_WEIGHT. Type checking on BigQuery UDFs is very strict so we have to cast integer numbers to floats to be able to call the function.

  measure: median_year {
    type: number
    sql: MEDIAN_WEIGHTED(ARRAY_AGG(STRUCT(CAST(${year} as FLOAT64) as num, CAST(${population} as FLOAT64) as weight)));;
  }
1 1 624
1 REPLY 1

Good news! Google BigQuery has released a feature which allows you to avoid using sql_preamble. You can just create the function once and it will persist! You could probably even run the create statement as the Looker BQ user via the SQL runner and it should just work.

What’s happening in BigQuery: New persistent user-defined functions,...

The latest releases from BigQuery include new persistent user-defined functions (UDFs), increased concurrency limits, new GIS and encryption functions, and more.

Top Labels in this Space
Top Solution Authors