[Analytic Block] Weighted Medians in BigQuery with UDFs

  • 25 January 2017
  • 1 reply
  • 321 views

Userlevel 6
Badge

Weighted Medians can be pretty helpful in data analysis. The example we are going to use here is the Names, populations and years. In any given year, some number of people are born with a given name. A weighted median answers the question:



What year where there as many people born with the name [James] as there were in following years?



We’re going to build a generic measure that can answer this question, regardless of how the data is aggregated.


Eyeballing the Data.


The graphs below show population and years for the names James and Isabel. We can see that the median year for James is 1953 (about 1/2 the area) and the median year for Isabel is 1998.



Explore Data in Full Screen



Explore Data in Full Screen


Let’s first Play with it First


Median Year lets us find names that were popular in given eras and areas. Below is a table that shows some names and states and the median year in for a given name and state. If you sort the columns, you can see where and when names originated or at least where they peaked.



Explore Data in Full Screen


The Complete LookML Model


connection: "bigquery_publicdata_standard_sql"

persist_for: "24 hours"

explore: custom_functions {
extension: required
sql_preamble:
CREATE TEMP FUNCTION _pairs_sum_float(a ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
RETURNS ARRAY<STRUCT<num FLOAT64, weight FLOAT64>> AS ((
SELECT
ARRAY_AGG(STRUCT(num,weight))
FROM (
SELECT
num
, SUM(weight) as weight
FROM UNNEST(a)
GROUP BY 1
ORDER BY 2 DESC
)
));


CREATE TEMP FUNCTION MEDIAN_WEIGHTED(a_nums ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
RETURNS FLOAT64 AS ((
SELECT
num
FROM (
SELECT
MAX(cumulative_weight) OVER() max_weight
, cumulative_weight
, num
FROM (
SELECT
SUM(weight) OVER (ORDER BY num) as cumulative_weight
, weight
, num
FROM UNNEST(_pairs_sum_float(a_nums)) a
ORDER BY num
)
)
WHERE cumulative_weight > max_weight/2
ORDER BY num
LIMIT 1
));
;;
}

explore: names {
extends: [custom_functions]
}

view: names {
sql_table_name: `fh-bigquery.popular_names.usa_1910_2013` ;;
dimension: name {}
dimension: state {}
dimension: year {type:number}
dimension: population {
type: number
sql: ${TABLE}.number ;;
}

measure: total_population {
type: sum
sql: ${population} ;;
}

measure: median_year {
type: number
sql: MEDIAN_WEIGHTED(ARRAY_AGG(STRUCT(CAST(${year} as FLOAT64) as num, CAST(${population} as FLOAT64) as weight)));;
}
}


Things to Notice


For the most part, this is a standard LookML model. Some things you might notice.


The Function Declarations


We’ve put the custom functions into a single explore


We’ve put the custom functions into an abstract explore (using extensions: required), this will let use these functions in multiple different explores without having to copy code around.


explore: custom_functions {
extension: required
sql_preamble:
CREATE TEMP FUNCTION _pairs_sum_float(a ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
RETURNS ARRAY<STRUCT<num FLOAT64, weight FLOAT64>> AS ((
SELECT
ARRAY_AGG(STRUCT(num,weight))
FROM (
SELECT
num
, SUM(weight) as weight
FROM UNNEST(a)
GROUP BY 1
ORDER BY 2 DESC
)
));


CREATE TEMP FUNCTION MEDIAN_WEIGHTED(a_nums ARRAY<STRUCT<num FLOAT64, weight FLOAT64>>)
RETURNS FLOAT64 AS ((
SELECT
num
FROM (
SELECT
MAX(cumulative_weight) OVER() max_weight
, cumulative_weight
, num
FROM (
SELECT
SUM(weight) OVER (ORDER BY num) as cumulative_weight
, weight
, num
FROM UNNEST(_pairs_sum_float(a_nums)) a
ORDER BY num
)
)
WHERE cumulative_weight > max_weight/2
ORDER BY num
LIMIT 1
));
;;
}

When we want to use a custom function in an explore, we extend from it.


explore: names {
extends: [custom_functions]
}

The MEDIAN_WEIGHTED function takes an array of structs as a parameter.


To create aggregate UDFs in BigQuery, you aggregate data into an array and pass it to a function. In this case for each data point, we need the number and the weight we want to give that number. We first combine the data point into a STRUCT and then aggregate them all into a group with the ARRAY_AGG function. Then calls MEDIAN_WEIGHT. Type checking on BigQuery UDFs is very strict so we have to cast integer numbers to floats to be able to call the function.


  measure: median_year {
type: number
sql: MEDIAN_WEIGHTED(ARRAY_AGG(STRUCT(CAST(${year} as FLOAT64) as num, CAST(${population} as FLOAT64) as weight)));;
}

1 reply

Userlevel 4

Good news! Google BigQuery has released a feature which allows you to avoid using sql_preamble. You can just create the function once and it will persist! You could probably even run the create statement as the Looker BQ user via the SQL runner and it should just work.


Reply