[Analytic Block] Creating a Median Aggregate Function in BigQuery with a User Defined Function

  • 24 January 2017
  • 9 replies
  • 688 views

Userlevel 6
Badge

About this Block

 

BigQuery allows for the creation of User Defined SQL function written in either SQL or Javascript. This block shows how to create and use a MEDIAN function written in SQL. 

How it works:

 

 

LookML explores have a parameter sql_preamble: that allow you to specify sql statements that execute before the query is run. BigQuery has a CREATE TEMP FUNCTION statement that lets you create user defined functions. We have written a function computes median and added it to the explore.

 

 

Every query that runs from this explore will have the median function inserted before the SQL code for the query.

 

 

explore: sf_salary {

sql_preamble:

CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)

RETURNS FLOAT64 AS ((

SELECT

AVG(num)

FROM (

SELECT

row_number() OVER (ORDER BY num) -1 as rn

, num

FROM UNNEST(a_num) num

)

WHERE

rn = TRUNC(ARRAY_LENGTH(a_num)/2)

OR (

MOD(ARRAY_LENGTH(a_num), 2) = 0 AND

rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )

));

;;

}

 

Calling the Function

 

 

Functions in BigQuery can take Array of any type (read about arrays in BigQuery here). We first collect all the numeric values into an array, useing the ARRAY_AGG() and then pass the array to the MEDIAN() user defined function which scans the array and looks for the median value.

 

 

  measure: average_base_pay {

type: average

sql: ${pay} ;;

value_format_name: "decimal_0"

}



measure: median_base_pay {

type: number

sql: MEDIAN(ARRAY_AGG(${pay})) ;;

value_format_name: "decimal_0"

}

 

The LookML Code

 

 

connection: "bigquery_publicdata_standard_sql"





explore: sf_salary {

sql_preamble:

CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)

RETURNS FLOAT64 AS ((

SELECT

AVG(num)

FROM (

SELECT

row_number() OVER (ORDER BY num) -1 as rn

, num

FROM UNNEST(a_num) num

)

WHERE

rn = TRUNC(ARRAY_LENGTH(a_num)/2)

OR (

MOD(ARRAY_LENGTH(a_num), 2) = 0 AND

rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )

));

;;

}



view: sf_salary {

sql_table_name: `lookerdata.sfsalary.salaries` ;;



measure: count {

type: count

drill_fields: [detail*]

}



dimension: id {

type: number

sql: ${TABLE}.Id ;;

}



dimension: employee_name {

type: string

sql: ${TABLE}.EmployeeName ;;

}



dimension: job_title {

type: string

sql: ${TABLE}.JobTitle ;;

}



# Pay is stored as a string, safely convert it to a float.

dimension: pay {

type: string

sql: CASE WHEN

REGEXP_CONTAINS(${TABLE}.TotalPay, r'^[\d\.]+$')

THEN CAST(${TABLE}.TotalPay AS FLOAT64)

END ;;

}



measure: average_base_pay {

type: average

sql: ${pay} ;;

value_format_name: "decimal_0"

}



measure: median_base_pay {

type: number

sql: MEDIAN(ARRAY_AGG(${pay})) ;;

value_format_name: "decimal_0"

}



dimension: year {

type: string

sql: ${TABLE}.Year ;;

}



dimension: status {

type: string

sql: ${TABLE}.Status ;;

}



set: detail {

fields: [

id,

employee_name,

job_title,

pay,

year,

status

]

}

}

 

The SQL for the Query

 

 

CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)

RETURNS FLOAT64 AS ((

SELECT

AVG(num)

FROM (

SELECT

row_number() OVER (ORDER BY num) -1 as rn

, num

FROM UNNEST(a_num) num

)

WHERE

rn = TRUNC(ARRAY_LENGTH(a_num)/2)

OR (

MOD(ARRAY_LENGTH(a_num), 2) = 0 AND

rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )

));



SELECT

sf_salary.JobTitle AS sf_salary_job_title,

COUNT(*) AS sf_salary_count,

MEDIAN(ARRAY_AGG((CASE WHEN

REGEXP_CONTAINS(sf_salary.TotalPay, r'^[\d\.]+$')

THEN CAST(sf_salary.TotalPay AS FLOAT64)

END))) AS sf_salary_median_base_pay,

AVG((CASE WHEN

REGEXP_CONTAINS(sf_salary.TotalPay, r'^[\d\.]+$')

THEN CAST(sf_salary.TotalPay AS FLOAT64)

END) ) AS sf_salary_average_base_pay

FROM `lookerdata.sfsalary.salaries` AS sf_salary



GROUP BY 1

ORDER BY 2 DESC

LIMIT 500


9 replies

Userlevel 2

Hi Lloyd, this is the good option in order to include UDF in SQL. But in LookML it seems like this parameter (sql_preamble) was EXPERIMENTAL and LookML mentions that it will be removed in future release. Due to this I am getting LookML error saying “Invalid property of explore:sql_preamble” and I am not able to push code to Production.



Any workaround for this. I am trying to use JS UDF in google BQ.

Userlevel 6
Badge

@khaah7 - In 4.6, sql_preamble: is implemented (and the warning has been removed). We’ve been playing with it for a while and figured it was good enough. Most other dialects let you install UDFs, but because BigQuery is stateless, we decided to implement this as sql_preamble:



Feel free to use it knowing the warning will disappear next release.

Userlevel 2

Hi Lloyed, is it possible to do that in version 4.2 as that’s what we are currenly using or have to upgrade to 4.6 in order to use that. Currently I am getting error in 4.2 and without resolving this error, I can not push LookML to production.

Userlevel 6
Badge

Yes, you can turn off validation until you get to 4.6. Go into “project settings” and uncheck ‘Require LookML Validation to Commit’. You can turn it back on after you upgrade.



Userlevel 2

Hi Lloyd, this will be grate. I am not seeing option to save/update setting after unchecking the option in project setting. I have admin role.

Userlevel 6
Badge

We’ll reach out. I’m not sure why you are seeing this.

Hey @kshah7,



The project settings are located under the Projects drop-down arrow. If you scroll down on the page, there should be an option to “Update Project Settings.”



If the option still isn’t there, there might be something else going on. If so, can you please email help.looker.com so we can take a closer look!



Thanks!





Adding on to this. Here is something I recently did using Javascript to create custom SQL functions within BigQuery for use in Looker.



  sql_preamble:

CREATE TEMPORARY FUNCTION url_decode(enc STRING)

RETURNS STRING

LANGUAGE js AS """

try {

return decodeURI(enc)

} catch (e) { return null }

return null;

""";

;;

for the measure, it might be useful to include IGNORE NULLS to ensure that the median is calculated as intended. For example, sum(value) and avg(value) ignore nulls. Similarly, it would make sense for median(value) to do the same.



MEDIAN(ARRAY_AGG(${pay}) IGNORE NULLS))

Reply