A window into the soul (of your data)

Brett_Sauve · ‎03-27-2022

This content, written by Brett Sauve, was initially posted in Looker Blog on Dec 2, 2014. The content is subject to limited support.

A lesser known feature of some SQL dialects is something called the "window function". While MySQL users will be left out in the cold, most other SQL dialects can take advantage of their power. They can be a little tricky to wrap your mind around at first, but certain calculations - which are very complex or impossible without window functions - can become straightforward.

Intriguing ...

To demonstrate the power of window functions, let's take a look at an example set of customer data:

name	status	lifetime_spend
Neil Armstrong	Platinum	1000.00
Buzz Aldrin	Platinum	2000.00
Yuri Gagarin	Platinum	3000.00
John Glenn	Gold	400.00
Alan Shepard	Gold	500.00
Jim Lovell	Gold	600.00

Now suppose you want to know how the customer ranks in spending against the other customers in their status. In other words, you're hoping for a result set that looks like this:

name	status	lifetime_spend	status_rank
Neil Armstrong	Platinum	1000.00	3
Buzz Aldrin	Platinum	2000.00	2
Yuri Gagarin	Platinum	3000.00	1
John Glenn	Gold	400.00	3
Alan Shepard	Gold	500.00	2
Jim Lovell	Gold	600.00	1

It's about to get real

Logically, this seems like a straightforward thing to do, so one might expect a fairly easy method in SQL. It is easy with a window function, but without one, it gets kind of nasty. If you do a quick Google search you're likely to come across several approaches. Some of them utilize concatenation hacks, clever self joins, and even user-defined variables. Did you even know SQL had variables you could define?

Just to demonstrate one option, which avoids variables for those who haven't used them, you could run a query like this:

SELECT   c1.name,

         c1.status,

         c1.lifetime_spend,

         COUNT(*) AS status_rank

FROM     customer AS c1

JOIN     customer AS c2

ON       c1.lifetime_spend <= c2.lifetime_spend AND

         c1.status = c2.status

GROUP BY c1.name,

         c1.status,

         c1.lifetime_spend

Demonstrating how this query works is not the primary focus of this entry, so we won't get into a step-by-step explanation. However, it can be helpful to consider how the JOIN works in this query. Before applying any column selections or grouping, you would arrive at a table like this:

c1.name	c1.status	c1.lifetime_spend	c2.name	c2.status	c1.lifetime_spend
Neil Armstrong	Platinum	1000.00	Neil Armstrong	Platinum	1000.00
Neil Armstrong	Platinum	1000.00	Buzz Aldrin	Platinum	2000.00
Neil Armstrong	Platinum	1000.00	Yuri Gagarin	Platinum	3000.00
Buzz Aldrin	Platinum	2000.00	Buzz Aldrin	Platinum	2000.00
Buzz Aldrin	Platinum	2000.00	Yuri Gagarin	Platinum	3000.00
Yuri Gagarin	Platinum	3000.00	Yuri Gagarin	Platinum	3000.00
...	...	...	...	...	...

Now it's a little easier to see that the number of times each name appears in the first column is the rank for that name, within its status.

Stop the pain

Certainly SQL queries can get a lot more complex than the one we just saw. However, to do something as straightforward as calculating a rank, it's a lot of fussing about. With a window function, this gets a lot easier and more elegant:

SELECT name,

       status,

       lifetime_spend,

       RANK() OVER(PARTITION BY status ORDER BY lifetime_spend DESC) AS status_rank

FROM   customer

That's it, no tricky logic, just exactly what you want.

Break it down now

Let's breakdown the concept of a window function to understand exactly how it works. The general idea is that you perform some sort of calculation over a group of rows in the table. Then, you list the result of that calculation next to each row that is a part of that group.

This is where window functions can get a little confusing. Typically, when you perform a calculation across multiple rows, you need to use it in combination with GROUP BY. Then your calculation (like COUNT or SUM) is run for each group. For example, you might run:

SELECT   status,

         SUM(lifetime_spend) AS total_spend

FROM     customer

GROUP BY status

This would create the following table:

status	total_spend
Platinum	6000.00
Gold	1500.00

Window functions perform those same types of calculations across multiple rows, but they return a result for each individual row in the table. In other words, they don't need to be used with GROUP BY. Instead of using GROUP BY, you specify your desired groups of aggregation by using PARTITION BY. For example, you could run:

SELECT name,

       status,

       lifetime_spend,

       SUM(lifetime_spend) OVER(PARTITION BY status) AS total_status_spend

FROM   customer

This would generate the following table:

name	status	lifetime_spend	total_status_spend
Neil Armstrong	Platinum	1000.00	6000.00
Buzz Aldrin	Platinum	2000.00	6000.00
Yuri Gagarin	Platinum	3000.00	6000.00
John Glenn	Gold	400.00	1500.00
Alan Shepard	Gold	500.00	1500.00
Jim Lovell	Gold	600.00	1500.00

This is part of the magic of a window function. You can perform aggregation without having to squish down your table via GROUP BY.

There are also some special window functions that let you do other interesting things. As we've seen, RANK lets us rank rows based on how they compare to other rows in their group. Other useful functions include LAG and LEAD, which let us get values from a row that is before or after the current row. NTILE lets you find the percentile / quartile / decile / etc a row is in within its group. Other similar functions, and their usage, are easily found on the web for your particular SQL dialect.

The window function in your life

Who doesn't love some good syntax?

The syntax of a window function works like this:
RANK() OVER(PARTITION BY status ORDER BY lifetime_spend DESC)

RANK(): Every window function starts off with the calculation to be performed. These can be the aggregate functions that you are probably familiar with, like COUNT, SUM, AVG, etc. You can also use the special window functions like RANK, LAG, LEAD, NTILE, etc.
OVER( … ): Every window function must use OVER so that SQL knows that you want to perform a window function. OVER also accepts the other settings for your window function, if needed. It is possible you won't need any additional options, and will just use OVER().
PARTITION BY status: If you want to perform your calculation over specific groups, you specify those groups using PARTITION BY. PARTITION BY accepts the name of a column, or multiple columns. The unique values of that column (or columns) is considered a group. In this sense, PARTITION BY is very similar to GROUP BY. If you want the calculation to be performed across the entire table, instead of for specific groups, you can just leave out this statement.
ORDER BY lifetime_spend DESC: Some calculations are sensitive to the order of rows within each group. For example, if you want to know something's RANK, you need to tell SQL how it should order the rows before it can determine who is in "first place", "second place", and so on. If you need to explain how the rows should be ordered within each group, use ORDER BY. Other types of calculations are not sensitive to the order of the rows. For example, it doesn't matter how the rows are ordered when using SUM, because a sum is the same no matter what order it's performed in. If you don't need an ORDER BY, you can exclude it.

Examples!

Show the total amount sold, for all orders, next to each row:

SUM(order_value) OVER()

Show the total amount sold, for each store, next to each row for that store:

SUM(order_value) OVER(PARTITION BY store)

Show the rank of each order according to its value, compared to all orders, next to the row for that order:

RANK() OVER(ORDER BY order_value DESC)

Show the rank of each order according to its value, compared to its store, next to the row for that order:

RANK() OVER(PARTITION BY store ORDER BY order_value DESC)

Show the quartile of each order according to its value, compared to its store, next to the row for that order:

NTILE(4) OVER(PARTITION BY store ORDER BY order_value DESC)

Just to get a little crazy, and demonstrate that window functions can do pretty complex things … show the name of the person who purchased the order that is two orders greater in value than the current order, when considering rank by store and department, and say "none found" if the order was in the top 2 already:

LAG(customer_name, 2, &#39;none found&#39;) OVER(

  PARTITION BY store, department

  ORDER BY order_value DESC

)

More power: rolling averages, cumulative sums, and beyond

Though we've covered a lot, there is one more feature of window functions that can give you even more analytical power. It's called the "frame clause", and it allows you to calculate things like rolling averages, cumulative sums, and many other interesting values.

So far, all the grouping we've done is according to the unique values within a column or set of columns (using PARTITION BY). However, you can do even fancier things by making the group relative to each row. In other words, as SQL runs through each row, it will perform a calculation for the surrounding rows. This allows you to do things like get a sum for only the 5 previous rows.

Here is a diagram that demonstrates which rows SQL will consider (in purple) when it reaches a given row (in aqua), if we tell it to look at the 5 previous rows:

1
2
3
4
5
6
7
8
9
10

1
2
3
4
5
6
7
8
9
10

1
2
3
4
5
6
7
8
9
10

Mmmm ... more syntax

To create a moving group like this, you'll use a "frame clause", which looks like this:
SUM(over_value) OVER( PARTITION BY store ORDER BY over_value DESC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW )

ROWS BETWEEN: Start a frame clause by using ROWS BETWEEN. This indicates that you want a moving group, relative to the current row.
5 PRECEDING: This is where you tell SQL the row that will start your moving group. The different options are:
- UNBOUNDED PRECEDING: start off at the first row of the partition
- X PRECEDING: start off x rows before the current row
- CURRENT ROW: start off at the current row
- X FOLLOWING: start off x rows after the current row
AND CURRENT ROW: This is where you tell SQL the row that will end your moving group. The different options are:
- AND X PRECEDING: end x rows before the current row
- AND CURRENT ROW: end at the current row
- AND X FOLLOWING: end x rows after the current row
- AND UNBOUNDED FOLLOWING: end at the last row of the partition
  
  It's relatively common to exclude this section from the frame clause. If you do so, SQL will assume you want to end at the CURRENT ROW.

ORDER BY and PARTITION BY with a moving group

The way that partitioning and ordering works with a moving group of rows like this can be a little confusing.

The ORDER BY clause is important when you are using a frame clause. SQL needs to know how it should order the rows before it can decide which rows are before or after any other.

The PARTITION BY clause is not necessarily required when using a frame clause, but there are times you might want to use one. If you do, you should know that the rows which are considered in the calculation must be within the same partition as the current row. This can have an effect near the beginning or end of a partition. For example, if we asked for 2 rows before and 2 rows after the current row:

At the beginning of the partition some of the initial rows can be excluded ...

In the middle of the partition everything is as expected ...

At the end of the partition some of the final rows can be excluded ...

Partition A

Partition B

Partition C

Partition A

Partition B

Partition C

Partition A

Partition B

Partition C

More examples!

Calculate a rolling average of order size, for all orders, based on the previous 10 orders:

AVG(order_value) OVER(

  ORDER BY order_time

  ROWS BETWEEN 10 PRECEDING AND CURRENT ROW

)

Calculate a rolling average of order size, for each store, based on the previous 10 orders:

AVG(order_value) OVER(

  PARTITION BY store

  ORDER BY order_time

  ROWS BETWEEN 10 PRECEDING AND CURRENT ROW

)

Calculate a cumulative sum of total amount sold, since the beginning of time:

SUM(order_value) OVER(

  ORDER BY order_time

  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

Calculate a cumulative sum of total amount sold, starting over each day:

SUM(order_value) OVER(

  PARTITION BY order_date

  ORDER BY order_time

  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

Using window functions in Looker

One important thing to keep in mind about window functions is that they cannot be placed in a GROUP BY clause. In Looker, dimensions are always placed into the GROUP BY clause. Therefore, you cannot simply place a window function in a dimension. For example, the following LookML will not work properly:

# Will result in an error

dimension: value_rank_by_store {

  type: number

  sql: RANK() OVER(PARTITION BY store ORDER BY order_value DESC) ;;

}

Make it work with derived tables

For this reason, generally speaking, window functions are used in derived tables. For example, to add a "value_rank_by_store" dimension to your order table, you might first create a derived table like this one:

view: order_window_facts {

  derived_table: {

    sql:

      SELECT

        order_id,

        RANK() OVER(PARTITION BY store ORDER BY order_value DESC)

          AS value_rank_by_store

      FROM

        order ;;

  }



  dimension: order_id {

    type: int

    sql: ${TABLE}.order_id

    primary_key: yes

    hidden: yes

  }



  dimension: value_rank_by_store {

    type: number

    sql: ${TABLE}.value_rank_by_store

  }

}

Then you could easily join it back to your original order table in your model file:

explore: order {

  join: order_window_facts {

    sql_on: ${order.id} = ${order_window_facts.order_id} ;;

  }

}

The CliffsNotes version

To summarize the above:

Window functions can be an easy and elegant way to add ranking, rolling averages, cumulative sums, and other powerful calculations to your queries.
Window functions allow you to perform aggregate calculations (like COUNT, SUM, and AVG, etc) against a group of rows, then list the result next to all the rows in that group.
Window functions can also give you new types of calculations like RANK, LAG, and NTILE.
In Looker, you usually implement window functions via derived tables.