A lesser-known feature of some SQL dialects is something called the "window function." While MySQL users will be left out in the cold, most other SQL dialects can take advantage of their power. They can be a little tricky to wrap your mind around at first, but certain calculations — which are very complex or impossible without window functions — can become straightforward.
Intriguing
To demonstrate the power of window functions, let's take a look at an example set of customer data:
Now, suppose you want to know how the customer ranks in spending, against the other customers in their status. In other words, you're hoping for a result set that looks like this:
It's About to Get Real
Logically, this seems like a straightforward thing to do, so one might expect a fairly easy method in SQL. It is easy with a window function, but without one, it gets kind of nasty. If you do a quick Google search you're likely to come across several approaches. Some of them utilize concatenation hacks, clever self joins, and even user-defined variables. Did you even know SQL had variables you could define?
Just to demonstrate one option (which avoids variables for those who haven't used them), you could run a query like this:
SELECT c1.name,
c1.status,
c1.lifetime_spend,
COUNT(*) AS status_rank
FROM customer AS c1
JOIN customer AS c2
ON c1.lifetime_spend <= c2.lifetime_spend AND
c1.status = c2.status
GROUP BY c1.name,
c1.status,
c1.lifetime_spend
Demonstrating how this query works is not the primary focus of this entry, so we won't get into a step-by-step explanation. However, it can be helpful to consider how the
JOIN
works in this query. Before applying any column selections or grouping, you would arrive at a table like this:
Now, it's a little easier to see that the number of times each name appears in the first column is the rank for that name, within its status.
Stop the Pain
Certainly, SQL queries can get a lot more complex than the one we just saw. However, to do something as straightforward as calculating a rank, it's a lot of fussing about. With a window function, this gets a lot easier and more elegant:
SELECT name,
status,
lifetime_spend,
RANK() OVER(PARTITION BY status ORDER BY lifetime_spend DESC) AS status_rank
FROM customer
That is it — no tricky logic, just exactly what you want.
Break it Down Now
Let's break down the concept of a window function, to understand exactly how it works. The general idea is that you perform some sort of calculation over a group of rows in the table. Then, you list the result of that calculation next to each row that is a part of that group.
This is where window functions can get a little confusing. Typically, when you perform a calculation across multiple rows, you need to use it in combination with GROUP BY
. Then, your calculation (like COUNT
or SUM
) is run for each group. For example, you might run:
SELECT status,
SUM(lifetime_spend) AS total_spend
FROM customer
GROUP BY status
This would create the following table:
Window functions perform those same types of calculations across multiple rows, but they return a result for each individual row in the table. In other words, they don't need to be used with GROUP BY
. Instead of using GROUP BY
, you specify your desired groups of aggregation by using PARTITION BY
. For example, you could run:
SELECT name,
status,
lifetime_spend,
SUM(lifetime_spend) OVER(PARTITION BY status) AS total_status_spend
FROM customer
This would generate the following table:
This is part of the magic of a window function. You can perform aggregation, without having to squish down your table via GROUP BY
.
There are also some special window functions that let you do other interesting things. As we have seen, RANK
lets us rank rows based on how they compare to other rows in their group. Other useful functions include LAG
and LEAD
, which let us get values from a row that is before or after the current row. NTILE
lets you find the percentile/quartile/decile/etc a row is in within its group. Other similar functions and their usage are easily found on the web for your particular SQL dialect.
The Window Function in Your Life
Who Doesn't Love Some Good Syntax?
The syntax of a window function works like this:
RANK() OVER(PARTITION BY status ORDER BY lifetime_spend DESC)
- RANK(): Every window function starts off with the calculation to be performed. These can be the aggregate functions you are probably familiar with like
COUNT
,SUM
,AVG
, etc. You can also use the special window functions likeRANK
,LAG
,LEAD
,NTILE
, etc.
- OVER(. . .): Every window function must use
OVER
, so SQL knows you want to perform a window function.OVER
also accepts the other settings for your window function, if needed. It is possible you won't need any additional options, and will just useOVER()
.
- PARTITION BY status: If you want to perform your calculation over specific groups, you specify those groups using
PARTITION BY
.PARTITION BY
accepts the name of a column, or multiple columns. The unique values of that column (or columns) is considered a group. In this sense,PARTITION BY
is very similar toGROUP BY
. If you want the calculation to be performed across the entire table, instead of for specific groups, you can just leave out this statement.
- ORDER BY lifetime_spend DESC: Some calculations are sensitive to the order of rows within each group. For example, if you want to know something's
RANK
, you need to tell SQL how it should order the rows, before it can determine who is in "first place," "second place," and so on. If you need to explain how the rows should be ordered within each group, useORDER BY
. Other types of calculations are not sensitive to the order of the rows. For example, it doesn't matter how the rows are ordered when usingSUM
, because a sum is the same no matter what order it's performed in. If you don't need anORDER BY
, you can exclude it.
Examples!
Show the total amount sold, for all orders, next to each row:
SUM(order_value) OVER()
Show the total amount sold, for each store, next to each row for that store:
SUM(order_value) OVER(PARTITION BY store)
Show the rank of each order according to its value, compared to all orders, next to the row for that order:
RANK() OVER(ORDER BY order_value DESC)
Show the rank of each order according to its value, compared to its store, next to the row for that order:
RANK() OVER(PARTITION BY store ORDER BY order_value DESC)
Show the quartile of each order according to its value, compared to its store, next to the row for that order:
NTILE(4) OVER(PARTITION BY store ORDER BY order_value DESC)
Just to get a little crazy, and demonstrate that window functions can do pretty complex things, show the name of the person who purchased the order that is two orders greater in value than the current order, when considering rank by store and department, and say "none found," if the order was in the top 2 already:
LAG(customer_name, 2, 'none found') OVER(
PARTITION BY store, department
ORDER BY order_value DESC
)
More Power: Rolling Averages, Cumulative Sums, and Beyond
Though we have covered a lot, there is one more feature of window functions that can give you even more analytical power. It's called the "frame clause," and it allows you to calculate things like rolling averages, cumulative sums, and many other interesting values.
So far, all the grouping we've done is according to the unique values within a column or set of columns (using PARTITION BY
). However, you can do even fancier things by making the group relative to each row. In other words, as SQL runs through each row, it will perform a calculation for the surrounding rows. This allows you to do things like get a sum for only the 5 previous rows.
Here is a diagram that demonstrates which rows SQL will consider (in purple) when it reaches a given row (in aqua), if we tell it to look at the 5 previous rows:
More Syntax
To create a moving group like this, you'll use a "frame clause," which looks like this:
SUM(over_value) OVER(
PARTITION BY store
ORDER BY over_value DESC
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
)
ROWS BETWEEN
Start a frame clause by using ROWS BETWEEN
. This indicates that you want a moving group, relative to the current row.
5 PRECEDING
This is where you tell SQL the row that will start your moving group. The different options are:
UNBOUNDED PRECEDING
: start off at the first row of the partitionX PRECEDING
: start off x rows before the current rowCURRENT ROW
: start off at the current rowX FOLLOWING
: start off x rows after the current row
AND CURRENT ROW
This is where you tell SQL the row that will end your moving group. The different options are:
AND X PRECEDING
: end x rows before the current rowAND CURRENT ROW
: end at the current rowAND X FOLLOWING
: end x rows after the current rowAND UNBOUNDED FOLLOWING
: end at the last row of the partition
It's relatively common to exclude this section from the frame clause. If you do so, SQL will assume you want to end at the CURRENT ROW
.
ORDER BY and PARTITION BY with a Moving Group
The way that partitioning and ordering works with a moving group of rows like this can be a little confusing.
The ORDER BY
clause is important when you are using a frame clause. SQL needs to know how it should order the rows, before it can decide which rows are before or after any other.
The PARTITION BY
clause is not necessarily required when using a frame clause, but there are times you might want to use one. If you do, you should know that the rows which are considered in the calculation must be within the same partition as the current row. This can have an effect near the beginning or end of a partition. For example, if we asked for 2 rows before and 2 rows after the current row:
More Examples!
Calculate a rolling average of order size, for all orders, based on the previous 10 orders:
AVG(order_value) OVER(
ORDER BY order_time
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
)
Calculate a rolling average of order size, for each store, based on the previous 10 orders:
AVG(order_value) OVER(
PARTITION BY store
ORDER BY order_time
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
)
Calculate a cumulative sum of total amount sold, since the beginning of time:
SUM(order_value) OVER(
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
Calculate a cumulative sum of total amount sold, starting over each day:
SUM(order_value) OVER(
PARTITION BY order_date
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
Using Window Functions in Looker
One important thing to keep in mind about window functions is that they cannot be placed in a GROUP BY
clause. In Looker, dimensions are always placed into the GROUP BY
clause. Therefore, you cannot simply place a window function in a dimension. For example, the following LookML will not work properly:
# Will result in an error
dimension: value_rank_by_store {
type: number
sql: RANK() OVER(PARTITION BY store ORDER BY order_value DESC) ;;
}
Make it Work with Derived Tables
For this reason (generally speaking), window functions are used in derived tables. For example, to add a "value_rank_by_store" dimension to your order table, you might first create a derived table like this one:
view: order_window_facts {
derived_table: {
sql:
SELECT
order_id,
RANK() OVER(PARTITION BY store ORDER BY order_value DESC)
AS value_rank_by_store
FROM
order ;;
}
dimension: order_id {
type: int
sql: ${TABLE}.order_id
primary_key: yes
hidden: yes
}
dimension: value_rank_by_store {
type: number
sql: ${TABLE}.value_rank_by_store
}
}
Then, you could easily join it back to your original order table in your model file:
explore: order {
join: order_window_facts {
sql_on: ${order.id} = ${order_window_facts.order_id} ;;
}
}
The Cliffs Notes Version
To summarize the above:
- Window functions can be an easy and elegant way to add ranking, rolling averages, cumulative sums, and other powerful calculations to your queries.
- Window functions allow you to perform aggregate calculations (like
COUNT
,SUM
, andAVG
, etc.) against a group of rows, then list the result next to all the rows in that group. - Window functions can also give you new types of calculations like
RANK
,LAG
, andNTILE
. - In Looker, you usually implement window functions via derived tables.