NOTE: When this entry was originally written, Looker users needed to use the following strategies to properly calculate certain metrics. As of Looker 3.16, symmetric aggregates have been implemented, which render much of the information in this article unnecessary.
However, this article is still useful for writing SQL in general, and specifically for working with SQL dialects that do not support symmetric aggregates.
If you're a SQL user, some of the first SQL concepts you probably learned about were joins and aggregate functions (such as COUNT
and SUM
). One thing that is not always taught is how these two concepts can interact and sometimes produce incorrect results. In this article we'll discuss what to look out for, the concept of a "fanout," and why it matters to SQL writers and Looker users alike.
Starting with a friendly join
Let's start off with a simple example, where we'll join together a couple of tables. Our first table will show our customers' names and the number of visits each customer has made to our website:
customer
customer_id | first_name | last_name | visits |
---|---|---|---|
1 | Amelia | Earhart | 2 |
2 | Charles | Lindbergh | 2 |
3 | Wilbur | Wright | 4 |
Our second table will include all the orders that those customers have placed. You can see that each order is linked to the customer who placed it by the customer's ID.
order
order_id | amount | customer_id |
---|---|---|
1 | 25.00 | 1 |
2 | 50.00 | 1 |
3 | 75.00 | 2 |
4 | 100.00 | 3 |
Joining these tables together in SQL would be pretty straightforward:
SELECT * FROM customer LEFT JOIN order ON customer.customer_id = order.customer_id
The result of that query would be this table:
customer_id | first_name | last_name | visits | order_id | amount | customer_id |
---|---|---|---|---|---|---|
1 | Amelia | Earhart | 2 | 1 | 25.00 | 1 |
1 | Amelia | Earhart | 2 | 2 | 50.00 | 1 |
2 | Charles | Lindbergh | 2 | 3 | 75.00 | 2 |
3 | Wilbur | Wright | 4 | 4 | 100.00 | 3 |
Aggregate functions gone bad
Now that we have a joined table, we need to be careful about how we use aggregate functions like COUNT
and SUM
.
Aggregate functions on a single table
Let's consider the customer table all by itself again. If we want to know the total number of customers, we can execute a simple query like this:
SELECT COUNT(*) FROM customer
SQL will count up the rows in the table as follows:
customer
COUNT |
customer_id | first_name | last_name | visits |
---|---|---|---|---|
1 | 1 | Amelia | Earhart | 2 |
2 | 2 | Charles | Lindbergh | 2 |
3 | 3 | Wilbur | Wright | 4 |
We'll get a count of 3, which is correct.
Or, if we want to know the total number of customer visits, we can execute another straightforward query like this:
SELECT SUM(visits) FROM customer
SQL will add up the number of visits in the table as follows:
customer
customer_id | first_name | last_name | visits |
---|---|---|---|
1 | Amelia | Earhart | 2 |
2 | Charles | Lindbergh | + 2 |
3 | Wilbur | Wright | + 4 |
SUM |
8 |
We'll get a result of 8, which is also correct.
Aggregate functions on the joined tables
So far, so good. However, if we try to use the same aggregate functions on either of our joined tables, we'll start to see incorrect results.
Running a basic count on the joined table, we will no longer get the correct number of customers:
SELECT COUNT(*) FROM customer LEFT JOIN order ON customer.customer_id = order.customer_id
SQL will count up the rows in the table as follows:
COUNT |
customer_id | first_name | last_name | visits | order_id | amount | customer_id |
---|---|---|---|---|---|---|---|
1 | 1 | Amelia | Earhart | 2 | 1 | 25.00 | 1 |
2 | 1 | Amelia | Earhart | 2 | 2 | 50.00 | 1 |
3 | 2 | Charles | Lindbergh | 2 | 3 | 75.00 | 2 |
4 | 3 | Wilbur | Wright | 4 | 4 | 100.00 | 3 |
We'll get a result of 4, even though there are really only 3 customers. You can see that Amelia is counted twice.
Similarly, if we try to sum the number of visits, we will no longer get the correct result:
SELECT SUM(visits) FROM customer LEFT JOIN order ON customer.customer_id = order.customer_id
SQL will add up the number of visits in the table as follows:
customer_id | first_name | last_name | visits | order_id | amount | customer_id |
---|---|---|---|---|---|---|
1 | Amelia | Earhart | 2 | 1 | 25.00 | 1 |
1 | Amelia | Earhart | + 2 | 2 | 50.00 | 1 |
2 | Charles | Lindbergh | + 2 | 3 | 75.00 | 2 |
3 | Wilbur | Wright | + 4 | 4 | 100.00 | 3 |
SUM |
10 |
We'll get a result of 10, even though there are only 8 visits. Amelia's 2 visits are added twice.
A fanout happened while you weren't looking
In the example we've been looking at, the primary table (customer) had only three rows. The "primary table" is the table that is in the FROM
clause of our SQL queries. After the join, we now have 4 rows. Since the joined table has more rows than the primary table, we say that a fanout has occurred.
To avoid a fanout, we can write the join in the opposite order. So, instead of this:
SELECT * FROM customer LEFT JOIN order ON customer.customer_id = order.customer_id
We would do this:
SELECT * FROM order LEFT JOIN customer ON order.customer_id = customer.customer_id
Now our joined table looks like this:
order_id | amount | customer_id | customer_id | first_name | last_name | visits |
---|---|---|---|---|---|---|
1 | 25.00 | 1 | 1 | Amelia | Earhart | 2 |
2 | 50.00 | 1 | 1 | Amelia | Earhart | 2 |
3 | 75.00 | 2 | 2 | Charles | Lindbergh | 2 |
4 | 100.00 | 3 | 3 | Wilbur | Wright | 4 |
The eagle-eyed observer will recognize that this is exactly like our other joined table, except for the order of the columns. That is true, and it's why most writers of SQL think of our two different joins as exactly the same.
However, when we do the join in this order, we do not have a fanout. Our original table (order) had four rows, and our joined table also has four rows, so there has been no fanout. Herein lies a key point: To help avoid fanouts, begin your joins with the most granular table.
Fanouts, schmanouts, who cares?
In both the fanout and the non-fanout cases, we still need to worry about the accuracy of aggregate functions. However, there is a subtle difference in the type of problems we're going to see. If you use Looker, this is a very important concept.
- No fanout — You can trust aggregate functions on your primary table but not necessarily on your joined tables.
- Fanout — You cannot necessarily trust aggregate functions on either your primary table or your joined tables.
To drive home this point, let's take a look at our previous examples.
Fanout example
If you'll recall from the previous example, the following join resulted in a fanout, because while the customer table only had three rows, the joined table had four rows.
SELECT * FROM customer LEFT JOIN order ON customer.customer_id = order.customer_id
customer_id | first_name | last_name | visits | order_id | amount | customer_id |
---|---|---|---|---|---|---|
1 | Amelia | Earhart | 2 | 1 | 25.00 | 1 |
1 | Amelia | Earhart | 2 | 2 | 50.00 | 1 |
2 | Charles | Lindbergh | 2 | 3 | 75.00 | 2 |
3 | Wilbur | Wright | 4 | 4 | 100.00 | 3 |
Since we are in a fanout situation, we cannot trust that aggregate functions will work on the primary table (customer). As we saw previously, SUM(visits)
will give us a value of 10, even though only 8 visits have actually occurred.
No fanout example
When we reversed the join, we did not get a fanout, because the order table had four rows, and the joined table also had four rows.
SELECT * FROM order LEFT JOIN customer ON order.customer_id = customer.customer_id
order_id | amount | customer_id | customer_id | first_name | last_name | visits |
---|---|---|---|---|---|---|
1 | 25.00 | 1 | 1 | Amelia | Earhart | 2 |
2 | 50.00 | 1 | 1 | Amelia | Earhart | 2 |
3 | 75.00 | 2 | 2 | Charles | Lindbergh | 2 |
4 | 100.00 | 3 | 3 | Wilbur | Wright | 4 |
Without any fanout, we can trust that aggregate functions will work on the primary table (order). For example, SUM(amount)
will give us a value of 250.00, which is the correct amount of money collected.
Two friendly joins, one frenemy join
If we want to avoid fanouts, it's important that we understand the three different types of joins.
One-to-one (friendly)
If one row of your primary table only ever matches up with one row of your joined table, you have a one-to-one join. This type of join will not result in a fanout, and aggregate functions will be accurate no matter where you use them.
Example: Suppose you have a person table and a DNA table. Since only one person can be matched with one DNA record, this is a one-to-one join.
Many-to-one (friendly)
If many rows of your primary table match up with the same row in your joined table, you have a many-to-one join. This type of join will also not result in a fanout, and aggregate functions will at least be accurate on the primary table.
Example: Suppose you have a person table and a state-of-residence table. Since many people can live in one state, this is a many-to-one join.
One-to-many (frenemy)
If one row of your primary table can match up with multiple rows in your joined table, you have a one-to-many join. This type of join can result in a fanout, and aggregate functions are not necessarily accurate anywhere.
Example: Suppose you have a person table and a children table. Since one person can have more than one child, this is a one-to-many join.
A fanout witch hunt
Understand your join type
The first, and preferred, method to check for a fanout is to understand the type of join that is occurring. One-to-one and many-to-one joins won't ever result in a fanout. However, if you know you are in a one-to-many situation, then there will always be the risk of a fanout. Even if a fanout has not already occurred, it will be a risk in the future if new rows are added.
Count rows before and after the join
The second method you can use to check for a fanout is to query a COUNT
before and after the join. The queries would look like this:
SELECT COUNT(*) FROM my_primary_table SELECT COUNT(*) FROM my_primary_table LEFT JOIN my_joined_table ON my_primary_table.my_column_1 = my_joined_table.my_column_2
If the count increases between the two queries, we know that a fanout has occurred. Since we're looking for an increase, it's important that the second query use a LEFT JOIN
. We don't want to artificially decrease the number of rows being reported just because a row in the primary table doesn't have a corresponding row in the joined table.
Unfortunately, this method cannot tell you if there is a risk of a future fanout. To know that, you need to understand the type of join that is occurring.
Looker to the rescue
If you're one of the lucky folks who use Looker, there are several LookML options that can help protect you from these pitfalls.
Joining with foreign_key
makes fanouts impossible
In Looker, you'll typically define joins between your views (views represent an existing or derived table) in your model file. At Looker, we recommend you implement joins using the foreign_key
declaration when possible. In LookML, a join that uses this method will be written like this:
explore: person { join: dna { foreign_key: dna_id } }
In plain English, we would interpret this to mean "take the primary key from dna, and join it to dna_id from person".
It is impossible for a join performed with foreign_key
to result in a fanout, as long as you've defined your primary keys correctly. Since the joined views are being joined with their primary key, which is unique, we know that only one row from the joined views can ever be matched with a row from the explore
.
Sometimes the column relationships in your database will not allow you to use foreign_key
, even though no fanout will occur. That's perfectly fine, and in this situation you can use sql_on
to define your join. The example above could be written like this:
explore: person { join: dna { sql_on: ${dna.id} = ${person.dna_id} ;; } }
As we learned in this article, if you eliminate the possibility of a fanout, you can be sure that the measures in your explore
can be trusted.
Aggregate measures from joined views are hidden unless the join is one-to-one
That sure is a mouthful, so let's work through it. In this article, we've seen that aggregate functions don't always work for tables that have been joined in. For that reason, Looker will not display aggregate functions (except for counts) for joined views. That way, your users will not accidentally retrieve bad information.
Counts can be used because Looker automatically uses a COUNT DISTINCT
in these cases. This is safe because, even if a row from a joined table is repeated multiple times, a COUNT DISTINCT
will only count it once.
Sometimes it is safe to use other aggregate functions, besides just COUNT DISTINCT
, on joined views. This occurs when the join relationship is one-to-one. If you know that a join is one-to-one, you can tell Looker about this, using the relationship
declaration. It looks like this:
explore: person { join: dna { foreign_key: dna_id relationship: one_to_one } }
When you make this declaration, Looker will start displaying aggregate functions for the joined view.
When a fanout is OK
Sometimes you know that a fanout is going to occur, but you've thought it through and decided that you don't need aggregate functions or that the ones you plan to use will work properly. In that case, Looker will definitely allow you to define your joins however you like. Just use sql_on
to define your join. As a reminder, the syntax will look like:
explore: person { join: child { sql_on: ${child.parent_id} = ${person.id} ;; } }
But I want to have my cake and eat it, too!
There are some clever ways to avoid inaccuracy with aggregate functions, even if you have problematic joins. If you can force the information in your joined tables to have a one-to-one relationship with your primary table, you can use aggregate functions to your heart's content.
Group your data in a one-to-one join
At the very beginning of this article, we started with the customer table, joined in order, and then saw some bad behavior when we extracted total customers and total visits. As a reminder, the tables look like this:
customer
|
order
|
---|
One approach we might take, to get this data joined together in a safe way, is to group the order table by customer_id. In so doing, we'll create a single row to be matched with each row in customer and end up with a one-to-one relationship. The SQL to perform this grouping might be written as:
SELECT customer_id, SUM(amount) AS total_amount FROM order GROUP BY customer_id
The resulting table will look like this:
customer_id | total_amount |
---|---|
1 | 75.00 |
2 | 75.00 |
3 | 105.00 |
If you compare this result set to the customer table, you can see that it will join quite nicely. To actually execute the join in SQL, you would need to use a sub query:
SELECT * FROM customer LEFT JOIN ( SELECT customer_id, SUM(amount) AS total_amount FROM order GROUP BY customer_id ) AS customer_totals ON customer.customer_id = customer_totals.customer_id
The resulting table will look like this:
customer_id | first_name | last_name | visits | customer_id | total_amount |
---|---|---|---|---|---|
1 | Amelia | Earhart | 2 | 1 | 75.00 |
2 | Charles | Lindbergh | 2 | 2 | 75.00 |
3 | Wilbur | Wright | 4 | 3 | 105.00 |
Now any aggregate function calculations we want to use (such as a SUM
on visits or total_amount) will work just fine.
Using Looker's derived tables to do the same thing
Looker has a great feature called derived tables. Derived tables have many sophisticated options but can be used simply to achieve the grouping of the order table that we want. This is how the LookML would be written:
view: customer_totals { derived_table: { sql: SELECT customer_id, SUM(amount) AS total_amount FROM order GROUP BY customer_id ;; } }
You could now treat this data as if it were a real table in your database, just like any other. You can define measures and dimensions just like any other view. This data can also be joined in LookML just like any other table. In this example, the LookML might be:
explore: customer { join: customer_totals { foreign_key: customer_id relationship: one_to_one } }
Quick summary
To summarize everything we've just covered:
- Aggregate functions like
SUM
andCOUNT
can misbehave if used against joined tables. - If a join has a one-to-one relationship, aggregate functions will work just fine.
- If a join has a many-to-one relationship, aggregate functions will work on the primary table but might not work on the joined tables.
- If a join has a one-to-many relationship, aggregate functions may not work anywhere.
- If you're using Looker, protect yourself from many of these pitfalls by making your joins with
foreign_key
if possible, and settingrelationship: one-to-one
when appropriate. - If you need to join tables that have a one-to-many relationship, and want to use aggregate functions, try grouping the joined tables first such that they will have a one-to-one relationship with the primary table.