One of Looker's most powerful features is something we call symmetric aggregates. But, because they can look pretty gnarly and mostly happen behind the scenes, coming upon them can be a bit confusing. So here's a quick rundown of why they exist, how they work, and why they're so unattractive.
Why Do We Need Symmetric Aggregates?
SQL, the language of data analysis, is amazingly powerful. But with great power comes great responsibility, and analysts have a responsibility to avoid accidentally calculating incorrect aggregates, such as sums, averages, and counts. (There are literally threads all over the Internet about this problem.)
Doing these calculations incorrectly is surprisingly easy and probably the cause of more analyst heartburn than we'll ever know. This example makes it clear how you can go wrong.
Imagine you have two tables, orders
and order_items
. The order_items
table records one row for each item in an order, so the relationship between the tables is one-to-many (because one order can have many items, but each item can only be part of one order). Let's say the orders
table looks like this:
order_id | user_id | total | order_date |
---|---|---|---|
1 | 100 | $ 50.36 | 2017-12-01 |
2 | 101 | $ 24.12 | 2017-12-02 |
3 | 137 | $ 50.36 | 2017-12-02 |
The
SUM(total)
equals 124.84
.
Now, let's say the
order_items
table contains six rows:
order_id | item_id | quantity | unit_price |
---|---|---|---|
1 | 50 | 1 | $ 23.00 |
1 | 63 | 2 | $ 13.68 |
2 | 63 | 1 | $ 13.68 |
2 | 72 | 1 | $ 5.08 |
2 | 79 | 1 | $ 5.36 |
3 | 78 | 1 | $ 50.36 |
If you wanted to get the count of items ordered, that'd be easy, too. The SUM(quantity)
equals 7
.
Now, let's join these two tables using their shared column, order_id
, which gives us:
order_id | user_id | total | order_date | item_id | quantity | unit_price |
---|---|---|---|---|---|---|
1 | 100 | $ 50.36 | 2017-12-01 | 50 | 1 | $ 23.00 |
1 | 100 | $ 50.36 | 2017-12-01 | 63 | 2 | $ 13.68 |
2 | 101 | $ 24.12 | 2017-12-02 | 63 | 1 | $ 13.68 |
2 | 101 | $ 24.12 | 2017-12-02 | 72 | 1 | $ 5.08 |
2 | 101 | $ 24.12 | 2017-12-02 | 79 | 1 | $ 5.36 |
3 | 137 | $ 50.36 | 2017-12-02 | 78 | 1 | $ 50.36 |
Now we can see some new things, such as that three items were ordered on Dec 1 and four items were ordered on Dec 2. Some of our previous calculations, like SUM(quantity)
, will continue to work fine. But what if we try to calculate the total spent? Now we've got a problem!
If we use our previous calculation, SUM(total)
, the 50.36 total for order_id = 1
is going to be counted twice, because it contains two different items, and the 24.12 total for order_id = 2
is going to be counted three times! So
SUM(total)
will give us 223.44
, instead of the correct answer, which is 124.84
.
This is bad. While it's easy to avoid this kind of mistake when you have two tiny example tables, things get way more complicated in real life, with lots of tables and lots of data. This is exactly the kind of miscalculation someone could make without even realizing.
This is the problem symmetric aggregates solve.
How Symmetric Aggregates Work
Symmetric aggregates prevent analysts — and anyone else using Looker — from accidentally miscalculating these types of aggregates. This takes a huge burden off analysts' shoulders, because they can trust that no one is going to charge ahead with incorrect data.
Symmetric aggregates do this, in plain English, by keeping track of what you're calculating and making sure to count each fact in the calculation the correct number of times. So, in the above example, the symmetric aggregates function realizes that
total
is a property of orders
(not order_items
), so it needs to count each order's total only once to get the correct answer.
It does this by using a unique primary key that analysts define in Looker. That means that when Looker is doing calculations on the joined table...
order_id | user_id | total | order_date | item_id | quantity | unit_price |
---|---|---|---|---|---|---|
1 | 100 | $ 50.36 | 2017-12-01 | 50 | 1 | $ 23.00 |
1 | 100 | $ 50.36 | 2017-12-01 | 63 | 2 | $ 13.68 |
2 | 101 | $ 24.12 | 2017-12-02 | 63 | 1 | $ 13.68 |
2 | 101 | $ 24.12 | 2017-12-02 | 72 | 1 | $ 5.08 |
2 | 101 | $ 24.12 | 2017-12-02 | 79 | 1 | $ 5.36 |
3 | 137 | $ 50.36 | 2017-12-02 | 78 | 1 | $ 50.36 |
it says to itself, "Hey, even though there are two rows with
order_id = 1
, I shouldn't double count the total, because I've already included it in my calculation. And for the three rows with
order_id = 2
, I should ignore the totals in the second and third rows, because they're repeats."
It's worth noting that symmetric aggregates depend on a unique primary key and the correct join relationship being specified in the model. So, if the results you're getting look wrong, talk to an analyst to make sure that everything is set up correctly.
Why Do Symmetric Aggregates Look That Way?
If you haven't seen symmetric aggregates in the wild, it can be a bit unsettling. When symmetric aggregates kick in, Looker goes from writing nice, well-behaved SQL, like this:
SELECT
order_items.order_id AS "order_items.order_id",
order_items.sale_price AS "order_items.sale_price"
FROM order_items AS order_items
GROUP BY 1,2
ORDER BY 1
LIMIT 500
to wild, crazy SQL, like this:
SELECT
order_items.order_id AS "order_items.order_id",
order_items.sale_price AS "order_items.sale_price",
(COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
*(1000000*1.0)) AS DECIMAL(38,0))) +
CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) )
- SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) )
AS DOUBLE PRECISION)
/ CAST((1000000*1.0) AS DOUBLE PRECISION), 0)
/ NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id
ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500
The exact format that symmetric aggregates take depends on the dialect of SQL Looker is writing, but all formats do the same basic thing: If multiple rows have the same primary key, the symmetric aggregates function only counts them one time. It does this by using the little-known SUM DISTINCT
and AVG DISTINCT
functions that are part of the SQL standard.
To see how this happens, let's take the calculation we did above and work it through with symmetric aggregates. Of the seven columns in the joined tables, we only need two: the one we're aggregating (total
) and the unique primary key for orders (order_id
).
order_id | total |
---|---|
1 | $ 50.36 |
1 | $ 50.36 |
2 | $ 24.12 |
2 | $ 24.12 |
2 | $ 24.12 |
3 | $ 50.26 |
Symmetric aggregates take the primary key (order_id
, in this case) and creates a very large number for each, which is guaranteed to be unique and always give the same output for the same input. (It generally does this with a hashing function, the details of which are beyond the scope of this article.) So that'd look something like this:
big_unique_number | total |
---|---|
802959190063912 | $ 50.36 |
802959190063912 | $ 50.36 |
917651724816292 | $ 24.12 |
917651724816292 | $ 24.12 |
917651724816292 | $ 24.12 |
110506994770727 | $ 50.36 |
Then, for each row, Looker does this:
SUM(DISTINCT big_unique_number + total) - SUM(DISTINCT big_unique_number)
And guess what. That works.
It reliably gives you the correctly aggregated totals, counting each total exactly the right number of times. It isn't fooled by repeated rows or by multiple orders that have the same total. Don't believe it? Do the math yourself and you'll get a better feel for how it works.
There's no question that the SQL required to do this isn't the prettiest to look at: With CAST()
, and md5()
, and SUM(DISTINCT)
, and STRTOL()
, you certainly wouldn't want to write it by hand. But, luckily, you don't have to. Looker writes it for you, and it just works. No wrong answers, no joins that "aren't allowed," and no headaches for analysts.
NOTE: When an aggregation is going to work properly without needing symmetric aggregates, Looker detects this automatically and doesn't bother with symmetric aggregates. Because symmetric aggregates impose some performance costs, this further optimizes the SQL that Looker generates, to make it as efficient as possible while still guaranteeing the right answer.