This article is written for anyone attempting to use LookML to build an explore in Looker. The concepts will be easier to understand if you have a proficiency in SQL, in particular the difference between inner and outer joins. Brush up with this concise article from w3schools if you’re unsure.
Looker has the ability to be a powerful SQL engine for your company. Abstract modeling in LookML allows data teams and IT teams to build general rules which are always true, freeing business analysts to build queries in the wild which are always correct even if the data team never anticipated them. The core driver of this ability is the symmetric aggregate algorithm which solves an industry-wide issue with SQL joins. However, two things must be done correctly to take advantage of the algorithm: primary keys must be accurate in every view that contains a measure (typically all of them), and relationship parameters must be correct in every join.
In many ways, understanding the primary key of a table is essentially the same as understanding what the table is and what might be done with it. The only thing that needs to be true is that the column (or set of concatenated columns) you choose as the primary key must have no repeated values in it. This is easy to test in Looker's SQL Runner as described in the Getting Your Primary Key Right Using SQL Runner and Creating a Compound Primary Key article.
Now that we’ve verified our primary keys, we can determine the correct value for the join’s relationship parameter. The purpose of the relationship parameter is to tell Looker whether to invoke symmetric aggregates when the join is written into a SQL query. A possible approach here would be to tell Looker to always invoke them, which would always produce accurate results. However, this comes at a performance cost so it is best to do so judiciously.
The process to determine the correct value is slightly different between inner and outer joins.
Consider an example. We have a table of orders with a primary key of order_id and a table of customers with a primary key of customer_id.
This join would be represented in LookML like this:
The “many_to_one” relationship here is referring to the number of times one value of the join field (customer_id) is represented in each table. In the left table (orders), a single customer ID is represented many times (or at least one is, customer #1). In the right table, every customer ID is only represented once since it is the primary key of that table. Therefore, records in the left table could have many matches for a single value in the right table. If customer_id wasn’t unique in every row of the customer table, then the relationship would be many_to_many.
This same reasoning can be determined programmatically by checking primary keys. Just follow these steps to determine the correct relationship value:
- Start by writing many_to_many as the relationship. As long as our primary keys are correct, this will always produce accurate results because Looker will always trigger the symmetric aggregation algorithm and enforce accuracy. However, since the algorithm complicates queries and adds run time, it is beneficial to try and change one or both sides to ‘one’ instead of ‘many.’
- Take a look at the field(s) which are in your sql_on clause from the left table. If together they form the primary key of the left table, you can change the left side of the relationship parameter to “one.” If not, it typically must remain a “many”. (For information about a special case, see the "Fine Print" section below.)
- Next, look at the field(s) representing your right table. If they form the primary key of the right table, you can change the right side to a “one.”
It’s best practice to write your sql_on phrase starting with the left table represented on the left side of equalities or inequalities, and the right table on the right side. That way it will be easier to determine the relationship, and also make it clear at a glance to which table already in the explore you are attaching this new table.
For outer joins, we also need to take into consideration that a fanout might also occur when null records are added during the join. This is particularly important because left outer joins are the default in Looker. While null records do not affect sums or averages, they do affect the way Looker runs a measure of type count. If this is done incorrectly, the null records will be counted (which is undesirable).
In a full outer join, null records can be added to either table if its join key is missing values that exist in the other table. Consider an example:
Just like in an inner join, the relationship between the tables’ primary keys is many_to_one. However, the added null record forces the need for symmetric aggregates on the left table as well. Therefore, we must change the relationship parameter to many_to_many, because performing this join disrupts counts on the left table.
If this example had been a left outer join, the null row would not have been added and the extra customer record would have been dropped. In that case, the relationship would still be many_to_one. This is the Looker default because it is assumed the base table defines the analysis. In this case we are analyzing orders, not customers. If the customer table was at the left the situation would be different.
In some Explores, the base table joins to one or more views which in turn need to join to one or more other views. In our example here, that would mean a table joined to the customer table. In these situations, it is best to only look at the individual join being written when evaluating the relationship parameter. Looker will understand when a downstream fanout affects a query even though the affected view isn’t in the join which actually created the fanout.
How Does Looker Help Me?
There are mechanisms in Looker to help ensure this is done right. One is a check for primary key uniqueness. Whenever there is a fanout and symmetric aggregates are needed to compute a measure, Looker checks the leveraged primary key for uniqueness. If it is not unique, an error will appear at query run time (but there is no LookML Validator error for this).
Also, if there is no way for Looker to handle a fanout (usually because no primary key is indicated), no measures will appear in the Explore from that view. Simply designate a field as the primary key to allow your measures to get into the Explore.
- Some dialects Looker can connect with but there isn't support for symmetric aggregates. Find that list in our docs.
- The Inner Join section above says to determine the correct relationship value you should look at the field(s) which are in your sql_on clause from the left table: "If together they form the primary key of the left table, you can change the left side of the relationship parameter to 'one.' If not, it typically must remain as 'many'. This is true unless your table contains multiple columns which have no repeated records in them. In this case, you can treat any such column as if it were a primary key when formulating your relationship, even if it is not the column designated “primary_key: yes”. It can be handy to make sure that there is some sort of software rule in place that ensures this will always remain true for the column you designate. If so, go ahead and treat it as such and make note of its special property in the view file for others to reference in the future (complete with SQL Runner link to prove it). Be aware, though. Looker will confirm the truth of implied uniqueness when a field is designated as the primary_key, but it will not do the same for other fields. It will simply not invoke the symmetric aggregates algorithm.