Please also refer to our Aggregate Awareness documentation page for additional details.
Introduction
This article is intended to be a step-by-step guide for implementing aggregate awareness in a practical scenario, including identifying opportunities for implementation, the value aggregate awareness drives, and a simple workflow for implementing it in a real model. This article is not a deep explanation of all aggregate awareness features or edge cases, nor is it an exhaustive catalogue of all its features.
What Is Aggregate Awareness?
In Looker, you mostly query against raw tables or views in your database. Sometimes these are Looker persistent derived tables (PDTs).
You may often encounter very large datasets or tables that, in order to be performant, require aggregation tables or roll-ups.
Commonly, you may create aggregation tables like an orders_daily
table that contains limited dimensionality. These need to be treated separately and modeled separately in the Explore, and they do not sit in the model neatly. These limitations lead to poor user experiences when the user has to choose between multiple Explores for the same data.
Now, with Looker's aggregate awareness, you can pre-construct aggregate tables to various levels of granularity, dimensionality, and aggregation; and you can inform Looker of how to use them within existing Explores. Queries will then leverage these roll-up tables where Looker deems appropriate, without any user input. This will cut down query size, reduce wait times, and enhance user experience.
What Value Does Aggregate Awareness Offer?
There are a number of significant value propositions aggregate awareness offers to drive extra value from your existing Looker model:
- Performance improvement: Implementing aggregate awareness will make user queries faster. Looker will use a smaller table if it contains data needed to complete the user's query.
- Cost savings: Certain dialects charge by the size of query on a consumption model. By having Looker query smaller tables, you will have a reduced cost per user query.
- User experience enhancement: Along with an improved experience that retrieves answers faster, consolidation eliminates redundant Explore creation.
- Reduced LookML footprint: Replacing existing, Liquid-based aggregate awareness strategies with flexible, native implementation leads to increased resilience and fewer errors.
- Ability to leverage existing LookML: Aggregate tables use the
query
object, which reuses existing modeled logic rather than duplicating logic with explicit custom SQL.
Basic Example
Below is a very simple implementation in the Looker model to demonstrate how lightweight aggregate awareness can be. This tutorial explains each step, the path to choosing the Explore that's being optimized, and the type of aggregate table being built.
Let's say we have a flights
table with one row for every flight recorded through the FAA, and we modeled this in Looker with its own view and Explore. We can then define an aggregate table like this:
explore: flights {
aggregate_table: flights_by_week_and_carrier {
query: {
dimensions: [carrier, depart_week]
measures: [cancelled_count, count]
}
materialization: {
sql_trigger_value: SELECT CURRENT-DATE;;
}
}
}
That's it. Now a user querying the flights
Explore will automatically be able to leverage the aggregate table defined above. The user won't have to inform Looker of any special conditions, Looker will just use that table if it's a fit for the fields the user selects.
Here is an example Looker SQL tab for a query that uses the aggregate table:
Here is an example Looker SQL tab for a query that does not have an associated aggregate table:
Setup
Enabling PDTs
In order for Looker to utilize aggregate awareness, PDTs need to be enabled on the connection you intend to use. To enable PDTs for your connection, follow the steps provided for your specific dialect here in our documentation.
Supported Databases
Aggregate awareness is supported by the majority of Looker-supported database dialects, including Google BigQuery, Amazon Redshift, Snowflake, MySQL, Postgres, SQL Server, and Athena.
Notable exceptions include Druid, Dremio, and Oracle ADWC.
Check the Looker documentation for the list of supported dialects for the most current release.
Step 1 — Identify Opportunities
Now that you have enabled your connection for PDTs, you are ready to start implementing aggregate awareness. The first step is to identify where aggregate awareness can play a role in optimization or driving any of the values mentioned above.
Content, Dashboards
The first step to finding opportunities for implementing aggregate awareness is finding dashboards and Looks that are slow. Often, these will be obvious; your users will be one of the most valuable resources for finding poorly performing content. However, often that's not sufficient, so you may need to let the data identify the content that's not performing well but is still heavily used.
For identifying dashboards with a high runtime, you can use your Looker instance's System Activity. As a shortcut, you can open this link in a browser, then replace "hostname" in the URL with your Looker instance's name. You'll see an Explore like this:
As you can see, there are a number of dashboards that are high utilization and also perform worse than the mean. An obvious choice here would be the Sample Visualizations dashboard shown above. It uses two Explores, so there's opportunity to optimize both.
Explores
The next step is figuring which Explores need to be optimized with aggregate awareness. Identifying Explores that are heavily used by users and have clear problematic areas (that have lower than average query response) is key to first addressing those issues.
Looker provides pre-built queries to identify the most-used Explores as well as a pre-built Explore you can use to understand the field usage patterns in said Explores.
You can again use System Activity as a starting point for identifying opportunities for optimizing Explores. As a shortcut, you can open this link in a browser, then replace "hostname" in the URL with your Looker instance's name. You'll see an Explore like this:
It's easy to identify at a glance Explores that are:
- Used heavily by users (as opposed to queries from the API or from scheduled deliveries)
- Queried often
- Performing poorly (relative to others)
From this, we can conclude that the flights
and order_items
Explores are probable candidates for aggregate awareness implementation.
Data, Field Usage
Finally, we can identify other opportunities at the data level by understanding fields that users commonly select and filter by. Having understood the content problems and the Explores that drive those dashboards, now we can see what dimensions and measures we'll want to use in our aggregate tables.
Use the System Activity Field Usage query to understand the commonly selected fields within the Explores you identified above. As a shortcut, you can open this link in a browser, then replace "hostname" in the URL with your Looker instance's name. Replace filters accordingly. You'll see an Explore like this:
It's quite clear that flights.count
and flights.depart_week
are the two most commonly selected fields for this Explore. We will note that for when we build our aggregate tables.
It's also important to note that there are elements of subjectivity and understanding of the data that will guide your selection criteria.
For example, we can safely assume that, from the top four fields above, users commonly want to understand how many flights there were and how many were canceled, and they want to see that data by week and broken down by carrier. That seems like a clear, logical, and real-world combination of fields and metrics.
Summary
The above steps should serve as a guide for finding dashboards, Explores, and fields that need to be considered for optimization. It's also worth understanding that all three may be mutually exclusive: the problematic dashboards may not be powered by the problematic Explores, and building aggregate tables with the commonly used fields may not help those dashboards at all. It is possible these are three discrete aggregate awareness implementations.
Step 2 — Designing Aggregate Tables
Having identified problematic areas of the model and content that aggregate awareness can assist with, the next step is arguably the most important: identifying what type of tables will best serve these problem areas.
Aggregate tables do not need to be an exact match for your query to be used. If your query is at the week granularity and you have a daily roll-up table, Looker will use your aggregate table instead of your raw, timestamp-level table. Similarly, if you have an aggregate table rolled up to the |
Supported Aggregates
It's important to understand what is or is not possible with aggregate awareness before you start building your tables.
Supported Measures:
- Standard measures: Measures of type SUM, COUNT, AVERAGE, MIN, and MAX
- Composite measures: Measures of type NUMBER, STRING, YESNO, and DATE
- Approximate distinct measures: Dialects that can use HyperLogLog functionality
Unsupported Measures:
Important: While these are unsupported, you can still leverage aggregate awareness by generating exact-match aggregate tables. |
- Distinct measures: Because distinctness can be calculated only on atomic, non-aggregated data,
*_DISTINCT
measures are not supported outside of these approximates that use HyperLogLog. - Cardinality-based measures: As with distinct measures, medians and percentiles cannot be pre-aggregated and are not supported.
Aggregate Table Granularity
Before arbitrarily building tables for combinations of dimensions and measures, you should determine common patterns of usage and field selection to make aggregate tables that will be used as often as possible with the biggest impact. Note that all fields used in the query (whether selected or filtered) must be in the aggregate table in order for the table to be used for the query. Again, refer to the above note about exact match granularity as you can cover multiple dimensionalities within a single aggregate table and still see large performance gains.
For example, when optimizing the Flights Explore from above, you can assume that users commonly query for Depart Week and Flights Count but also filter by Carrier. Therefore, creating a flights_by_week_and_carrier
(example below) will result in more frequent aggregate table usage than a set of flights_by_week
and flights_by_carrier
tables. Again, your business users and anecdotal evidence as well as data from Looker's System Activity will help guide your decision-making process.
explore: flights {
aggregate_table: flights_by_week_and_carrier {
query: {
dimensions: [carrier, depart_week]
measures: [cancelled_count, count]
}
materialization: {
sql_trigger_value: SELECT CURRENT-DATE;;
}
}
}
In our example above, there are two dimensions selected very frequently (flights.depart_week
and flights.carrier
) as well as two measures (flights.count
and flights.cancelled_count
). As such, a logical aggregate table to build would use all four of these fields. So our first aggregate table will be flights_by_week_and_carrier
and will look like this:
Note that generating this query from the base table took 15.8 seconds and scanned 38 million rows without any joins using Amazon Redshift. Pivoting the query, which would be a normal user operation, took 29.5 seconds.
After implementing our flights_by_week_and_carrier
table, the subsequent query took 7.2 seconds scanning 4592 rows (a 99.98% reduction in table size). Pivoting the query took 9.8 seconds.
Despite the small sample size, we can see how often these fields are used:
Even if we very modestly estimated that 25% of these queries used all 4 fields in the simplest fashion (simple select, no pivot), 3379 x 8.6 seconds = 8 hours, 4 minutes in aggregate user wait time eliminated.
Note that the example model we're using here is very basic. These results should not be used as a benchmark or frame of reference for your model.
After applying the exact same flow to our ecommerce model (order_items
shown above), the results are below:
Source |
Query Time |
Rows Scanned |
Base Table |
13.1 seconds |
285,000 |
Aggregate Table |
5.1 seconds |
138,000 |
Delta |
8 seconds |
147,000 |
The fields used in the query and subsequent aggregate table were brand
, created_date
, orders_count
, and total_revenue
, using two joins. The fields had been used a total of 11,000 times. Estimating the same combined usage of ~25%, the aggregate saving for users would be 6 hours, 6 minutes (8s * 2750 = 22000s). The aggregate table took 17.9 seconds to build.
Looking at these results, it's worth taking a moment to step back and assess the returns potentially gained from:
- Optimizing larger, more complex models/Explores that have "acceptable" performance and may see performance improvements from better modeling practices
versus
- Using aggregate awareness to optimize simpler models that are used more frequently and are performing poorly
It is possible to start to see diminishing returns for your efforts as you squeeze the last bit of performance from Looker and your database, and you should always be cognizant of the baseline performance expectations, particularly from business users, and whatever limitations your database imposes (such as concurrency, query thresholds, cost, and so on), which you shouldn't expect aggregate awareness to overcome.
Other Considerations
When designing an aggregate table, remember that having more fields will result in a bigger, slower table. Bigger tables can optimize more queries and therefore be used in more situations, but won't be as fast.
For example:
explore: flights {
aggregate_table: flights_by_week_and_carrier {
query: {
dimensions: [carrier, depart_week,flights.distance, flights.arrival_week,flights.cancelled]
measures: [cancelled_count, count, flights.average_distance, flights.total_distance]
}
materialization: {
sql_trigger_value: SELECT CURRENT-DATE;;
}
}
}
This will result in the aggregate table being used for any combination of dimension shown and any measure, so its "spread" is quite wide. But a simple SELECT of carrier
and count
now produces a scan of an 885,000-row table, as opposed to 4,592 rows when the table was based on 2 dimensions. That 885K row table is still a 97% reduction in table size (from the previous 38 M rows); however, adding 1 more dimension increases the table size to 20 M rows. As such, there are diminishing returns as you increase the spread.
Step 3 — Building Tables
We hope you can now see the value of aggregate awareness, understand the scope of its use, and have a good understanding of the types of tables you want to build. This section explains the methods involved in creating the tables themselves.
Taking our example Explore that we investigated and identified as an opportunity to optimize, we'll build three different aggregate tables for it:
flights_by_week_and_carrier
flights_by_month_and_distance
flights_by_year
Example: Building the flights_by_week_and_carrier
table
Do the following to build an aggregate table for flights_by_week_and_carrier
:
- Go to the Explore you have identified for aggregate awareness optimization.
- Select the dimensions and measures to be included in the aggregate table (
flights.depart_week
,flights.carrier
,flights.count
,flights.cancelled_count
). - Go to the Explore's gear menu and select Get LookML.
- Copy the LookML from the
aggregate_table
section. - Paste the LookML into the model file and save the file.
Now we have the flights_by_week_and_carrier
table to use for our queries. Once you deploy this change to production, your users will also leverage this aggregate table.
Persistence
Any aggregate table you create has a persistence parameter similar to PDTs and NDTs. You should take care to align the automatic regeneration of these aggregate tables with your caching policy by leveraging datagroups. Best practice is to use the same datagroup for the aggregate table that is used for the Explore. In the absence of datagroups, use the sql_trigger_value
parameter instead. A generic, catch-all value for this is:
sql_trigger_value: SELECT CURRENT_DATE() ;;
This will build your aggregate tables automatically at midnight every day.
Timeframe Logic
When Looker builds an aggregate table, it will contain data up to the point in time the aggregate table was built. Any data that has been subsequently appended to the base table would normally be excluded from the results of a query using that aggregate table.
In this diagram, the two orders received today will not be present in the Orders aggregate table, since the orders were received after the aggregate table was built.
However, Looker can UNION the aggregate table and the base table together when a user queries for a timeframe that overlaps with the aggregate table.
Now, if a user filters for a timeframe that overlaps with the end of both the aggregate and the base table, as shown above, the three orders we see in blue and the two orders in green will both be shown in the user's results. In order to trigger this behavior in Looker, certain conditions need to be met:
- The aggregate table has a time filter.
- The aggregate table includes a dimension based on the same time field as the time filter.
Summary
To recap, to build an aggregate awareness implementation, there are three fundamental steps:
- Identify opportunities where optimization using roll-up tables is appropriate and impactful.
- Design aggregate tables that will provide the most coverage for common user queries while still remaining small enough to reduce the size of those queries sufficiently.
- Build the aggregate tables in the Looker model, pairing the persistence of the table with the persistence of the Explore cache.