Background
A Persistent Derived Table (or PDT for short) is a table that Looker can create and manage in a target database. The table is loaded with data from a SQL statement defined in a view file, and is refreshed on a regular basis. This is in contrast to Ephemeral Derived Tables, which are never written to the database, but are used in SQL queries like a database view.
In a LookML view, an Ephemeral DT is defined as:
view: customer_facts { derived_table: { sql: SELECT ... FROM ... WHERE ... ;; } }
Enabling PDT Capabilities for Database Connections
Starting in Looker 7.8, admins can set a level of concurrency for PDT builds with the Max PDT Builder Connections setting on the Connection Settings admin page.
In order to add persistence to derived tables, we must enable the Persistent Derived Tables setting and specify a database or schema in the Temp Dataset field in the Connection Settings page, available through the Admin panel.
The username Looker uses to connect to the database must have permissions to CREATE
and DROP
tables and other database objects in the specified Temp Dataset schema. Dialect-specific instructions for configuring a temp schema and user permissions can be found in Looker's Database Configuration documentation.
Adding Persistence
In a LookML view file, a PDT must be persisted with either a persist_for:
parameter, sql_trigger_value:
parameter, or datagroup_trigger
that is tied to a datagroup
, but not a combination of those options.
- Use
persist_for
to specify a "time to live" for the PDT.- For example:
persist_for: "8 hours"
.
- For example:
- Use
sql_trigger_value
to specify a query that returns one value that will change when the PDT should be regenerated.- For example, to "rebuild when the MAX(
last_update_ts
) value changes", you could use the command:sql_trigger_value: SELECT MAX(last_update_ts) FROM customers
.
- For example, to "rebuild when the MAX(
- Use
datagroup_trigger
to group PDT rebuilds and/or tie them to your ETL or caching processes. - For example, to "rebuild when triggered by the datagroup named
order_datagroup
", you could use the command:datagroup_trigger: order_datagroup
.
In the LookML view file, a PDT might look like this:
view: customer_facts { derived_table: { sql: SELECT ... FROM ... WHERE ... ;; sql_trigger_value: SELECT MAX(last_update_ts) FROM customers } # field definitions would continue here. }
A PDT must have either a persist_for: "duration"
, sql_trigger_value: sql_statement
, or datagroup_trigger: datagroup_name
attribute defined.
Using persist_for
When using persist_for:
, the PDT will be built for the first time when a user's query attempts to use it. If other queries also reference the PDT, they will use the cached PDT data, as long as the duration specified as part of the persist_for:
has not elapsed since the PDT was last built. If the duration has elapsed, the PDT will be rebuilt when it is next needed.
For example, persist_for: "20 minutes"
means the query results returned from a PDT will always be less than 20 minutes old. If the cached data is older, the PDT will be rebuilt before returning a query's results. A PDT is always rebuilt in response to another query attempting to use it. So, if the system is idle, the PDT may not be rebuilt for a some time. Thus, some user queries that attempt to use the PDT may need to wait while the PDT is rebuilt, causing those users to see slower response times.
Using sql_trigger_value
When using sql_trigger_value: sql_statement
, Looker's PDT regenerator thread will trigger and run the specified sql_statement
on a regular basis — by default, every five minutes — and record the result. If the result is different than the previous run, the PDT is reloaded. This means the PDT can be loaded before a user actually needs the data. Typically, sql_statements
look like one of these examples:
SELECT COUNT(*) FROM source_table
— Reload the data whenever the number of rows in some other table increases.SELECT MAX(update_timestamp) FROM source_table
— Reload the table whenever an update occurs to the underlying data.SELECT CURRENT_DATE
— Reload the table at midnight.SELECT CAST(DATE_ADD(hrs, -3, CURRENT_TIMESTAMP) AS DATE)
— Reload the table at 3 AM.
The frequency by which Looker's regenerator thread checks and runs sql_trigger_values
and datagroups
can be adjusted as desired on the Connection Settings page accessible through the Admin panel.
Using datagroup_trigger
Persisting PDTs with datagroup_trigger
is similar to sql_trigger_value
, with the exception that the sql_statement
is specified in the sql_trigger
parameter of a datagroup
, which is defined at the model level. Datagroups can also be used to specify a time period for cache age with an additional parameter called max_cache_age
. Unlike persist_for
, sql_trigger
and max_cache_age
can be applied together in a datagroup
to create a caching policy that can be applied to both PDTs and Explores.
For example, let's say we've defined the following datagroup
in a model:
datagroup: orders_datagroup { sql_trigger: SELECT max(id) FROM my_tablename ;; max_cache_age: "24 hours" }
Applying this datagroup
to a PDT with datagroup_trigger: orders_datagroup
will rebuild a derived table when the value returned by the sql_trigger: SELECT max(id) FROM my_tablename
is different than the value previously stored. Additionally, max_cache_age: "24 hours"
invalidates the data if it has been cached for 24 hours, to avoid the risk of queries returning old data.
Like sql_trigger_value
, Looker's regenerator thread checks datagroup
sql_triggers
approximately every five minutes by default. This setting can be adjusted on the Connection Settings page accessible through the Admin panel.
Recommended Settings For Sorting and Indexing
The following settings are optional but highly recommended. They control how the data in the PDT is sorted and indexed.
For RedShift and Aster, distribution specifies the column name whose value is used to spread the data around a cluster. When two tables are joined by the column specified in the distribution
parameter, the database can find the join data on same node, so internode I/O is minimized.
Example: distribution: customer_id
For RedShift, setting the distribution_style
parameter to ALL
instructs the database to keep a complete copy of the data on each node. This is often used to minimize internode I/O when joining relatively small tables. Setting this value to EVEN
instructs the database to spread the data evenly through the cluster, without using a distribution column. This value can only be specified when distribution
is not specified.
Example: distribution_style: ALL
The parameter sortkeys
is used on RedShift only. The values specify what columns of the PDT are used to sort the data on disk to make searching easier. On RedShift, sortkeys
or indexes
may be used, but not both.
Example: sortkeys: [customer_name, first_purchase_date, last_purchase_date]
The parameter indexes
is used on most databases. The values specify what columns of the PDT are indexed. On RedShift, indexes
are used to generate "interleaved sort keys."
Example: indexes: [column_name, column_name, ..]
So, the overall definition of a PDT could be:
view: customer_facts { derived_table: { sql: SELECT ... FROM ... WHERE ... ;; sql_trigger_value: SELECT MAX(last_update_ts) FROM customers ;; distribution: customer_id; sortkeys: [customer_name, first_purchase_date, last_purchase_date] } # field definitions would continue here. }
Understanding PDT Names
When Looker determines that the PDT should be created, it generates a PDT name made up of the "scratch schema" + "table status code" + "hash value" + "view name", by:
- Generating a new unique code for the name of the table. It will be a string of letters and numbers. For example:
4DEM41PGKN2KFC63Y8RTF
- Prepending the unique code with a string like
LC$
, to indicate the current status of the table (LC$
= "Looker Create" andLR$
= "Looker Read"). For example:LC$4DEM41PGKN2KFC63Y8RTF
- Appending the name of the view to create a table name for the generated table. For example:
LC$4DEM41PGKN2KFC63Y8RTF_customer_fact
- Prepending the scratch schema. For example:
tmp.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact
Once the PDT name is created, Looker:
- Uses the derived table SQL to fashion a
CREATE TABLE AS SELECT
(or CTAS) statement and execute it. For example:CREATE TABLE tmp.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact AS SELECT ... FROM ... WHERE ...
- When the table is built, issues the statements to create the indexes.
- Renames the table from
LC$..
("Looker Create") toLR$..
("Looker Read"), to indicate the table is ready to use. - Drops any old version of the table that should not be in use anymore.
There are a few important implications of this:
- The SQL that forms the derived table must be valid inside a CTAS statement.
- The column aliases on the result set of the
SELECT
statement must be valid column names. - The names used when specifying
distribution
,sortkeys
andindexes
must be the column names listed in the SQL definition of the derived table, not the field names defined in the LookML.
When to Use a PDT
First, Try Not to Use a PDT
Often a PDT is very useful for a certain analysis, but will also make the resulting Explore less flexible.
A trap that new Looker developers often fall into is taking the SQL they previously used for their reports and unthinkingly inserting it into Looker. This simplifies Looker into being used like the tools it replaces. Instead, take advantage of Looker's more flexible and powerful capabilities.
As a general rule of thumb, a developer should try to model without using PDTs. Wait until an issue is actually identified before building a PDT.
Even when an issue is identified, in some cases it can be resolved through other means. Make sure to analyze the execution plans of slow queries. This can be done in Looker with the Explain from SQL Runner tool. The addition of an index or the change of a column data type might resolve an issue without the need to build a PDT.
Using PDTs to Test Optimizations
The PDT functionality makes it very simple to test different indexing, distributions, and other options, without needing a large amount of support from your DBA or ETL developers.
Consider a case where you have a table, but wish to test different indexes. Your initial LookML for the view may look like this:
view: customer { sql_table_name: warehouse.customer ;; }
You need to change the view LookML like this:
view: customer { # sql_table_name: warehouse.customer derived_table: { sql: SELECT * FROM warehouse.customer ;; persist_for: "8 hours" indexes: [customer_id, customer_name, salesperson_id] } }
Query this once to trigger the generation on the PDT. Then run your test queries and compare your results. If your results are good, you can ask your DBA or ETL team to include those indexes in the original table.
Remember to change your view code back to remove the PDT.
Pre-Joining/Aggregating Data
Sometimes the volume or type of data is such that prejoining and/or aggregating some of the data is really useful.
For example, suppose we want to report on customers by cohort based on when they made their first order. This query might be expensive in real time, but using the following PDT, the query can be calculated once and reused:
view: customer_order_facts { derived_table: { sql: SELECT c.customer_id, MIN(o.order_date) OVER (PARTITION BY c.customer_id) AS first_order_date, MAX(o.order_date) OVER (PARTITION BY c.customer_id) AS most_recent_order_date, COUNT(o.order_id) OVER (PARTITION BY c.customer_id) AS lifetime_orders, SUM(o.order_value) OVER (PARTITION BY c.customer_id) AS lifetime_value, RANK() OVER (PARTITION BY c.customer_id ORDER BY o.order_date ASC) AS order_sequence, o.order_id FROM warehouse.customer c LEFT JOIN warehouse.order o ON c.customer_id = o.customer_id ;; sql_trigger_value: SELECT CURRENT_DATE ;; indexes: [customer_id&, order_id, order_sequence, first_order_date] } }
Add a Primary Key Using a PDT
Primary keys can be critical when calculating aggregates. Most tables will have some column or expression that has a primary key. However, every now and then, a table might not have any reasonable way to uniquely identify a row. In this case, a PDT can be used to generate a unique id:
view: transaction { # sql_table_name: warehouse.transaction derived_table: { sql: SELECT ROW_NUMBER as transaction_pk, * FROM warehouse.transaction sql_trigger_value: SELECT CURRENT_DATE indexes: [customer_id, product_id, salesperson_id, transaction_date] } dimension: transaction_pk { type: number primary_key: yes hidden: yes sql: ${TABLE}.transaction_pk ;; ... } }
PDT Anti-Patterns
There are several situations where a PDT might not be the best choice:
- The most important issue occurs when the underlying data is changing frequently and a "snapshot in time" is not sufficient.
- For example, consider the
customer_order_facts
PDT above. If the data is getting fed from the transactional system on a nightly basis, then this pattern makes sense. If the data is being fed continuously or nearly continuously, the PDT will quickly become out of date. Does it matter if the Looks and dashboards that rely on this view are not up to date? That depends on the use case. Do your non-technical Looker end users understand that this data is not constantly updating? Consider fixing that in training, with field naming choices, or by applying notes and descriptions on fields and Explores, or with notes on a dashboard tile.
- For example, consider the
- If the cost and time involved in creating PDTs often enough is too high.
- In the prior situation, the
sql_trigger_value:
could be set so the table rebuilds more often. However, the rebuild process recreates the entire table, rather than just modifying a few rows of data. At some rebuild frequency, the cost is too much and the database spends more time rebuilding the PDTs rather than serving useful queries. In these cases, instead of a PDT in Looker, consider having the ETL processes create this table using an incremental update strategy.
- In the prior situation, the