When creating persistent derived tables (PDTs), performance can be an issue. Especially when the table is very large, querying the table may be slow, just as it can be for a large table in your database. You can improve performance by filtering the data or by controlling how the data in the PDT is sorted and indexed:
- Add filters to limit the dataset.
With particularly large datasets, having many many rows will slow down queries against a PDT. If you usually query only recent data, it might a good idea to add a filter to the WHERE clause of your PDT. This way, only relevant data will be added to the table each time it rebuilds so that running queries will be much faster. Then you can create a separate, larger PDT for historical analysis to allow for both fast queries for recent data, and the ability to query old data.
- Use
indexes
(MySQL or Postgres) orsortkeys
anddistribution
(Redshift).When creating a large PDT, indexing the table (on MySQL or Postgres) or adding sortkeys/distribution (on Redshift) can help immensely with performance. It is usually best to add
indexes
on ID or date fields. For Redshift, it is usually best to addsortkeys
on ID or date fields and adistribution
on the field used for joining.
Recommended Settings to Improve Performance
The following settings are optional, but highly recommended. They control how the data in the PDT is sorted and indexed:
- For RedShift and Aster, distribution specifies the column name whose value is used to spread the data around a cluster. When two tables are joined by the column specified in the
distribution
the database can find the join data on same node, so internode I/O is minimized.Example:
distribution: customer_id
- For RedShift, setting the
distibution_style
toALL
instructs the database to keep a complete copy of the data on each node. This is often used to minimize internode I/O when joining relatively small tables. Setting this value toEVEN
instructs the database to spread the data evenly through the cluster without using a distribution column. This value can only be specified whendistribution
is not specified.Example:
distribution_style: ALL
- The parameter
sortkeys
is used on RedShift only. The values specify what columns of the PDT are used to sort the data on disk to make searching easier. On RedShift,sortkeys
orindexes
may be used, but not both.Example:
sortkeys: [customer_name, first_purchase_date, last_purchase_date]
- The parameter
indexes
is used on most databases. - The values specify what columns of the PDT are indexed. (On RedShift,
indexes
are used to generate interleaved sort keys.)Example:
indexes: [column_name, column_name, ...]
The overall definition of a PDT could be:
view: customer_facts { derived_table: { SELECT ... FROM ... WHERE ... ;; sql_trigger_value: SELECT MAX(last_update_ts) FROM customers ;; distribution: customer_id; sortkeys: [customer_name, first_purchase_date, last_purchase_date] } # field definitions would continue here. }