These best practices reflect recommendations shared by a cross-functional team of seasoned Lookers. These insights come from years of experience working with Looker customers from implementation to long-term success. The practices are written to work for most users and situations, but as always use best judgment when implementing.
Redshift is a clustered, columnar store cloud database that consists of nodes and is well-suited to large analytical queries against massive datasets. However, query performance is negatively affected when too much data has to be transferred across a cluster network. While some network traffic is inevitable, reducing network traffic can help to mitigate this performance degradation.
There are a variety of things that you can do to reduce network traffic and improve query performance.
- Denormalize Redshift schemas where possible to reduce the number of joins required, improving parallelization and therefore query performance.
- Set strategic distribution keys on underlying database tables. A distribution key dictates how data gets distributed across the nodes of a cluster. Using the same distribution key for multiple tables causes data to be collocated across nodes, reducing network traffic and improving query performance. Note: This section refers to distribution keys on the underlying table in Redshift, not the Looker parameter of the same name.
- Consider utilizing the “all” distribution style for very small tables that are frequently used in queries. This will cause the table to be present on all nodes and therefore always co-located with other data required in a query.
- Only consider "even" distribution style if a table is really large and never (or very rarely) needs to be joined to another table, as this will ensure that the power of the entire cluster is used for computing, and the need for co-located data is minimal when no joins are occurring.
- Monitor the skew of the cluster when using distribution keys. If the skew is high, then common joins and expressions are likely under-utilizing some nodes, while others are overworked.
- Set strategic sort keys on underlying database tables. Utilize date fields, join fields, and/or fields that are commonly used as filters for reporting as sort keys. Leverage interleaved sort keys for complex data set (such as sessionized data). Note: This section refers to sort keys on the underlying table in Redshift, not the Looker parameter of the same name.
- Ensure that the Redshift cluster is sized appropriately. Ensure that disk space and memory are not exceeded and that compute is not pegged at 100. Given the same amount of resources, the more nodes that are available, the better the performance will be.
- Continuously monitor the Redshift cluster closely to identify opportunities for optimization. Installing the Redshift performance block within the Looker instance is a great tool for this, which contains pre-defined performance optimization views, explores and dashboards.
For additional database optimization tips, check out our article on Optimizing Database Setup for Reading Data