Best Practice: Optimize Redshift Performance by Reducing Network Traffic Across a Cluster


Userlevel 1

Looker will not be updating this content, nor guarantees that everything is up-to-date. 

 

These best practices reflect recommendations shared by a cross-functional team of seasoned Lookers. These insights come from years of experience working with Looker customers from implementation to long-term success. The practices are written to work for most users and situations, but as always use best judgment when implementing.

Amazon Redshift is a clustered, columnar-store cloud database, that consists of nodes and is well‐suited to large analytical queries against massive datasets. However, query performance is negatively affected when too much data has to be transferred across a cluster network. While some network traffic is inevitable, reducing network traffic can help mitigate this performance degradation.

There are a variety of things that you can do to reduce network traffic and improve query performance:

  • Denormalize Redshift schemas where possible, to reduce the number of joins required, improving parallelization and therefore query performance.
  • Set strategic distribution keys on underlying database tables. A distribution key dictates how data gets distributed across the nodes of a cluster. Using the same distribution key for multiple tables causes data to be colocated across nodes, reducing network traffic and improving query performance. Note: This refers to distribution keys on the underlying table in Redshift, not the Looker parameter of the same name.
    • Consider utilizing the "all" distribution style for very small tables that are frequently used in queries. This will cause the table to be present on all nodes and therefore always colocated with other data required in a query.
    • Only consider "even" distribution style if a table is really large and never (or very rarely) needs to be joined to another table, as this will ensure that the power of the entire cluster is used for computing, and the need for colocated data is minimal when no joins are occurring.
    • Monitor the skew of the cluster when using distribution keys. If the skew is high, then common joins and expressions are likely under‐utilizing some nodes, while others are overworked.
  • Set strategic sort keys on underlying database tables. Utilize date fields, join fields, and/or fields that are commonly used as filters for reporting as sort keys. Leverage interleaved sort keys for complex data sets (such as sessionized data). Note: This section refers to sort keys on the underlying table in Redshift, not the Looker parameter of the same name.
  • Ensure that the Redshift cluster is sized appropriately. Ensure that disk space and memory are not exceeded, and compute is not pegged at 100. Given the same amount of resources, the more nodes that are available, the better the performance will be.
  • Continuously monitor the Redshift cluster closely, to identify opportunities for optimization. Installing the Redshift performance block within the Looker instance is a great tool for this, as it contains pre-defined performance optimization views, explores and dashboards.

For additional database optimization tips, check out this post about Optimizing Database Setup for Reading Data.


0 replies

Be the first to reply!

Reply