Troubleshooting Redshift performance - extensive guide

  • 27 January 2015
  • 0 replies
  • 781 views

Userlevel 3

This article covers general steps to troubleshooting performance issues arising from running Looker queries on Redshift database. It is not at all exhaustive since there can be many reasons for poor performance. Some of the commonly perceived reasons for poor database performance are:







  1. Poorly designed model


    Ex:






  • using an unnecessary join,


  • querying data without filters








  1. Less than optimal database set-up


    Ex:






  • missing specification of sort keys,


  • missing key procedures during the ETL process








  1. Poorly written queries by Looker


    Ex:






  • Looker using a Redshift function that has been deprecated




It is important to correctly diagnose the type of performance issue you are experiencing. For example, while Looker does handle a big part of logic for writing SQL queries, most of that logic is passed on to Looker in the form of LookML definitions. If definitions are correctly defined, on average Looker will generate more efficient queries than if all of the possible queries and their derivatives are written by hand.



Here we will focus on type (2) performance issues.



Prior requirements:





  • You have understanding of sort and distribution keys (see Lloyd’s blog post on Redshift)


  • Primary model owner (developer) successfully applied a similar LookML model on other datasets of a similar size with no significant performance issues (incorrect model can significantly impact performance)




  • conditionally_filter or always_filter are applied at least once within the model


  • Admin privileges




Understanding performance through simplified queries



In order to understand what needs to be changed in the database set-up, it is practical to first discover whether the issue is due to a lack of horsepower or due to a non-optimally defined schema.



Often, running several simplified queries can point you in the right direction. Assuming we are concerned with performance of table1 and, optionally, with that of table2 joined in, we shall try the following:



Query 1



SELECT * FROM table1 LIMIT 10



Query 2



SELECT time_stamp_column FROM table1 ORDER BY 1 <ASC/DESC> LIMIT 10



Query 3



SELECT time_stamp_column FROM table1 

WHERE

time_stamp_column > <USE LOOKER GENERATED DATE FILTER CLAUSE FOR SOME SMALL PERIOD>

ORDER BY 1 LIMIT 10



Query 4



SELECT time_stamp_column FROM table1 

LEFT JOIN table2

ON <PUT THE EQUALITY CONDITION HERE>

WHERE

time_stamp_column > <USE LOOKER GENERATED DATE FILTER CLAUSE FOR SOME SMALL PERIOD>

ORDER BY 1 LIMIT 10



If any of these SELECT queries run slowly, the best action is to investigate them by using the EXPLAIN command (more on query planning, Explain and simple sample results).



There are, however, some quick guidelines. Slowly returning Query 1 is suggestive of either the fact that this is a very horizontal/wide table or that some of the columns contain a lot of data. Both statements are easily checked and actually should be irrelevant if you replace * (star) with a single small column. While Looker only queries columns needed for selected dimensions, special care should be taken in querying these tables from Looker’s PDTs (i.e. avoid using a *).



Slowly returning Query 2 suggests either that sorting keys are not specified correctly, that there are unsorted regions, or that network traffic between nodes in a cluster takes too much time. Query 3 then is another way to get at sortkey problem observed in Query 2–although it typically introduces an issue with whether time dimension is correctly defined by a user in LookML (e.g. epoch, date_time, etc), so the query deserves special care.



Finally, if query of a single table is responsive, we want to investigate the join relationship between the two tables. There are many potential problems:





  1. two tables use different distribution keys, so a join creates unnecessary networking traffic between nodes


  2. if sortkey is not specified on the column in the equality clause from the right table, then Redshift will have to scan the entire right table (very slow)


  3. one or both columns use a very big VARCHAR - something that is not at all an issue for storage, but is a problem when values are hashed since the entire capacity of every field is hashed.




Again, these are very rough guidelines, but can serve as good starting points and as potential benchmarks during future changes.



Do keep in mind that Redshift, separate from Looker, creates its own cache to compile queries. This should not matter for these simple SELECTs, but you might need to vary the above queries slightly to get a more objective benchmark; for example, by adding/removing additional small columns, by changing the ORDER direction, or by adjusting the limit; although that is somewhat of a dark magic.



Does model logic correspond to “sortkeys” and “distkey”?



Often times, Redshift keys are specified prior to the actual building of a model. This typically results in a problem since while you are creating the table schema, you don’t yet know how each of the tables will be queried.



A number of questions need to be answered first. Will the table in question be used only as an explore, only as a secondary view joined into an explore, or both? Will the table be queried always for all of time or will users require only the last few days? You should have some rough idea of answers to these questions before attempting to set up the appropriate database schema (WARNING: changing keys on a Redshift table requires dropping the table).



Next, you will need to check existing definition of keys (see PG_TABLE_DEF table):



SELECT * from pg_table_def where tablename = 'some_table_name' WHERE distkey AND sortkey > 0;



Sometimes, your table uses a non-default (non-public) schema. In that case, you or your administrator will need to add it into your username’s search path:



ALTER USER looker SET search_path TO '$user',some_secondary_schema,public



Key points to watch out for:





  • Is the same distkey specified for large tables that can be joined together? Alternatively, this might cause too much network traffic between nodes.


  • Are there sortkey(s) specified on default filters in the model?


  • Are there sortkey(s) specified on ids/primary keys of tables where these ids are used in join clauses?




Typically the largest tables are comprised of event or transaction data over time. This means that in 90% of cases the sort key should be applied to a column containing time. There can, of course, be other secondary sort keys or situations where the event table is never used as a Looker Explore.



Did sorting happen correctly?



Assuming definitions are correctly defined, next check whether:





  1. Data has been Inserted (or COPIED) in the right sort order


  2. Sorting completed 100%




Redshift relies heavily on that you correctly load the data (see Sort Key Order or COPY). If you are performing individual INSERTs very frequently most likely you are not inserting the data in the correct sort order specified in your schema.



It is also quite common to miss the difference in your LookML model between date dimensions derived from timestamps and date dimensions that refer to rollups. Using the former within the model while sorting on the later will likely result in less than optimal set-up. In most cases you will need to drop the table and re-set the sort order on the more granular column (i.e. the one containing timestamp, epoch, etc).



To check whether rows are fully sorted, you can query Redshift’s admin tables:



SELECT btrim(pg_namespace.nspname::character varying::text) AS schema_name, btrim(p.name::character varying::text) AS table_name, sum(p."rows") AS "rows", sum(p.sorted_rows) AS sorted_rows, sum(p."rows") - sum(p.sorted_rows) AS unsorted_rows,

CASE

WHEN sum(p."rows") <> 0 THEN 1.0::double precision - sum(p.sorted_rows)::double precision / sum(p."rows")::double precision

ELSE NULL::double precision

END AS unsorted_ratio

FROM stv_tbl_perm p

JOIN pg_database d ON d.oid = p.db_id::oid

JOIN pg_class ON pg_class.oid = p.id::oid

JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace

WHERE btrim(pg_namespace.nspname::character varying::text) = 'public' AND p.id > 0

GROUP BY btrim(pg_namespace.nspname::character varying::text), btrim(p.name::character varying::text)

ORDER BY sum(p."rows") - sum(p.sorted_rows) DESC, sum(p.sorted_rows) DESC;



– make sure to change the schema from public to whatever you are using for your table(s).



This should produce a table like this:



  schema_name   |     table_name   |     rows     | sorted_rows  | unsorted_rows |    unsorted_ratio   

----------------+------------------+--------------+--------------+---------------+----------------------

public | table1 | 64048263264 | 52214727344 | 11833535920 | 0.18475966898936

public | my_table2 | 21777720385 | 18456376872 | 3321343513 | 0.152511073440343



Another way to indirectly infer whether the table is loaded at sort order, is to run a Vacuum procedure. The more sorted the table is, the less time the Vacuum should take (see Vacuuming Tables). This might be useful if you are already running a lengthy Vacuum and wondering why it is taking so long. To estimate how much more time is needed for Vacuum to complete:



SELECT * FROM svv_vacuum_progress;



A case for Vacuum with frequent DELETEs/UPDATEs



In some ETL jobs data is not just COPIED over but is also updated or deleted. This is often the case with rollups. In rollup tables we only care about facts (e.g. averages, sums per user/group). That means that as facts change, rows need to be either updated or removed entirely in favor of newer rows. This is not an optimal set-up, but it does preserve storage.



In these cases, make sure to frequently run Vacuum even if data was originally inserted in sorted order. This can significantly slow down your user-run queries if they are performed during one of the Vacuum jobs.



There are a couple of solutions to frequent Vacuuming. One is to utilize Redshift’s workload management–thereby giving higher priority to user-run queries over longer ETL jobs. This will further delay the completion of your Vacuum procedure, so it is ideal that you temporarily increase the number of nodes for the ETL period, making it brief and less costly on user queries.



Another approach involves splitting of one table into many smaller monthly/daily/hourly tables (Using Time Series Tables. The idea is to create a new small table at some interval and then UNION all of the tables into, conceptually, a single table using Redshift’s Create View syntax.



This will allow Vacuuming/Analyze to be performed on just one table while everything else is left alone. Query optimizer will then skip some or most of the tables altogether based on the time period specified in the WHERE clause.



Workload Management



It is important to consider Looker use in the wider business’s requirements for Redshift, and then how to make changes to accommodate these requirements. WLM enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.



If Looker is the only instance using WLM, typically your optimization will focus only on performance during the ETL process or during queries that load data into Redshift. More on implementing WLM…



A closer look at “Explain”: Network Traffic



XN Limit (cost=1000705977145.87..1000705977145.90 rows=10 width=4)

-> XN Merge (cost=1000705977145.87..1000716373406.35 rows=4158504192 width=4)

Merge Key: tstamp

-> XN Network (cost=1000705977145.87..1000716373406.35 rows=4158504192 width=4)

Send to leader

-> XN Sort (cost=1000705977145.87..1000716373406.35 rows=4158504192 width=4)

Sort Key: tstamp

-> XN Seq Scan on stumble_events (cost=0.00..41585041.92 rows=4158504192 width=4)



Network traffic is slow and you should expect its relevant cost to be large. A number of variables can make network traffic worse or more frequent: incorrect distribution keys, incorrect compression, or, as mentioned above, large VARCHAR specifications.



distribution keys


This is often an issue with highly normalized data. Your events table might be correctly distributed around user_id, but then another of other tables don’t even have a user_id column. It’s O.K. if these tables are small – they will be just distributed across nodes – but it is problematic for larger tables.



incorrect compression


Primary/foreign keys have to be decompressed before they can be used. Whatever is saved in storage and network traffic compression is likely to be lost in processing latency - especially if the length of these keys is optimally designed for the number of potential rows. A combination of Sortkeys and delta encodings is not as straightforward. But better keep it simple: apply compression to non-id/non-date columns only.



You can run ANALYZE COMPRESSION <table name> to have Redshift give you guidance on best compression for non-id fields. But ignore its advice for ids and time columns. Redshift can understand the variability in data, but it cannot understand how you plan on using the table - in terms of foreign/primary key relationships.



It is possible that Redshift can do better compression prediction if you already have defined foreign/primary key relationship, but Looker is a lot more flexible when it comes to the use of foreign/primary keys, so it is very unlikely that those constraints would be enforced.



large VARCHAR specifications


This creates a problem in storage. It seems fine because size is automatically resized based on the value. However, if you use this column for either a COUNT DISTINCT, GROUP BY, or a join specification, then Redshift has to create a HASH on the maximum length of the column. That means that you can get many times the magnitude of your actual stored values traveling across the network (may be x100 more).



Horsepower



If you have lots of data it might just require either more nodes or the replacement of HDD nodes in favor of faster but smaller SSD nodes. While it is a financial decision, it is easy to test both configurations without the need to drop or create any new tables. You just need to make sure that the number of SSD nodes can accommodate the cumulative size of your HDD nodes.



To find your total current capacity & actual physical size on disk:



SELECT owner as node , diskno, used, capacity from stv_partitions order by 1, 2, 3, 4;



Review your History



Finally, monitor your queries. Are there any queries that take longer to execute? Looker provides an explore to track performance of queries executed in Looker under <domain>/explore/i__looker/history. You can also browse query count and individual queries from Redshift control panel. And, of course, this is possible through a command line as well:



SELECT query, pid, elapsed, substring FROM svl_qlog

ORDER BY starttime desc

LIMIT 100;



Update: Now, many of the tasks explained here, and more can be facilitated by using the performance dashboards in our Redshift block


0 replies

Be the first to reply!

Reply