Optimizing Redshift Performance

  • 9 January 2015
  • 3 replies
  • 232 views

Userlevel 3

Sort Keys



Typically, the primary timestamp or date field of any given table will be the best candidate for your sortkey. This helps Redshift speed up queries that are sorted or limited by that column.



Distribution Keys



For tables that join in other tables with a great deal of frequency (such as fact tables in the star schema world), it’s best to set your distribution key to be the foreign key with the greatest cardinality. For tables that join into other tables (such as dimension tables in the star schema world), it is best to set the primary key to be the distribution key.



More about Sort & Distribution Keys



Redshift has a page on how to best choose sort and distribution setups depending on data configuration.



Analyzing Table Design



Redshift has a nice page with a script that you can run to analyze your table design. The script checks if you’ve got sort keys, distribution keys, and column compression dialed in.



Vacuum



As you update tables, it’s good practice to vacuum. This is because newly added rows will reside, at least temporarily, in a separate region on the disk. (You may be able to specify a SORT ONLY VACUUM in order to save time)



To learn more about optimizing performance in Redshift, check out this blog post by one of our analysts.


3 replies

Here is a view of that script for analyzing table design in case anyone wants it.


- view: audit_redshift_table_analysis
derived_table:
sql: |
with temp_staging_tables_1 as (SELECT n.nspname as schemaname, c.relname as tablename, c.oid as tableid,
(SELECT COUNT(*) FROM STV_BLOCKLIST b WHERE b.tbl = c.oid) as size_in_megabytes
FROM pg_namespace n, pg_class c
WHERE n.oid = c.relnamespace
AND nspname NOT IN ('pg_catalog', 'pg_toast', 'information_schema')
AND c.relname <> 'temp_staging_tables_1'),
temp_staging_tables_2 as (SELECT tableid, MIN(c) as min_blocks_per_slice, MAX(c) as max_blocks_per_slice, COUNT(DISTINCT slice) as slice_count
FROM (SELECT t.tableid, slice, COUNT(*) AS c
FROM temp_staging_tables_1 t, STV_BLOCKLIST b
WHERE t.tableid = b.tbl
GROUP BY t.tableid, slice)
GROUP BY tableid)
SELECT t1.*,
CASE WHEN EXISTS (SELECT *
FROM pg_attribute a
WHERE t1.tableid = a.attrelid
AND a.attnum > 0
AND NOT a.attisdropped
AND a.attisdistkey = 't')
THEN 1 ELSE 0 END as has_dist_key,
CASE WHEN EXISTS (SELECT *
FROM pg_attribute a
WHERE t1.tableid = a.attrelid
AND a.attnum > 0
AND NOT a.attisdropped
AND a.attsortkeyord > 0)
THEN 1 ELSE 0 END as has_sort_key,
CASE WHEN EXISTS (SELECT *
FROM pg_attribute a
WHERE t1.tableid = a.attrelid
AND a.attnum > 0
AND NOT a.attisdropped
AND a.attencodingtype <> 0)
THEN 1 ELSE 0 END as has_col_encoding,
100 * CAST(t2.max_blocks_per_slice - t2.min_blocks_per_slice AS FLOAT)
/ CASE WHEN (t2.min_blocks_per_slice = 0)
THEN 1 ELSE t2.min_blocks_per_slice END as pct_skew_across_slices,
CAST(100 * t2.slice_count AS FLOAT) / (SELECT COUNT(*) FROM STV_SLICES) as pct_slices_populated
FROM temp_staging_tables_1 t1
left join temp_staging_tables_2 t2 on t1.tableid = t2.tableid

fields:
- measure: count
type: count
drill_fields: detail*

- dimension: schemaname
type: string
sql: ${TABLE}.schemaname

- dimension: tablename
type: string
sql: ${TABLE}.tablename

- dimension: tableid
type: int
sql: ${TABLE}.tableid

- dimension: size_in_megabytes
type: int
sql: ${TABLE}.size_in_megabytes

- dimension: has_dist_key
type: int
sql: ${TABLE}.has_dist_key

- dimension: has_sort_key
type: int
sql: ${TABLE}.has_sort_key

- dimension: has_col_encoding
type: int
sql: ${TABLE}.has_col_encoding

- dimension: pct_skew_across_slices
type: number
sql: ${TABLE}.pct_skew_across_slices

- dimension: pct_slices_populated
type: number
sql: ${TABLE}.pct_slices_populated

sets:
detail:
- schemaname
- tablename
- tableid
- size_in_megabytes
- has_dist_key
- has_sort_key
- has_col_encoding
- pct_skew_across_slices
- pct_slices_populated

yes, I created a view on this also,


      SELECT schema as schema
, "table" as table
, diststyle as distribution_style
, size as table_size
, sortkey1 as first_sortkey
, sortkey_num as number_of_sortkeys
, max_varchar as max_size_varchar
, pct_used as percent_used
, empty as percent_empty
, unsorted as percent_unsorted
, stats_off as staleness
, tbl_rows as total_rows
, skew_sortkey1 as skew_sortkey
, skew_rows as skew_rows
FROM svv_table_info
WHERE schema = 'looker_scratch'
-- AND "table" ILIKE 'lr$%'
-- AND skew_rows IS NOT NULL
ORDER BY skew_rows DESC

Thanks Scott, I’ll check those out. I have just started building some auditing views, so I don’t have much yet.

Reply