We have all of our pdts on a time schedule (either every hour or every day), but this causes problems in two cases:
1 - The underlying table is being rebuilt -> the pdt is empty or misses some data
2 - The database currently has a high load -> running expensive pdts slows everything else down
In neither of these cases do we want the pdt to rebuild, what’s the best practice way to achieve this? My thoughts thus far have been to extract the sql triggers into a view in the database and have something like the below in every looker view. This allows us to force the value of the sql_trigger to not change in periods where cases (1) or (2) apply, but it feels like a really heavy solution to the problem.
sql_trigger_value: SELECT * FROM schema.pdt_trigger_values where view_name = 'VIEW NAME'
Any suggestions would be welcome
I’d like to use your first solution (and indeed we do this where we can), but there are two distinct issues:
1 - Data comes in in drips. There is a constant feed of updates and I don’t really want to be rebuilding constantly
2 - If the underlying_table is being rebuild then count(*) will return 0 at some moments in time leading to the pdt be rebuilt empty.
I thought some about this over the easter weekend and potentially the solution is just to make some changes upstream such that data comes into redshift in a way more amenable to looker triggers (e.g. pool it up and only transfer every hour or so, make the table truncation have exclusive locks on relevant tables). Then your final solution would work nicely.
I’d be keen to see the final suggestion implemented, there’s very little correlation between how much we value freshness of a pdt and how computationally intense a pdt is.