Question

Incremental PDT's - rebuild entire table nightly

  • 31 January 2022
  • 11 replies
  • 211 views

Hi All! Does anyone have thoughts on the following use-case for incremental PDT’s?

I have an incremental PDT that refreshes every 30 minutes with an increment_offset of 5 days. Is it possible to set a trigger so that this PDT rebuilds in its entirety once per day? (from the beginning of time, ignoring the increment offset - just like it did on the initial build).

The use case is that customers want to see very recent results during the day, but every night, there should be a ‘true-up’ for late arriving data - and it’s ok that this build takes 5x longer.


This topic has been closed for comments

11 replies

Userlevel 6
Badge

You could:

increment the pdt on the date loaded date rather than activity date so that it only appends new data (you can then cut it down to only 1 increment load rather than 5), you still include activity date for analysis in the pdt - this only works if you have additive measures across days.

 

or

 

delete pdt from database forcing looker to rebuilt in full.

@IanT Thanks for the response! I don’t think the first option would work because we have data that occasionally changes going back 10+ days - which is why we would want to do an entire rebuild every night.

The second option is interesting. The PDT in the database would need to have a stable name in order for it to be deleted. I could set up a script in Redshift to do the nightly delete. I wonder if it would have unintended effects on Looker if a PDT has simply disappeared from the database...

Userlevel 6
Badge

Does the data actually change or is it in event format and it comes in “late”?

 

You should test option 2 but I believe that it will rebuild the table the next time the 30 min trigger is hit and if any queries run it will fall back to the original explore (if an agg table) or it will rebuild the pdt and then run your analysis (so slow for first time IF before that 30 mins)

The underlying data would change. For example a ‘location name’ could be changed by a client, and this should be applied to all data. I’ll test option 2 - will update the thread with results!

Userlevel 4

Good catch DataChico

I would also like to see a function to recreate the PDT from time to time.

The current solution is fine, but as described here, the current solution is not good enough if you have historical changes.
We currently refresh once a weekend (Manual process where we add "an 1 = 1" to the datagrup, or comment it out again.
This leads to a complete refresh...
Only a workaround and not a solution...

 

 

@moebe Thanks for sharing your previous post - looks like you’ve spent quite a bit of time with this problem. I wonder if there is any automated way to add the 1 = 1 into the datagroup? There may be a way to do it using the API, but again, a workaround and not a real solution. This should be added to the product roadmap for Looker...

I have an issue where i had a bug with my data source and i will need to rebuild the whole incremental PDT and not finding a way to do that without changing the query, push to production and changing it back and push to production again (which i will do since i don’t find a better way, or i can change the offset for the day where the bug happened (which in my case is one month ago so still better than rebuild the whole thing) but we face a similar issue and we were thinking in a different approach:

issue: some columns in our data source may change e.g.: is_customer_alive boolean :) 

we can have an update_time on the row that can be used as the incremental key but would that duplicate the object? → I actually don’t know and haven’t tried yet

solution considered → have incremental for the whole table except the column that can change. that column would be in a new view that can be incremental on update_time just with a foreign key and the value needed. Then a sanity check to see how looker is appending the data so if it duplicates the the foreign key, we made another query on that so we make the fk unique (if needed) and that might need a full PDT without incremental or maybe don’t need a PDT at all, depending on the size of your data an how do you use it. 

We are about todo those changes/test so can’t confirm if it is a good solution but hopefully can help you there.

just to add to an info, we could rebuilt the full PDT by adding a space to the query…. but couldn’t figure how to full rebuild a view that has incremental PDTs without doing a Pull request

Would it be possible to split the table in two?

The archive table would rebuild completely each night, with a live (derived) table pulling in only records updated in the current day. A subsequent derived_table could be written to pull only the latest record for the primary key across the two “feeder” tables.

This eliminates incremental scheduling, however depending on the volume/structure/db it could slow things down.

The current solution is fine, but as described here, the current solution is not good enough if you have historical changes.

 

Thank you for the article link.

Would it be possible to split the table in two?

The archive table would rebuild completely each night, with a live (derived) table pulling in only records updated in the current day. A subsequent derived_table could be written to pull only the latest record for the primary key across the two “feeder” tables.

This eliminates incremental scheduling, however depending on the volume/structure/db it could slow things down.

yep. I would do a incremental PDT with all the fields that don’t change and a PDT (full rebuild) with the columns that change and you join them on the explore level