Question

Handling Datagroups and Snowflake

  • 17 January 2020
  • 7 replies
  • 2376 views

Userlevel 3

Hello everybody

we are using Looker on Snowflake.

A REST Webservice calls our ETL. This can happen 1 to N times a day.

After the ETL is finished with its valuation, we want to renew several dashboards and/or looks automatically or send single looks or dashboards also by mail.

The only GUI possibility I see is to use Datagroups for this.


Since we are using Snowflake and our business does not access the data 24/7 (it can, may and sometimes does), but mainly during peak hours we can “suspend” Snowflake quite often.


The default for a Snowflake connection at Looker is to query the datagroup every 5 minutes. (PDT And Datagroup Maintenance Schedule - /5***)

This would cause Looker to connect to Snowflake every 5 minutes and Snowflake would never be suspended.

If we reduce the times to e.g. one hour, we reduce the costs but lose time. We would then register any database updates too late and thus inform the business about market changes too late.


How did you solve this problem for you? Or which solution approaches do you see?


7 replies

Userlevel 6

remove the sql trigger from the datagroup and hit the looker datagroup api endpoint to set the stale_before time for the datagroup at the exact moment your data is updated.

Hello Ian,


Thank you very much for this feedback: I had the exact same issue as Benjamin and until now, I had changed the default CRON to 1h in the Connection settings to avoid WH wakeup in Snowflake, which is not ideal at all!


But may I ask you how you set it up on your side? From your explanation, I understand that you are using the Looker API update datagroup API endpoint right?



So, did you setup a simple script or a more robust application on the backend to handle this? I’m thinking about how to handle new datagroups created, connecting the ETL to this, the frequency of check to update the “stale before” parameter? I assume you had to build some solid logic to handle all this correctly?


Thanks again 😉

Anthony

Userlevel 6

A pipeline in our etl orchestration tool will trigger the relevant data group for the data it has just updated the data for. It can be as simple as just running a bash or python script.

Thanks a lot for this Ian, I understand better now : seems less “big” than what I was thinking of 😉

Userlevel 3

Hi @IanT

Just to clarify, as support told me today, that



I have confirmed that triggering a datagroup via the API is not currently possible. I can certainly pass your feedback along to our product team on this point. I thought I would reiterate that it is possible to invalidate a datagroup’s cache, as I mentioned during our chat, by hitting the update_datagroup API endpoint and setting the stale_before parameter to the current time (in a unix epoch timestamp). But, as we discussed, that won’t actually rebuild the PDTs at the time of the call - rather, they will be rebuilt the next time a piece of content built on that PDT is queried.



but you/you have made it. So I hope that I had only a communication problem with the support.

How exactly do you do it that your ELT triggers the datagroup and thus triggers all NDT, PDT, Aggregate Tables and Schedules?

Userlevel 6

I don’t think I have checked out NDTs, PDTs, Agg tables and schedules. In the production products we use this for we don’t have those, we trigger emails after we blow the cache via the run once endpoint because we don’t want schedules hanging from a datagroup incase we want to reset them and NOT send the email automatically.

We update the datagroup using this payload: {“stale_before”: time.time(), “triggered_at”: time.time()}.

This blows the cache but as I say, not tested to see what the triggering actually does…if anything.

+1. I’ve also used this approach (invalidate cache via API, hit run_once API endpoint to trigger PDT rebuilds etc.) in the past and it works reasonably, although it is a bit clunky.

Reply