Scheduling (unlimited) Results to S3

zachtaylor
Participant II

#The Basics

As of Looker 4.2, you can schedule reports directly to an S3 bucket. Results can be unlimited, allowing users to schedule and send large result sets, provided they meet the streaming criteria (that is, the report can’t contain table calculations or totals or, in some dialects, pivots). The scheduler will let you know as you’re scheduling if the report can be unlimited or not. To see this option in the scheduler modal, the user needs to have the send_to_s3 permission.

When to Schedule to S3

Sending reports directly to S3 works well when email is not an option because of the size of the result set. Because we use Sendgrid to process scheduled emails from Looker, and it has a 19.5MB limit (as of November 2016), the reports we send via email have to be limited in size. Streaming to S3 allows customers to bypass browser, memory and email limitations. This also may be useful if you want to automate a system to pull data down from S3 and use it in other applications within your business or in other tools.

Another use case is customers that want to store snapshots or periodic reports in S3.

#How It Works

  1. Create a schedule for a saved Look or Dashboard.
  2. Select ‘Amazon S3’ from the ‘Destination’ dropdown and enter S3 credentials (the user needs to have the send_to_s3 permission to see this)
  3. Select ‘Schedule’ or send a test via ‘Send Test’

Custom Applications

Have a great idea for ways that this feature can be used? Post details below to give ideas to others in the Looker community!

0 24 5,580
24 REPLIES 24

I may have missed it, but does the output automatically overwrite the previous file or is it saved as a separate unique results set? Thank you.

Hey @Charlie_Bryan,
We will overwrite the previous file if there’s already a file with that name. We do append a timestamp to the file name, so this means we will pretty much always make a new file!

Just a quick update to @sam8’s comment above:

As of Looker 4.18 we also add an identifier to file names so that even files generated at the same time should have unique names.

Is there a way have the S3 delivery write to the same S3 object every time? I appreciate the ability to have a snapshot of data, but I’m also looking to have the latest export to always be available at the same location.

I may end up just sending it to a webhook and doing my own S3 uploading if it’s not possible.

There isn’t a way to modify what object the S3 delivery will write to, @natetallman . It will always append a timestamp and a unique identifier, so overwriting shouldn’t ever happen.

I’ll bring this feedback to the product team - in the meantime, I think your idea of sending to a webhook is a good one that will give you full flexibility on your delivery.

+1 for being able to choose the naming convention of the files - or at the very least the option to overwrite, we need this to generate our re-marketing feeds and I thought this would be an option 😢

Any update with regards to being able to choose the file naming convention? Would be great to just overwrite the same file every time.

Hi Seth- I’ve notified the product team about your inquiry.

Thanks for checking in.

This functionality is great, but a relatively “simple” change (yes I know all changes are simple if you don’t have to do them) would be to allow name format strings. That way we could control the filename a bit. We run hourly processes and pass them to other systems. Filename predictability makes coding the downstream systems easier.

You could have “name” and then “append”.

“%%hour%%” in the append field for instance would create the name with only the time rounded to the hour.
“%%last_full_hour%%” in the append field for instance would create the name with only the time rounded to the last full hour (which matches your own terminology in the looker UI).

If the append field is left blank then the default applies and it would have ghosted “%%timestamp%%%%guid%%”.

Hey @Stephen_Johnsto,

This is a great idea! I’ll be sure to pass the feedback along to our product team.

Any news regarding this issue? That would be really helpful for a specific flow we have also.
Thanks a lot

Hey there @Marceau_Bouleng no updates to speak of, but one solution if you’re in need of this immediatly woudl be to use a lambda in AWS to detect the incoming webhook, and then manually rename the file according to your criteria. There’s a good outline of what this would look like here: Anyone using Looker webhooks to call a AWS Lambda function

Hope this helps, I’ll pass along your desire to see this implemented. Have a good one!

@will_adams @Marceau_Bouleng

I just wrote a very simple Lambda function that copies the file to a second bucket and scrapes the timestamp+hash from the filename. The result is one bucket with an historic archive of all previously scheduled snapshots, and a second bucket with the most up-to-date version of that schedule.

I’ll be writing up a post tomorrow with instructions and will share as soon as it’s ready.

@Marceau_Bouleng @Seth_Newman @quandrew @natetallman @Charlie_Bryan Please check out my new article about managing S3 schedules from Looker with AWS Lambda! And please let me know if I can clarify anything for you.

Thanks a lot Jesse,
Will try it for sure this week.

cobbr2
Participant II

Thanks, @jesse_carah; I’ve adapted your solution for our use case (not quite open-sourceable at the moment, though; all the hard work is in Terraform setting up the lambda, etc).

One thing comes up with JSON-simple formats: I’m using some of these exports as a way to bridge queries we write in Looker back into Athena for further analysis with Hadoop (yeah, not exactly a normal data warehouse data flow, and it may not last more than a few months).

In that use case, the representation of the JSON output as an array of hashes requires us to transform it one more time; Athena expects newline-separated hashes with no Array. I haven’t managed to convince its JSON serde to read the output Looker produces (I can only use org.openx.data.jsonserde.JsonSerDe, because of the embedded dots in keys).

Anybody else with this use case? I can always hack our lambda to do it, but figured I’d put the case in the wild so others could see.

Hey @cobbr2 .

Glad you were able to adapt the Lambda for your use-case.

The dots in the field name is something that we can’t address today in Looker. However, I did see that there is a “dots.in.key” property in Openx-JsonSerDe:

Hive doesn’t support column names containing periods. In theory they should work when quoted in backtics, but doesn’t, as noted in SO#35344480. To work around this issue set the property dots.in.keys to true in the SerDe Properties and access these fields by substituting the period with an underscore.

Link here.

Let me know if that resolves your issue.

cobbr2
Participant II

Thanks, @jesse_carah

I’m using that SerDe. The array brackets apparently cause that SerDe to have enough trouble that it won’t parse at all so far. Unfortunately, while it supports the dots.in.key property (and the ignore.malformed.json property, both of which I’ve enabled), all I get is NULLs for values so far. My experimentation time is limited this week – and we can use a CSV export temporarily – so I won’t have a result to post here for a few days.

Ahh, I see. I faced a similar constraint when trying to load JSON from Looker into Redshift. I side stepped the issue by using CSV exports, like you suggested. Hacking the Lambda to convert the JSON may also prove to be a pretty straightforward approach as well. At any rate, forcing newline JSON seems like an odd design decision.

I’ll reach out internally to see if anyone knows of any more direct approaches. Let me know how things go!

2 posts were split to a new topic: Support S3 encrypted buckets

Alex_Gray
Participant III

I created a new topic on whether or not IAM Roles are supported when sending the results to S3:

I’m posting here, in case anyone runs across it with this post.

Our bucket also denies uploads unless server-side encryption bit is set in the request.

What ACL does Looker uses when we export to S3? We are having issues with the end destination (another AWS account) being unable to access the file.

We’ve setup the IAM user with access key and secret key.

For some reason, when we select the limit to be “Results in Table”, it works, but when we select the limit to be “All results”, it fails.

Does it need a separate set of permission for all results which might be streaming data? the size of the result for All Results is only 67KB, and yet s3 put is failing. 

Error: Upload to S3 bucket {bucket_name} aborted due to error: The ciphertext refers to a customer master key that does not exist, does not exist in this region, or you are not allowed to access.

every time we need to send or schedule  data to s3 bucket we need to insert the access and secret keys?