Looker Self-Hosting Knowledge Share

thomas_brittain · 09-23-2020 05:39 AM

Hello wonderful Looker people,

Is there anyone else out there who is self-hosting their Looker instances and willing to share best-practices, horror stories, or anything else regarding the challenges of self-hosting? Remember, this is for posterity.

I’ll kick it off. Right now I’m reviewing our scalability. We repackage Looker into our final product and we have quadrupled our customer base in the last 2 years. We are starting to run into performance issues. Before I share our current dilemma, it is important to note I’m a data analyst put in charge of taking care of our Looker nodes, not a proper devops person. And Looker staff, feel free to correct me I get this wrong (please, it would help me learn).

The current struggle I’m working through is getting more CPUs per node. Currently, we have 4 nodes each with 8 CPUs. It seems when a customer creates a dashboard with many tiles (25+) and schedules it to be rendered, all of the rendering jobs get assigned to the same node. So we see that node maxed out for 2-15 minutes while the job completes. And I believe, if a user is operating off the same node then their experience is horrific. I’m requesting we slowly add more CPUs to our nodes, while exploring enforcing a limit of queries per dashboard. Of course, the later is difficult because there doesn’t seem to be a built-in option.

Anyone else out there willing to share?

thomas_brittain

I didn’t know this article existed. I thought I’d link it here in case others need it.

Monitoring Looker

agovindan-16279

@thomas_brittain We are in the same boat as you in my current organization. We are just onboarding with Looker and looking for best practices and tips related to a self hosted Looker installation. Again like you we are a team of analysts but trying to work with our DEVOPS team to set this up. They are clueless and we are even more clueless!

Documentation seems to be very scarce and not answering a lot of questions. It's all based on an assumption that it will be Google/Looker hosted only which is bad!

For example, one of the questions we have - Is it recommended to set up the MySQL database during initial installation? What would be a typical scenario when the Hyper-SQL database size goes beyond 600MB and performance issues start cropping up - how many concurrent users OR how many dashboards being accessed? Also what tables does this metadata contain? It says configurations, users and other data? What is this "other data"? Is it this Hyper-SQL/MySQL DB we have to look into for monitoring license usage? Is there some schema diagram and documentation available for this DB?

This and many more such questions. Is there some group of Looker Administrators? (a rare breed I would think considering many seem to be taking the Looker hosted installation approach!)

Eric_Lyons

Hi @thomas_brittain this is a great question. I specifically wanted to pull out this portion:

The current struggle I’m working through is getting more CPUs per node. Currently, we have 4 nodes each with 8 CPUs. It seems when a customer creates a dashboard with many tiles (25+) and schedules it to be rendered, all of the rendering jobs get assigned to the same node. So we see that node maxed out for 2-15 minutes while the job completes. And I believe, if a user is operating off the same node then their experience is horrific. I’m requesting we slowly add more CPUs to our nodes, while exploring enforcing a limit of queries per dashboard. Of course, the later is difficult because there doesn’t seem to be a built-in option.

One potential option is to create a set of renderer or schedule nodes. You can move those render specific nodes from behind your load balancer (to remove the potential of general traffi from hitting these vms) and set the renderer threads on the nodes behind the ELB (which I will call the UI nodes) to 0 in the lookerstart.cfg file.

We then set the render threads on the render threads to use the default number of threads or you could potentially increase it given the render nodes have enough memory allocated. This way render tasks will get directed to your renderer nodes and should remove some of the memory overhead from the UI nodes.

Thanks,

Eric

thomas_brittain

Hello @Eric_Lyons,

Thanks for the note.

We actually tried this, however, I found trying to create a “scheduler node” was not as simple as lowering

--concurrent-render-jobs=0
--scheduler-threads=0

On the nodes in the LB rotation and removing the “scheduler node” from the LB’s rotation. We tried this and noticed the nodes in rotation would still send out schedules, though, much fewer. At the time (late-2019), I’d assumed it may have to do with Alerts or other features not yet covered by the startup parameters, but did not continue to investigate as our issue turned out to be network related.

On a different note, we discovered the performance issue was entirely do to the bandwidth capacity of the DBs → Looker cluster. If there is not enough throughput into Looker, then the Looker nodes seem to run hot, which may lead one to think it is compute resource issue. I’m assuming the nodes are spending a lot of compute managing jobs which cannot complete as they do not yet have the data they need. But that’s speculation.

@Eric_Lyons , would you be able to provide more specifics on the creating a “scheduler / render” node? Including the exact startup parameters to be successful? It’s for posterity.

thomas_brittain

For other on-prem Looker admins out there, I wrote a Python script for downloading specific Looker versions to make it easier to apply patches. Hope it helps.

Python script for download specific JARs

Cheers!

Darton

One way we solved the scalability issue was very cunning. Basically we all know we can chose which instance can be the scheduler and which is general purpose ( ui usage ). Since we can control the master delegation and usually the master node should not be doing any rendering/scheduling or work for that matter. We came up with an idea in AWS to make a provisioning of one EC2 instance in a separate autoscaling-group ( lets call it master ) so looker automatically assumes its the master node ( we don't allow traffic in this one from the load balancer, we just use it as a node-to-node commander ) then we created a different autos-scaling group with traffic from the load balancer ( lets call it ui ) where we set the --concurrent-render-jobs=0 and--scheduler-threads=0. And lastly we create several instances under a different auto-scaling group with traffic from the load balancer ( lets call it schedulers ) that run the schedule jobs. This solved the scaling issue where the random delegated master node can suffer from CPU saturation or memory usage an crash leaving looker to delegate a new master which made the cluster unstable and since we use auto-scaling groups we are free to auto-scale in AWS based on cpu/mem the schedulers and ui without killing the master node. This solution works very well for our "on-prem" but it comes with one consequence you always have to orchestrate a right order of your VMs. First provision a solo VM ( looker is forced to become the master node, this one no load balancer traffic ) then provision the rest of the VMs in the cluster for scheduling and ui usage. Hope it helps. Cheers.

rdunlavy

@thomas_brittain for the separate scheduler and render setup, the complete list of flags to add to the nodes behind the LB would be:

--scheduler-threads=0
--unlimited-scheduler-threads=0
--concurrent-render-caching-jobs=0
--concurrent-render-jobs=0

The --unlimited-scheduler-threads one is most likely the cause of the behavior you’re seeing as you described it. There are two separate sets of threads working on unlimited scheduled jobs (any schedules of a look/explore with “All Results” selected in the config) and limited scheduled jobs (everything else).

The --concurrent-render-caching-jobs one will prevent the nodes behind the LB from running any queries related to render jobs. Rendering PDF/PNGs work in a two-stage process of first pre-running the queries to cache the results, then booting up Chromium and loading the dashboard to generate the image after retrieving the results from cache. This prevents the node from doing the first step as well, --concurrent-render-jobs prevents the node from doing the second step.

thomas_brittain

One more tool for the on-prem admin tool-bag.

Looker UI response time testing with Python and Selenium