Question

How to isolate development databases from production ones?

  • 2 December 2015
  • 3 replies
  • 360 views

In my organization, we have some practices and requirements that don’t match up well with Looker’s assumptions about how developers should work with data.


They’ve caused us to add a lot of additional tooling to make things work; I thought I’d describe them in hopes of



  1. Explaining why those assumptions don’t work and getting product improvements

  2. Helping others who need to work around those assumptions.


The problem


We are a health tech company. That means we have to be very good at compartmentalizing data, encrypting and de-identifying it, and maintaining our agility.


In implementing a specific feature (“story”) here, the typical workflow is:



  1. Create a script that enhances the warehouse’s schema

  2. Create an ETL or event stream which provides that data to the warehouse

  3. Add explores and views to our Looker models

  4. Add visualizations to a dashboard


We want to be able to release (and rollback) the “whole story” and “nothing but the story”. But we deliver no value to our customers or internal users by just getting the data to a warehouse. So most of the time, we want all four of those steps to go together.


Also, since we can literally affect the life and death of people, stories go through an acceptance test process. Code (including new schemas) have to be deployed on integration environments, then on acceptance test environments, and then to the production environment (of course, only once we’re sure the code (and schema) works).


Yet, of course, there are more than one developer and there are many stories to tell. And we go through that acceptance test process quickly; we release weekly.


What we do


To make that process work, we want every developer and every integration environment to have their own warehouse (just like they have their own OLTP databases, message queues, etc). That isolates their changes to the story we’re telling.


In Looker, that means that every model file has to be changed in two ways for each environment: the connection for that model has to change, and the name of the model file has to change.


Also, no views or explores can name their models.


Every developer and environment gets their own Github repository. E.g., my “cobbr2/looker-master” repo is a fork of the company’s “ConsultingMD/looker-master” repo. They also get their own Looker project, which is configured to look at that Github repo.


Every developer and environment gets a database in our “test” Redshift cluster, and our Looker world is configured to have a connection per database.


The company maintains 4 Github repos for Looker alone: the looker-master, plus two for acceptance testing and one for production.


In order to start work on a story, developers:



  • Use a clone of their repo on their personal workstation.

  • That clone is set up with two git upstreams: their personal fork and the corporate looker-master.

  • Pull from our corporate looker-master.

  • Run a script which renames the model files and changes the connection information in every model file (it also removes model: entries in views and explores).


Then, in Looker, they sync their personal LookML project developer branch to “production”.


When we complete work on a story, we have to continue this convoluted process:



  • Create a proper “story” branch (fitting our naming convention) from the development branch on the personal repo

  • Run the renaming script to rename things for the corporate looker-master repo

  • Submit a PR using the normal Github fork PR process


In order to avoid polluting that story with the next one, the developer then:



  • Checks out the master branch of their personal repo

  • Merges the corporate master into it again

  • Runs the renamer again

  • Then in Looker, reverts their personal branch to production, and starts again.


This process is error prone and very inconvenient:



  • Imagine what the developer has to do if their pull request is rejected (use their workstation to stash their current branch, revert the renaming commit on the tip of their old story branch, reset their developer branch to the old story branch, fix the bug, do it again).

  • With only a few developers and roughly 50 views (many of which aren’t explorable), we have about 450 entries on our Explore menu

  • There are tons of nuisance commits to deal with the file renames (7% of our commits)

  • The tools have to use generalized YAML parsers and YAML generators which don’t cope with the extra spaces that Looker allows in LookML (more nuisance commits) or reformat the files (more noise in our code reviews).


Why the current Looker branching process doesn’t work



  • It doesn’t allow us to isolate databases by specific developer or environment.

  • It doesn’t provide a scheme for separate branches per story

  • It requires us to change the model file names just to get separate connections per developer


Suggestions?



  • Provide an “environment” or “configuration” abstraction which contains the database connection separately from the models (and the branch)

  • Allow developers to “branch from master” and name their branch

  • Allow users to select the branch they are working on

  • Permit users to branches

  • Stop using the “master” branch to mean “production”. In normal projects, “master” is many commits ahead of the “latest release”.

  • To support that, perhaps we would want to make “projects” represent a pair: a branch and an environment. So “production” could be using a release branch (e.g., “rc/branch/2015-11-30”) and a production database environment, and “release management” becomes just naming the correct branch. And developers create new projects for every story.


I’m not sure those suggestions quite add up to what we need, but they’d go a long way in the right direction.


In the meantime, one thing we really like about LookML is that we can process it with 3rd party parsers and generators. If we lose that, we lose the ability to work around the problems Looker’s abstractions impose.


3 replies

Userlevel 3

+1 on this.

As our data and reporting infrastructure continues to mature (and become criticial to business processes), a well thought through productionisation process along with compartmentalisation will become ever more valuable.

+1 would love more granular branch control!

+1. We have very similar issues, and maintain 2 separate full environments (staging and production) on different Git repos. In particular we struggle with stories that fail or are delayed in UAT. When this happens it’s a very tedious process to back these out of the staging repo master branch before pushing the staging repo changes to the production repo.

Reply