Question

Kimball Data Warehousing? Or does Looker let us do more?

  • 20 August 2017
  • 4 replies
  • 379 views

Hi folks,


We’ve been using using Looker successfully for the past 6-8 months, but are now outgrowing the initial design, and want to create a more powerful Data Warehouse – more data in it, more analyses supported, etc. As part of the process, my team and I are reading the classic(?) Kimball’s Data Warehouse book.


My question is pretty general at this point: Do you find that you need to follow all the techniques in that book strictly? Or do you find that because Looker has things like symmetric aggregates and lets you explore complex joins more easily, you are breaking some of the “rules” in the book? For example, I’m wondering about strictly sticking to a star schema vs. doing some snowflaking? i.e. Does that book need an update because of what Looker has done?


If we follow the Kimball approach, at least on the face of it, it looks we might have to do a lot more in ETL than I did in the first iteration of the data warehouse. Maybe that’s what Looker does? Allows for simpler ETL, since more can be done inside the LookML layer? Do you have any examples of this you can share?


I will likely have more specifics to ask, but wanted to open with this one!


Thanks so much,

Artem


4 replies

Userlevel 2

This is quite an old thread. But I stumbled upon it looking for some specific information, and wanted to comment for anyone else that happens to land here with similar questions.


Kimball’s designs are not based on the assumption that storage is expensive. One of Kimball’s main goals is to de-normalize data. This always ends up in redundant data that takes up more space. And if one actually reads any of the books, they are always defending the de-normalization of data in lieu of saving space.


Second, to say that Kimball’s designs assume that DBMSs are slow is misguided. Just because a methodology was born in an era of slower systems, doesn’t automatically make the design outdated. To this day, DBAs and data architects still caution heavily about the expense of poorly constructed SQL statements and too many JOINs. Performance concerns haven’t changed. And why should they? Hardware is certainly faster. But data is also exponentially bigger. A database that held 5,000,000 records in 1995 might hold 500,000,000,000,000 records today. Dimensional modelling addresses performance in ANY environment. Just because systems are faster today, doesn’t mean we should ignore everything else. If we did, we would find ourselves in the “just throw some more RAM and a new CPU in there” trap. Data warehousing is not necessarily cheap. By ignoring the performance of your data models, you’ll likely end up spending a lot more in computing credits from your host.


Finally, some of the main points that Dimensional modelling addresses actually have nothing to do with hardware and system software. Many of Kimball’s focuses included:



  • Making the data more available to end-users

  • Modelling the data for simple navigation

  • Eliminating inconsistencies in atomic data and calculated data

  • Unification and assimilation of data from multiple source databases


Most of Kimball’s points are still very relevant today. In many respects, they are more relevant with the large amounts of data we are dealing with - and will continue to deal with in the future.

Userlevel 6
Badge

Kimball’s designs are based on the assumptions that databases are relatively slow and that storage is expensive. Neither of these things are true anymore. Most of Kimball’s designs are around a particular form of cubing.


Column stores are amazingly fast and efficient. Scanning and storage is cheap.


So to answer your question, Kimball’s book has limited value in today’s world.


There are lots of really good documented patterns in LookML that can help with lots of this.


A good starting place to find patterns in data are here:


https://discourse.looker.com/c/lookml/blocks


Is there anything in particular that you are trying to do that you are finding difficult?

Thank you so much for the reply, Lloyd. The blocks are helpful, and I didn’t know there was a way to see them all in one place like that.


The reason we were looking at Kimball’s book is we wanted some instructions on how to best design the tables in a Data Warehouse, even before LookML comes into play. I cobbled together a Data Warehouse v1 with some help from your excellent onboarding team, but I felt like I made some bad design decisions early on that were hurting us, so when designing Data Warehouse v2, was hoping to avoid them.


I see what you mean about Kimball, which is why I was asking the question, although the latest edition seems to note a lot of changes in cost of hardware and the impact on the design. The stuff I found of value are (to pick a tiny example) avoiding having a NULL as Foreign Key and instead linking it to a related item that would have a default value. Or various patterns around how to handle slowly changing dimensions (SCDs).


So what I’m looking specifically is something that explains how to best design a Data Warehouse to be based on best practices and also be Looker-friendly! Maybe you can write a book one day. I just haven’t seen a lot of best practices, and individual blocks don’t explain how to put it all together.


In lieu of a book, I’d also love to see if you or anyone else wants to trade best practices on how to best handle a typical subscription-based business. We spend money on ads in various channels (e.g. Facebook, Adwords, Radio), have people pay us subscription revenue over time for various digital SKUs they buy, and we run a lot of optimization experiments on our users. I’d be curious if there is a full example of something like that, as it feels to be similar to 100s of other businesses, and I’d love not to reinvent the wheel. What I love in Kimball’s book are complete case studies of an industry, and showing how he would lay out the tables/columns. I just want a Looker-friendly and updated version!


Any other ideas on how to avoid reinventing the wheel are certainly appreciated. And yes, the blocks are an awesome help!


Artem

Kimball’s designs are based on the assumptions that databases are relatively slow and that storage is expensive. Neither of these things are true anymore. Most of Kimball’s designs are around a particular form of cubing.

 

Column stores are amazingly fast and efficient. Scanning and storage is cheap.

 

 

So to answer your question, Kimball’s book has limited value in today’s world.

 

 

There are lots of really good documented patterns in LookML that can help with lots of this.

 

 

A good starting place to find patterns in data are here:

 

 

https://discourse.looker.com/c/lookml/blocks

 

 

Is there anything in particular that you are trying to do that you are finding difficult?

 

@lloydtabb 

The link is not longer valid, do you mind to point me out where can I found documented patterns in LookML? is there any book you can recommend for this purpose?

Reply