Question

Kimball Data Warehousing? Or does Looker let us do more?

  • 20 August 2017
  • 5 replies
  • 1014 views

Hi folks,


We’ve been using using Looker successfully for the past 6-8 months, but are now outgrowing the initial design, and want to create a more powerful Data Warehouse – more data in it, more analyses supported, etc. As part of the process, my team and I are reading the classic(?) Kimball’s Data Warehouse book.


My question is pretty general at this point: Do you find that you need to follow all the techniques in that book strictly? Or do you find that because Looker has things like symmetric aggregates and lets you explore complex joins more easily, you are breaking some of the “rules” in the book? For example, I’m wondering about strictly sticking to a star schema vs. doing some snowflaking? i.e. Does that book need an update because of what Looker has done?


If we follow the Kimball approach, at least on the face of it, it looks we might have to do a lot more in ETL than I did in the first iteration of the data warehouse. Maybe that’s what Looker does? Allows for simpler ETL, since more can be done inside the LookML layer? Do you have any examples of this you can share?


I will likely have more specifics to ask, but wanted to open with this one!


Thanks so much,

Artem


5 replies

Userlevel 6
Badge

Kimball’s designs are based on the assumptions that databases are relatively slow and that storage is expensive. Neither of these things are true anymore. Most of Kimball’s designs are around a particular form of cubing.


Column stores are amazingly fast and efficient. Scanning and storage is cheap.


So to answer your question, Kimball’s book has limited value in today’s world.


There are lots of really good documented patterns in LookML that can help with lots of this.


A good starting place to find patterns in data are here:


https://discourse.looker.com/c/lookml/blocks


Is there anything in particular that you are trying to do that you are finding difficult?

Thank you so much for the reply, Lloyd. The blocks are helpful, and I didn’t know there was a way to see them all in one place like that.


The reason we were looking at Kimball’s book is we wanted some instructions on how to best design the tables in a Data Warehouse, even before LookML comes into play. I cobbled together a Data Warehouse v1 with some help from your excellent onboarding team, but I felt like I made some bad design decisions early on that were hurting us, so when designing Data Warehouse v2, was hoping to avoid them.


I see what you mean about Kimball, which is why I was asking the question, although the latest edition seems to note a lot of changes in cost of hardware and the impact on the design. The stuff I found of value are (to pick a tiny example) avoiding having a NULL as Foreign Key and instead linking it to a related item that would have a default value. Or various patterns around how to handle slowly changing dimensions (SCDs).


So what I’m looking specifically is something that explains how to best design a Data Warehouse to be based on best practices and also be Looker-friendly! Maybe you can write a book one day. I just haven’t seen a lot of best practices, and individual blocks don’t explain how to put it all together.


In lieu of a book, I’d also love to see if you or anyone else wants to trade best practices on how to best handle a typical subscription-based business. We spend money on ads in various channels (e.g. Facebook, Adwords, Radio), have people pay us subscription revenue over time for various digital SKUs they buy, and we run a lot of optimization experiments on our users. I’d be curious if there is a full example of something like that, as it feels to be similar to 100s of other businesses, and I’d love not to reinvent the wheel. What I love in Kimball’s book are complete case studies of an industry, and showing how he would lay out the tables/columns. I just want a Looker-friendly and updated version!


Any other ideas on how to avoid reinventing the wheel are certainly appreciated. And yes, the blocks are an awesome help!


Artem

Userlevel 2

This is quite an old thread. But I stumbled upon it looking for some specific information, and wanted to comment for anyone else that happens to land here with similar questions.


Kimball’s designs are not based on the assumption that storage is expensive. One of Kimball’s main goals is to de-normalize data. This always ends up in redundant data that takes up more space. And if one actually reads any of the books, they are always defending the de-normalization of data in lieu of saving space.


Second, to say that Kimball’s designs assume that DBMSs are slow is misguided. Just because a methodology was born in an era of slower systems, doesn’t automatically make the design outdated. To this day, DBAs and data architects still caution heavily about the expense of poorly constructed SQL statements and too many JOINs. Performance concerns haven’t changed. And why should they? Hardware is certainly faster. But data is also exponentially bigger. A database that held 5,000,000 records in 1995 might hold 500,000,000,000,000 records today. Dimensional modelling addresses performance in ANY environment. Just because systems are faster today, doesn’t mean we should ignore everything else. If we did, we would find ourselves in the “just throw some more RAM and a new CPU in there” trap. Data warehousing is not necessarily cheap. By ignoring the performance of your data models, you’ll likely end up spending a lot more in computing credits from your host.


Finally, some of the main points that Dimensional modelling addresses actually have nothing to do with hardware and system software. Many of Kimball’s focuses included:



  • Making the data more available to end-users

  • Modelling the data for simple navigation

  • Eliminating inconsistencies in atomic data and calculated data

  • Unification and assimilation of data from multiple source databases


Most of Kimball’s points are still very relevant today. In many respects, they are more relevant with the large amounts of data we are dealing with - and will continue to deal with in the future.

Kimball’s designs are based on the assumptions that databases are relatively slow and that storage is expensive. Neither of these things are true anymore. Most of Kimball’s designs are around a particular form of cubing.

 

Column stores are amazingly fast and efficient. Scanning and storage is cheap.

 

 

So to answer your question, Kimball’s book has limited value in today’s world.

 

 

There are lots of really good documented patterns in LookML that can help with lots of this.

 

 

A good starting place to find patterns in data are here:

 

 

https://discourse.looker.com/c/lookml/blocks

 

 

Is there anything in particular that you are trying to do that you are finding difficult?

 

@lloydtabb 

The link is not longer valid, do you mind to point me out where can I found documented patterns in LookML? is there any book you can recommend for this purpose?

Artem

There is a big problem with people thinking their databases have infinite processing speed and capacity and that looker can instantly display the answer to any problem.  If you are looking at getting results back in a few seconds, or limiting your expenses with BigQuery or elsewhere in the cloud, don’t expect this without doing some data architecture work on the back end.

I want to advise the comments that Kimball’s primarily purpose is to ensure that a company its data is conforms with what the business expects it to be. We do this by looking at each business process and ensuring that it’s descriptive attributes conform with every other process which uses the same attributes. Likewise we also ensure that our business process’s finest grain of information is carefully  managed and that attributes attached to each continuously conform to what the business expects.  When that “beep” happens, we still need to ensure our attributes are conforming with the business’s expectations and can't simply trust that beep will make any sense to the stakeholders when we try to graph it. Kimball also defines a rigorous ETL methodology that ensures that this model can withstand change.

Looker’s, at face value, provided the capability to support a small but important part of the Kimball model to conform your dimensions and metrics by establishing a customizable meta-data layer.  But meta-data alone won’t ensure the quality of your results or conformance with business stakeholder expectations.

Kimball proposes a three pronged parallel approach to developing a data warehouse and BI solution after the business requirements are established. These dimensional modeling, technical architecture and BI development.

Please will see that Power BI acknowledged this shortcoming of their software by recommending their customers implement Kimball as a default approach. Although I’m not seeing this with Looker, the fact still stands that the laws of physics are immutable and you will hit a processing bottlenecks and see rising processing costs in your cloud accounts as you scale your app if these things are not considered up front.

I recommend using the Kimball Toolkit and whatever other toolkits you can find to ensure this does not happen.  If possible try to cache your dimensional information as far downstream as possible for any web session so as to limit call-backs to your server -- e.g. dropdown controls are usually cached and not reloaded very time a person slices their dashboard.  Please think about consider your dimensions and aggregation in the same way as you would these drop-downs.  Why would you want to incur the cost of repeatedly recalculating and reloading the same data repeatedly moment by moment across your enterprise and incurring the underling costs over and over or continually retrieving the same wide values over and over when the data is actually slowly changing or not changing at all over time?

 

 

Reply