At Looker we are often asked about best practices when it comes to designing a new data warehouse. Typically this happens just as when companies are moving to MPP, and maybe even, column-oriented, databases, so it is clear from the start that replicating the design from an operational database is not appropriate.
It is also clear that there is not a single answer - search on Google turns up millions of results. That said, here are a few rules of thumb that you can apply as you focus on building your analytical data warehouse to work with Looker:
- simplicity (aka shortest path)
- performance
- single copy of data
- transparent EL process
Shortest Path
You should not need to use mapping tables or Entity–attribute–value tables to get to the value. The path to any two dimensions in one SELECT
query should not involve more than a couple of joins.
Typically “long path” designs arise from storing original data in NoSQL format. Because there is very little analytical value derived from performing such transformations with Looker–albeit, in a more transparent way–it is better to perform such logic prior to the data warehouse.
Performance
We’ve had a number of posts already on performance: 1) case studies, 2) Redshift Performance, and 3) more on Redshift if you want to understand storage usage and its distribution/skew. It’s important because in the Looker world no waits in line to get a report - even calculations over billions of rows are expected to be instantaneous.
Because performance often requires redundancy or more complicated EL/ETL processes, a balance is needed between it and the desire to create one single copy of data or between performance and the transparent EL process. In some cases, date fields might need to be duplicated across tables. In other cases, de-normalization of tables is best done outside of Looker.
Single copy of data
Storing in multiple columns across several tables the same information is very confusing. It is often unclear whether there is any difference in how data is collected.
Typically, the rule of thumb is to store any data only once if it can ever change. This means that customer information, for example, should not be duplicated across several tables.
Transparent EL process
High quality analytics requires understanding of all assumptions that are made in the model about the data. LookML provides that transparency. However, if some complicated transformation, join or filtering happens before the data ends up in the data warehouse, your assumptions about underlying data might be wrong.
In summary
It is easier to visualize priorities in terms of 2x2 matrix. For example:
Simplicity vs. Performance
Likewise, it can be important to consider Simplicity vs. Single Copy of Data, Performance vs. Single Copy of Data, etc