About This Block
State data, also known as slowly changing dimensions (SCDs), are a growing datatype found in modern data management and analytics. SCDs are specific types of data that can change over time at any moment. With the proper table schema, we want to establish a structure that can not only identify the current state of a record, but also provide historical record keeping of any previous states of that record.
You may find yourself interested in managing state data for the below reasons:
- Desire to perform deep-dive historical analysis
- Working with modern, “append-only” technologies such as BigQuery and Hadoop as your analytics datastore.
In this block, I will be exploring a way to leverage Looker to analyze this type of data (specifically slowly changing dimensions type 2)
Ideal Data Types
Below are some examples of data that are often considered state data:
- Dimension tables containing randomly changing attributes (address, subscription, etc)
- Milestone or progress tables
- Historic tables
Let’s start with a sample state data table - I’m going to use a “User Account” table as an example. It is reasonable to expect that a given user’s account status may change over time. Therefore, the table captures attributes about each user, and the respective timeframes for each status (or “Account Status” in the picture below), prior to a change in status. Notice how a user’s current “Account Status” is indicated by specifying the “End” timestamp as
Given that we are populating this users table this way, how can we leverage Looker to answer current and historical questions about our users?
Current State: Users and statuses
While a simple
COUNT might work on a standard users table, it will overreport users on a SCD table. This is due to the existence of a record for every account state of a user. Therefore, we will create a custom measure.
- measure: user_count
We will now be looking at a count of distinct users within the table based on the user_id. We need to use a count_distinct because the same user can appear multiple time in this table. The output of this measure against the table above will be 10.
Being able to leverage count_distinct to get the unique user count from a state table is useful; however, a different approach is needed to only look at current records. Remember how all current records have an end timestamp of
9999-12-31 23:59:59? We can use the phrase
after today in the matches filter type to exclude historial rows.
Simply apply the “after today” phrase into a matches filter on the
end dimension. Now you can bring in all desired fields.
Historical State: User Count and Detail
The real power of the state table is being able “rewind” and gain state data at any previous point in time. Again, let’s start with a simple user count. In the current example, I created a custom measure with a filter that only pulled current records. Since we want the flexibility to look at any point in history, we will perform our analysis entirely in the Explore while leveraging filters there.
Pretend we want to see user account states back on November 19th, 2011. Simple specify a start and end date filter and bring in the fields. Remember to filter the “begin date” before the date of interest and the “end date” after the date of interest.
Notice how this is the current state for some users but a previous state for others. Comparing to the image above, you can see that certain users are at a lower status at this point in time.
User state data that evolves over time lends itself nicely to timeseries analysis. In our example, we have users whose statuses improve over time. We can setup custom measures to watch as the counts change over time.
First, we need to join our user accounts state table to a standardized date table. This is accomplished via a join that respects the user state. Notice how the join predicate is satisfied as long as the user’s state is applicable during that date of the date table.
From here, we can build things like filtered measures to watch when statuses are obtained by all our users, as well as how quickly those statuses are vacated (either by promotion or demotion).
Where is the code?
Getting 401 on the explores, and pictures won’t load.
I would be interested in counting, for example, the number of days a user account has been active. Given a SCD with columns Status (Active, Inactive), Start Date, End Date.
Does this block do this?
The links in this post are no longer active, but there is an example of this model here: https://github.com/llooker/salesforce_fivetran/blob/master/sf_opportunity_snapshot.view.lkml