I’m using a Kimball-lite data schema and have run into an issue with building charts according to how our business users would like to see them.
The issue has been dubbed “the zero problem” internally, and it has to do with reporting on things that didn’t happen. Although this pattern repeats itself across our system, the simplest use case to illustrate the point is our question and answer modeling.
In our app, users can answer multiple choice questions. I’ve modeled it generally like this:
For a given question, we want to show a chart of its answers and how many responses each one got. Crucially, we want to show all answer options even if they got no responses!
For data, let’s assume the only user metadata field is something called is_foodie
, and we have two multiple choice questions:
- What is your favorite food?
a. Salad
b. Pizza
c. Tacos
d. Coq au vin - What is your favorite color?
a. Blue
b. Purple
c. Red
We also have a smattering of responses to these questions.
Data Tables
surrogate_key | name | is_foodie |
---|---|---|
1 | Ian | false |
2 | John | true |
3 | Sally | true |
4 | Peter | false |
5 | Jeff | false |
surrogate_key | question_key | question_body | choice_body |
---|---|---|---|
1 | 1 | What is your favorite food? | Salad |
2 | 1 | What is your favorite food? | Pizza |
3 | 1 | What is your favorite food? | Tacos |
4 | 1 | What is your favorite food? | Coq au vin |
5 | 2 | What is your favorite color? | Blue |
6 | 2 | What is your favorite color? | Purple |
7 | 2 | What is your favorite color? | Red |
surrogate_key | dim_user_key | dim_question_key |
---|---|---|
1 | 1 | 2 |
2 | 2 | 4 |
3 | 3 | 4 |
4 | 4 | 3 |
5 | 5 | 3 |
6 | 1 | 5 |
7 | 2 | 6 |
8 | 4 | 7 |
Setting this up in the straightforward way in Looker will yield queries roughly like this in order to get a count of responses per question, grouped by choice:
SELECT
dim_questions.choice_body
, COUNT(fact_responses.surrogate_key)
FROM fact_responses
JOIN dim_questions
ON fact_responses.dim_question_key = dim_questions.surrogate_key
WHERE dim_questions.question_key = 1
GROUP BY 1
ORDER BY 1
With our data, that yields:
choice_body | count |
---|---|
Coq au vin | 2 |
Pizza | 1 |
Tacos | 2 |
Unfortunately, no one responded that they liked salad so that’s not reported in the chart! This is a nonstarter for our use case. Writing SQL manually I would solve this with a right join.
SELECT
dim_questions.choice_body
, COUNT(fact_responses.surrogate_key)
FROM fact_responses
RIGHT JOIN dim_questions
ON fact_responses.dim_question_key = dim_questions.surrogate_key
WHERE dim_questions.question_key = 1
GROUP BY 1
ORDER BY 1
And that yields what we want:
choice_body | count |
---|---|
Coq au vin | 2 |
Pizza | 1 |
Salad | 0 |
Tacos | 2 |
All good so far, and I can do the above in Looker by modeling my explore with right join.
But now let’s try to slice and dice the data. In this case, let’s look only at what the foodies answered. Simply adding the filter to the chart in the straightforward way yields SQL like this:
SELECT
dim_questions.choice_body
, COUNT(fact_responses.surrogate_key)
FROM fact_responses
JOIN dim_users
ON fact_responses.dim_user_key = dim_users.surrogate_key
RIGHT JOIN dim_questions
ON fact_responses.dim_question_key = dim_questions.surrogate_key
WHERE dim_questions.question_key = 1
AND dim_users.is_foodie
GROUP BY 1
ORDER BY 1
choice_body | count |
---|---|
Coq au vin | 2 |
Uh oh, the zeros are missing again! That’s because the WHERE
filtration occurs after the join, and the values are either null
or false
. I think the only way to solve this is to use the join’s ON
clause for filtration rather than the WHERE
clause (and then to make sure that the right joined dimension is joined last).
SELECT
dim_questions.choice_body
, COUNT(fact_responses.surrogate_key)
FROM fact_responses
JOIN dim_users
ON fact_responses.dim_user_key = dim_users.surrogate_key
AND dim_users.is_foodie
RIGHT JOIN dim_questions
ON fact_responses.dim_question_key = dim_questions.surrogate_key
WHERE dim_questions.question_key = 1
GROUP BY 1
ORDER BY 1
choice_body | count |
---|---|
Coq au vin | 2 |
Pizza | 0 |
Salad | 0 |
Tacos | 0 |
So far I’ve only figured out how to do that in Looker by using parameters, with LookML similar to this:
join: dim_users {
type: inner
sql_on:
${fact_responses.dim_user_key} = ${dim_users.surrogate_key}
{% if dim_users.is_foodie_parameter._in_query %}
and dim_users.is_foodie = {% parameter dim_users.is_foodie_parameter %}
{% endif %}
;;
relationship: many_to_one
}
This works in simple cases, but has a few downsides:
- A parameter must be set up that duplicates the dimension for all possible filters
- The join query must be amended for each possible filter
- Looker’s filter support is much more robust that this supports
The latter point is the one I’m most hung up on. I’d like to expose looks built this way on dashboards with other looks built in the straightforward way, and allow filters to work on all of them simultaneously. I can do this by mapping the dashboard’s filter to the chart’s parameter, but this only works in simple cases, like a single string value.
My main question is:
Is it possible to support this use case and take advantage of Looker’s filtration options?
That includes being able to filter on dates by range (e.g. “last 7 days” on another dimension), including multiple values (e.g. user name in ('Bob', 'Sally')
), etc.
Has anyone else encountered this? What are other solutions? What am I missing?
Thanks!