Question

Modeling Degenerate Dimensions (Like Dates) Across Multiple Tables

  • 31 January 2017
  • 10 replies
  • 892 views

Userlevel 1

I have a scenario I’m working through that looks something like this:


I have 2 tables (in practice way more than 2 tables) that entail GA site analytics that are both joined with a content items dimensions table and each with a date dimension. One fact table, with aggregated metrics, looks like this:


cms_id | date | metric | value
123abc | yesterday | 'pageviews' | 10000

and another fact table, which deals with traffic source metrics specifically, looks like this:


cms_id | date | source | medium | campaign | pageviews
123abc | yesterday | twitter | socialmarketing | SF | 1200

Treating cms_id as a consolidated dimension works great because I can specify that both of these tables join with my content items table on that cms_id, and that’s fine. But I don’t have a corresponding dimension table for dates! So right now, in my explore there are 2 date dimensions and you need to know coming in which dimension to use depending on which table you’re working with!


What’s the right way to model this, so that I can select date in one spot in my explore and have it do the right thing on both of these tables, without creating a dates dimension table?


10 replies

Userlevel 3
Badge

Hey @jfhbrook! I think I understand, but just to clarify: Are you trying to get the dates in both tables to work as a single unit? It sounds like maybe you’re looking to ensure that when someone filters on one of these dates, it will apply to both.


We should be able to do this using templated filters, but let me know if I’ve missed the mark here.

Userlevel 1

That’s roughly correct! Though, I don’t want to have to pick an arbitrary table as having the “canonical” date, since I’m not always querying that table, so I’m not sure a templated filter is what I want either. We have a trial of this in one of our in-development explores but it means that random_ass_table.date is being used on the filter everywhere, even when random_ass_table is otherwise irrelevant to the query.

What about using an alias for the random_ass_table.date dimension so that it blends in with the other ones, then hiding the other dates? To a user using the explore it will just look like a normal date even if it’s coming from some other table.

Userlevel 6
Badge

You could make a third derived table that just contains cms_id and date and then join each of the metric tables to it. Then hide cms_id and date in both the joineed tables.


explore: stats {
join: table1 {
sql_on: ${stats.cms_id} = ${table1.cms_id} and ${stats.event_date} = ${table1.date}
relationship: one_to_many
}
join: table2 {
sql_on: ${stats.cms_id} = ${table2.cms_id} and ${stats.event_date} = ${table2.date}
relationship: one_to_many
}
}


view: stats {
derived_table:
sql: SELECT DISTINCT cms_id, date FROM <one of the tables> ;;
persist_for: "24 hours"
}
dimension_group: event {
...
}
dimension: cms_id {}
}

view: table1 {
dimension: date {hidden: yes}
dimension: cms_id {hidden: yes}
...
}
Userlevel 1

I decided to try an approach where I try generating a dimension table for the date, which is pretty cheap considering I only need a few years worth of dates with no metadata:


view: dates_dimension {
derived_table: {
sql: SELECT now()::date - generate_series(0, 365 * 3 - 1) AS date ;;
}

dimension: primary_key {
primary_key: yes
hidden: yes
type: string
sql: ${TABLE}.date ;;
}

dimension_group: date {
label: "Master Date"
type: time
timeframes: [date, week, month]
convert_tz: no
sql: ${TABLE}.date ;;
}
}

That part is mostly fine. So now I want to join it across multiple tables. The current structure of these tables is something like,


  join: properties {
type: left_outer
relationship: one_to_many
sql_on: ${Articles.id} = ${properties.business_id} ;;
}

join: cms_lookup {
type: left_outer
relationship: one_to_many
sql_on: ${properties.id} = ${cms_lookup.property_id} ;;
}

join: content_items {
type: left_outer
relationship: one_to_many
sql_on: ${cms_lookup.platform_account_id} = ${content_items.platform_account_id} ;;
}

...

join: ga_content_metrics_daily {
type: left_outer
relationship: one_to_many
sql_on: ${content_items.platform_item_id} = ${ga_content_metrics_daily.platform_item_id} ;;
}

join: ga_traffic_sources_daily {
type: left_outer
relationship: one_to_many
sql_on: ${content_items.platform_item_id} = ${ga_traffic_sources_daily.platform_item_id} ;;
}

and I want to add the date field. So if I only cared about one of these tables I could do:


  join: dates_dimension {
relationship: one_to_one
sql_on: ${dates_dimension.date_date} = ${ga_content_metrics_daily.date_date} ;;
}

but I need to specify this join for multiple tables! “why don’t you just use from,” I can hear you saying–I tried that, but then the dimension again shows up as a separate field for each thing I’m joining it to, defeating the purpose!


Maybe sql_on needs ANDs in the on clause?


EDIT: I thought about just generating the product of the cms ids and the dates, but in a world with 50k content items and 1000 dates, that ends up with a table of 50 million rows and I don’t want that


EDIT 2: I ended up implementing the cartesian-product-of-ids-and-dates idea, which from a query standpoint works really well. Time will tell if this is sustainable from a data size standpoint.

Hey @jfhbrook, Glad that taking the cartesian-product-of-ids-and-dates idea is working for you. This is great from a query standpoint, but I totally understand your concern with this method from a data size standpoint.


If this becomes a problem in the future, you could go back to using the generating_series function. If you’re using Redshift, please note that this function is deprecated and could be completely withdrawn at some point. By generating the series, and using the combination of the from and fields parameters, we could choose which fields we want to include in the join. This would still allow for all of the dates to be joined to every view, while limiting which fields are shown in the explore so as to avoid duplication.


If you decide to go this route, the best practice would be to create a set that contains all of the fields that you want to bring through. This could then be applied to each joins’ fields parameter, only bringing through the fields in the set. The explore would only show the fields in the set as well.

Userlevel 6
Badge

I re-read your post and maybe reorganizing as a funnel would make your life easier.



Userlevel 1

An update for those of you following along: This approach worked for us for a while, but is starting to cause some pretty serious performance issues. We’re in the process of just eating the added complexity of multiple date fields.

Userlevel 1

I’ve been working with @jfhbrook, and we’ve come up with a solution that appears to be efficient and avoids unnecessary joins.


First, we define a view that contains a single date dimension (this is how we did it in Redshift, SQL will vary depending on db type):


view: measurement_date {
view_label: " Measurement Date"
derived_table: {
sql:
SELECT getdate()::date - row_number() over (order by true) + 1 AS date
FROM content_items LIMIT (365 * 2) ;;
sql_trigger_value: SELECT TRUNC(GETDATE()) ;;
distribution_style: all
sortkeys: [ "date" ]
}

dimension: date {
primary_key: yes
hidden: yes
type: date
}

dimension_group: date {
type: time
label: "Measurement"
description: "Master date range dimension. Use this to break out or filter measures (followers, etc) by the date they occured."
timeframes: [date, week, month]
convert_tz: no
}

}

Then, we joined this view into our explore like this:


  join: cms_dates_lookup {
from: measurement_date
type: left_outer
relationship: one_to_many
sql_on: 1 = 1 ;;
}

Then, we used liquid variables to conditionally join metrics to this view if any of the measurement date fields were selected in the query:


  join: ga_api_content_metrics_daily {
type: left_outer
relationship: one_to_many
sql_on: ${content_items.platform_item_id} = ${ga_api_content_metrics_daily.cms_item_id}
{% if
cms_dates_lookup.date_month._in_query or
cms_dates_lookup.date_date._in_query or
cms_dates_lookup.date_week._in_query
%}
AND ${ga_api_content_metrics_daily.date} = cms_dates_lookup.date
{% endif %} ;;
}

Note that the cms_dates_lookup.date value is raw SQL, we didn’t use a variable like ${cms_dates_lookup.date_date}. That’s because any views that you mention via variables will be be joined in your query, even if the conditional expression evaluates to false.

But if I have 3 tables that would only be related to the date field, what would be the solution.

Reply