Do you have date tables in your model? If so, you’ve probably noticed performance issues around them, but may not know how to resolve them.
There are various problems that tend to crop up when people use date tables, but in this article I’ll focus specifically on the intentional (though unnecessary and problematic) fanning out of records in order to count or summarize by date a number of entities that last for a period of time.
Examples of questions that directly follow this pattern would be:
- How many support tickets were open each day?
- How many employees are covering each hour?
- How many registered users do we have each week?
- How much active subscription revenue do we have each month?
- How many orders were outstanding or in-process each day?
Examples of questions that are facilitated by first having the above information would be:
- How many open tickets weren’t updated by the support team each day?
- Which hours have fewer than 2 employees covering them?
- Each week, what is the distribution of number of times (including 0) that users were active?
- What is the average month-over-month growth rate of active subscription revenue?
It’s tempting take our entities (tickets, users, subscriptions, etc.) and join a date table on the condition that the date is between the entity’s start and end-date (or the current date for open-ended entities). This is tempting because then for each entity, you have a row for each day the entity was valid, and then you can just aggregate over the date to get your answer.
Here is some example SQL:
count(*) as active_entities,
SUM(value) as active_value
LEFT JOIN date_table
ON date_table.month >= date_trunc('month', entity.start_date)
AND date_table.month <= date_trunc('month', entity.end_date)
Why it’s problematic:
O(e*d) join - In most databases (assume yours too unless you know otherwise), joins on inequalities result in the database performing a nested loop join, meaning that the number of comparisons it must do is equal to e*d, where e and d are the number of rows in your entity and date tables respectively. As your dataset grows, both in number of entities and dates, these computations grow quadratically.
Broadcasting of rows - If your database is an MPP database, in order to do nested loop joins, it must broadcast data around the network to enable doing all of the comparisons in the nested loop join.
O(e*d) intermediate result set - Not only do the number of comparisons grow quadratically, but if you have open ended entities like users, the size of the intermediate result set is also quadratic, which can lead to large materialized tables, and hash operations that spill to disk
Easy to compound the problem by one-to-many joining activity - It can be tempting to join event or activity rows to the date table as well, multiplicatively increasing the cardinality of the join
The performant pattern
The basic idea is to aggregate your entity table twice, once by start period, and once by end period, and then to use an ordered window function to calculate the net aggregate over time.
--The outer sum is for the running total over months
--The inner sum is to combine the 2 rows for each month for "new" and "closed"
(ORDER BY entity_period.month ROWS UNBOUNDED PRECEDING) as active_entities,
(ORDER BY entity_period.month ROWS UNBOUNDED PRECEDING) as active_value
-- New entities each month
date_trunc('month',entity.start_date) as "month",
COUNT(*) as ct,
SUM(entity.value) as value
GROUP BY 1
-- Closed entities each month
-- Depending on your business logic, you may want to add a month or period
-- to the close date, so that entities that are active for any portion
-- of the month are only removed in the subsequent month.
-- Also, you could prefer to prorate their value for the current month
date_trunc('month',entity.end_date) as "month",
0 - COUNT(*) as ct,
0 - SUM(entity.value) as value
GROUP BY 1
/* Optionally, if you have very sparse data and want
to make sure all months are accounted for ...
SELECT "month", 0 as ct, 0 as value FROM months_table
) as entity_period
GROUP BY 1
ORDER BY 1
The complexity of this query should be O(e + p log p) where e is the number of entities, and p is the number of periods. The “log p” is due to the sort operation, but since you are grouping your activity to a smaller number of periods, this is typically not an issue.