Do you have date tables in your model? If so, you’ve probably noticed performance issues around them, but may not know how to resolve them.
There are various problems that tend to crop up when people use date tables, but in this article I’ll focus specifically on the intentional (though unnecessary and problematic) fanning out of records in order to count or summarize by date a number of entities that last for a period of time.
Examples of questions that directly follow this pattern would be:
Examples of questions that are facilitated by first having the above information would be:
It’s tempting take our entities (tickets, users, subscriptions, etc.) and join a date table on the condition that the date is between the entity’s start and end-date (or the current date for open-ended entities). This is tempting because then for each entity, you have a row for each day the entity was valid, and then you can just aggregate over the date to get your answer.
Here is some example SQL:
SELECT
date_table.month,
count(*) as active_entities,
SUM(value) as active_value
FROM entities
LEFT JOIN date_table
ON date_table.month >= date_trunc('month', entity.start_date)
AND date_table.month <= date_trunc('month', entity.end_date)
Why it’s problematic:
The basic idea is to aggregate your entity table twice, once by start period, and once by end period, and then to use an ordered window function to calculate the net aggregate over time.
SELECT
entity_period.month,
--The outer sum is for the running total over months
--The inner sum is to combine the 2 rows for each month for "new" and "closed"
SUM(SUM(entity_period.ct)) OVER
(ORDER BY entity_period.month ROWS UNBOUNDED PRECEDING) as active_entities,
SUM(SUM(entity_period.value)) OVER
(ORDER BY entity_period.month ROWS UNBOUNDED PRECEDING) as active_value
FROM(
-- New entities each month
SELECT
date_trunc('month',entity.start_date) as "month",
COUNT(*) as ct,
SUM(entity.value) as value
FROM entity
GROUP BY 1
UNION ALL
-- Closed entities each month
SELECT
-- Depending on your business logic, you may want to add a month or period
-- to the close date, so that entities that are active for any portion
-- of the month are only removed in the subsequent month.
-- Also, you could prefer to prorate their value for the current month
date_trunc('month',entity.end_date) as "month",
0 - COUNT(*) as ct,
0 - SUM(entity.value) as value
FROM entity
GROUP BY 1
/* Optionally, if you have very sparse data and want
to make sure all months are accounted for ...
UNION ALL
SELECT "month", 0 as ct, 0 as value FROM months_table
*/
) as entity_period
GROUP BY 1
ORDER BY 1
The complexity of this query should be O(e + p log p) where e is the number of entities, and p is the number of periods. The “log p” is due to the sort operation, but since you are grouping your activity to a smaller number of periods, this is typically not an issue.