Stats table calcs: comparing rates over time (Bonus: anomaly detection!)

If you havenโ€™t already read my post on using the beta distribution for an A/B test, that might be a good warm-up read. (The disclaimers from that post still apply ๐Ÿ˜‰ )

In that example, the data took the form of a proportion - a number of successes out of a number of trials. What do you do if you want to instead compare rates of events over intervals? For example, user signups, error events, or web traffic over time.

Well, in the A/B testing example, the data points can be thought of as being taken from a Bernoulli distribution: either yes or no with a given probability. That probability is in turn modeled by the conjugate prior of the Bernoulli distributions, the beta distributions, so we used the beta_inv table calc.

In the present scenario, the data points can be thought of as being taken from a Poisson distribution: its possible values are whole positive numbers representing a number of events happening in an interval assuming those events occur independently at a given rate. That rate, which is what we are actually interested in quantifying, can be thought of as being taken from the Poissonsโ€™ conjugate priors: Gamma distributions.

For the sake of illustration, lets say you had data like this:

Month   | Days    | Errors
------------------------------------------
Jan     | 31      | 12
Feb     | 28      | 13
Mar     | 31      | 19

Is the rate of errors in March higher than in February? On the surface it looks that way, since the observed rate was 19/31=0.61 events per day, as compared to the observed rate of 13/28 = 0.46 events per day in February.

But with any observation, there is variability and we are interested in the โ€œtrueโ€ underlying rate that generated this data, moreso than the observed rate in our sample. The conjugate prior distribution allows us to understand exactly that - given a set of observations, what the true underlying rate may be.

With our example, when you use the conjugate prior (gamma distribution), you see there is quite a bit of range, including lots of overlap between what the rate of errors may be for March and for February:

Just like in the A/B testing example, you could use the inverse distribution function to come up with credibility intervals and then visualize them using the timeline viz:

The upper/lower bound formulas for 90% credibility intervals would be

gamma_inv(
    0.5 +/- 0.45,
    ${...events} + 0.001, 
    1/( ${...days} + 0.001)
)

Bonus: Anomaly Detection

Letโ€™s try to detect when there is a drop in a rate of events. Weโ€™ll continue with the gamma distribution, but these concepts can just as easily be applied to detect a change in a proportion, with the beta distribution.

Weโ€™ll make a look that is designed to be run every day. Weโ€™ll have it pull some trailing activity by day, for example 60 daysโ€™ worth. Then weโ€™ll add some formulas to detect when there is a significant drop:


(In this example, the threshold happens to be on a day with 0 events, but it is highlighting a difference in data before and after the line, not simply highlighting that day for being 0)

A practical consideration is choosing the windows of time to compare. There are many ways to do this, with different levels of finesse such as dynamically choosing baseline periods, normalizing for weekly or seasonal trends, or pushing the window functions into SQL instead of table calcs, but for demonstrative purposes, Iโ€™ll simply use table calcs to consider windows of increasing size looking back from the current day, and to compare those windows to similarly sized prior windows. Then Iโ€™ll highlight in red the first day (if any) with a significant, sizable drop. Here are the specific formulas I used:

  • N Periods: row()
  • 2 N rows available?: row() <= count(${accidents.event_date}) / 2
    ^ We will use this to โ€œHide Noโ€™sโ€ since we need half of the rows for comparisonโ€™s sake only
  • Trailing N Total: running_total(${accidents.count})
  • Prior N Total: sum(offset_list(${accidents.count},1,row()))
  • Trailing N Upper Bound: gamma_inv(0.95, ${trailing_n_total} + 0.001, 1 / (${n_periods} + 0.001))
  • Prior N Lower Bound: gamma_inv(0.05, ${prior_n_total} + 0.001, 1 / (${n_periods} + 0.001))
  • Is practical difference?: ${trailing_n_upper_bound} < ${prior_n_lower_bound} * 0.667
    ^ Normally you want to specify a ratio of โ€œpractical equivalenceโ€, here a drop of over 1/3rd
  • As 0/1: if( ${is_practical_difference}, 1, 0)
  • Difference running total: running_total(${as_01})
  • Most Recent Deviation: if( ${difference_running_total}=1 AND offset(${difference_running_total},-1)=0, max(${accidents.count}), null)
    ^This will be used to highlight the deviation on the chart

Setting that last measure to a column allows us to get a โ€œhighlightโ€:
f646e905942c3a553d359382f0f011850243864d.png

In conclusion, we can say (with high probability) that to the right of this line, the true underlying rate of events has dropped by at least a third, as compared to the same sized period immediately to the left of this line. And because weโ€™re using Bayesian statistics, these formulas are robust against small datasets, meaning no manual exception handling or awkward sample size declarations up front.

7 13 2,027
13 REPLIES 13

@fabio1 โ€“ very cool- thanks. How are you producing those overlaid distributions in the events per day view?

Hi @AbhiSivasailam!

Iโ€™ll admit I got a bit creative there ๐Ÿ™‚ I just made a result set with 100 numbers as the dimension, used table calcs to translate them over the domain I wanted to plot (min_number + ${num_from_0_to_100}/100*(max_number-min_number), and then used the gamma_dist table calc (cumulative=no) to get the PDF value:
7094a97a061af9e932cdbc815f60fdfbd5095125.png
I just hardcoded three measures for my visual, but I think you should be able to do this with pivots to dynamically pass in different alpha and beta values for each of N series

This would be useful for my use case - I have a simpler model using the rolling standard deviation and rolling mean to detect anomalies (I guess it would be like using a normal distribution for the underlying process but honestly I just chose it as it was simpler and seemed to work well enough).

You can now condition schedules to be sent or not based on table calculations (using the โ€œHide Noโ€™sโ€ feature on the calcโ€™s column header), so you can now do the alerting described in the bonus section!

@fabio1 I am not sure how your prior_n_total works, but how can it be greater than the trailing_n_total?

For the sake of illustration, letโ€™s say today is Sunday, and my data contains 6 days from the previous week, descending, so row 1 is Saturday, 2 is Friday, โ€ฆ 6 is last Monday.

On row 1, Saturday, I want to compare the interval from start (1) to current (1) against a similarly sized row immediately preceding it (2 to 2). In other words, Saturday vs Friday. Trailing_n is the first period. Prior_n is the second period.

To continue the example, letโ€™s say there was no significant difference, so we continue comparing larger and larger intervals to look for significance via more data.

So on row 2, Friday, I want to compare the interval from start (1) to current (2) against a similarly sized row immediately preceding it (3 to 4). In other words, Saturday + Friday vs Thursday + Wednesday.

To achieve this, trailing_n uses running total, which totals from the first row to the current row. And prior_n uses offset_list to grab a set of data relative to the current row starting from offset 1 (1 row after the current row) to row() (i.e. the number of the current rowโ€ฆ that rows after the current row)

If the calculations donโ€™t seem to be working, let me know!

I donโ€™t follow. If running total is cumulative on day n, wouldnโ€™t prior then be the previous cumulative total on the day n-1?

If you can use your numbers on the table to illustrate vs abstractly, it would help to see what exactly youโ€™re referring to.

Thanks.

Yes, it would. The key is that this table is sorted descending by date, so โ€œ-1 dayโ€ is โ€œ+1 rowโ€. Perhaps there is another sorting+formula combination that makes more intuitive sense, but this one seemed to work well for me

Looking at rows 1 vs 2, with Count = 2 vs 1, you can see Trailing N is 2, and Prior N is 1
Looking at rows 1&2 vs 3&4, with Counts = 2&1 vs 3&3, you can see Trailing N is 3 (2+1) and Prior N is 6 (3+3)

@fabio1 could you explain better the steps of getting this result? My results arent getting binned

It might help to see the results youโ€™re seeing. If you want to post it here (like the screenshot I posted), that works, or if you prefer to keep it private and you have access to chat support, they also have lots of experience with this type of thing and can surely help work through it.

@fabio1 how were you able to create your gamma distribution chart? (attached) Iโ€™m trying to plot the probability density function of a normal distribution I have, but I canโ€™t seem to figure out how. I have my data, the mean and the standard deviation. Any advice?

Hi @mcharboneau! Hereโ€™s how I did that:

Top Labels in this Space
Top Solution Authors