Standard Deviation and Simple Time Series Outlier ...

Michael_Erasmus · 09-11-2015 07:33 AM

I recently used this technique as a quick and dirty way to find anomalies in time series data and thought I would share.

Let’s say you have simple time series data, event counts grouped by day:

SELECT 
	TO_CHAR(DATE_TRUNC('week', updates.date),'YYYY-MM-DD') AS "updates.date_week",
	COUNT(*) AS "updates.count"
FROM public.updates AS updates

To find big ‘jumps’ in this data, you could use the 3 sigma rule. Basically any observations that fall outside of three standard deviations from the mean is considered an outlier.

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

To model this in a Look, I used table calculations. The first thing we need is the Standard Deviation of the count field. This is calculated by taking the square root of the average of the squared deviations of the values from their mean value.

This is pretty easy to define as a table calculation:

round(sqrt(mean(power(${updates.count} - mean(${updates.count}), 2))),2)

Once we have that, we can pretty easily define our 3-sigma rule:

if(abs(${updates.count} - mean(${updates.count})) >  ${standard_deviation} * 3, "Yes", "No")

And that’s all you need to detect outliers!

Zam

Sweet pattern, Michael! Wanted to touch base and let you know about a few other features that might be simplifications/additions to this pattern.

We’ve built out the ability to hide table calculated no values from a visualization, i.e. you could hide your outliers.
There are stddev_pop and stddev_samp functions in table calculations.

Sai_Satish

The data points you consider for the Mean , STD calculations matters… I feel baseline average should be calculated by averaging the hourly rollups (Example - 9:00 AM to 10:00 AM hourly rollups ) for the three Mondays within the three-week period.

Similarly, The standard deviation is calculated from the baseline average for rollups

The standard deviation provides a statistical indicator of how much variability exists in the population data that factored into the baseline average calculations.

•For hourly reporting, the standard deviation is calculated for the polled values.

•For daily reporting, the standard deviation is calculated for hourly averages.

•For weekly reporting and beyond, the standard deviation is calculated for the daily averages.

minosdz

Hi, thank you for your answer @Sai_Satish.
I’d like to compare each day to the previous week’s similar days of the week(Monday to the last x Mondays, Friday to the last X Fridays…etc). Let’s say I have multiple users each one has a number of transactions per day during the last 10 weeks period. Then, I want to compare this Monday transactions_count to the prior 10 Mondays data by averaging the number of transaction-during the whole past 10 Mondays(I used std_dev and mean too to detect outliers).

Now, if I understood you correctly, I have to compare my actual Monday data to the mean(hourly)data of the prior Mondays.!

Do you mind if you clarify a little bit?

Standard Deviation and Simple Time Series Outlier Detection Using Table Calculations