Practical and robust anomaly detection in time series

iandanforth · on Jan 6, 2015

Another good package to look at is Etsy's Skyline - https://github.com/etsy/skyline

Their introduction of the Kale stack (which includes Skyline) is a great read - https://codeascraft.com/2013/06/11/introducing-kale/

I spent a month or so evaluating anomaly detection systems and I can tell you a few things the twitter post fails to mention

1. You can get a long way with an ensemble of simple techniques. And it's always better than any single technique.

I wouldn't recommend trying to install skyline but re-implementing the ensemble of anomaly classifiers they have might take you a day or two and it will get you 90% of the way there.

2. The False Positive rate is probably your most important metric.

Detecting anomalies is good, but your ops team already has plenty of alerts to deal with. If you throw false positives at them from a new system they will hate you and ignore the new system. Most papers report ROC curve data as the metric for classification, that can be ok too.

3. Don't build something complex when a threshold will do.

If a point anomaly is obvious to a human, you absolutely should not build a complex system to detect it, just use thresholds. It's only when you want to detect anomalies before they cross a threshold that you should start on this kind of task. That leads me to

4. Almost all anomalies have a temporal component.

If your detector isn't ultimately looking at multiple sources of data and finding patterns that initially look like normal behavior (or odd behavior over time like a change in frequency) then its not adding as much value as it could. Slow trends, increased predictability, absent spikes which are still within threshold, those are the kind of anomalies your simple systems will miss and which will add a lot of value.

Ultimately anything that makes ops life easier and alerts them sooner to real problems is good. But in anomaly detection it is easy to fool yourself into thinking you need something complex to start out with, and when you've built a complex thing that it is "working" because it finds 95% of the obvious outliers.

Eridrus · on Jan 7, 2015

Why would you not recommend installing skyline? I would like to have something that watches metrics for me, and we're already using graphite for stats.

iandanforth · on Jan 8, 2015

Getting the full package working was (several months ago) a painful process. It was much easier to just grab the algos.

bglazer · on Jan 6, 2015

One of my responsibilities is to do post-outage reporting to estimate how much money we lost. Right now I use Holt-Winters to give a forecast starting just before the outage began. I then estimate the loss to be the difference between the forecast and the data points that fall outside of the forecast's confidence interval.

Is this a statistically valid method? I chose Holt-Winters because I'm inexperienced and it's appealingly intuitive. Should I be looking at anomaly detection methods instead? Would they be able to tell me what "normal" would have been for the duration of the anomaly?

slipperyp · on Jan 7, 2015

I don't know if it's statistically valid, but I've used methods similarly based on this to do calculations like you're talking for a role similar to what you described.

There are more subtleties that it might be important for you to take into account - primarily:

1) need to sample fairly extensively before/after outage to calibrate more accurately against Holt-Winters (the Holt-Winters seasonal projection should accurately project the trend, but actual numbers are probably running at some slight or significant rate above/below projections)

2) When running those samples, it's important to sample data where you believe the data points are definitely not impacted by the outage. This is often quite challenging since outages sometimes might span low / peak traffic periods or ramp-up/down periods.

3) Finally, it can be hard to pinpoint the actual start / end of the event (the identify the time samples you want to consider in your measurement for the outage cost). Particularly the end, since there's often some pressure for queued operations (by software or by your users who are itching to complete what they were trying to do) that may make your samples fluctuate. That backfill pressure can be substantial and is important to not ignore in your measurement of the actual cost of the issue. Say you're a retail site - if you have a 15 minute period of 50% order drop but the first 5 minutes where service is restored, the total order rate was 50% above projections. Do you count that as 15 minutes of 50% order drop, or 10 minutes of 50% order drop? Both are legitimate but it's important to know what metric you're measuring yourself against so you're as correct / honest as you can be.

bglazer · on Jan 7, 2015

Thanks for the reply!

1. This is a good point. I haven't incorporated sampling after the outage into the analysis, but that should be a good qualitative measure of the accuracy of the forecast.

2. I typically have good data from server logs of when an outage started. Outages during low volume periods are quite difficult to analyze though. I usually revert to just comparing the outage volume to the average volume for the whole outage period.

3. The end is typically more difficult to determine, as there's typically a period of instability as servers are restarted sporadically, followed by a "recovery" caused by the backfill pressure that you mentioned. My solution is to count any samples above the forecast's confidence interval as "recovery" and to subtract the total recovery from the loss estimate.

glibgil · on Jan 7, 2015

H-W is not enough. I don't know what is though. http://forecasters.org/submissions09/LawtonRichardISF2009.pd...

23david · on Jan 7, 2015

Do you find it useful / helpful for the engineering org to do this in a rigorous fashion? Seems like it could be misused very easily. Just curious.

bglazer · on Jan 7, 2015

That's hard to say. I work in a large (>10,000) person organization, so the loss estimates usually move up the management onion to the point where I have no idea what happens to them.

I know that rigorous estimates are useful to convince management that spending money on reliability is a net-positive investment. For example, migrating these servers will cost $10,000 but they went down for 30 minutes yesterday and we lost $1M.

I do worry that they'll be used to publicly shame people when they make mistakes that bring down systems, but I'm confident that my management will be responsible and mature enough not to do that.

falcolas · on Jan 6, 2015

An interesting complement to Etsy's Skyline project[1], which was intended to read data from the incoming Graphite data streams. The two do seem like they could be complementary, however.

[1] https://github.com/etsy/skyline/wiki

tjradcliffe · on Jan 6, 2015

Does anyone have a clue what "ESD" stands for in this context? The article is too buzzword-heavy to be very meaningful, even to a practitioner (this is surprisingly common in data analysis, where there seems to be a culture of naming things in the most opaque way possible.)

jamessb · on Jan 6, 2015

The abstract in reference 3 says it is the "extreme studentized deviate".

tjradcliffe · on Jan 6, 2015

Thanks. I searched all over the place but was guessing the "D" stood for "disaggregation".

In general things that depend on (approximate) normalcy, which ESD apparently does (unsurprisingly, given the name) don't count as "robust" in my personal lexicon. There is relatively little reason in this day and age to continue to use parametric tests of any kind, particularly for data that are almost certain to contain significant non-normal components.

bqe · on Jan 6, 2015

Do you have any suggestions for a more robust method for anomaly detection in seasonal time series data?

tjradcliffe · on Jan 7, 2015

Every case is different (as the article says, generalizing these things is not easy) but the basic approach always involves two things:

1) Build a model of the expected time series. There is no way of avoiding this. To find an anomaly you must define "that which is expected", either in terms of the actual data, differences, or moments.

2) Measure the distribution around the expected values based on past data.

3) Apply some test that answers some version of the question, "What is plausibility of the belief that the new data are drawn from the combination of model plus distribution?" The trick here is that you aren't interested in all anomalies, just "significant" ones, which may have different temporal behaviour, etc.

The important thing is to test relative to the distribution you actually have, which is never going to be particularly normal, especially in the wings, which you are going to really care about when attempting to be maximally sensitive to real anomalies. Normal distributions almost always underestimate the tails, which makes them prone to false triggers, which another poster here has pointed out: you really do not want.

Robustness against "unknown unknowns" in the anomaly distribution is one thing you want to be particularly careful about. Weird things happen all the time in real data, and you are generally looking for anomalies of a particular kind, with particular characteristics. The ideal anomaly detector will catch those without going off on every odd thing that happens.

evgen · on Jan 7, 2015

Don't underestimate how far you can get with simple Holt-Winters exponential smoothing to generate a forecast and then note deviations from same. With the Taylor (Taylor 2010) mods to same you can get up to three different types of seasons.

graycat · on Jan 7, 2015

Observation (1).

The data they are looking at is essentially a univariate stochastic point process, that is, an arrival process. The most important special case is the Poisson process. There the times between arrivals are independent, identically distributed random variables with exponential distribution with parameter the arrival rate. The number of arrivals in an interval is random with Poisson distribution (compare with the terms of the Taylor series for exp(x)).

See early in

E. Cinlar, 'Introduction to Stochastic Processes'.

There for the Poisson process there is a 'qualitative, axiomatic' definition -- an arrival process with stationary, independent increments. A cute derivation from this just qualitative description results in the details of the Poisson process.

One point about this qualitative approach is that commonly in practice the assumptions are obvious just intuitively.

Another solid approach to a Poisson process is the renewal theorem; there is a careful treatment in W. Feller's now classic volume II.

The theorem says that under mild assumptions a sum of independent renewal processes converges to a Poisson process. Arrivals at Twitter look like a nearly perfect example.

So, basically without any anomalies the Twitter data is a Poisson process.

Observation (2).

In part, the anomaly detector is based on the extreme Studentized deviate (ESD) statistical hypothesis test as in

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3...

but this test has a Gaussian assumption, not a Poisson assumption. So, there should be some mention of justifying using a Gaussian assumption.

Point (1).

The work in the OP, that is, an anomaly detector, is basically, nearly inescapably and necessarily a statistical hypothesis and, thus, faces the usual issues of false alarm rate (significance level of the test, conditional probability of Type I error given an anomaly), detection rate, and the classic Neyman-Pearson best possible result.

In particular, it is important and usually standard to have a means to adjust, control, and know the false alarm rate, but in the OP I saw no mention of false alarm rate, power of the test, etc.

On a server farm bridge or in a network operations center (NOC) with near real time anomaly detection, false alarm rate too high is a serious concern. With realistic detectors, false alarm rate too low means detection rate too low and is also a concern.

More.

There are some tests that are both multi-dimensional and distribution-free, with false alarm rate known exactly in advance and adjustable in small steps over a wide range. Such tests might be good for monitoring for 'zero day' problems, that is, ones never seen before, in serious server farms and networks.

joeyo · on Jan 7, 2015

  > but this test has a Gaussian assumption, not a Poisson assumption.

In practice the Gaussian assumption might not be too badly abused (at least for Twitter) since the Poisson distribution approaches the Gaussian when lambda is large.