Anomalous Adventures Part 1: Elasticsearch or R?


A couple of months ago, Elasticsearch released automated anomaly detection as part of X-Pack. It is amazing! X-Pack Machine Learning is just so easy to use. The team at Elasticsearch have dealt with all the statistical pain (picking the right methods, dealing with trend and seasonality...) as well as all the practical pain (running in real-time, no need to maintain a complex code base...). However, there's one issue: X-Pack Machine Learning is a giant black box. What makes X-Pack Machine Learning tick? Why does it select some points as anomalous, whilst ignoring others? For us, these are critically important questions - our SLAs depend on them. So we have started to roll up our sleeves and dig into these questions.

Kibana & the Anomaly Explorer


The anomaly explorer in Kibana is sensational. It is easy to navigate, starting with a high-level summary and providing full details below. We love it. We had been struggling to sell anomaly detection to our internal stakeholders prior to turning to Elasticsearch. But, it was Kibana's Anomaly Explorer that managed to kick of real buy-in. It is intuitive to use and it provides a sense of self-discovery that has helped people learn to seek their own answers and ultimately, trust in the outputs. Forgive me, I have covered up some of the labels (AlertName and Influenced By) above.

The image above is a nice example of Kibana's Anomaly Explorer, but also, of the types of questions that people are asking about the results. For example, the group of blue anomalies around January 8th and 9th are marked as low severity, despite the fact that they are more than 10 times higher than normally observed. Our users want to know why the severity is so low, should we investigate these or let them pass, can we customise the thresholds? The answer to the last one is the easiest - yes, we can customise the thresholds. All of the anomalies are saved to an index in Elasticsearch and we can do anything we want with this information. The first question, why the severity is so low, is a little trickier to answer.

Notice, in the image above, that there are spikes at 12 PM each day. These are really interesting and they highlight one of the best and one of the most frustrating features of X-Pack's anomaly detection. Clearly, these are a repeating pattern. Therefore, they are not unexpected and are not anomalous. But, our users are really interested in these types of repeating patterns and I haven't yet found a way to extract information about these repeating patterns from Elasticsearch.

Repeating patterns are a great example of something we want to explore in more depth. They are patterns that we need to understand and potentially do something about. There are other patterns as well, for example: strongly trending data streams or highly variable data streams that we want to explore. It is clear, from exploring the results of X-Pack's anomaly detection, that Elasticsearch is accounting for these trends and patterns, but it doesn't elevate them as trends of interest - they simply aren't anomalous. So we need to dig a little deeper. Unfortunately, our X-Pack trial has also expired... so we are turning to R to help us dive into the nitty gritty details of what's happening under the hood of X-Pack's anomaly detection.

Elasticsearch vs. R

I love R. I've been using R for a few years now, I love it's flexibility and the richness of statistical tools that are being driven by researchers around the world. That said, I am a little scared of rolling my own anomaly detection routines which will come at a large development cost and, more importantly, a huge maintenance cost. But, our users are highly technical and highly skeptical people - we need to be able to prove our case to them and for this, I am happy to tackle this problem in R.

Our first step has been to try to use R to reproduce the results we achieved from Elasticsearch. We've explored a variety of methods for this. From basic to more complex, we have explored:

  • simple moving averages
  • probabilistic models
  • time series methodologies

For each of these methods, we plotted the Elasticsearch anomalies against our own detected anomalies and we calculated a variety of performance metrics (Type I and II error rates, accuracy, F1 Score and diagnostic odds ratio). Visually, we have achieved quite reasonable consistency with Elasticsearch:

Apologies for the poor resolution, but hopefully you can get the general feel. Above we have plotted four data streams (A - D) over time. Anomalies, detected via R, are marked as blue circles. Anomalies, detected via Elasticsearch, are marked by the red ribbon at the bottom of each plot. Let's be clear, this isn't quite apples-for-apples. We did not partition by data stream when we analysed these in Elasticsearch, but we did include the data stream as a key influencer. In R, we did partition each data stream and analysed them separately because, as you can see, all four data streams exhibit quite different behaviours. Because of this, we haven't been able to separate the Elasticsearch anomalies by data stream and therefore, the red ribbons are the same in all four plots and represent the total anomaly outputs acros them all. Long story short, we need to look across all four plots to determine whether R found a corresponding anomaly.

Visually, the results are promising. R is certainly noisier at the beginning of the data streams but overall, there is good agreement between the two. More concretely, our performance metrics capture the more quantitative comparison:

Overall accuracy: 76%. Overall, our (to date) very basic methods in R are more conservative than those used by Elasticsearch resulting in 10% more anomalies.

Type II Error Rate: 14%. These are the inconsistencies that we really care about: the anomalies that Elasticsearch detected but which R failed to detect. We need this to be as low as possible.

Diagnostic Odds Ratio: 13.8 [95% CI: 5.3 - 35.6]. This is well above 1 and a very positive sign. The Type I error rate (detected by R but not by Elasticsearch) will keep this lower than it perhaps could be. But, we would prefer a more conservative approach, so we are happy with this.

Exploring the Type II Error Rate

It's the Type II Error rate that we are really concerned about: the anomalies which were detected by Elasticsearch but not detected by R. To get a feel for how exposed we were, we extracted these inconsistent results and plotted the distribution of their Anomaly Score generated by Elasticsearch:


In the plot above, we can see that the majority of Type II Errors occur for records with low Anomaly Scores. By default, Elasticsearch categorises an anomaly as:

  • Warning (Score: 0 - 25)
  • Minor (Score: 25 - 50)
  • Major (Score: 50 - 75)
  • Critical (Score: 75 - 100)

The obvious question then, is whether these inconsistencies are truly of interest? Taking a probabilistic approach to this, there is a 14% chance that we will miss an anomaly which falls within the Major - Critical bands. There is only a 5% risk that we will miss a Critical anomaly.

From a practical point of view, our events are cumulative. This means, that if we were to initially miss a Major event, it would continue to escalate and the risk that we will miss this twice in a row would be less than 2%. These are favourable odds.

Final Thoughts

We are becoming big fans of Elasticsearch and X-Pack Anomaly Detection. However, we have a strong requirement to dive under the hood of this black box and understand what is driving Elasticsearch's outputs. This is why we have turned to R and methods that we can interrogate to understand what is going on under the hood. In this first phase, we've managed to largely reproduce the results from Elasticsearch. Overall, we are more comfortable with R being more conservative than Elasticsearch. But perhaps more importantly, we have the ability to quantify our performance and we have highlighted some really interesting Type II Errors which we need to follow up on...


To view or add a comment, sign in

Insights from the community

Explore topics