Session abstract:
So you’ve created some machine learning algorithms and tested them out in the lab, and they seem to be working fine. But how can you monitor them in production, especially if they are constantly learning and updating, and you have many of them? This was the challenge we faced at Anodot and I’ll talk about the interesting way we solved it.
At Anodot, we have approximately 30 different types of machine learning algorithms, each one with its own parameters and tuning capabilities, designed to provide real time anomaly detection. Many of these are online learning algorithms, which get updated with every new piece of data that arrives. Adding to the complexity, the outputs of some of the algorithms act as the inputs others. These algorithms run constantly on the vast number of signals that are sent to our SaaS cloud (currently more than 35 million signals are reported to Anodot every 1 to 5 minutes). We knew from day one that it was crucial to track the performance of these algorithms, so we would know if something happened that improved or degraded their performance, but we were faced with a challenge – how to accomplish this?
First, we collect time series metrics that constantly measure various performance indicators for each of the algorithms. We measure the number of anomalies we discover for each customer, their score distribution, the number of seasonal patterns discovered, classification changes and rates between of the various model selection algorithms, number of clusters and their quality from our various clustering algorithms, and much more.
But, manual tracking of changes in these algorithm performance metrics is not feasible - there are too many models/algorithms to track manually (even with dashboards/reports). So, we discovered that by applying our anomaly detection algorithms to track these metrics, we quickly discover any change in their behaviour and see their abnormal patterns. When we get alerts on these abnormal changes, we determine if it was because of some algorithm tuning, or perhaps it is a valid shift.
In the talk I'll show multiple examples of how this approach helped us detect, fix and eventually design better learning algorithms. I'll also describe the general methodology for "learning the learner".