Introduction
If you have ever been interested in running ML models into production (or if you have already done it!), you have probably heard of drift monitoring. That’s a good news! It means that you have put a finger on a really important part of in-production ML services: Their measured performance (usually on a test dataset) can - and will surely - vary over time. In fact, lots of things can happen once your model is deployed: the distribution of the data it has been trained changes, the context from which data is extracted evolves… In short, your model will face data that can differ a lot from what it is used to.
Being able to monitor changes in your model’s performance in a production environment is crucial to keep high standard of service quality and ensure that you are fully in line with MLOps philosophy. Indeed, as an iterative cycle, you need to monitor as soon as possible your production service, in order to quickly come back to an experimental phase if needed. An even better solution would be to not only monitor your model but to be warned as soon as its performance has dropped below a certain threshold. Why not even automatically retrain your model on fresh data?
Let’s think about how to build a drift monitoring pipeline!
Monitoring VS Detecting: Why isn’t it that easy?
Detecting drift in data or a model’s performance is not that complicated when we think about it. We need to find a good measure of it, compute it and finally state about whether data streams have drifted or not (i.e. the model’s performance has decreased or not).
There are many appropriate techniques that can help us identifying drift within data. Consider the Kolmogorov-Smirnov (K-S) test for instance: This non-parametric univariate test is commonly used to check whether two samples of data come from the same distribution. If you run it on a new piece of data and an old one, you can have indications whether data has drifted or not.
An even easier solution would be to check the performance of our model at two different times (therefore with different pieces of data). The difference between the two measures could be seen as a measure of drift: My R² has decreased by 15% in a month, I may have to retrain my model to avoid drifting.
So, what is the real difficulty here? Monitoring drift.
In fact, we have seen that detecting drift can be done quite easily with the right tools, but monitoring it implies automation and data persistence (i.e. to accumulate metrics and store them). It is not just about computing a predefined measure of drift (as we have seen above), but also to track its evolution over time. We would need to compute it periodically at a specific time, to store the results for a certain period of interest, retrieve previous results to compare them globally, and probably define an alerting system to warn me when my model’s performance drifts beyond an acceptable level.
It sounds like we have a bit of work ahead. Let’s check how we could do this in practice!
The building blocks of a drift monitoring pipeline
First, let’s focus on a specific forecasting use case in the financial industry. We have a ML service (in production!) that every week forecasts the daily value (that is the change in price) of a given asset over the next week. It uses a lot of data, coming from different sources… In short, there’s a multiplicity of factors linked to a potential drift. Our goal is to monitor it.
💾 First, I need to store my predictions to compare them later on. Indeed, since I am making one prediction per week on a seven days window, I need to wait for the next week to have access to the true values. Meanwhile, I need to make sure that I don’t lose my predicted values! We can use any data storage (cloud storage may be an easy option), upload the predictions as soon as it is computed and we will be fine.
🎯 Then, we make sure that we can access the true values. Of course, comparing the predicted values with their true counterparts requires that we can access them easily. In our case, a week after the forecast.
🌡️ We define the measure to be monitored. It can be as simple as a R², a MAE or RMSE. Also, an interesting metric to consider could be the difference between the R² (or any other) at different timestamps. It provides a more dynamic view of the evolution of your metrics. But, I guess that you are starting to see the difficulty here. We now need to not only store the predicted values, but also the metrics from a week to another to finally be able to compare the metrics over time.
🔁 Finally, we need to make everything dynamic. Once we have all of the above, we are not done. We need to add the time dimension to this metric computation. From a week to another we then need to build a time series made of our measured values. The time series therefore needs to be appended every week. We also need to think about how to display it, how to set up the alerts, but also how to be in sync. Everything needs to happen at different times but periodically, correctly interacting with our data storage systems, retrieving the right values at the right time to finally augment and display a time series. That can be a lot to maintain!
As we just saw, doing everything by hand can quickly become difficult. Thankfully, there exists built-it tools that can save us time and ensures performance. Let’s see an example.
Monitor your service with Craft AI
When you’re using a third-party platform to deploy your model, it can come with a built-in drift monitoring functionality that can save you a lot of time and effort. If not, you will have to use other tools to ensure the completeness of the MLOps cycle. Craft AI offers you a monitoring functionality directly embedded in its MLOps platform! Let’s see how it works
When using the Craft AI’s platform to deploy your model you have the opportunity to define metrics directly in your code in an easy way using the Python SDK. For instance, defining a R² metric inside of my drift monitoring pipeline would give something like this:
The great thing about you doing this is that you don’t need to care about synchronicity, storage of the metrics, appending the time series with the new R² values. The Craft AI’s platform does this for you.
In the monitoring tab, you can see the time evolution of the metrics you have defined in your code when you were creating this monitoring pipeline. It is also extremely easy to interact with the display, to retrieve a particular execution that has a weird performance measure for instance. This way, you can locate potential sources of bugs but also monitor drift.
This is strongly anchored in the MLOps philosophy, being able to quickly iterate between a production environment and an experimental one: thanks to my monitoring, I noticed that something is not going well with my service, therefore I can start experimenting right away and redeploy whenever I feel ready!
Starting from version 1.2.0 of the Craft AI’s platform, you will also be able to simply set up alerts in the alerting center, or define metric triggers to retrain your model for instance. This ensures that you keep your ML service in control and lets you achieve high quality standards easily!
Conclusion
We have seen why drift monitoring is crucial to any ML service in production and what was required to build an efficient monitoring pipeline. Basically, measuring drift, detecting it is not the hard part. Difficulties start to arise when we want to automate the process and synchronize everything (think about our financial example …). Sometimes, using a MLOps platform can simplify a Data Scientist’s life on many aspect and especially on this one. That’s what Craft AI is offering you!
Dive into MLOps with a hands-on personalized demo. Witness how our solution can enhance your machine learning use cases into real-world. Schedule your demo now!