The difficulties of industrializing AI

For a long time, the main objective of a Data Scientist has been to find the best algorithmic recipe to answer a given business problem. To facilitate this prototyping phase, many tools have emerged such as open-source libraries and Data Science platforms; the latter even go so far as to offer a no-code experience.

14/04/2022

MLOps

Tous les articles

Sommaire

À télécharger

However, an essential aspect of an artificial intelligence project has been ignored for too long: industrialization. We must not lose sight of the fact that only an AI in production, i.e. a system whose results (forecasts, recommendations, etc.) are made available to their end users, can allow significant productivity gains for companies.

According to Gartner, nearly 85% of AI projects fail to go into production. And for those who succeed, the observation is clear: the costs are significant and the constraints numerous. So what explains this “industrialization wall”?

Deploy and redeploy the machine learning model

Once the algorithmic recipe has been validated during prototyping, it becomes necessary to compare the model with dynamic data, arriving in real time. The production environment must have a data connector (databases, web API, cloud storage spaces, etc.) that is stable and efficient.

Prototyping results are then challenged by real-time data, which is too often different from prototyping data. It is therefore necessary to revise the method.

After this first test, the code must gain in maintainability and performance to go into production while maintaining the load and supporting each evolution. This resumption of the code (refactoring) is a necessary step; optimizing too early would be counterproductive, reducing the agility of prototyping. The production environment must be suitable for this conversion of methods between prototyping and production.

“Scaling up and integrating into production information systems require skills specific multiplying the initial cost of the project.”

Scaling up and integrating into production information systems require specific and highly requested skills - including expensive ones - such as ML Engineers /DevOps/Developers, multiplying the initial cost of the project. A manual approach to these steps, which is not capitalized, makes production possible in theory but expensive, long and perilous from one production start to the next.

The difficulties become even more significant when it comes to modifying and updating the algorithmic recipe, which is rarely fixed once it is put into production. Each redeployment will be very time-consuming, as it will require the resumption of the production stages. Thus, evolutions will be very difficult to redeploy, which may encourage you to keep unsuitable solutions in production or devote a lot of time to redeployments.

Production frictions often lead to solutions being abandoned: too cumbersome, too risky, too time-consuming.

Follow the production

Once the model is deployed, permanent supervision is necessary to allow teams Data to be alerted to service malfunctions, if possible in advance; to diagnose and correct them.

Two different types of malfunctions may occur.

  • The first possible malfunction: users no longer receive results. This may come from an error in the code, an unexpected case, an unavailability of the source data...
  • Second possible malfunction, more specific to the services of Machine Learning : Users receive forecasts but the forecasts are of poor quality. This is the case, for example, when a predictive maintenance service raises too many false alarms or a recommendation engine sends inadequate or uninteresting proposals. In this situation, the system is operational from a software perspective, but not from a business perspective, which can lead to poorer adoption of AI due to a loss of confidence in its results.

For the first case of malfunction, having service logs to be able to trace the error and debug it will be key. The observability of operations is essential here. It drastically reduces the time needed to identify and correct the service.

In addition, to prevent service interruptions related to input data, Data Scientists or ML Engineers must implement data quality checks and validations (for example checking data types, data losses, missing data, missing data, outlier values, name changes in data fields, etc.).

“The observability of operations is essential here. It drastically reduces the time needed to identify and correct the service.”

For the second type of malfunction, this is much less obvious to detect. Identify a loss in mathematical performance of a model of Machine Learning requires more complex management strategies. A main challenge is to set up continuous performance checks and raise alerts when performance decreases. We are talking here about drift management (Drift), re-learning strategies and in some cases model modification.

To support deployment and maintenance, the teams Data will need, among other things, version management tools (Versioning) and the follow-up of experiments (Tracking): to be able to return to any previous state (with the right version of the data, code, model...) in a simple and safe way and compare the models with each other.

The road to production is long and winding for the services of Machine Learning deliver value over the long term.

Optimization of the algorithmic recipe, management of the production environment, management of the production environment, code recovery, code recovery, etc. All these are key steps and points of friction that can endanger the industrialization of an AI project.

All of these functions are located at the crossroads of the fields of expertise of a Data Scientist and a DevOps.

This is why the emergence and adoption of MLOps (Machine Learning Operations) are so important: it finally provides teams Data a methodology and tools to calmly cross the industrialization wall.

By also allowing a reduction in production costs and an increase in the reliability of results, MLOps has everything to make artificial intelligence aware of its true rise.

Une plateforme compatible avec tout l’écosystème

aws
Azure
Google Cloud
OVH Cloud
scikit-lean
PyTorch
Tensor Flow
XGBoost
jupyter
PC
Python
R
Rust
mongo DB