Machine Learning Model Testing and Monitoring Strategy

4 min readJan 28, 2021

Within the life cycle of analytical models, one of the main points that will ensure the quality and correct execution of these models is testing. To this end, not only must we test that our model works correctly, but also that all the components involved in the training and execution of the model do so.

In the following graphic from the Google article “Hidden Technical Debt in Machine Learning Systems”, you can visually appreciate the code and components needed to implement this cycle. The amount involved only in the training or prediction part is very small compared to the rest:

Source: Hidden Technical Debt in Machine Learning Systems

Also from Google they published another article called “The ML Test Score”, identifying all the types of tests needed to be able to cover this complete lifecycle. The purpose of this post is to summarize the types of tests identified in this article, that can help to define a testing strategy in our machine learning models lifecycle.

ML Test Score

The article approaches the development of analytical models with an industrialization approach at the same level as traditional software development. That is: the entire lifecycle must be automated (construction, testing, deployment, etc.) and to this end, all steps must be treated as if they were applications or processes that cannot fail, as they would block the implementation of new models or could impact their correct execution or that of applications dependent on them.

The article groups the tests directly related to the development and exploitation of analytical models into 4 main groups:

Source: The ML Test Score

Data Tests

In this type are grouped the tests related to data processing, both in the training phase and in the execution phase. Here we must take into account the following tests:

Check that training input data is of sufficient quality or with validation schemes
All features are useful for model training.
The cost of extracting the features must be proportional to the benefit they bring to the model.
Training data must meet the defined privacy requirements.
The code used to process the input data must also be analyzed and tested as it may also contain bugs.

Model Development Tests

As in the data processing phase, it is also necessary to adequately test the model development phase:

All code used to develop the model has to be analyzed, tested and versioned.
Find the correlation between offline metrics (hit, error optimization, etc.) with online metrics (sales, customer acquisition, etc.).
Check that the hyper-parameters selected are the optimum ones and that the adjustment work has been carried out.
Check that, after successive training sessions, the evolution of the model does not stagnate, or that a lower frequency of training is necessary.
Compare results with simpler models to verify that more complex developments are more effective than simpler ones.
The model must have the same performance for the whole range of data to be treated, so we must verify that we have an adequate dispersion of the data.
The model has to be fairness compliance, so it is necessary to check that it is correlated with, for example, protected user categories.

ML Infrastructure

The training must be able to be reproduced automatically, so we must verify that we do not have non-deterministic elements in the development of the model.
Test the training code of the model, its libraries, dependencies, etc.
Test the entire training process, validating data and code.
Validate the quality of the model before deploying it, setting minimum prediction thresholds and comparing with previous models.
Use progressive deployment strategies (Canary, A/B) to verify that the new version of the deployed model improves accuracy over the previous one and is safe.
Implement rollback strategies for model deployments to allow rollbacks to previous versions in case of issues

Monitoring Tests

A change in the data also has an impact on the model, so it is necessary to monitor these changes.
Monitor that there are no differences between the training data and the data on which the model is being executed.
Check that we do not have models that have not been trained for a long time, as there could be problems in retraining them (data or training flows not available, the model may have been losing effectiveness, etc.).
Monitor model performance, not only in effectiveness but also in training times, execution, bandwidth, etc.

Next steps

In order to implement all these tests in the lifecycle of analytical models, it is necessary to encourage cultural change in both data scientists and ML engineers, and to treat the development of analytical models as we treat the development of applications that run in production and can have an impact on our business if they do not work well.

There are different ways to apply these tests, with a more manual approach supported by checklists, or trying to automate as much as possible with a MLOps approach. In the next article we will discuss this latter approach.