Understanding Data Drift and Model Drift: Drift Detection in Python

Feature Model Drift Data Drift
Definition A decline in a model's performance due to changes in the environment. A change in the statistical properties of the data used to train a model.
Cause Change in the relationship between input variables and target variable. Change in the distribution of input data.
Example Spam detection model struggles as spammers use new tactics. Customer purchase prediction model performs poorly as customer demographics shift.
Impact Predictions become less accurate or reliable. Model may still function but on outdated information.

What is Drift?

Machine learning models are trained with historical data, but once they are used in the real world, they may become outdated and lose their accuracy over time due to a phenomenon called drift. Drift is the change over time in the statistical properties of the data that was used to train a machine learning model. This can cause the model to become less accurate or perform differently than it was designed to.

In other words, "drift" is the decline in a model's ability to make accurate predictions due to changes in the environment in which it is being used.

Why do Machine Learning Models Drift?

There are several reasons why machine learning models can drift over time.

One common reason is simply that the data that the model was trained on becomes outdated or no longer represents the current conditions.

For example, consider a machine learning model trained to predict the stock price of a company based on historical data. If we train the model with data from a stable market, it might do well at first. However, if the market becomes more volatile over time, the model might not be able to accurately predict the stock price anymore because the statistical properties of the data have changed.

Another reason for model drift is that the model was not designed to handle changes in the data. Some machine learning models can handle changes in the data better than others, but no model can avoid drift completely.

Types of Drift

Let's explore the two different types of drift to consider:

1. Concept Drift

Concept drift, also known as model drift, occurs when the task that the model was designed to perform changes over time. For example, imagine that a machine learning model was trained to detect spam emails based on the content of the email. If the types of spam emails that people receive change significantly, the model may no longer be able to accurately detect spam.

Concept Drift can be further divided into four categories (Learning under Concept Drift: A Review, Jie Lu et al.):

2. Data Drift

Data drift, also known as covariate shift, occurs when the distribution of the input data changes over time. For example, consider a machine learning model that was trained to predict the likelihood of a customer purchasing a product based on their age and income. If the distribution of ages and incomes of the customers changes significantly over time, the model may no longer be able to predict the likelihood of a purchase accurately.

It is important to be aware of both concept drift and data drift and take steps to prevent or mitigate their effects. Some strategies for addressing drift include continuously monitoring and evaluating the performance of a model, updating the model with new data, and using machine learning models that are more robust to drift.

source: https://www.datacamp.com/tutorial/understanding-data-drift-model-drift

Thoughts 🤔 by Soumendra Kumar Sahoo is licensed under CC BY 4.0