Data and concept drift are frequently mentioned in ML monitoring, but what exactly are they, and how are they detected? Furthermore, given the common misconceptions, are data and concept drift things to be avoided at all costs or natural and acceptable consequences of training models in production? Read on to find out.
What Is It?
Perhaps the more common of the two is data drift, which refers to any change in the data distribution after training the model. In other words, data drift commonly occurs when the inputs a model is presented within production fail to correspond with the distribution it was provided during training. This typically presents itself as a change in the feature distribution, ie, specific values for a given feature may become more common in production. In contrast, other values may see a decrease in prevalence. For example, consider an e-commerce company serving an LTV prediction model to optimize marketing efforts. A reasonable feature for such a model would be a customer’s age. However, suppose this same company changed its marketing strategy, perhaps by initiating a new campaign targeted at a specific age group. In this scenario, the distribution of ages being fed to the model would likely change, causing a distribution shift in the age feature and perhaps a degradation in the model’s predictive capacity. This would be considered data drift.
When Should You Care?
Contrary to popular opinion, not all data drift is bad or imply that your model is in need of retraining. For example, your model in production may encounter more customers in the 50 – 60 age bracket than it saw during training. However, this does not necessarily mean that the model saw an insufficient number of 50 – 60-year-olds during training, but rather that the distribution of ages known to the model simply shifted. In this case, retraining the model would probably be unnecessary.
However, other cases would demand model retraining. For example, your training dataset may have been small enough that your model didn’t encounter any outliers during training, such as customers over the age of 100. When in production, though, the model might see such customers. In this case, the data drift is problematic, and addressing it is essential. Therefore, having a way to assess and detect the different types of data drift that a model may encounter is critical to getting the best performance.
What Is It?
Concept drift refers to a change in the relationship between a model’s data inputs and target variables. This can happen when changes in market dynamics, customer behavior, or demographics result in new relationships between inputs and targets that degrade your model’s predictions. The key in differentiating concept drift from data drift is the consideration of the targets—data drift applies only when your model encounters new, unseen, or shifting data. In contrast, concept drift occurs when the fundamental relationships between inputs and outputs change, including on data that the model has already seen. Going back to our example of the LTV prediction model, suppose a country-wide economic shift happens in which customers of a certain age group suddenly have more money to spend, resulting in more purchases of your business’ product within this demographic. This happened quite employed during the Covid-19 pandemic when US government-issued stimulus checks fell into the hands of millions of under millennials throughout the country. The number of millennials interacting with your model wouldn’t necessarily change, but the amount they would spend on purchases would. Detecting this concept drift and retraining the model would be vital to maintaining its performance.
When Should You Care?
In some sense, you should always care about the concept drift, at least to be aware that it has happened. Because concept drift refers to an underlying shift in the relationships between targets and outputs, model retraining is always required to capture these new correspondences. You will only want to retrain the model if the relationships you’re aiming to capture are still representative of your downstream business KPIs. While this will often be the case, it is not always a guarantee. For example, your business model might shift such that you decide you care more about the number of time customers spend on your website (so that you can increase ad revenue) rather than the amount of money they spend on your actual products (which you may have been small, to begin with). You’d probably want to train an entirely different model in such a circumstance, so concept drift in the original model would no longer be a concern.
Tips for Monitoring Both Types of Drift
What Not to Do
As our previous examples have illustrated, simply being alerted to the presence of data or concept drift is not sufficient. A deeper understanding of how shifts in the data distribution or relationships between inputs and targets affecting model performance and downstream business KPIs is critical to addressing drift in the proper context. Unfortunately, many tools fail because they only alert data scientists to changes in the overall data distribution. Changes to smaller, specific data segments often foreshadow more drastic distributional shifts. The key to successfully addressing drift is being alerted to these subtler, earlier shifts and attending to them promptly because, by the time a significant drift enough to detect in the overall distribution has occurred, the problem has usually already manifested itself in multiple areas and significantly degraded model performance on substantial amounts of data. At this point, remedying the issue becomes a game of playing catch up in which you are always one step behind, allowing data to flow through your system on which your model is improperly trained.
What You Should Do Instead
The proper way of addressing data and concept drift is to create a feedback loop within your business process and monitor your model in the context of the business function it serves. You want to decide on actual, quantifiable performance metrics that rapidly allow you to assess how your model is performing at any instant and thereby enabling you to understand whether changes in the data distribution correlate with a decrease in performance. Ultimately, this will allow you to connect input features to actual business outcomes and learn when the underlying concept has shifted. If it has, you can then understand it in context and decide whether it’s worth taking steps to address it.
Finally, you want to ensure that you’re measuring changes to your data on a granular level. Within machine learning, forsaking the trees for the forest can manifest errors in problematic ways. Having a good understanding of your model’s performance requires being tuned to specific data segments. These are often the first to show issues before propagating to the entire distribution. Continuing with our LTV model example, if customers in a smaller state, such as Rhode Island, were the first to receive their stimulus checks, this might not be a significant enough shift to register across the entire distribution overall. However, knowing about this change could alert you that more global shifts in the data distribution were forthcoming (ie, other states would soon be issuing stimulus checks). Thus, detecting changes in data at the granular level is extremely important for the early identification of data and concept drift and squeezing the best performance from your models.
Data and concept drift occurs when a model is no longer performing as intended due to changes in data; However, they manifest for different reasons. For example, data drift arises when there is a shift in the input data distribution between training and serving a model in production. In these cases, the shift may be inconsequential or require model retraining, depending on how well the model generalizes to the new distribution. On the other hand, concept drift occurs when the underlying function mapping inputs to target changes. In these cases, model retraining is nearly always required to capture the new relationships, assuming that these relationships are relevant to your downstream business KPIs. Ultimately, you want to establish a feedback loop between business outcomes and data features to detect data and concept drift. It would help if you also defined robust performance metrics based on these outcomes, assessing how well your model is doing and correlating this with specific features. And finally, you want to guarantee that you are monitoring changes to your data at a granular level so that you are alerted to shifts in the distribution before propagating and affecting the entire dataset.