1 Introduction

Organizations worldwide are experiencing exponential data growth and effective use of such data in analytical workflows presents unprecedented challenges for data engineering functions (Abedjan et al. 2016). Extraction, Transformation and Loading (ETL) process within data management functions is tasked with validation of incoming data input files, file structure verification and audit of source formats (Kimball and Ross 2013). While the process ensures the data files are ingested without error, it does not validate the content within a data field. For instance, when multiple sources feed a data field, scale, unit, or plus-minus signs can be different for a newly added data source, but they often go undetected during the ETL process. Derived data fields like a score, output of a mathematical model or a formula, could easily hide issues with data inputs from standard ETL validations. Such data input issues are commonly observed by the downstream users of the model and are denoted as manifestation of “concept drift”.

Concept drift refers to the changes in distributions and statistical properties within data over time (Gama et al. 2014; Riess 2022) This makes it challenging for machine learning models to accurately project previously learned patterns to new circumstances. This results in a degradation of model performance, and depending on the application, consequences can be severe. However this is a gradual process, with a primary emphasis on adapting to evolving data patterns and implementing corrective actions to the model

Remedying data issues promptly in a production environment is expensive and any delay in such intervention risks the automated analytical functions making decisions based on erroneous data prior to issue detection. This becomes increasingly critical in the context of recent artificial intelligence (AI) applications, where data-driven decision-making occurs with minimal supervision (Polyzotis et al. 2017). Many of these data observability challenges caught the interest of the data management research community only recently. A major issue is that the behavior of AI systems depends on the data ingested, which can change due to errors in upstream data pipelines. As a consequence, algorithmic and system-specific challenges can often not be disentangled in complex AI applications (Polyzotis et al. 2018).

1.1 Problem statement

This study addresses the problem of detecting drifts in data distributions and divergence within the same data fields (input variables) processed from two different sample populations. We will elaborate on this problem using a hypothetical bank example. Bank A obtains a monthly performance data file from a credit bureau for all its credit card holders. One of the fields is a customer behavior model(CBM) score, which is useful for the bank as it helps predict the future payment delinquency of the credit card holders. The bank automates the credit card renewal processes, and the automated policy prevents auto renewals when CBM score is less than 620. In November 2022, the bank observed a decline in auto-renewal rates, falling from 95 percent to 90 percent. With a million customers renewing every month, this translated to 50,000 credit cards requiring manual renewal. Upon reviewing the input files, analysts noticed an issue with the data file, which is shown in the graphs below.

The below graph shows that the November 2021 CBM score distribution was centered around a mean value of 675, but as the November 2022 file was processed, there was a drift in this distribution, with the average score shifting to 600, even though there was no known change in the profile of the credit card holders. Later, analysts discovered that the CBM scoring model at the bureau did not accurately process one of the inputs, resulting in more customers falling into less than 600 CBM buckets. Such hard-to-detect ETL data issues are expensive for a bank that relies on automation. In this instance, 50,000 customers experienced automatic renewal denials, necessitating manual review efforts, and adversely impacting the overall customer experience. It’s not just rolling back the incorrect data in production, but the downstream impact of reversing a decision poses operational and reputational risks to the Bank. As more organizations embrace AI for automating and decisioning processes, the severity of challenges related to input data problems becomes more pronounced.

Fig. 1
figure 1

The first graph shows the drift in most recent month compared to same month previous year. The second histogram arranges the CBM score by deciles and shows the percentage difference in each bucket. In a data intensive environment, where data files are processed daily and every file contains hundreds of fields, front-end validations like this is not practical

The efficiency of ETL processes lies in their ability to handle input files of diverse frequencies and sizes. However, these processes lack a built-in mechanism to assess the variance of content within data fields. The presence of inconsistent data can significantly distort the results of models, often negating the benefits of AI approaches (Hellerstein 2008).As data continues to grow exponentially, and the adoption of black-box machine learning models rises, it becomes crucial to monitor less obvious data issues such as drift, as manual front-end validations prove impractical.

The concept of data drift can be traced back to early studies in information theory and statistics, laying the groundwork for subsequent advancements in research related to drift detection, adaptation strategies, and their integration into machine learning frameworks. A seminal paper published in 1951 (Kullback and Leibler 1951), which discussed data drift and later became widely acknowledged as Kullback-Leibler (KL) divergence in data distributions, has played a foundational role in many such studies. This paper builds upon these research and proposes a novel method for early drift detection in data ingestion pipelines.

1.2 Objectives of the sudy

The objective of this study is two-fold: firstly, to present Kullback-Leibler (KL) divergence as a method for detecting drifts early in data distributions, and secondly, to propose a solution addressing the identified drift problem, with the specific aim of enhancing data engineering functions in organizations that have adopted AI in automation and decision-making.

The rest of the paper is organized as follows. In Section 2, we discuss related research works and how we identified the gap and formulated the objective. In Section 3, we present the methodology in two components: first, the derivation of Population Stability Index(PSI) as a variant of Kullback-Leibler divergence, and second, the description of the simulation approach employed to generate data for the application of the PSI technique developed in the preceding section. In Sections 4, we present the experiment results and their implications, followed by a concise summary and concluding remarks in Section 5, outlining potential avenues for future research.

2 Related work

The literature we have reviewed in this context can be categorized into three groups. The first set of studies addresses various ETL and non ETL data issues, including incorrect or inconsistent data, outliers, duplicates, missing values, integrity constraint violations, data validity in model quality, schema evolution, training-serving skew, and overall data management challenges in the context of machine learning model management (Fig. 1).

The common theme among the second set of studies is the exploration of concept drift in machine learning models, particularly in online supervised learning scenarios. These studies delve into adaptive learning processes and strategies to handle evolving data distributions. Additionally, they highlight the utilization of various techniques such as evolutionary algorithms, metaheuristics, and ensemble methods to effectively detect and adapt to concept drift in non-stationary data streams.

The last set of studies specifically focus on the application of Kullback-Leibler (KL) divergence, or its variants, in various domains to address issues related to data distribution shifts, concept drift detection, and classifier evaluation. These studies utilize KL divergence as a statistical measure to quantify dissimilarity between probability distributions, enabling the detection of anomalies, monitoring of system behavior, and identification of distributional shifts. he following paragraphs will summarize these three groups of studies.

Hellerstein (2008) examines data quality challenges in large organizations, particularly focusing on incorrect or inconsistent data. They emphasize data cleaning techniques like outlier detection and exploratory data analysis to effectively address these issues. Abedjan et al. (2016) explore data cleaning for enterprise applications, addressing errors such as outliers, duplicates, missing values, and integrity constraint violations. They stress the importance of using a combination of tools and strategies for comprehensive error coverage. Gudivada et al. (2017) discuss data quality considerations in the context of big data and machine learning, suggesting a reevaluation of traditional approaches. They introduce a data governance-driven framework and highlight tools for managing data quality beyond traditional cleaning and transformations. Polyzotis et al. (2017) tackle data management challenges within machine learning pipelines, focusing on tasks such as comprehending, validating, cleaning, and enriching training data. They emphasize the significance of data validity in model quality and address challenges like schema evolution and training-serving skew. Polyzotis et al. (2018) address data management issues in the context of machine learning model management, covering various aspects from training to deployment and monitoring. They underline the complexity of managing ML models and call for further research on data management challenges specific to ML systems.

Gama et al. (2014) provide a comprehensive examination of concept drift in online supervised learning, detailing adaptive learning processes, categorizing strategies for handling concept drift, and surveying techniques and algorithms. Their review serves as a valuable resource for understanding concept drift adaptation. In contrast, Ghomeshi et al. (2019) focus on addressing concept drifts in non-stationary data stream classification by introducing the Evolutionary Algorithm-based Concept Drift (EACD) ensemble method. This approach dynamically adjusts its ensemble to detect and resize types, offering superior performance in diverse non-stationary environments compared to existing algorithms. Riess (2022) explores automated adaptation to concept drift in machine learning models, highlighting population-based methods like Genetic Algorithm and Particle-Swarm Optimization. The study identifies challenges in evaluating minority class performance and transparency in real-world data drift characteristics, suggesting future research directions for improved concept drift detection and correction.

Zeng et al. (2014) develop statistics based on KL divergence for monitoring large-scale technical systems. Their study focuses on detecting anomalous system behavior by comparing estimated density functions with reference density functions, particularly for Gaussian distributed process variables. Basterrech and Wozniak (2022) address concept drift in continual learning, introducing Kullback-Leibler divergence for ongoing monitoring of changes in probability distributions in multi-dimensional data streams. Their method, KL-divergence-based concept drift detector (KLD), offers a fast and robust decision rule to predict and understand concept drift occurrences. Ponti et al. (2017) introduces the decision cognizant Kullback-Leibler divergence (DC-KL) as a measure for evaluating classifier agreement in decision-making systems with multiple classifiers. This research contributes to discerning between classifier congruence and incongruence in pattern recognition systems. Lin (2017) applies a variant of KL divergence called population stability index (PSI) in financial model validation, aiming to measure distributional shifts between two samples over time. Yurdakul (2018) explores KL divergence and specifically PSI properties in scorecard monitoring, providing statistical properties of PSI. This study provided a valuable reference to PSI as a distinct case of KL divergence offering deeper insights into the interpretability of PSI statistics.

The existing literature extensively investigates data management challenges, offering valuable insights into data quality, cleaning, and management. However, there is a noticeable gap in integrating scalable techniques like KL divergence or its variants for drift detection in data ingestion pipelines. While KL divergence and similar algorithms are employed for concept drift detection or front-end model validations, they primarily focus on adjusting to evolving data patterns and are slow to detect data issues. As organizations increasingly adopt AI technologies, there is a pressing need for robust data governance practices to mitigate this risk. This paper aims to address this gap by proposing a scalable drift detection algorithm, within data ingestion pipelines, utilizing a variant of KL divergence.

3 Methodology

The selection of Kullback-Leibler (KL) divergence as the evaluation metric in this study is based on the comprehensive review of existing literature, which highlights its significance in addressing data distribution shifts and concept drift detection. Unlike other algorithms that primarily focus on adjusting to evolving data patterns, KL divergence offers a statistical measure to quantify dissimilarity between data distributions, enabling the early detection of anomalies and automated intervention.

3.1 PSI as a variant of KL divergence

Given two probability distributions P (actual), and Q (expected) of a discrete random variable x, \(x={x_1, x_2, ..., x_B}\), KL divergence is defined as:

$$\begin{aligned} D_{KL}(P(x) \,||\, Q(x)) = \sum _{i=1}^{B} P(x_i) \cdot \ln \left( \frac{P(x_i)}{Q(x_i)}\right) \end{aligned}$$
(1)

An interpretation of KL divergence is that it measures the expected excess surprise in using the actual distribution versus the expected distribution as a divergence of the actual from the expected. B is the number of buckets (discrete) of the distribution.

\(D_{KL}\) measures divergence however, researchers note that it’s not a true distance measure as its definition is not symmetric. That is:

$$\begin{aligned} D_{KL}(Q(x) \,||\, P(x)) \ne D_{KL}(P(x) \,||\, Q(x)) \end{aligned}$$

A symmetric measure is obtained by defining:

$$\begin{aligned} D(P, Q)= & {} D_{KL}(Q \,||\, P) = D_{KL}(P \,||\, Q) \\= & {} \sum _{i=1}^{B} P(x_i) \ln \left( \frac{P(x_i)}{Q(x_i)}\right) + \sum _{i=1}^{B} Q(x_i) \ln \left( \frac{Q(x_i)}{P(x_i)}\right) \\= & {} \sum _{i=1}^{B} P(x_i) \ln \left( \frac{P(x_i)}{Q(x_i)}\right) - \sum _{i=1}^{B} Q(x_i) \ln \left( \frac{P(x_i)}{Q(x_i)}\right) \\= & {} \sum _{i=1}^{B}(P(x_i) - Q(x_i)) \ln \left( \frac{P(x_i)}{Q(x_i)}\right) \end{aligned}$$

This variant of KL divergence is known as Population Stability Index (PSI) and is widely used in machine learning and model validations as a divergence measure. The following steps will show how to compute PSI using the CBM score data we discussed in the problem statement.

From the derivation above,

$$\begin{aligned} PSI = \sum _{i=1}^{B}(P(x_i) - Q(x_i)) \ln \left( \frac{P(x_i)}{Q(x_i)}\right) \end{aligned}$$
(2)

In the context of CBM score data distribution, B is the number of bins CBM accounts data was grouped into. For example, bin 1 contains the number of accounts with CBM score between 300 and 400. \(P(x_i)\) is the percent of accounts in bin i, in November 2022. This is the actual data. \(Q(x_i) \) is the percent of accounts in bin i in November 2021. This is the baseline or expected data distribution in that bin. PSI is then calculated as shown in the Table 1 below.

Table 1 Calculation of PSI

PSI calculated in this example is 0.1106. PSI thresholds are used to determine similarity between the baseline and new samples. PSI less than 0.1 is considered similar or no significant drift. PSI between 0.1 and 0.2 is considered substantial divergence and \(PSI > 0.2 \) is considered significant shift. However, these are only guidelines and confidence intervals can be different for different distributions.

In data validation applications, the PSI threshold can be adjusted to capture even minor changes, depending on the risk appetite of the business. Additionally, the computation of PSI can be extended to encompass all data fields that impact downstream AI models. By adopting this approach, comprehensive real-time data validation is ensured before critical decisions are made by these systems.

Table 2 Base sample(Q) - simulation criteria

3.2 Simulation approach

Simulation of the data to reflect the real-life data scenarios is an important step in this study. The advantage of simulation is that we could reproduce all known data issues without having to wait to experience them in the production environment. Also, we could experiment and document how the proposed technique solves the issues.

To test the similarity of base(Q) and target(P) distributions we created the following four scenarios. Base file had four data fields, Advertisement Response, Sales Volume, Deposits, and CBM Score. Sample size, Mean and standard deviation used for each field are summarized in the Table 2 below. Assumption of normality is not necessary for PSI calculations, but these data fields tend to be normal in real life around the specified mean.

Next step is to simulate the target sample of the same data fields by introducing data issues from the real world. The error scenarios applied to the above four data series are summarized in the Table 3 below.

Table 3 Through the door sample (P) simulation

Ad response Advertisement response is a data series that reports the number of responses to various advertisement campaigns from the online advertisements delivered through advertisement platforms like Google, Facebook, Twitter, etc. At times incomplete files may be delivered from these platforms. 10 percent of the values selected at random were set to missing to mimic data missing from one of the major platforms.

Sales volume The field represents the sales transactions of an international luxury car dealer. It’s quite unlikely that prices fluctuate significantly in this segment, so a significant price increase suggests some double counting or accounting mistakes during a system migration. Sales transaction price was increased by 10 percent in the quartile one for 50 percent of the random records. This would shift many of them to the 2nd quartile of the base sample.

Deposits Deposit distribution of a major bank for millions of their customers. A newly added branch banking system reports the numbers in 1000s instead of actual numbers for 20 percent of random cases.

CBM score CBM is a monthly behavior score of the credit card holders of a bank, refreshed monthly to monitor the health of the portfolio. 10 percent of the quartile four customers of a bank show a score drop of 50 points due to some input error in the model, sending them to lower buckets.

Simulated base and target (TTD) distributions are plotted below (Figs. 2, 3, 4 and 5) to visualize the divergence in samples.

Fig. 2
figure 2

Ad response

4 Experiment results

Tables 4 and 5 exhibit the summary of PSI calculations for each data distribution at decile and demi-decile levels respectively. For Ad Response, the PSI component values range from 0.00043 (at decile 5) to 0.10097 (at decile 1), with an overall PSI of 0.06724. While the overall PSI suggests no significant drift, the high PSI value at decile 1 indicates a potential anomaly in the data. This observation is supported by the graph for Ad Response. The PSI’s capability to detect such drifts at the component level offers valuable flexibility in implementing a configurable rule to pause the ETL process for an investigation. Furthermore, Table 5 for Ad Response demonstrates that when we expanded the number of bins to 20, the issue was magnified, with the total PSI value now reaching 0.1. For Sales Volume, Deposits, and CBM Score, the total PSI values indicate a moderate to significant level of data drift, aligning with the graphical representations. Additionally, in all cases, expanding the bins led to increased PSI values, highlighting the sensitivity of PSI values to bin sizes. We provide a comprehensive breakdown of calculations at both decile and demi-decile levels in the Appendix.

5 Summary and conclusion

As detailed in Section 4, in order to simulate the data divergence issue within data streams, we chose four baseline data fields: Ad Response, Sales Volume, Deposits, and CBM Score. To introduce realistic variations, we deliberately introduced real-life errors, causing distortions in the distributions. Subsequently, we computed the Population Stability Index (PSI) with various bin sizes, and the summarized results are presented in Table 6 below.

Fig. 3
figure 3

Sales volume

The guidelines used are as follows: when PSI is less than 0.1, the distributions are considered similar or show ’little drift.’ PSI values falling between 0.1 and 0.25 indicate a ’moderate drift,’ which warrants a review. On the other hand, PSI greater than 0.25 suggests significant divergence or ’significant drift’ from the baseline distribution, requiring immediate attention.

As expected, PSI effectively detected the distortions introduced into the data fields during the simulations. Ad Response is the only data field that showed a below threshold number when PSI was measured using deciles. However, when binned into twenty buckets, PSI was significant with 0.11. Demi-decile binning in general produced a higher PSI value.

Fig. 4
figure 4

Bank deposits

To conclude, PSI provides a straightforward and interpretable metric of distributional shifts between two samples over time, making it easy to understand and implement in practical scenarios. Unlike many complex drift detection algorithms, PSI calculation involves simple computations, which is easy to implement using SQL while processing data. Additionally, PSI is robust to changes in data volume and frequency, allowing for effective monitoring of data drift in dynamic environments where data streams may vary in size and frequency of updates. Moreover, PSI can detect subtle shifts in data distributions by setting appropriate thresholds to detect issues early and intervene timely to mitigate potential issues arising from data drift. Overall, the simplicity, robustness, and sensitivity of PSI make it a valuable tool for detecting data drift and maintaining the integrity of analytical workflows in data-driven organizations.

Fig. 5
figure 5

CBM score

Table 4 Summary of PSI results - PSI at deciles
Table 5 Summary of PSI Results - PSI at Demi-deciles

Future research PSI thresholds followed currently are from the industry best practices borrowed from engineering and modeling applications. The properties of PSI need to be studied in the context of large volume data engineering applications. The cost of false positives and false negatives differ with type of data fields so determination of PSI thresholds should be based on the cost benefit analysis. Optimal discretization (binning) is another area left to explore in a future study.

Table 6 Summary of PSI results - PSI at demi-deciles