Exploring uplift modeling with high class imbalance

Nyberg, Otto; Klami, Arto

doi:10.1007/s10618-023-00917-9

Exploring uplift modeling with high class imbalance

Open access
Published: 23 January 2023

Volume 37, pages 736–766, (2023)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Exploring uplift modeling with high class imbalance

Download PDF

3568 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Uplift modeling refers to individual level causal inference. Existing research on the topic ignores one prevalent and important aspect: high class imbalance. For instance in online environments uplift modeling is used to optimally target ads and discounts, but very few users ever end up clicking an ad or buying. One common approach to deal with imbalance in classification is by undersampling the dataset. In this work, we show how undersampling can be extended to uplift modeling. We propose four undersampling methods for uplift modeling. We compare the proposed methods empirically and show when some methods have a tendency to break down. One key observation is that accounting for the imbalance is particularly important for uplift random forests, which explains the poor performance of the model in earlier works. Undersampling is also crucial for class-variable transformation based models.

ImbalancedLearningRegression - A Python Package to Tackle the Imbalanced Regression Problem

Uplift Modeling

Underlying Objective

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

High class imbalance is prevalent in e-commerce where conversion rates are typically in the range of 0.1–5% (Diemert et al. 2018; Richardson et al. 2007). The rate depends on whether the conversion event is e.g. a click, a visit, or a purchase, with the more valuable events (the purchases) being at the lower end of this spectrum. High class imbalance makes modeling difficult as observations contribute to a cost function in proportion to their number, resulting in the cost function being easily minimized when an algorithm largely ignores the minority class. A common way to deal with this problem in classification tasks is by undersampling where observations from the majority class are dropped to better balance the number of positive and negative observations. However, models trained on undersampled data are not well-calibrated in the original task. Usually this is not a problem as a simple threshold is enough to decide which class an observation belongs to, and if needed the model can be calibrated afterwards using a calibration method (e.g. Zadrozny and Elkan 2002). However, these techniques do not directly translate to uplift modeling.

Uplift modeling is the art of modeling the causal effect of some treatment on individual observations (Rzepakowski and Jaroszewicz 2010). More formally it is defined as the difference between two probabilities

$$\begin{aligned} \tau (x) = p(y=1 \vert x, do(t=1)) - p(y=1 \vert x, do(t=0)) \end{aligned}$$

(1)

where x are the features of an observation, y is the class-label, t is the treatment label where $t=1$ indicates treatment and $t=0$ no treatment, and do(.) is the do-operator (Pearl 2009).

As uplift is the difference between two probabilities, we need to be more careful in accounting for the distortion of the probabilities caused by undersampling. Recently Nyberg et al. (2021) proposed a method relying on stratified undersampling for uplift estimation, but that solution relies on simplifying assumptions that are not always valid. This paper explores different undersampling and calibration methods in the context of uplift modeling, and in particular, proposes the first general method applicable for improving uplift estimates for tasks with high class imbalance that makes no simplifying assumptions on the dataset (in addition to the assumptions made by uplift modeling). We present four undersampling methods (three of which are novel) and three new calibration methods (one of which is novel) with theoretical foundations, and empirically evaluate them on the largest available uplift datasets that exhibit high class imbalance as well as on one synthetic dataset. We also demonstrate how the results depend on data characteristics, such as the amount of imbalance and the base conversion rate, and illustrate when specific methods can fail.

The main finding is that high class imbalance can be effectively addressed with undersampling. All tested uplift models improved when the class imbalance was accounted for when the datasets were large enough, and for some methods the effect is dramatic. Methods based on the class-variable transformation (Jaskowski and Jaroszewicz 2012; Lai 2006) do not work at all without undersampling but become competitive when the imbalance is corrected for, and for uplift random forests (Guelman et al. 2015) we observe a 50–60% improvement in the standard performance metric with undersampling. In some previous comparisons, such as Fernández-Loría and Provost (2022), Semenova and Temirkaeva (2019) and Belbahri et al. (2021), random forest -based methods have performed poorly and we postulate that accounting for the imbalance by undersampling would have changed some of the conclusions of these works. Another interesting observation is that we were able to reliably estimate uplift when there were as few as 200 positive observations in the minority class. Even though the exact number required naturally depends on the specific case, our result is encouraging in terms of practical applicability of uplift modeling for industries dealing with problems with extremely small conversion rates and moderate amounts of data.

2 Related work

We build on two streams of previous work: class imbalance and uplift models. This section provides the necessary background on both aspects for understanding the rest of the paper.

2.1 Class imbalance

The term “imbalance” has been used to refer to multiple different aspects in uplift modeling: Olaya et al. (2020) used it to describe an imbalance in treatment policy usually referred to as “confounding effects” (Austin 2011), and Betlei et al. (2018) used it to describe a setting where there is a large difference in the number of treated and untreated observations. In contrast, this paper deals with the imbalance in class labels (the outcomes), following the terminology of Nyberg et al. (2021). This problem has been thoroughly studied in the context of classification and is commonly referred to as “class imbalance” (Kaur et al. 2019), but remains understudied in uplift modeling.

There are two main techniques for dealing with high class imbalance in classification: weighting and sampling, including oversamping, undersampling, and synthetic sampling. Moreover, oversampling and undersampling are sometimes combined (Chawla et al. 2002). In weighting, the minority class observations are given a larger weight in the cost function to ensure that the algorithm will account for them appropriately, whereas in oversampling the observations of the minority class are resampled so that there are multiple copies of them. Synthetic sampling generates new unique observations based on the properties of existing observations (Chawla et al. 2002). Undersampling, in turn, refers to techniques that discard some of the observations in the majority class(es).

Even though both weighting and sampling could potentially be used in the context of uplift modeling, we specifically focus on undersampling because it maps so elegantly to the typical use cases. Especially in e-commerce, one can easily collect a large number of negative observations and e.g. datasets used in this paper contain more than ten million observations. By undersampling the negative observations we can reduce the size of the training dataset and hence also the computation time. In contrast, oversampling in these cases would result in extremely large training sets.

2.2 Uplift models

In this work we consider only learning scenarios where the data has been collected in a randomized trial so that the treated and untreated observations come from the same underlying distribution p(x) and the choice of the treatment has been made independently of x. With this assumption, the do-notation in Eq. (1) simplifies to conditioning on t. There are also models that do not require this assumption, that work on observational data (e.g. Johansson et al. 2016) or on data where the treatment policy is known (e.g. Austin 2011), but these include other assumptions that are often hard to verify and including these in the experiments would complicate matters without bringing additional value.

Uplift modeling is an active research topic and numerous different principles and practical models have been proposed; see Gubela et al. (2020) for a recent overview. Our interest is in studying specifically the undersampling process as a means of accounting for high class imbalance, largely in a method-agnostic manner. Basic understanding of the modeling approaches is needed e.g. to understand which undersampling methods are compatible with what models, but we leave out the technical details of the learning algorithms as they are not central for our work. We will evaluate the different undersampling approaches in the context of a few models, selected as representative examples of popular methods belonging to different families, and these particular methods are explained briefly next.

The double classifier by Radcliffe and Surry (1999) is a classic model motivated directly by Eq. (1). It is a type of T-learner (Künzel et al. 2019) as it uses two classification models, one to model $p(y \vert x,do(t=1))$ and another to model $p(y \vert x,do(t=0))$ and simply estimates the uplift by computing the difference between the two probabilities. We use the double classifier with logistic regression as base classifier, denoting the model by DC-LR. Even though DC-LR has historically received little attention, in part due to critique provided by Radcliffe and Surry (1999) and Guelman et al. (2015), it performed best in a relatively recent comparison by Semenova and Temirkaeva (2019).

The class-variable transformation (CVT) was proposed by Jaskowski and Jaroszewicz (2012) and Lai (2006). In CVT, the outcome variable y and treatment label t are used to create a new variable z so that $z_i=1$ when $y_i=1$ and $t_i=1$, or when $y_i=0$ and $t_i=0$. Otherwise $z_i=0$. With this transformation, uplift becomes $\tau (x) = 2 \cdot p(z \vert x) - 1$, i.e. the uplift problem is transformed into a classification problem. This way the uplift problem can be solved with one classifier rather than two. CVT with logistic regression (CVT-LR) performed best in the comparison of Nyberg et al. (2021) on one of the datasets we will use in our evaluation.

A somewhat related approach is the revert label (RL) proposed by Athey and Imbens (2015). A similar class-transformation is performed so that the new variable is defined as

$$\begin{aligned} r_i = \frac{t_i \cdot y_i}{\pi (x_i)} - (1-t_i) \frac{y_i}{1-\pi (x_i)} \end{aligned}$$

(2)

where $\pi (x_i)$ is the propensity score (the probability that an observation of type $x_i$ was treated). When the training data is collected in a randomized controlled trial, this is assumed to be a constant value for all $x_i$. The big difference between CVT and RL is that while the former transforms the learning problem into a classification problem, the latter transforms it into a regression problem. The r in RL takes at least three values and the uplift is the expectation of r. As the expectation is continuous, this is best treated as a regression problem. This is also the same formulation later proposed by Rudaś and Jaroszewicz (2018). As a practical method building on the RL concept, a neural network that minimizes the mean-squared error between the revert label and the output similarly to Belbahri et al. (2021) is included in the experiments. It can be shown that this formulation is equivalent to the one presented by Gutierrez and Gérardy (2017) where they showed that it is possible to minimize the mean-squared error between the uplift estimate and the actual unobservable uplift.

Another interesting family of models are uplift random forests. The forest proposed by Guelman et al. (2015) was included in the experiments instead of e.g. the causal random forest by (Wager and Athey 2018) as the former is better suited for binary class labels. In contrast to all the previous models, trees and forests try to directly model what makes an observation susceptible to influence. This is accomplished by applying a splitting criterion that maximizes heterogeneity in the resulting leafs, i.e. that results in leafs where the treated observations have as different positive rate as possible from untreated observations (given some constraints on leaf size etc.). Despite their recent popularity, the empirical evidence has not been entirely convincing.

The experiments in this work consider only the four models described above, each representing a common family of uplift models. The undersampling methods can also be used with various other uplift models, such as the S- and X-learners (Künzel et al. 2019) and the model proposed by Lo (2002). These models would be compatible with (some of) the proposed undersampling methods, but we leave their evaluation as future work to keep the empirical experiments manageable.

3 Methods

The main goal of our work is to establish best practices for addressing high class imbalance in uplift modeling problems using undersampling as the technical solution. As mentioned earlier, undersampling has a long history in classification problems but our recent preliminary work Nyberg et al. (2021) remains thus far the only investigation into the problem in uplift modeling. In this section we further develop the initial ideas in that work to a comprehensive formulation. We start by defining the basic concepts and notation used for addressing probabilities estimated from undersampled data, and present four alternative undersampling strategies for uplift problems, three of which are novel and one of which was previously presented in Nyberg et al. (2021). The methods differ in terms of which observations are discarded and at what rate. We then explain three methods for calibration of uplift estimates in Sect. 3.3.

3.1 The undersampling process

Undersampling refers to dropping randomly selected observations of the majority class to better balance the ratio between treated and untreated observations. For all of the proposed methods, we always keep all of the positive observations and drop some of the negative observations and all formulas in this paper are formulated assuming that $y=0$ is the majority class. This way, the positive class $y=1$ will have larger prevalence in the undersampled data. We define undersampling using a factor k so that

$$\begin{aligned} p^*(y=1) = k \cdot p(y=1) \end{aligned}$$

(3)

where $p(y=1)$ denotes the probability of positive observations before undersampling and $p^*(y=1)$ is the corresponding probability after undersampling.^{Footnote 1} Here $p(y=1)$ is estimated from data and equals the fraction of positive observations. That is, k tells how much the probability of positive observations increases because of the undersampling. To improve the balance we need to have $k \ge 1$ (with equality corresponding to no undersampling) but additionally the factor has a natural upper bound $k < \frac{1}{p(y=1)}$. This corresponds to dropping all negative observations.

In practical terms, the undersampling is carried out by looping over the negative observations and independently keeping each one with the probability

$$\begin{aligned} s = \frac{1/k - p(y=1)}{1 - p(y=1)}. \end{aligned}$$

(4)

We have chosen to formulate the undersampling process using the factor k, rather than the probability s, for several reasons: (a) it is directly interpretable as the change of probability (Eq. (3)), (b) it leads to more clear and concise equations for the stratified undersampling procedure introduced later in Sect. 3.2.3, and (c) it leads naturally to one calibration method (Sect. 3.3.2).

The factor k defines the average change. The uplift as defined in Eq. (1), however, depends on the conditional probabilities. Their distortion is characterized by

$$\begin{aligned} p^*(y=1 \vert x) = \frac{p(y=1 \vert x)}{p(y=1 \vert x) + s \cdot (1 - p(y=1 \vert x))}. \end{aligned}$$

(5)

This follows directly from the undersampling process that reduces the proportion of negative observations, indicated by $(1-p(y=1 \vert x))$, by a factor of s while keeping all positive observations. Since s is in the denominator, this distortion is non-linear in terms of the probability $p(y=1 \vert x)$. This means that the quantities needed for estimating uplift change because of the undersampling and this change needs to be accounted for to obtain unbiased estimates. When $p(y=1 \vert x)$ is small the relationship is approximately linear and corresponds to multiplication with k as in the average case, but for larger probabilities this does not hold.

Figure 1 illustrates the effect of undersampling. Here we assume the probabilities $p(y=1 \vert x)$ are estimated using maximum likelihood (ratio of positive and negative observations) in local neighborhoods of x, with the square indicating one such neighborhood. When we set $k=2$ for this data with high class imbalance ($p(y=1)=0.0083$), we keep negative observations with probability $s=0.4958$ (Eq. (4)). Since the true probability is small, we have $s \approx \frac{1}{k}$. In the local neighborhood indicated by the square, the proportion of positive observations approximately doubles when dropping approximately half of the negative observations. However, if our data had a local neighborhood with high probability $p(y=1 \vert x)$ this would not be the case. For example, for $p(y=1 \vert x)=0.3$ we would get $p^*(y=1 \vert x) \approx 0.46$, corresponding to slightly more than $50\%$ increase in the positive rate. When k is large, this non-linear distortion becomes significant also for smaller probabilities.

3.2 Undersampling for uplift modeling

The equations above hold for any undersampling method that drops negative observations. Next, we present four different undersampling methods that can be used for improving class balance in uplift modeling. The methods differ in terms of what rate treated and untreated negative observations are discarded. To indicate this we introduce additional notation where the undersampling parameters k and s are replaced by $k_{t=1}$, $k_{t=0}$, $s_{t=1}$, and $s_{t=0}$ as needed to indicate when the undersampling is applied only to treated or untreated observations.

3.2.1 Undersampling for classification

The double classifier method (Radcliffe and Surry 1999) directly trains two models for treated and untreated observations separately, and hence standard undersampling for classification can be used to improve accuracy of these models independently. That is, we can separately perform undersampling for the treated and untreated samples, always dropping only negative observations. More formally, this is defined as

$$\begin{aligned} \begin{aligned}&p^*(y=1 \vert t=1) = k_{t=1} \cdot p(y=1 \vert t=1) \\&p^*(y=1 \vert t=0) = k_{t=0} \cdot p(y=1 \vert t=0) \end{aligned} \end{aligned}$$

(6)

where typically $k_{t=1} \ne k_{t=0}$ since the positive rates and hence the severity of class imbalance differs in the treated and untreated observations. The factors $k_{t=1}$ and $k_{t=0}$ are chosen independently using hold-out validation on the validation set (see Sect. 4.2) and a measure of classification performance, e.g. AUC-ROC. The model estimating $p^*(y=1 \vert t=1)$ is evaluated on the treated observations in the validation set and the model estimating $p^*(y=1 \vert t=0)$ on the untreated observations in the validation set.

As the undersampling process distorted the probabilities, the scores output by the classifiers will not correspond to true probabilities. This distortion needs to be corrected. For classifiers this can be done by calibration, the process of mapping scores to empirical estimates, with several practical methods like isotonic regression (Zadrozny and Elkan 2002), Bayesian binning into quantiles (Naeini et al. 2015), or platt-scaling (Platt 1999) available. We used isotonic regression. After calibration, we can estimate the uplift using Eq. (1) directly. The obvious drawback of this conceptually simple strategy is that it is only compatible with the double classifier approach.

3.2.2 Naive undersampling

This is a method where negative observations are dropped with equal probability regardless of whether they are treated or untreated. This corresponds to naively doing undersampling as it has been done for classification without accounting for the differences between treated and untreated observations.

The treated and untreated observations typically have different average positive rate resulting in different severity of class imbalance. In addition, as the treated and untreated observations typically come from different underlying distributions, the optimal undersampling rate will differ. Naive undersampling ignores this and is implicitly based on the assumption that the underlying distributions and the severity of class imbalance is similar in both treated and untreated observations. We define it using

$$\begin{aligned} p^*(y=1) = k \cdot p(y=1), \end{aligned}$$

(7)

and the undersampling is carried out using a single s derived using Eq. (4). The parameter k is found using hold-out validation. In contrast to the previous method, we now need to use an uplift evaluation metric for selecting the optimal parameter. We used AUUC that is also used as the main evaluation metric for uplift methods (see Sect. 4.1).

This approach is conceptually simple, compatible with all uplift models and only requires choosing one undersampling factor. However, it is biased whenever $p(y=1 \vert t=1) \ne p(y=1 \vert t=0)$, as will be explained in more detail in the next subsection. Nevertheless, it can still improve the performance in some cases as will be shown in Sect. 4.

3.2.3 Stratified undersampling

Stratified undersampling was presented in our preliminary work Nyberg et al. (2021). Similar to naive undersampling, it drops both treated and untreated majority class observations using one common factor k so that

$$\begin{aligned} \begin{aligned}&p^*(y=1 \vert t=1) = k \cdot p(y=1\vert t=1) \\&p^*(y=1 \vert t=0) = k \cdot p(y=1\vert t=0). \end{aligned} \end{aligned}$$

(8)

In contrast to the naive undersampling, however, we now use different s for the two groups: We compute $s_{t=1}$ and $s_{t=0}$ separately for the two populations using Eq. (4), now using the group-conditional probabilities $p(y=1 \vert t=1)$ and $p(y=1 \vert t=0)$ instead of the overall rate.

As indicated by Eq. (5), the undersampling process changes the probabilities in a non-linear manner. However, if both $p(y=1 \vert x, t=1)$ and $p(y=1 \vert x, t=0)$ are sufficiently small for all x, then the change is approximately linear and we have $p^*(y=1 \vert x, t=1) \approx k \cdot p(y=1 \vert x, t=1)$ and $p^*(y=1 \vert x, t=0) \approx k \cdot p(y=1 \vert x, t=0)$. Then the uplift $\tau (x)$ will also be approximately linear in k so that $\tau ^*(x) \approx k \cdot \tau (x)$. Nyberg et al. (2021) explicitly relied on this linearity assumption.

In the rare case when $p(y=1 \vert t=1) = p(y=1 \vert t=0)$, stratified undersampling is equivalent to naive undersampling. To better understand the difference when this is not the case, we can convert the common s used in naive undersampling back to two separate factors $k_{t=1}$ and $k_{t=0}$ (using inverse of Eq. (4)). This means the naive undersampling corresponds to using different undersampling factors despite using a common s, and consequently we no longer have clear linear relation for the uplifts as both terms are modified by different factors.

3.2.4 Split undersampling

The most comprehensive undersampling method we consider is split undersampling which undersamples the treated and untreated observations with different factors $k_{t=1}$ and $k_{t=0}$. The equations are then

$$\begin{aligned} \begin{aligned}&p^*(y=1 \vert t=1) = k_{t=1} \cdot p(y=1 \vert t=1) \\&p^*(y=1 \vert t=0) = k_{t=0} \cdot p(y=1 \vert t=0), \end{aligned} \end{aligned}$$

(9)

This is equivalent to the equations of undersampling for classification, but now factors $k_{t=1}$ and $k_{t=0}$ are chosen jointly. That is, an uplift model is now trained on the undersampled dataset, and the combination of factors $k_{t=1}$ and $k_{t=0}$ is evaluated using hold-out validation and an uplift metric. We again used AUUC as the criterion.

This approach is general in the sense that it puts no assumptions on the positive rates or conditional probabilities present in the data. It is also general in that it includes both stratified and naive undersampling as special cases. We obtain the former when $k_{t=1}=k_{t=0}$ and the latter when $k_{t=1}=\frac{1}{s_{t=0} \cdot (1-p(y=1 \vert t=1)) + p(y=1 \vert t=1)}$. These equalities are the result of these two methods aiming to control the distortion in probabilities so that they are easily manageable. In contrast, split undersampling requires no such dependence between $k_{t=1}$ and $k_{t=0}$. As the treated and untreated observations usually have different positive rates, and hence severity of class imbalance, the optimal undersampling parameters to deal with the class imbalance will also usually be different. Hence split undersampling has the potential to find undersampling parameters that better fit the problem.

As a consequence of freely choosing $k_{t=1}$ and $k_{t=0}$, the uplift estimates produced by the model trained on the undersampled data will no longer produce well-ranked predictions. The predictions might even have the wrong sign. This will need special attention later in the calibration step. We also note that even though we only consider binary uplift problems in this work, the split undersampling method directly generalizes to multi-class uplift problems (Papangelou 2021) and provides the first solution for addressing class imbalance for these. This will be elaborated on later in Sect. 3.3.3 when discussing calibration of split undersampling estimates.

3.3 Calibration methods

All of the undersampling methods distort the probabilities in a non-linear way (Eq. (5)). When rank alone is sufficient for the intended use (Verbeke et al. 2012; Devriendt et al. 2021; Gubela et al. 2020), both naive and stratified undersampling will produce adequate results without calibration. However, this is not the case for undersampling for classification and split undersampling. These two methods distort the probabilities with and without treatment so that the difference between these, the uplift estimates, will not be ranked in a meaningful way. This is further dealt with in Sect. 3.3.3.

Sometimes calibrated uplift estimates are needed for downstream processing. E.g. in the case of using free delivery as treatment in an online store, both a calibrated uplift estimate $\tau (x)$ and a calibrated probability estimate for $p(y=1\vert x, t=1)$ are needed for optimal targeting. Then the treatment should only be applied if $\tau (x) \cdot profit \ge cost \cdot p(y=1 \vert x, t=1)$, where profit refers to the profit of the sale excluding delivery costs and cost refers to the cost of delivery. This is discussed in more detail by Haupt and Lessmann (2020).

In the experiments, we calibrated all uplift estimates. With undersampling for classification the calibration is applied after model training but before combining the two models to an uplift model. For the rest, the calibration can be performed as a separate post-processing step using the methods described next.

3.3.1 Isotonic regression and $\tau $-isotonic regression

Isotonic regression produces a function g(s) that minimizes $\sum _i(g(s_i) - y_i)^2$ under a monotonicity constraint so that $g(s_i) \le g(s_j)$ for $s_i < s_j$. When y is binary and $s_i$ and $s_j$ are scores outputted by some classification algorithm, this becomes a calibration algorithm. This is commonly used as a post-processing step to transform the outputs to well-calibrated probabilities (Zadrozny and Elkan 2002). Isotonic regression is in this form used in this paper in undersampling for classification (Sect. 3.2.1).

Nyberg et al. (2021) extended calibration with isotonic regression to uplift modeling. We call this $\tau $-isotonic regression to separate it from isotonic regression for calibration. In the revert-label formulation ${\mathbb {E}}(r \vert x) = \tau (x)$ (Athey and Imbens 2015), hence by replacing $y_i$ with the revert-label $r_i$, g(s) becomes an estimator for uplift. Using $\tau $-isotonic regression will ensure that the uplift estimates will match empirical estimates. In the experiments, this calibration method is used together with naive undersampling to correct for the distortion introduced by undersampling. The method itself places no requirements on the uplift model or the scores, but it enforces monotonicity in the estimates.

3.3.2 Renormalization

Renormalization is a calibration method specifically for calibrating estimates obtained with stratified undersampling. For that case both of the probabilities $p^*(y=1 \vert x, t=1)$ and $p^*(y=1 \vert x, t=0)$ estimated from undersampled data are approximately k times as large as the actual probabilities, and consequently so are the uplift estimates $\tau ^*(x)$. This distortion can be corrected easily with division by k, thus renormalizing the estimate. This correction is only applicable for stratified undersampling as it relies on use of equal k factors, and as explained in detail by Nyberg et al. (2021) it is accurate only when the conversion rates are small. For larger rates the distortions are no longer sufficiently linear.

3.3.3 Local neighborhood calibration

Local neighborhood calibration uses two input probabilities to produce one calibrated uplift estimate. Using two input probabilities enables the calibration method to change the rank of uplift estimates between observations. This is something that cannot be accomplished by $\tau $-isotonic regression or renormalization and it is necessary to correct for the distortions introduced by split undersampling. This calibration method also extends to multi-class problems. Denoting the probability that an observation of class j is kept after undersampling by $s_{y=j, t}$, the probability of observations of that class in some local neighborhood of x is

$$\begin{aligned} p^*(y=j \vert x, t) = \frac{s_{y=j, t} \cdot p(y=j \vert x, t)}{\sum _{l\in J} s_{y=l,t} \cdot p(y=l \vert x, t)}. \end{aligned}$$

(10)

No assumptions are made as to whether there is one class that is in majority in the multi-class case, hence the class for $s_{y, t}$ is explicitly specified. Elsewhere in the paper the majority class is assumed to be the negative class and the y is dropped from the notation. By rearranging, the original probabilities before undersampling can be calculated by solving the system of equations

$$\begin{aligned} {\left\{ \begin{array}{ll} &{} p(y=j \vert x, t) = \frac{p^*(y=j \vert x,t)}{s_{y=j, t} \cdot (1 - p^*(y=j \vert x, t))} \cdot \sum _{l \in J, l \ne j}s_{y=l, t} \cdot p(y=l \vert x, t) \\ &{} \sum _{l \in J} p(y=l \vert x, t) = 1. \end{array}\right. } \end{aligned}$$

(11)

Setting $J = \{0, 1\}$ corresponds to the case with a binary class variable. Then the notation can be simplified so that $j=0 \Rightarrow s_{y=j, t}=s_t$ (Eq. (4)) and $j=1 \Rightarrow s_{y=j, t}=1$ as all of the positive observations are kept. Solving the system of equations then results in the maximum likelihood estimate (see “Appendix 1” for details)

$$\begin{aligned} p(y=1 \vert x, t) = \frac{s_t \cdot p^*(y=1 \vert x, t)}{1 - p^*(y=1 \vert x, t) \cdot (1 - s_t)}. \end{aligned}$$

(12)

Assuming that the output of a model approximates $p^*(y=1 \vert x, t)$, the distortion introduced by undersampling can be corrected using the equation above. Note that the equations cover the calibration of one probability. This calibration needs to be done separately for the conversion probability with $t=1$ and $t=0$ with appropriate parameters. Only then can a corrected uplift estimate be calculated as the difference between these two.

4 Experiments and results

We illustrate and evaluate the new methods using three experiments where each experiment addresses a separate research question. Before presenting the experiments and the results, we describe the metrics and datasets.

4.1 Metrics and hold-out validation

The main evaluation metric used is the area under the uplift curve (AUUC) (Jaskowski and Jaroszewicz 2012) commonly used for evaluating uplift models. It measures the expected increase in positive rate due to targeting treatments rather than randomizing them, averaged over all treatment rates. Hence it is the expected increase in positive rate due to your model if you have no preference on treatment rate. AUUC is a general purpose metric for goodness of fit and is particularly suitable in academic contexts where the use case is undefined. The absolute values of AUUC are often small even when the relative improvements are large. For legibility, we will report results as mAUUC ($1000 \cdot AUUC$) but will additionally clarify in the text the relative improvement.

As AUUC depends only on the rank of the observations and not the magnitude, we additionally used the expected uplift calibration error (EUCE) (Nyberg et al. 2021) as a metric to estimate how well the predictions match empirical rates. To estimate EUCE, all observations are first sorted based on the uplift predictions into m bins so that each bin contains $C = N / m$ observations with the first bin containing the observations with smallest predictions etc. N is the number of observations. For each bin j, the empirical uplift is estimated as

$$\begin{aligned} b_j = \frac{\sum y_{i, t=1}}{N_{j, t=1}} - \frac{\sum y_{i, t=0}}{N_{j, t=0}} \end{aligned}$$

(13)

where the sum is over all observations i in bin j. $N_{j, t=1}$ and $N_{j, t=0}$ refer to the number of treated and untreated observations in bin j, whereas $y_{i, t=1}$ and $y_{i, t=0}$ refer to the labels of the treated and untreated observations in the bin. Further, denoting the average uplift estimate for the observations in one bin $u_j = \frac{\sum \tau (x_i)}{C}$, EUCE can be expressed as

$$\begin{aligned} EUCE = \frac{1}{m} \sum _j |u_j - b_j |. \end{aligned}$$

(14)

Following the original formulation, the number of bins m used for estimating EUCE was set to 100 in the experiments.

For all methods we select the optimal undersampling factors k (or $k_{t=1}$ and $k_{t=0}$) using simple hold-out validation. The datasets were randomly split into training sets (50%), validation sets (25%), and testing sets (25%), where the training data is used for learning the models, the validation data for selecting the undersampling factor, and the final metrics are evaluated on the test data. This setup was deemed sufficient for the tree largest datasets. For the two smaller ones, this procedure was repeated 10 times so that the observations were randomly re-asssigned to the sets for each run, and the testing set metrics were averaged for the result tables.

In the experiments the values for $k_{t=1}$ and $k_{t=0}$ tested were $\{1, 2, 4, 8, 16, 32, 64, 128, 256\}$. In addition, in Experiment 1 the $k_{t=0}$ values included all values for $k_{t=1} \cdot 1.55$. These choice were made because this includes stratified undersampling and cases where $p^*(y=1 \vert t=1) = p^*(y=1 \vert t=0)$. The second choice captures the intuition that if both the treated and untreated observations have the same conversion rate, the issues caused by undersampling should be of similar magnitude in both cases. The best parameters were chosen using AUC-ROC on the validation set for classic undersampling and AUUC on the validation set for all other models.

4.2 Datasets

Evaluating methods for correcting class imbalance requires data that exhibits high class imbalance and is sufficiently large for evaluating the uplift reliably. We evaluate the methods on the three largest publicly available datasets, and additionally on two smaller datasets to illustrate the limitations of the methods. Details on the datasets are provided in Table 1.

Table 1 Statistics of the datasets as used in the experiments

Exploring uplift modeling with high class imbalance

Abstract

Similar content being viewed by others

ImbalancedLearningRegression - A Python Package to Tackle the Imbalanced Regression Problem

Uplift Modeling

Underlying Objective

Explore related subjects

1 Introduction

2 Related work

2.1 Class imbalance

2.2 Uplift models

3 Methods

3.1 The undersampling process

3.2 Undersampling for uplift modeling

3.2.1 Undersampling for classification

3.2.2 Naive undersampling

3.2.3 Stratified undersampling

3.2.4 Split undersampling

3.3 Calibration methods

3.3.1 Isotonic regression and \(\tau \)-isotonic regression

3.3.2 Renormalization

3.3.3 Local neighborhood calibration

4 Experiments and results

4.1 Metrics and hold-out validation

4.2 Datasets

4.3 Experiment 1: comparing methods and models

4.4 Experiment 2: reducing dataset size

4.5 Experiment 3: when is \(p(y=1 \vert t)\) small?

5 Discussion

6 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Appendix: Local neighborhood correction

A Appendix: Local neighborhood correction

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation