1 Introduction

Customers are a critical asset for every company operating in a contractual setting. The ability to retain profitable customers is a significant determinant of customer equity, i.e., the total lifetime value of a company’s customers (McCarthy et al., 2017). Consequently, customer retention is a strategic imperative (Rust et al., 2004) and customer churn prediction (CCP) models are a crucial tool for data-driven customer relationship management (Wu et al., 2022).

CCP models exploit data at the individual level (e.g., demographic, socio-economic, behavioral data, etc.) to predict whether customers will terminate an existing business relationship within a future time window. These models help companies anticipate and remedy decreases in the stream of cash flows associated with customer churn. Moreover, CCP models can help uncover the underlying drivers of churn or decaying customer loyalty. Such insights are useful to revisit business processes and service offerings and raise customer equity in the long run. Further applications include the estimation of customer lifetime value, which relies on high-quality estimates of customer retention (Schweidel et al., 2014). Finally, CCP models can be deployed to govern the targeting of retention campaigns, either as standalone solutions or as a component of (causal) uplift models (Janssens et al., 2022).

The increasing availability of customer data combined with lower costs of data storage and computational infrastructure fostered the use of supervised machine learning (ML) models to predict customer churn over the past decades (Verbeke et al., 2012). The ability to scale well with high-dimensional data (e.g., an increasing number of customers and/or customer features) and to capture complex, non-linear dependencies between features and the churn event make ML models the tool of choice for CCP in both, industry and academic applications.

The temporal aspects of features have an impact on the performance of customer churn prediction models. Risselada et al. (2010) observed how the predictive performance of different types of CCP models deteriorates quickly over time and suggest the development of dynamic models. To make models more generalizable, Gattermann-Itschert and Thonemann (2021) suggest multi-slicing, an approach for training CCP models on data that is composed of different parts covering different time horizons. This paper takes an orthogonal approach. We develop dynamic CCP models that incorporate time-varying behavioral customer features in the form of recency, frequency, and monetary value (RFM).

The marketing literature emphasizes the predictive power of RFM features (Zhang et al., 2015) and many prior studies on CCP have considered corresponding predictors (Janssens et al., 2022). However, conventional statistical or ML classifiers such as logistic regression or tree-based models do not readily accommodate time-varying features. They assume that observations are independently and identically distributed. This collides with the nature of customer-level time-series data. Hence, the processing of time-varying data requires a non-trivial effort of manual feature engineering or aggregation, which potentially hinders the predictive performance of ML models. The process of mapping time series data to a fixed-length feature vector is not only labor intensive, but it also results almost always in a loss of information. Recent developments in deep neural networks (DNN) architectures for sequential data have the potential to overcome the problems inherent to shallow ML models to exploit time-varying data. Given a comprehensive space of architectural choices, the design of a DNN-based CCP model that accommodates both, time-varying and time-invariant features is not straightforward. Moreover, recent empirical evidence in related classification tasks in the financial industry suggests that deep learning models might not outperform simpler alternatives based on gradient boosting models (Gunnarsson et al., 2021). Transferring corresponding results to the CCP setting, it is important to examine whether purely DNN-based models are effective and whether the respective merits of DNNs to handle time-varying features and conventional (e.g., gradient-boosting-based) classifiers can be integrated to obtain a hybrid—best-of-breed—solution.

This paper contributes to the literature by empirically testing the potential of DNNs to predict customer churn using time-varying RFM features in the financial services industry. We systematically explore the vast option space of model architectures at the interface of deep vs. shallow and static vs. dynamic churn models and their different forms of integration. This offers churn analysts valuable guidance on how to capitalize on available customer data with both, time-varying and time-invariant features. Our experiments are based on a unique data set from a European financial service provider that encompasses anonymized information of 480 thousand customers collected over 48 months. We find that recurrent neural networks outperform transformer models for CCP using time-varying RFM measures. This finding is confirmed when other, time-invariant customer features enter a CCP model, independent of how different sets of features are integrated. Finally, we find no statistical evidence that hybrid approaches, which integrate DNN predictions in conventional classifiers (based on regularized logistic regression and extreme gradient boosting) improve performance further—highlighting that DNNs are suitable standalone classifiers for predicting churn using time-varying RFM measures.

The paper is organized as follows. Section 2 provides a review of related work on classic, ML based-models, and deep learning approaches to predict customer churn. Sections 3 and 4 describe the data and experimental design, respectively. We report empirical results in Sect. 5 and conclude thereafter.

2 Related literature

A large body of CCP literature comprises different fields of study including operations research, marketing, statistics, and computer science. The promise of increased predictive accuracy and the requirements of operational churn management to handle data sets with a vast number of customer characteristics, multi-collinearity problems, and noisy features have raised a considerable amount of interest in ML-based churn prediction (Janssens et al., 2022; Qi et al., 2009; Wu et al., 2022). Our paper contributes to the existing CCP literature by exploring the beneficial impact of using time-varying over static customer characteristics and investigating the potential of DNN.

To date, the majority of CCP literature employs models for large-scale cross-sectional and static data. Verbeke et al. (2012) provide a comprehensive comparison of the performance of ML models, which highlights that most previous studies focus on cross-sectional data. More recently, Janssens et al. (2022) propose a novel expected maximum profit measure for B2B churn prediction to directly incorporate the heterogeneity of customer values and profit concerns of the company using gradient boosting on a large cross-sectional data set from a North American B2B beverage retailer. Another recent exemplary study proposes combining PCA analysis with AdaBoost to deliver an enhanced and more stable churn prediction performance in a cross-sectional e-commerce context (Wu et al., 2022). As an exception, Chen et al. (2012) extend the support vector machine model to accommodate time-varying variables and predict customer churn without prior feature aggregation. More generally, the incorporation of static features from time-varying information requires the use of heuristic aggregation procedures such as moving or weighted averages or augmenting the set of features with time-lagged observations of these time-varying variables. These techniques are well-established in the telco-industry (Wei & Chiu, 2002) and e-commerce (Koehn et al., 2020) to predict customer behavior based on call-detail records and click-stream data, respectively. This motivates us to investigate the merit of directly modeling customer churn using time-varying features and DNNs.

DNNs have witnessed rising popularity in the CCP literature to leverage novel data sources like social-network features (Óskarsdóttir et al., 2017) or text data (De Caigny et al., 2020). On the contrary, the use of DNN to extract information from time-varying (RFM) features is sparsely explored in prior work. Table 1 summarizes related work that has used time-varying features and DNN in a CCP context.

Table 1 Prior work on churn prediction using time-varying features and DNN

Closely related to our work, several studies have experimented with time-varying features and deep learning, but none of these studies include time-varying RFM variables. Tan et al. (2018) and Zhou et al. (2019) obtain a churn model by combining convolutional (CNN) and long short-term memory (LSTM) neural networks. Both studies report that their models outperform benchmarks that do not use time-varying information. Both papers distinguish between static and time-varying features, yet they do not study how to best combine the two types of features in a prediction model. Results in Wangperawong et al. (2016) and Zaratiegui et al. (2015) from applying CNNs for churn prediction provide further empirical evidence regarding the merit of deep learning-based churn models and the use of time-varying features. Results in these studies, however, are based on a limited number of variables from a relatively short, few-month timeframe. Wangperawong et al. (2016) also consider time-varying features alone while most real-world CCP applications involve both, time-varying and invariant features. Further applications that combine supervised and unsupervised approaches for churn prediction in dynamic contexts include Liu et al. (2018) and Yang et al. (2018). Liu et al. (2018) combine time-varying and static features but do not distinguish between the two types of features. They also consider a relatively short time frame of four months for their time-varying features. Yang et al. (2018) consider an even shorter time window of two weeks.

In conclusion, our study is the first to examine the potential of behavioral time-varying variables to predict churn using DNN in the financial services industry. Compared to existing work, our study relies on a much longer period of time-varying features from multiple years. Unlike prior work, in which temporal data is represented by some application-specific data, RFM variables are a well-established, widely used, and generic class of features in marketing decision-support (Zhang et al., 2015). Moreover, our thorough assessment of DNN and RFM variables contributes original empirical evidence on which architectures obtain better performance for CCP. For instance, transformer networks, a state-of-the-art deep learning approach, are, to our best knowledge, for the first time tested in a CCP context in this paper. Further, we propose modeling frameworks that can be combined with existing customer churn prediction, and we investigate different options on how to best combine static and time-varying features. Based on the analysis of prior literature, we intend to answer the following research questions:

RQ1: what is the most effective DNN architecture for CCP using time-varying RFM measures?

RQ2: what is the most effective DNN architecture for CCP using mixed data (time-varying RFM measures and static customer variables)?

RQ3: can the performance of a mixed-feature DNN model be improved further by hybridizing it with a conventional CCP classifier?

3 Methodology

The methodology described in this study is graphically depicted in Fig. 1. The models we compare differ on two dimensions: (1) the features on which they are trained and (2) the algorithms deployed to train classifiers. First, we explain the difference between time-varying and static features in Sect. 3.1. Next, in Sect. 3.2.1, we introduce DNNs that can handle time-varying input, followed by strategies to combine static and time-varying inputs in these DNNs. Last, we introduce hybrid approaches that incorporate the output of the DNNs in traditional classifiers in Sect. 3.2.2.

Fig. 1
figure 1

Visual representation of variable sets and classifiers considered in this study

3.1 Static and time-varying features

Our methods presume two categories of features exist. The first category involves static customer variables, which remain constant over time (e.g. demographic characteristics). The second category includes time-varying measures of recency, frequency, and monetary value. These time-varying RFM variables are available on a monthly level and are derived from ongoing product contracts. In line with existing research, we calculate recency as the number of days that have passed since the last new product was subscribed, frequency as the number of open contracts on a given date, and monetary value as the total monthly value associated with a customer’s open product portfolio. Figure 2 provides an example of the calculation and the time-varying aspect of RFM variables. Note how, after a contract is opened or closed, or if time passes by, the values of recency, frequency, and monetary value, are modified.

Fig. 2
figure 2

Example of the outcome and RFM calculation for a fictitious customer

3.2 Classification algorithms

3.2.1 Deep neural networks

We consider recurrent models, with or without attention mechanisms, and transformer-type neural network architectures. In this subsection, we sketch the intuition behind these models while emphasizing the decisions required to model time-varying and static variables.

Recurrent neural networks are, by design, well-suited for time-varying input variables. Consider a vanilla recurrent neural network (RNN) and assume that we observe, for each customer i, a sequence of features X1,…,Xt of fixed length T. Note that this architecture depends on the current value of the variables \({X}_{t}\) and is dynamic since the hidden state \({h}_{t}\) depends on its past value, which, in turn, incorporates information extracted from previous realizations of the variables, e.g., \({h}_{t-1}{X}_{t-1}\). Each hidden state \({h}_{t}\) processes the information for time step t. The hidden state for period t, with t = 1,…,T is given by:

$$ \begin{array}{*{20}c} {h_{t} = \tanh \left( {W_{x} X_{t} + W_{h} h_{t - 1} } \right)} \\ \end{array} $$

with \(\tanh \left( {} \right)\) as the hyperbolic tangent function.Footnote 1 To predict the probability of churn, we add a dense layer with a sigmoid activation function. This allows for estimating the free parameters, that is the connection weight matrices \({W}_{x}\) and \({W}_{h}\), by computing the gradient of the loss \(L\), defined by the binary cross-entropy function, over N customers:

$$ \begin{array}{*{20}c} {L = \mathop \sum \limits_{i = 1}^{N} \left[ { - y_{i} \log \left( {\hat{y}_{i} } \right) - \left( {1 - y_{i} } \right)\log \left( {1 - \hat{y}_{i} } \right)} \right]} \\ \end{array} $$

For notational convenience, the rest of the paper uses the symbol \(W\) to refer to a properly shaped matrix of all the weights in the network. For the estimation of the weights \(W\), the hidden state for the first period is usually set to zero to proceed with the estimation (\({h}_{0}=0\)). We refer to Goodfellow et al. (2016) for an overview of the backpropagation algorithm and optimization-specific details.

The vanilla RNN may suffer from the exploding or vanishing gradient problem, which in turn degrades the ability of RNNs to learn long-term dependencies. Gated Recurrent Unit (GRU) (Cho et al., 2014) and Long Short-Term Memory (Hochreiter and Schmidhuber 1997)Footnote 2 neural networks address these issues. Both architectures introduce gating mechanisms that facilitate a different flow of information in the network. There is empirical evidence that both, the GRU and LSTM architecture, offer comparable performance and that both perform better in sequence modeling compared to the vanilla RNN (Chung et al., 2014). Thus, we focus on these models. Formally, the GRU architecture is given by the following components:

$$ \begin{array}{*{20}c} {\begin{array}{*{20}r} \hfill {\begin{array}{*{20}l} {r_{t} = \sigma \left( {W_{r} X_{t} + W_{hr} h_{{\left( {t - 1} \right)}} } \right),} \hfill & {} \hfill & {n_{t} = \tanh \left( {W_{n} X_{t} + r_{t} \left( {W_{hn} h_{{\left( {t - 1} \right)}} } \right)} \right)} \hfill \\ {z_{t} = \sigma \left( {W_{z} X_{t} + W_{hz} h_{{\left( {t - 1} \right)}} } \right),} \hfill & {} \hfill & {h_{t} = \left( {1 - z_{t} } \right)n_{t} + z_{t} h_{{\left( {t - 1} \right)}} } \hfill \\ \end{array} } \\ \end{array} } \\ \end{array} $$

where \({r}_{t}\), \({z}_{t}\), and \({n}_{t}\) are three elements that represent the reset, update, and new information gates.Footnote 3 Note how the reset and update gates are similar to the structure of the vanilla recurrent network except for the use of a sigmoid function instead of the hyperbolic tangent. Similarly, the new information gate \({n}_{t}\) resembles the vanilla recurrent network with the extra characteristic that the reset gate multiplies the previous hidden state. Despite these similarities, the hidden state \({h}_{t}\) in the GRU model is computed as a weighted average of the previous hidden state \({h}_{t-1}\) and the new information \({n}_{t}\) where the weight, in turn, depends on \({z}_{t}\)—a function of the original inputs \({X}_{t}\) and \({h}_{t-1}\).

Consider as an alternative the LSTM architecture given in Eq. (4):

$$ \begin{array}{*{20}c} {\begin{array}{*{20}r} \hfill {\begin{array}{*{20}l} {i_{t} = \sigma \left( {W_{ii} X_{t} + W_{hi} h_{{\left( {t - 1} \right)}} } \right),} \hfill & {} \hfill & {g_{t} = \tanh \left( {W_{ig} X_{t} + W_{hg} h_{{\left( {t - 1} \right)}} } \right)} \hfill \\ {f_{t} = \sigma \left( {W_{if} X_{t} + W_{hf} h_{{\left( {t - 1} \right)}} } \right),} \hfill & {} \hfill & {c_{t} = f_{t} c_{{\left( {t - 1} \right)}} + i_{t} g_{t} } \hfill \\ {o_{t} = \sigma \left( {W_{io} X_{t} + W_{ho} h_{{\left( {t - 1} \right)}} } \right),} \hfill & {} \hfill & {h_{t} = o_{t} \tanh \left( {c_{t} } \right)} \hfill \\ \end{array} } \\ \end{array} } \\ \end{array} $$

This architecture uses a cell state represented by \({c}_{t}\) and a weight given by the output gate \({o}_{t}\) to compute the hidden state \({h}_{t}\). The cell state is a weighted sum of the cell state in the previous period \({c}_{t-1}\) and an updated cell state proposal denoted by \({g}_{t}\), where the weights are given by the input and forget gates \({i}_{t}\), and \({f}_{t}\).

We consider variations of DNNs for sequential data in terms of (1) using the entire sequence of hidden states \({h}_{t}\) or just a subset, (2) using bidirectional models, (3) using separate input layers for static and time-varying data, and the consideration of attention mechanisms.

First, in RNN, the hidden state \({h}_{t}\) for time step \(t\) summarizes all the information up to that point of the sequence. Hence, we can extract information from the entire sequence using only the last hidden state. Alternatively, we can use the entire sequence of hidden states to estimate churn, which could improve the quality of the estimated probabilities at the cost of adding parameters in the final layer.

Next, RNNs allow for reversing the order of the sequence of X1,…,Xt which facilitates bidirectional modeling. This is relevant in our context because a bidirectional model can better capture differences between the customers that arise earlier in time. Again, it is possible to consider the entire sequence of bidirectional hidden states \({h}_{t}\) or just the latest one to compute probabilities.

Last, the inclusion of static variables in the model is not straightforward. First, we could merge the sequential and static variables and include them together in the \({X}_{t}\) matrix. We call this approach Merging in our experiments. We thus merge the static data with the time-varying data at the input level, having at each timepoint a value of the static feature, similar to the time-varying variables. This implies that the dimension of the weights \(W\) increases considerably as the sequence also consists of static features. Moreover, given that static features by definition do not change over time, the static features might preclude the model from fully exploiting the sequential patterns in the data. As a second alternative to consider static and time-varying data, we could divide the features into two subsets and use DNNs designed for each type of data. We refer to this approach as Concatenation and consider RNN only with the subset of sequential variables, and concatenate the hidden states from this model with the hidden states from a feedforward neural network for the static data.Footnote 4 In other words, such an approach concatenates the hidden representations of the time-varying features at point T, with the static features. In theory, such an approach would allow the architecture to specialize and better exploit the two types of variables. As a final variation, we consider the use of attention mechanisms with the concatenation approach. Attention allows the architecture to weight different points in the sequence to make the predictions such that the final hidden state is not the only component summarizing the sequence. The literature offers several alternatives to introduce attention mechanisms (Chaudhari et al., 2021; Galassi et al., 2021). We focus on a global attention type as described in Luong et al. (2015). Compared to local attention, global attention is more expensive since it considers all hidden states of the encoder when deriving the context vector instead of only a subset of hidden states. However, if sequences are relatively short, as in the focal study, global attention is relatively easy to train and offers good performance in natural language processing tasks. We refer to this model variation as Attention.

The transformer architecture is a different kind of DNN to model sequential data. It is designed to overcome the computational burden of RNNs by relying only on the attention mechanism. RNNs generate a sequence of hidden states, \({h}_{t}\) as a function of previous hidden states, \({h}_{t-1}\), and the input to the hidden state at position t (Vaswani et al., 2017). As sequences become longer, the inherently sequential nature of RNNs precludes parallelization with a major impact on the computation time. The transformer architecture relies on encoder- and decoder stacks that embed the input and output sequences. The attention mechanism is applied within the layers of the encoder- and decoder stacks. An attention function basically maps a query vector and a set of key-value pairs vectors to an output vector. Explaining the inner works of the transformer model is beyond the scope of this paper and we refer instead to the original implementation in Vaswani et al. (2017), and the notes in Rush (2018). For the objective of this paper, it is sufficient to understand that the self-attention mechanism in the transformer model allows us to encode the sequential data. Like recurrent models, we can incorporate static variables using merge or concatenation approaches.

3.2.2 Hybrid approaches

In addition to evaluating the potential of standalone deep neural networks and comparing alternative architectures, we assess the potential of hybrid approaches, combining DNN and ML models. Specifically, similar to earlier approaches (De Caigny et al., 2020), we construct ML classifiers that incorporate DNN predictions as features. This approach investigates the interest in deploying DNNs that are purpose-built to accommodate time-varying RFM features into existing CCP models. We focus on two ML models with a proven track record in the CCP literature: (1) the regularized (lasso) logistic regression, and (2) the tree-based extreme gradient boosting. The regularized logistic regression is the industry standard model and serves as our benchmark. It can handle high-dimensional data (Hastie et al., 2009) and is easier to interpret compared to other ML models due to its linear form. The regularized logistic model requires manually specifying the interaction terms in the model as well as other transformations to deal with non-linearities. Tree-based gradient boosting models overcome these limitations and offer a strong benchmark. Moreover, previous research for related classification tasks in the financial industry suggests that tree-based gradient boosting models are not outperformed by neural networks (Gunnarsson et al., 2021). In short, the model relies on constructing an ensemble of decision trees sequentially. We refer to Chen and Guestrin (2016) for details on the model estimation.

4 Experimental configuration

4.1 Data set

We obtained a data set through a partnership with a major French provider of financial services for our experiments, containing monthly client records. The provided sample consists of customers for whom the focal company is their primary financial services provider to ensure the high quality of the behavioral data. The database contains variables that are frequently used to predict churn, like demographic characteristics and behavioral information to compute sequential RFM variables. The time-varying RFM measures are engineered as described in Sect. 3. Figure 3 visualizes how average frequency, recency, and monetary values differ between churners and non-churners over time. In line with the literature, churners are relatively new (lower recency) and have consistently fewer products (lower frequency), and are less valuable (lower monetary value).

Fig. 3
figure 3

Evolution over time of mean RFM variables by churn status

Table 2 provides definitions of all the variable categories other than RFM that we use in the experiments. The three categories (i.e. demographic, behavior, and customer-company interaction) group 37 variables, which are frequently used in CCP (De Caigny et al., 2020).

Table 2 Overview of static customer variables in our data set

In line with the CCP literature, our main target variable is a dichotomous indicator that equals one if the customer cancels all the contracts with the company within a fixed observation window of 12 months, which in our sample starts on April 1st, 2018. Moreover, the churn definition is aligned with the company’s current business processes.

4.2 Data preprocessing and experimental setup

Our data preprocessing and experimental setup are based on conventional practices in recent CCP studies. First, we standardize the features and obtain the parameters for the standardization from the training data. Next, we treat our training samples for the presence of class imbalance because it can negatively impact the predictive performance. To train the model, we apply an under-sampling approach in line with De Caigny et al. (2018). To focus on our core contributions, exploring alternative sampling approaches is out of the scope of the focal study.

We base the results on holdout test sets on fivefold cross-validation, which is common practice in CPP (Van Nguyen et al., 2020). To tune the hyper-parameters of the models, we rely on a grid search of hyper-parameters and a nested cross-validation procedure. The grid search helps to reduce the variability that would arise from a random search of hyper-parameters. The nested cross-validation provides a clearer picture of the models’ performance and stability compared to a single split. Table 3 provides an overview of the considered parameter settings.

Table 3 Overview of the hyper-parameter settings

4.3 Evaluation metrics

To evaluate the performance of the models, we report the area under the receiver operating curve (AUC), the top-decile lift (TDL), and the expected maximum profit criterion (EMPC). The AUC assesses a model’s performance independent of the decision threshold to convert estimated churn probabilities into dichotomous class labels of churn and non-churn. Decision-makers can easily interpret this metric because it captures the probability that a randomly selected non-churner has a lower predicted churn probability than a randomly selected churner. Next, TDL allows evaluating the performance of a model within the top ten percent of customers with the highest predicted probability of churn compared to a random selection. Hence, this metric describes how much better the classifier can predict churners compared to randomly selecting churners. The top-decile lift expresses the increase in the number of churners in the top-decile relative to the overall churn rate. The EMPC facilitates assessing a model from a profit perspective and finding the most profitable set of customers to target by making assumptions around the expected future revenues and the distribution of misclassification costs (Verbraken et al., 2013). The assumptions for expected future revenues (CLV) and costs (retention offer and contacting costs) are based on De Caigny et al. (2020), who detail this for the financial services industry. We use the default options of the empChurn R package for the distribution of the misclassification costs, as often done in prior churn prediction studies (e.g. Verbraken et al., 2013).

4.4 Statistical testing

Our experimental setup results in five performance estimates per model and for each evaluation metric. To statistically compare model performance measures, we rely on the corrected repeated k-fold cv test suggested by Bouckaert and Frank (2004) appropriate for pairwise comparisons of classifiers with multiple performance measures based on experimental cross-validation of an arbitrary number of replications and folds, and on a single data set. The t-statistic they suggest for this purpose is given by:

$$ \begin{array}{*{20}c} {t = \frac{{\frac{1}{kr}\mathop \sum \nolimits_{i = 1}^{k} \mathop \sum \nolimits_{j = 1}^{r} x_{i,j} }}{{\sqrt {\left( {\frac{1}{kr} + \frac{{n_{2} }}{{n_{1} }}} \right)\hat{\sigma }^{2} } }}} \\ \end{array} $$

where \({n}_{1}\) is the number of customers used for training models, \({n}_{2}\) is the number of customers in the test set, r is the number of experimental repetitions, k is the number of folds, \({x}_{i,j}\) is the measured difference in predictive performance for fold i and replication j, and finally, \({\widehat{\sigma }}^{2}\) is the variance of the \(k\cdot r\) performance differences. In comparisons that involve more than two classifiers, we deploy Li's procedure (Li, 2008) to correct p-values and protect against an elevation of the family-wise error.

5 Results

This section discusses the observed results. Table 4 presents the detailed average performance levels of the various models considered in the study, measured in terms of AUC, TDL, and EMPC. The analyses are structured along the research questions introduced above.

Table 4 Results (averages and standard errors) in terms of AUC, top-decile lift (TDL) and maximum profit criterion (EMPC) for the classifiers in this study

RQ1: what is the most effective DNN architecture for CCP using repeated RFM measures?

The first research question is dedicated to a comparison of alternative DNN architectures for accommodating time-varying RFM measures. Table 5 presents the results of the statistical tests comparing RNN and Transformer models estimated on time-varying RFM measures, which clearly indicate that RNNs dominate a transformer model for handling time-varying RFM measures. This holds for all three performance metrics under consideration.

Table 5 Pairwise comparison between the RNN and transformer models estimated on time-varying RFM measures

RQ2: what is the most effective DNN architecture for CCP using mixed data (repeated RFM measures and static customer variables)?

Next, we extend the feature set by including static customer variables alongside time-varying RFM measures. Section 3.2.1 presented alternative network architectures to accommodate these additional features in RNN and transformer models. Table 6 presents the results of pairwise model comparisons.

Table 6 Pairwise comparison of DNN architectures estimated on mixed data: time-varying RFM measures and static customer variables

As can be seen in the first column of Table 6, comparisons are grouped into three parts. The first set involves a comparison of the three RNN architecture variants: merged, concatenation, and attention. The second set compares network architectures in the Transformer category: merged, concatenation, attention, and finally, multi-head attention. A final comparison involves a test that compares the best RNN variant versus the best Transformer variant. From the results presented in Table 6, the following conclusions emerge. First, among RNN architectures, concatenation exhibits the best performance in absolute terms. Statistical tests demonstrate its superiority over the merged variant, but not over the architecture based on attention. Second, among transformer variants, none of the architectures is dominant, at least not in statistical terms. The highest performance can be observed for the architectures with multi-head attention. Finally, a comparison of the best variant of both architecture families reveals that RNN with concatenation significantly outperforms the best transformer variant, i.e. a transformer network with multi-head attention. In summary, the results identify recurrent neural networks with concatenation as the most promising DNN architecture for accommodating a mixed set of variables (Table 6).

Table 7 Pairwise comparison between RNN and Transformer models estimated on time-varying RFM measures vs. best-in-class RNN and Transformer architectures estimated on mixed data (time-varying RFM measures with static customer variables)

Table 7 presents the results of a comparison between RNN and transformer networks based solely on time-varying RFM features, and their best-in-class counterparts based on mixed data. From these tests, unsurprisingly, it is clear that the performance for all metrics significantly improves when models are trained on a mix of time-varying RFM features and static customer variables. This shows that the added complexity of these network variants is justified by increased predictive performance. The observed result also supports the argument expressed in Sect. 2 that predicting customer churn using time-varying features alone does not offer a fully-comprehensive picture of the merits of DNNs.

Table 8 Pairwise comparisons of the RNN (Concatenation) and hybrid models integrating predictions produced by RNN (Concatenation) in (1) regularized logistic regression and (2) extreme gradient boosting

RQ3: can the performance of a mixed-feature DNN model be improved further by hybridizing it with a conventional CCP classifier?

Finally, the potential of hybridized approaches is tested. Specifically, we examine whether incorporating DNN-model predictions in conventional ML classifiers raises predictive performance. Table 8 presents the results of three comparisons. The best overall DNN architecture, i.e. RNN with concatenation, is compared to (1) a regularized logistic regression model, and (2) extreme gradient boosting, both incorporating the predictions of RNN with Concatenation as an additional feature. In addition, both hybrid variants are compared.

The statistical testing results in Table 8 indicate that the interest in hybridizing models depends on the ML learner that serves as a host. In the regularized logistic regression with RNN predictions, the performance is not significantly improved over a standalone RNN model. However, when RNN predictions are embedded in an extreme gradient boosting model, the performance is found to significantly improve over a standalone RNN model. This holds for all three performance measures. Finally, we observe the overall best performance for the hybrid extreme gradient boosting model that incorporates RNN predictions since it outperforms the hybrid regularized logistic regression. Differences are statistically significant for all three performance criteria.

To conclude, Table 9 reports the set of selected hyper-parameters by outer fold for the models with the best performance based on \(AU{C}_{\mathrm{outer}}\). For the regularized logistic model there is a broad range of selected regularization terms \(\lambda \) and there is no clear tendency for the model to perform worse in the outer test folds. Results for the gradient boosting show no variability regarding the learning rate. They also suggest using a larger number of boosting rounds and sample randomly half of all the available variables. The RNN concatenation model performed better on average when using a GRU encoder of time-varying variables, using a unidirectional version, and only taking into account the last hidden state of the sequence. Regarding the transformer model, there is no clear stable pattern of the hyper-parameters except for the number of layers (8).

Table 9 Optimal hyperparameter values per cross-validation fold

6 Discussion

Having provided empirical answers to our core research questions, we next discuss the implications of the observed results and reconcile our findings.

First, we find RNN-type networks to provide a more suitable framework for extracting information from time-varying RFM variables than transformer networks. Much literature advocates the advantages of transformers. They have virtually replaced RNNs in language-related problems and become increasingly popular for computer vision tasks. Our findings oppose this trend of the transformer becoming the lingua franca of deep learning. Scalability resulting from the ease of parallelizing computations is a major advantage of the transformer. This advantage facilitates pretraining transformer networks on very large data sets, which would be impossible with RNNs. Large pretrained networks lead to the superiority of the transformer in tasks like language or image processing where enormous amounts of data for pretraining are easily available. Marketing applications like CCP do not involve the same masses of data. Moreover, it is not at all clear whether concepts like pretraining and transfer learning are applicable in marketing and/or whether this would be successful. Therefore, scalability advantages, which enable transformers to benefit from richer pretraining in other domains, do not translate into higher performance in the CCP context studied here. Given that we do not consider transfer learning but train our networks from scratch, which is probably the standard setting in many marketing applications, the strict sequentially of hidden state updates, the computational bottleneck within the RNN framework, might be the reason for the RNN to extract more useful information from the time-varying RFM variables and, therefore, predicting customer churn more accurately than transformers (as confirmed by Table 4). In the same vein, the observed results shed some light on the sequential dependency structure in RFM variables. Transformers are much credited for their ability to capture long-term dependencies in sequential data. The superiority of RNNs as observed in Table 4 indicates that churn patterns in the focal data set are mainly driven by near- or short-term effects. This is notable because switching costs in the financial services industry are higher compared to many other service businesses. Relatively higher switching costs suggest the willingness to churn to build up over time and some lead time before finally canceling a service. In this regard, finding the RNN to excel in an industry with high switching costs, where one would expect long-term dependencies between RFM data and the churn event, could indicate that RNNs will also outperform transformer networks in other domains where changing service provides is easier than in the financial services.

Second, many customer behavior forecasting settings including CCP require processing static and time-varying covariates. Neural networks facilitate the processing of alternative types of data in a single network architecture. Results associated with RQ2 clarify alternative options for integrating demographic (static) and RFM (time-varying) customer features in a CCP model. Consistent with expectations, we find a simple pooling of static and time-varying features in the input vector to perform less well than multi-modal architectures in which task-specific subnetworks take care of processing static and dynamic inputs, respectively, and subsequent layers concatenate the derived latent representations of each type before producing predictions. Attention mechanisms offer yet another approach to integrating static and time-varying features. Multi-head attention is the key component of transformer networks. Our results echo this characteristic in that transformer networks perform best with multi-head attention. For RNNs, our results warrant recommending marketing analysts to use the concatenation approach when devising a CCP model.

Third, we find further evidence for the power of the gradient boosting framework to aid customer-centric decision tasks. When comparing CCP models derived from a single DNN to hybrid classifiers, in which DNN outputs are transferred to a second-stage classifier, the combination of RNN and gradient boosting facilitates developing the most accurate churn model overall. Much empirical work highlights the outstanding performance of gradient boosting. A recent study in the related field of credit scoring (Gunnarsson et al., 2021), for example, suggests that gradient boosting outperforms DNN in credit scoring. Our results mirror the unmatched performance of gradient boosting and suggest that the combination of RNN for extracting information from time-varying features and a gradient boosting-based post-processor for integrating observed and latent customer characteristics gives a highly powerful CCP model. In appraising this recommendation, it is only fair to mention several previous studies that have applied ML algorithms including gradient boosting to heterogeneous data sets including both, static and time-varying features. Corresponding work uses a range of feature transformations to aggregate the latter before applying an ML-based prediction model (e.g., Koehn et al., 2020). To maintain a clear focus on the research questions that inspired this study, we did not report empirical results from comparing DNN-based models to purely static CCP models with feature aggregations. However, these results are available for the interested reader as supplementary materialFootnote 5 and confirm the superiority of sequence learning over feature aggregation when using RNNs, whereby the aggregated time-varying RFM features (across different transformations) have been processed by a regularized logistic regression and an extreme gradient boosting classifier.

7 Conclusion

ML has become an increasingly important tool to support decision-making in marketing. Corresponding decision models forecast customer behavior using transactional data related to the recency, frequency, and monetary value of client transactions. Such RFM variables are naturally dynamic and time-evolving. Their incorporation into predictive ML models is nontrivial and offers considerable degrees of freedom. Focusing on the prominent use case of customer churn prediction (CCP), the paper has studied the potential of recently introduced deep learning approaches for sequential data modeling in a CCP context. We have examined the space of modeling options comprising different network architectures for sequence prediction, pooling strategies for blending dynamic and static customer features, and approaches to obtain an overall churn prediction in a single stage CCP model versus a hybrid strategy, which incorporates DNN-based churn predictions into a conventional CCP model.

The empirical results observed in a multi-factorial experiment related to customer churn in the financial services industry provide original insights into the interplay of alternative modeling options and suggest marketing analysts a principled routine of how to integrate time-varying RFM variables into CCP and, more generally, customer behavior forecasting models. In practice, the developed tool could inform business decision-makers about customers’ risk of churning. Our experiments present several modeling options that allow financial services providers to create better churn prediction models using sequential data. First, the hybrid approaches serve as a fairly easy way to extend existing customer churn prediction models with sequential data. As our results demonstrate, such an approach could already significantly improve churn prediction models that do not integrate time-varying data. Next, we present a pure DNN-based approach that integrates various types of data. Such an approach could be extended with other data sources that were not included in this experiment.

Our study exhibits limitations, which pave the way for future research. Most importantly, we employ a data set from the financial service industry. Feature-rich, real-world data comprising time-varying and static customer characteristics for customer behavior and specifically churn prediction is not easily available in the public domain. Access to the proprietary data set has been, therefore, a key asset facilitating this research. However, we cannot claim external validity of the observed results beyond the employed data and welcome future work to revisit the observed results, DNN architectures, and CCP models using different data sets. Another limitation concerns the focus on correlational models to predict churn as opposed to causal models predicting retention uplift. When the primary goal of CCP is to target retention campaigns, decision-makers should use uplift models. However, in appraising the limitation of not using uplift modeling, it is important to note that campaign targeting is not the only objective of CCP and that many uplift modeling strategies such as the S-, T-, or X-learner employ conventional (i.e., correlational) ML models as a building block. In this regard, our results are immediately relevant to uplift modeling and inform campaign targeting decisions whenever an uplift model processes time-varying features. Next, we acknowledge that another important goal of CCP beyond campaign targeting is to understand the drivers of customer churn. DNN models are black-boxes and require additional (xAI) tools to reveal the behavioral patterns derived from transactional customer data. While model-agnostic and DNN-specific xAI tools exist, we see the largest need for future work in this direction, namely building and testing xAI tools that support both, time-varying and time-invariant features. Mixed data is common in CCP and many other applications in marketing and beyond. Having clarified how to leverage corresponding data for CCP in this study, the next necessary step is to develop tools for interpreting the DNN-based prediction approaches we find to excel. Finally, our study provides insights into the impact of time-varying features on churn, but there could also be external influences on the features. An important example is covariate shift, for which more research is needed in the customer churn prediction domain.