1 Introduction

Customer base analysis (CBA) is an essential component of customer relationship management when looking at companies in a non-contractual setting being interested in a long-term relationship with their customers. CBA uses information on past purchases to analyse and predict transactional patterns. Classical models in this field assume that the inter-purchase times of an individual customer follow a given type of distribution, where the distribution parameter(s) themselves follow another probability distribution to account for heterogeneity within the customer cohort. In addition, most of these models imply that every customer will at some point stop doing business with the company. This may e.g. happen when a customer starts purchasing from a competitor, does not need the product anymore, or literally dies. The estimation of a customer’s lifetime is particularly challenging in a non-contractual setting because the timing of the dropout cannot be directly observed. Modelling lifetimes can be done in a similar way to the purchase process, involving individual dropout rates which themselves follow a heterogeneity distribution.

CBA models generally provide two measures that offer direct management benefits. The expected number of future purchases and the probability of a customer still being active at the end of the observation period are valuable metrics that enable the optimisation of marketing strategies regarding incentive or reactivation campaigns. In addition, the estimated parameters can be used to calculate the customer lifetime value (CLV) (see e.g. Fader et al. 2005b and Glady et al. 2009). On an operational level, the CLV can additionally substantiate marketing decisions like individual service offers. On the general management level, the cumulated CLV helps to determine the financial value of a company (McCarthy and Fader 2018).

The Pareto/Negative Binomial Distribution (Pareto/NBD) model was introduced by Schmittlein et al. (1987) and is one of the most acknowledged and cited CBA models in the literature. Its performance is highly regarded among researchers (see e.g. Fader et al. 2005a; Gupta et al. 2006; Batislam et al. 2007). In the early 2000s, the major point of criticism of the Pareto/NBD model was its computational complexity despite its rather simple mathematical assumptions (Fader and Hardie 2005; Jain and Singh 2002). This complexity mainly arises from the large number of hypergeometric functions that need to be calculated. In addition, determining the maximum likelihood estimates (MLE) often led to numerical problems (Ma and Liu 2007; Hoppe and Wagner 2010). This was reflected by the fact that only very few empirical validations had been performed at that time (Batislam et al. 2007). The technological progress has relaxed this issue in different respects. Not only the processing speed and methods have improved considerably, but also predefined functions and packages in publicly available software like R (R Core Team 2020) reduce the individual effort for the application of such a model and enables its utilisation as a benchmark model (Fader et al. 2005a; Jerath et al. 2011; Bemmaor and Glady 2012; Platzer and Reutterer 2016).

Over the past 10–15 years, the concept of Markov Chain Monte Carlo (MCMC) algorithms has taken root in CBA (Ma and Liu 2007; Abe 2009; Singh et al. 2009; Ma and Büschken 2011; Schweidel et al. 2014; Platzer and Reutterer 2016). Although their mathematical basis is more complex than the calculation of the MLE, these algorithms enable a new type of data analysis (Paap 2002) offering three main advantages. Firstly, MCMC procedures provide not only point estimates (i.e. the mode) of the model parameters but the entire posterior distribution of the parameters, which allows to also determine the mean, median, or standard deviation. Secondly, these procedures can estimate individual level parameters in addition to describing the heterogeneity distribution across customers. Thirdly, it is possible to augment the timing of the individual dropouts, which gives new insights into the unobserved customer lifetime.

MCMC methods have been applied to the Pareto/NBD model using three different algorithms or algorithm frameworks, namely those by Abe (2009), Ma and Liu (2007), and Singh et al. (2009). However, the extant literature on the parameter estimation of this model is incomplete in three significant aspects. Firstly, prior research has not performed a comparison between the different MCMC algorithms or with MLE. Though MCMC algorithms provide much richer information about the model parameters than MLE, it is unclear if these algorithms perform better regarding parameter recovery or forecast accuracy and are thus worth the additional implementation effort. Secondly, the influence of the (hyper) prior distributions, i.e. the distributions of the heterogeneity parameters r, αs, and β, is left unconsidered in all three implementation methods. Robert (2007) states that the (hyper) prior distributions are the key to Bayesian inference and their determination is thus the most important step in the MCMC procedure. However, none of the authors who introduced a MCMC algorithm for the Pareto/NBD model has addressed this issue. Singh et al. (2009) use a \({\varGamma} (5,5)\) distribution for each of the heterogeneity parameters without further elaboration, whereas Abe (2009) and Ma and Liu (2007) do not state which hyper parameters they employed at all. As the Pareto/NBD model is mostly used for cohorts of new customers or products and thus for small data sets, the choice of hyper distributions may be crucial for the analysis and inferences (Edwards et al. 1963). Thirdly, to the best knowledge of the authors, there exists no research on minimal data set requirements for the application of the Pareto/NBD model with MCMC. Two studies (Schmittlein and Peterson 1994; Hoppe and Wagner 2010) performed such examinations but referred to MLE only.

In this article, we contribute to extant literature by addressing these neglected issues. We use simulated data sets with known underlying parameter values to assess the hyper parameter sensitivity as well as the recovery and forecast quality of the different estimation procedures. In study 1, we compare the different MCMC algorithms (partially based on the R package BTYDplus (Platzer 2016)) with each other as well as with MLE. In order to systematically analyse the parameter estimates, especially with respect to the prior distribution sensitivity, we choose the simply structured parameter space of Fader et al. (2005), which provides three equally distributed values for each heterogeneity parameter. The results show that MCMC, and in particular Abe’s algorithm, outperforms MLE. The diagnosed superiority of Abe’s MCMC algorithm leads to the question whether existing data set restrictions for the practical application of the Pareto/NBD model in the literature (Schmittlein and Peterson 1994; Hoppe and Wagner 2010) can be relaxed as these refer to MLE. In study 2, we address this question by replicating the simulation study of Hoppe and Wagner (2010) and additionally incorporating Abe’s algorithm. As this research question focusses on the data set properties, the parameter space for this simulation study is based on behavioural characteristics, leading to a more complex set of heterogeneity parameters. Both studies require preliminary work on the choice of hyper parameters, which gives us valuable insights into this mostly disregarded but crucial aspect of MCMC in CBA.

The remainder of this paper is organised as follows. In Sect. 2, we recall the Pareto/NBD model assumptions, describe the properties of the different MCMC algorithms, and introduce the measures we employ for their comparison. The explicit procedures, probabilities, and distributions are outlined in detail in the Technical Appendix (A). In Sect. 3, we compare the different MCMC algorithms with each other and with MLE as outlined above. In Sect. 4, we analyse the MCMC performance on different data set sizes to derive minimal requirements for cohort sizes and the length of the calibration period. In Sect. 5, we summarise and discuss our results and derive implications for researchers and practitioners alike.

2 The Pareto/NBD model

2.1 Model assumptions

The Pareto/NBD model is based on the following assumptions (Schmittlein et al. 1987). For the simplification of the notation, we remove the customer index i when regarding individual level formulas. In addition, we explain the effects of different numerical values for each of the parameters.

  1. 1.

    Each of the N customers goes through two stages. The “lifetime” with a firm starts with the initial purchase and ends at a non-observable time \({{\tau}} \), where the state of inactivity is non-reversible.

  2. 2.

    While the customer is active, the time elapsed between two consecutive purchases follows an exponential distribution with parameter \({\lambda} \), i.e.

    $$\begin{aligned} f(t_{j}-t_{j-1})={\lambda }e^{-{\lambda }(t_{j}-t_{j-1})}, \end{aligned}$$
    (1)

    where tj denotes the time elapsed from the initial \((t_{0}=0)\) to the jth purchase. As \(E(t_{j}-t_{j-1})=\frac{1}{{\lambda }}\), the inter-purchase times tend to be shorter for larger values of \({\lambda} \), leading to a higher number of purchases. Therefore, an arbitrary customer with \({\lambda }=0.05\) has an average inter-purchase time of 20 time units (e.g. weeks) and, leaving the dropout process aside, we would expect him to make five purchases within 100 times units.

  3. 3.

    The customer’s lifetime \({\tau} \) is exponentially distributed with parameter \({\mu} \), i.e.

    $$\begin{aligned} f({\tau} )={\mu} e^{-{\mu} {\tau} }. \end{aligned}$$
    (2)

    Comparable to the purchase process, \(E({\tau} )=\frac{1}{{\mu} }\), i.e. a small \({\mu} \) implies a longer lifetime with the company. A customer with \({\mu} =0.1\) would thus be expected to drop out after 10 time periods. Another perspective on the dropout process is the defection rate (DR). For an exponentially distributed lifetime \({\tau} \), DR is a constant given by

    $$\begin{aligned} DR=1-e^{-{\mu} }, \end{aligned}$$
    (3)

    which quantifies the probability that a customer will defect within one time unit. In the above example, the probability that a customer with \({\mu} =0.07\) defects within the next time unit is \(6.8\%\).

  4. 4.

    The purchase rate \({\lambda} \) varies across customers and follows a gamma distribution with parameters r and \(\alpha \), i.e.

    $$\begin{aligned} g({\lambda }|r,\alpha )={\displaystyle \frac{\alpha ^{r}{\lambda }^{r-1}e^{-{\lambda }\alpha }}{\varGamma (r)}}. \end{aligned}$$
    (4)

    A direct behavioural understanding of the parameters \((r,\alpha )\) is very difficult. Instead, we can interpret the expected value and dispersion by considering that

    $$\begin{aligned} E({\lambda })={\displaystyle \frac{r}{\alpha },\sigma _{{\lambda }}={\displaystyle \frac{\sqrt{r}}{\alpha }}},CV_{{\lambda }}={\displaystyle \frac{1}{\sqrt{r}}}, \end{aligned}$$
    (5)

    with \(\sigma _{{\lambda }}\) being the standard deviation and \(CV_{{\lambda }}\) the coefficient of variation (CV) for \({\lambda} \). This implies that, e.g. for \(r=0.5\) and \(\alpha =10\), the average purchase rate \(E({\lambda })\) is 0.05 and \({\lambda} \) averagely varies by \(1.41=141\%\) within the cohort.

  5. 5.

    Heterogeneity in \({\mu} \) follows a gamma distribution with parameters s and \(\beta \), i.e.

    $$\begin{aligned} g({\mu} |s,\beta )={\displaystyle \frac{\beta ^{s}{\mu} ^{s-1}e^{-{\mu} \beta }}{\varGamma (s)}}. \end{aligned}$$
    (6)

    Similar to the purchase process, a direct interpretation of \((s,\beta )\) is unfeasible. Using

    $$\begin{aligned} E({\mu} )={\displaystyle \frac{s}{\beta },\sigma _{{\mu} }={\displaystyle \frac{\sqrt{s}}{\beta }}},CV_{{\mu} }={\displaystyle \frac{1}{\sqrt{s}}}, \end{aligned}$$
    (7)

    allows us to interpret that e.g. for \(s=2\) and \(\beta =20\), the average dropout parameter \(E({\mu} )\) is 0.1 and that it varies by \(1=100\%\) within the cohort.

  6. 6.

    \({\lambda} \) and \({\mu} \) vary independently across customers.

The posterior distributions of the individual and heterogeneity parameters are derived in A2.

2.2 MCMC algorithms for the Pareto/NBD model

The extant literature provides three different MCMC algorithms for the Pareto/NBD model. We present each of them in 2.2.1 to 2.2.3 and discuss their technical differences in 2.2.4.

2.2.1 Ma and Liu (2007)

Ma and Liu (2007) generate the individual parameter estimates \(\{\hat{{\lambda }}_{i},\hat{{\mu} }_{i}\}\) as well as the heterogeneity parameter estimates \((\hat{r},\hat{\alpha },\hat{s},\hat{\beta })\) by applying a Gibbs sampler. Since only the posteriors of \({\alpha }\) and \({\beta }\) reduce to a known distribution, we draw from the combined posterior distributions (22)–(25) by using a slice sampling routine (Neal 2003) to simulate the heterogeneity parameters.For our study, we use a slice sampling routine based on the R package BTYDplus (Platzer 2016). To reduce the computational cost, the slice sampling is outsourced to C++ using the Rcpp package (Eddelbuettel and François 2011; Eddelbuettel 2013).

2.2.2 Abe (2009)

Abe (2009) also generates individual and heterogeneity parameter estimates, but additionally simulates the unobserved individual dropout times \(\{{\tau} _{i}\}\) by applying a data augmentation technique (Tanner and Wong 1987). He first creates a latent indicator variable \(z_{i}\), which specifies if customer i is still active in \(T_{i}\) by using formula (17) for \(P\left( alive\right) \). Depending on the aliveness status, he draws \(\hat{{\tau} }_{i}\) from the appropriate distribution shown in Figs. 1 and A3.1. The specific value of \({\tau} _{i}\) simplifies the posterior distributions for \(\{{\lambda }_{i}\}\) and \(\{{\mu} _{i}\}\) according to A2. Therefore, we can draw the individual parameters straight from a gamma distribution employing a common number generator. Abe’s version of the posterior distribution for \(\{{\mu} _{i}\}\) noted in (19) is conditional on being alive rather than conditional on the specific value of \({\tau} _{i}\). We therefore implement a slightly different version by using \(\varGamma (s+1,\beta +{\tau} _{i})\) from formula (21) in both cases of the aliveness status in order to consider all information available. As the data augmentation has no effect on the heterogeneous likelihood, we need to perform the same slice sampling technique for \(\{r,\alpha ,s,\beta \}\) as in the algorithm introduced by Ma and Liu. Abe’s algorithm in the described form is also coded in the BTYDplus package (Platzer 2016).

2.2.3 Singh et al. (2009)

Singh et al. (2009) use a divergent approach for their MCMC procedure. The major difference to the other algorithms is that they waive the individual parameters \(\{{\lambda }_{i},{\mu} _{i}\}\). They separately regard the NBD-distributed counting process on the one hand side and the Pareto-distributed dropout process on the other. The full conditional distributions in A2 require values for the dropout times \(\{{\tau} _{i}\}\). The candidates for this individual lifetime are drawn from a truncated Pareto-II-distribution with parameters s and \(\beta \). A candidate \({\tau} _{i}^{(c)}\)is being accepted with probability (30) from A3.3. The heterogeneity parameters \(r,\alpha ,s,\) and \(\beta \) are drawn from (26)–(29) using a slice sampling routine.

2.2.4 Similarities and differences

Figure 1 shows a schematic overview of similarities and differences of the three MCMC algorithms.

Fig. 1
figure 1

Comparison of the different MCMC procedures

Even though Singh et al. (2009) and Abe (2009) both use a data augmentation procedure, Singh et al.’s (2009) algorithm reveals some weaknesses in the direct comparison. The most obvious one is that they do not provide estimates for \(\{\hat{{\lambda }}_{i},\hat{{\mu} }_{i}\}\), which makes certain evaluations impossible on the individual level. Additionally, their data augmentation technique allows the estimated dropout times to stay constant over many iterations. In contrast, Abe’s augmented lifetimes inevitably change in every step, which leads to a more granular distribution. Lastly, their algorithm underperforms in terms of computational cost. Using data augmentation simplifies the posterior distributions and thus generally speeds up the estimation. The absence of the individual parameter level though causes higher data input requirements in the slice sampling step, which is very time-consuming.

Ma and Liu (2007) use the very laborious slice sampling routine for the estimation of the individual parameters \(\{\hat{{\lambda }}_{i},\hat{{\mu} }_{i}\}\), which impedes the determination of lifetime estimates. These are two major disadvantages compared to the data augmentation procedures.

At this point, we can conclude that the conception of Abe’s algorithm is superior to the others as it unifies all advantages concerning (1) individual parameter estimates, (2) augmented dropout dates, and (3) calculation time while showing no weaknesses compared to the other methods.

When we think of the likelihood as a distribution, there are three measures of central tendency that we can use as point estimates, namely the mode, the mean, and the median. MLE is defined as the likelihood mode, whereas MCMC procedures also provide the mean and median of the posterior parameter distributions. We additionally examine which of these measures is the best point estimate regarding parameter recovery in our simulation study.

All MCMC algorithms in this study are performed using four chains with random initial values and 5,000 draws each, where the first 2,000 steps are defined as the burn-in. The MLE comparison values are determined by a routine from the BTYD package in R (Dziurzynski et al. 2014) which is based on the optim() function of the stats package (R Core Team 2020).

2.3 Performance Metrics

Let \(\theta _{i}=\left( r,\alpha ,s,\beta ,E({\lambda })=\frac{r}{\alpha },E({\mu} )=\frac{s}{\beta }\right) _{i}\) denote the vector of the true parameter values and \(\hat{\theta }_{i,k}\) the parameter point estimate (mean, median, or MLE) for customer i (\(i=1,...,N\)) of data set k (\(k=1,...,K\text {)}\). We analyse the recovery of the heterogeneity parameter estimates \((\hat{r},\hat{\alpha },\hat{s},\hat{\beta })\) of our simulated data sets by using the mean percentage error (MPE) and the mean absolute percentage error (MAPE), which are defined as:

$$\begin{aligned}&MPE\left( \hat{\theta }\right) ={\displaystyle \frac{100}{K\cdot N}\cdot \sum _{k=1}^{K}\sum _{i=1}^{N}\frac{\hat{\theta }_{i,k}-\theta _{i,k}}{\theta _{i,k}}}, \end{aligned}$$
(8)
$$\begin{aligned}&MAPE\left( \hat{\theta }\right) ={\displaystyle \frac{100}{K\cdot N}\cdot \sum _{k=1}^{K}\sum _{i=1}^{N}\frac{\left| \hat{\theta }_{i,k}-\theta _{i,k}\right| }{\theta _{i,k}}.} \end{aligned}$$
(9)

For the performance comparison of the recovery of the individual parameters \(\{{\lambda }_{i}\}\) and \(\{{\mu} _{i}\}\), the MPE and MAPE are unsuitable measures as the parameter values may be very close to zero and therefore produce infinite or undefined M(A)PE values. Hence, we draw on the concept of the mean arctangent absolute percentage error (MAAPE) instead (Kim and Kim 2016). The MAAPE is defined as follows:

$$\begin{aligned} MAAPE\left( \hat{\theta }\right) ={\displaystyle \frac{100}{K\cdot N}\cdot \sum _{k=1}^{K}\sum _{i=1}^{N}arctan\left( \frac{\left| \hat{\theta }_{i,k}-\theta _{i,k}\right| }{\theta _{i,k}}\right) }. \end{aligned}$$
(10)

The arctan(x) function is bound to \(\left[ 0,\frac{\pi }{2}\right] \), which makes the MAAPE robust against very small parameter values as well as estimate outliers.

Abe and Singh et al. provide augmented values for the individual dropout times \(\{\hat{{\tau} }_{i}\}\), which we use to derive a different type of information on the activity status of an individual customer than P(alive) gives us. We define the survival accuracy SA as

$$\begin{aligned} SA=\frac{100}{K\cdot N}\cdot \sum _{k=1}^{K}\sum _{i=1}^{N}I_{\{(T_{i}-{\tau} _{i})\cdot (T_{i}-\hat{{\tau} }_{i})>0\}}, \end{aligned}$$
(11)

with

$$\begin{aligned} I_{\{(T_{i}-{\tau} _{i})\cdot (T_{i}-\hat{{\tau} }_{i})>0\}}={\left\{ \begin{array}{ll} 1, &{} (T_{i}-{\tau} _{i})(T_{i}-\hat{{\tau} }_{i})>0\\ 0, &{} (T_{i}-{\tau} _{i})(T_{i}-\hat{{\tau} }_{i})\le 0 \end{array}\right. }, \end{aligned}$$
(12)

where the indicator function is one if and only if the median estimate \(\{\hat{{\tau} }_{i}\}\) states correctly whether a customer has defected.

The analysis of the future number of individual purchases \(x_{i}^{*}\) requires the mean absolute error (MAE) because the basic value may be zero and leads to errors in MPE, MAPE, and MAAPE. It is defined as

$$\begin{aligned} MAE\left( \hat{x^{*}}\right) ={\displaystyle \frac{1}{K\cdot N}\cdot \sum _{k=1}^{K}\sum _{i=1}^{N}\left| \hat{x^{*}}_{i,k}-x_{i,k}^{*}\right| .} \end{aligned}$$
(13)

3 Study 1: Comparison of the MCMC procedures

3.1 Data set generation for study 1

Our parameter space for creating the synthetic data sets of study 1 is based on Fader et al. (2005) and uses \(r,s\in \{0.25,0.5,0.75\}\) and \(\alpha ,\beta \in \{5,10,15\}\), allowing \(3^{4}=81\) combinations of these parameter values. To reduce the computational cost, we choose a random sample of 500 with replacement from these 81 combinations. This implies an average number of \(\frac{500}{3}=166.7\) data sets for the analysis of a single underlying heterogeneity parameter value. As \(E({\lambda })=\frac{r}{\alpha }\) and \(E({\mu} )=\frac{s}{\beta }\) rely on two parameters each, their average number of data sets reduces to \(\frac{500}{3^{2}}=55.6\). To simulate the individual purchase and dropout rate of the customers for each of these 500 data sets, we use the integrated random number generator in R and draw a sample of 1,500 individual values for \(\{{\lambda }_{i}\}\) and \(\{{\mu} _{i}\}\) from \(\varGamma (r,\alpha )\) and \(\varGamma (s,\beta )\) respectively. For each data set and customer, we now draw one realisation from \(Exp({\mu} _{i})\) to receive the lifetime \(\{{\tau} _{i}\}\) of the individual customers by using the integrated random number generator for the exponential distribution. Lastly, we simulate an initial purchase time for each customer as a fraction of the first time unit and generate successive inter-purchase times by drawing random values from \(Exp({\lambda }_{i})\) until the individual lifetime \(\{{\tau} _{i}\}\) is exceeded. We simulate these transaction data for a total period of 104 time units, of which 78 are used as the calibration period and the remaining 26 as the forecast period, corresponding to 1.5 years and 6 months on a weekly basis. To reduce the computational effort, we keep these values as well as the cohort size of 1,500 fixed throughout study 1.

3.2 Comparison of the hyper parameters

The application of MCMC procedures requires the choice of a prior distribution for the heterogeneity parameters \((r,\alpha ,s,\beta )\). This means that we need to choose a distribution type and specify its parameters (“hyper parameters”). We perform the MCMC algorithms with different hyper parameters to learn about the sensitivity of the parameter recovery. Using gamma distributions \(\varGamma (h_{1},h_{2})\) as the conjugate prior, we need to specify their expected values and CV given by \(\frac{h_{1}}{h_{2}}\) and \(h_{1}^{-0.5}\), respectively. As each of the heterogeneity parameters \((r,\alpha ,s,\beta )\) can take three values, we choose the middle one as the expected value of the prior distribution, i.e. \(E(r)=E(s)=\frac{h_{1}}{h_{2}}=0.5\) and \(E(\alpha )=E(\beta )=\frac{h_{1}}{h_{2}}=10\). This enables us to analyse the effect of the expectation of \(\varGamma (h_{1},h_{2})\) being set too large (for \(r,s=0.25\) and \(\alpha ,\beta =5\)), too small (for \(r,s=0.75\) and \(\alpha ,\beta =15\)), or accurate (for \(r,s=0.5\) and \(\alpha ,\beta =10\)). We simultaneously vary the coefficient of variation (CV) of \(\varGamma (h_{1},h_{2})\) between 0.01 and 2.

Figure 2 shows the MAPE of all four heterogeneity parameters for all three MCMC procedures based on their median draw, both averaged over the 500 data sets (bold red line) and separated by their underlying parameter value. Most plots for the total MAPE show a slight U-shape. The numeric values suggest an optimal CV of 0.5 for the algorithms of Abe and Singh et al. for the given parameter space. Their courses of the MAPE for the low (blue dotted line) and high (green dashed line) parameter values illustrate that an overrated prior mean causes higher deviations than an underrated, in particular when being combined with a small CV. Thus, in case of uncertainty about the choice of hyper parameters when employing the algorithms of Abe or Singh et al., it is recommended to combine a smaller expected value with a larger dispersion. The estimates of Ma and Liu though show a high sensitivity towards both the mean and the CV of the hyper distribution. Considering the different scaling of their plots on both axes because of their high MAPE values, they only generate comparatively good estimates when using highly informative prior distributions. For the further comparison with the other procedures, we use \(CV=0.1\) when estimating parameters for Ma and Liu’s algorithm, as this value provides the smallest MAPE for all four heterogeneity parameters.

In addition to the conjugate prior distribution, we also apply uninformative hyper prior distributions based on Jeffreys (1946). However, as they turn out to be less accurate than a gamma distribution, they will not be considered in the further analysis. To decide between the mean and median draw as the best fitting point estimate, we compare their MPE and MAPE values. The median outperforms the mean in all respects, especially for larger CV values as these promote outliers. For the further analysis, we will therefore restrict to using the median draw whenever a point estimate is required and compare the performance metrics for the different MCMC procedures with their respective optimal hyper parameters.

Fig. 2
figure 2

MAPE of the heterogeneity parameters for the different hyper distributions of study 1

3.3 Comparison of the MCMC procedures

3.3.1 Recovery of the heterogeneity parameters

Table 1 reports the MPE and MAPE values for \(\theta \) using the optimal hyper distribution of each algorithm as derived above. As Fig. 2 already suggested, Ma and Liu’s procedure generates highly overrated estimates because the given parameter space is too broad to satisfy the need for very informative priors. Singh et al.’s results are very similar to Abe’s in the absolute deviation but show an underestimation in the dropout process parameters s and \(\beta \). The MLE routine performs well for the purchase process but produces high deviations in the dropout process. We can therefore conclude that Abe’s algorithm is clearly preferred over all other estimation procedures.

Table 1 MPE and MAPE for \(\theta \)

3.3.2 Recovery of the individual parameters

As Singh et al. and MLE do not provide any individual parameter information, we can only compare the parameter recovery of the individual parameters \(\{{\lambda }_{i},{\mu} _{i}\}\) for Abe’s and Ma and Liu’s algorithms. Table 2 reports the MAAPE values which were calculated in total as well as separated by the number of repurchases in the calibration period. We should consider that very small parameter values promote higher relative deviations and therefore increase the MAAPE. The results show that the MAAPE for \(\{{\lambda }_{i}\}\) decreases with an increasing number of purchases because a higher number of repurchases gives us more information on the individual \({\lambda} \) and also implies larger values for \({\lambda} \). Opposed to this, the estimation of \({\mu} \) deals with two contrary effects. A higher number of repurchases gives more indirect information on the lifetime. Simultaneously, it also suggests a later dropout and thus a lower value of \({\mu} \), which promotes larger values for the MAAPE. These effects cause decreasing MAAPE values for \(\{{\lambda }_{i}\}\) and a U-shaped curve for \(\{{\mu} _{i}\}\). As the M(A)PE values for \(E({\lambda })\) and \(E({\mu} )\) in Table 1 already suggested, Abe’s individual estimates outperform Ma and Liu’s in the MAAPE. This holds in particular for two hardly traceable domains, namely the purchase parameter \(\{{\lambda }_{i}\}\) for the cohort of single buyers and the complete dropout process, represented by \(\{{{\mu} _{i}}\}\).

Table 2 MAAPE for study 1, Abe with hyper CV 0.5, Ma and Liu with hyper CV 0.1

3.3.3 Survival accuracy

Abe and Singh et al. both provide augmented values for the individual lifetimes \(\{{\tau} _{i}\}\). The distribution of the survival accuracy (11) over the 500 data sets is box-plotted in Fig. 3 and shows a significantly higher rate for Abe’s algorithm than for Singh et al. In Fig. 4, we compare the actual activity status with the median draw of \(\{{\tau} _{i}\}\) for all customers of all data sets. According to the plotted figures, Abe’ algorithm performs particularly better at correctly classifying the churned customers. The reason for this can be attributed to the different simulation routines for \(\{{\tau} _{i}\}\), which are described in A3.1 and A3.3. As Singh et al. use an acceptance probability, the values for \(\{{\tau} _{i}\}\) tend to remain unchanged for many simulation steps, leading to a less accurate distribution and thus to a smaller survival accuracy.

Fig. 3
figure 3

Survival accuracy

Fig. 4
figure 4

Survival cross tables (in %)

3.3.4 Forecast accuracy

In our final analysis of study 1, we compare the forecast accuracy of the four algorithms. Corresponding to a length of 6 months on a weekly basis, we calculate \(E(x^{*})\) for our forecast period of 26 time units using three different methods. These methods depend on the available parameters and are described in A4. Method 1 solely requires the heterogeneity parameters \((r,\alpha ,s,\beta )\) and can hence be applied to all procedures. Method 2 is the individual expectation conditional on the customer still being active. As it requires values for \(\{{\lambda }_{i}\}\), \(\{{\mu} _{i}\}\), and \(\{{\tau} _{i}\}\), it can only be applied to Abe’s algorithm. In method 3, the above condition is removed by multiplying the individual expectations with P(alive) and thus it only requires values for \(\{{\lambda }_{i}\}\) and \(\{{\mu} _{i}\}\). Therefore, it can be applied to the procedures of Abe and Ma and Liu. We conjecture that considering individual level parameters in methods 2 and 3 should lead to better forecasts for an individual customer, whereas the use of the heterogeneous estimates in method 1 should yield a better forecast on the aggregated level.

Table 3 reports different measures for the forecast accuracy. The individual MAE is defined in (13) and shows the mean deviation over all customers and data sets on the individual level. The aggregated MPE and MAPE refer to the total sum of future purchases of all customers in a data set. These forecast metrics reveal three effects. Firstly, small individual MAE values do not necessarily imply a small M(A)PE on the accumulated level. As zero buyers cannot be underrated, a general underestimation leads to a small MAE on the individual level for zero buyers but to a higher deviation in the accumulated purchases. This effect is very distinct here as 80.3% of the customers over all data sets made no purchase in the forecast period. Secondly, Ma and Liu’s overestimation of \(E({\mu} )\) (see Table 1) implies shorter lifetimes and hence less future purchases leading to the previously described effect of small individual MAE values but large M(A)PE values. Thirdly, we can observe the expected effect that individual level parameters perform well when considering the individual purchase behaviour whereas heterogeneous estimates do better on the aggregated level. The reason for this can be seen in a “smoothing” effect which impedes extremely small values for \(x^{*}\). Exemplarily for Abe’s algorithm, 54.3% of all customers have a forecast of less than 0.01 purchases when using the individual method 3, whereas method 1 predicts less than 0.01 purchases for only 1.4% of all customers. We can conclude that the choice of the best forecast method depends on its purpose. Forecasts on the cohort level should be made by using method 1 either with Abe’s MCMC algorithm or MLE. Since an individual method should be used to receive the most accurate individual forecast, the application of MCMC is required. In this case, Abe slightly outperforms Ma and Liu.

Table 3 Purchase forecast accuracy for the different procedures and formulas

3.4 Summary of study 1

The inferences we can draw from study 1 regarding the best Pareto/NBD procedure are very clear. Abe’s algorithm is the only one that provides the full range of estimate values and can hence be applied to all covered analyses. From all the performance metrics examined, it is at least equal to the other procedures but mostly outperforms them. In addition, it requires considerably less computing time than the other two algorithms. Furthermore, we showed that the augmented values of \(\{{\tau} _{i}\}\) are a valuable addition to the parameter estimates. Concerning the sensitivity analysis, we have shown that the choice of the hyper parameters has a crucial influence on the parameter estimation.

To investigate the data set requirements for applying the Pareto/NBD model in study 2, we will limit ourselves to the superior algorithm of Abe and compare it with MLE as a benchmark.

4 Study 2: Data Set requirements

In study 2, we replicate the simulation study of Hoppe and Wagner (2010) and examine whether the minimal data sets requirements they derived for MLE can be relaxed when using Abe’s MCMC procedure.

4.1 Data set generation for study 2

The simulation framework developed by Hoppe and Wagner (2010) is based on behavioural characteristics. As these cannot directly be mapped by the heterogeneity parameters, the authors define the average purchase frequency, the average dropout rate, and their dispersions within the cohort and translate them back into values for \((r,\alpha ,s,\beta )\) using the formulas (3), (5), and (7). Table 4 shows the behavioural characteristics used for study 2. The \(3^{4}=81\) combinations of their values result in the heterogeneity parameter space as noted in Table 5.

For each of the 81 scenarios, Hoppe and Wagner (2010) created 100 synthetic data sets for cohort sizes of \(N\in \{250,500,750,1000,1250,1500\}\) and calibration periods of \(T_{cal}\in \{12,18,24,30\}\) time units. Since the parameter estimation with a MCMC procedure requires a multiple of time and storage space compared to MLE, we reduce the sample size of the synthetic data sets to 30 replications for each scenario and perform both MLE and Abe’s MCMC algorithm.

Similar to study 1, we randomly generate 1,500 individual parameters \(\{{\lambda }_{i},{\mu} _{i}\}\) for each data set of each combination of \((r,\alpha ,s,\beta )\) and simulate the individual lifetimes and purchase profiles. For smaller cohort sizes \(N<1,500\), we only consider the first n customers and vary the cut-off dates for shorter calibration periods of \(T_{cal}<30\).

Table 4 Behavioural characteristics and derived parameter values
Table 5 Derived parameter space

4.2 Technical limitations

The number of data sets that cannot not be estimated with the MLE routine from the BTYD package (Dziurzynski et al. 2014) increases with a decreasing number of customers and calibration period length. In total, 17.6% of all data sets can either not be estimated or contains parameter estimates equal to their upper boundary defined in the optimisation routine and are therefore regarded as illegitimate estimates. Hence, we additionally apply the solnl() optimiser from the NlcOptim package (Chen and Yin 2019). As the optim() function tends to struggle with small values of \(E({\mu} )\) and solnl() with large values of \(E({\mu} )\), we are able to reduce the ratio of missing or illegitimate estimates to 7.6%. In cases where both routines produced results, we choose the one with the higher log-likelihood value. Hoppe and Wagner (2010) reported similar estimation problems in their study but gave no detailed information on their fail ratio.

4.3 Hyper parameter sensitivity

For the analysis of the hyper parameter sensitivity of Abe’s MCMC algorithm, we use the mean of the parameter space in Table 5 as the prior mean and vary the CV of the heterogeneity parameters between 0.5 and 5. For each of the 81 parameter combinations, we apply these values to a sample of three data sets. Since the influence of the hyper parameters increases with a decreasing size of the data set (Edwards et al. 1963), we use reasonably small data set restrictions based on Hoppe and Wagner (2010) by specifying \(N=750\) and \(T_{cal}=18\) time units to reduce the computational effort of this analysis.

Figure 5 shows the MAPE of \(\hat{\theta }\) with respect to the different hyper parameters. The numeric values of the MAPE as well as the MPE reveal the smallest accumulated error for a CV of the heterogeneity parameters of 1 which we therefore apply to the whole data set framework. The parameter space in study 2 has a much wider range than in study 1, explaining this higher value of the hyper CV. However, this enhanced range of parameters also yields another effect. As we use the same hyper parameters for each parameter combination regardless of their true underlying values, we should be aware that the performance metrics we receive in study 2 will be more conservative than in study 1.

Fig. 5
figure 5

MAPE for the different hyper CVs of study 2

4.4 Parameter Recovery

Table 6 compares the marginal MAPE values (i.e. the MAPE averaged over all parameter sets and data sets) of Abe’s algorithm with MLE (in italic) and shows their progression over the different numbers of customers and observation periods. MCMC outperforms the MLE in the recovery of the true parameters in every respect, in particular in \(E({\mu} )\) for very small data sets with \(N\le 500\) and \(T_{cal}\le 18\) time units. This is mainly driven by the fact that the ML strives exclusively for the mode as the single desirable point of the likelihood, which is challenging when dealing with very uninformative data sets. In contrast to this, MCMC considers the entire distribution and is thus less prone to extreme outliers in case of very flat likelihood functions.

Further, Table 6 shows that the recovery of the purchase process parameters is more sensitive to the cohort size N than to the length of the calibration period \(T_{cal}\). Opposed to this and in line with the results obtaines by Hoppe and Wagner (2010), the estimates of the dropout process are highly dependant on both N and \(T_{cal}\).

Table 6 Marginal MAPE for the study 2 data sets

4.5 Minimal data set requirements

Hoppe and Wagner (2010) derived the minimal data set requirements solely based on a criterion for \(E({\lambda })\), disregarding the dropout process. Due to the low sensitivity of the purchase process to \(T_{cal}\), they could restrict their limit recommendations to specifications of N. They determined the 95% quantile of the MAPE for \(E({\lambda })\) for each combination of purchase frequency, dropout rate (as defined in Table 4), N, and \(T_{cal}\).

We cannot fully replicate the minimal requirements of Hoppe and Wagner (2010) when applying the thresholds to our MLE values. Though the MAPE of the MCMC estimates outperforms MLE, the difference is not large enough to allow a relaxation of their minimal requirements on the cohort size. To the contrary, we receive even more restrictive data set limitations for medium and high purchase rates. Still, we enhance their data set specifications with conditions of \(E({\mu} )\) using the same method as for \(E({\lambda })\). Lewis (1982) defines thresholds for the interpretation of the MAPE. The limit of 10% used by Hoppe and Wagner (2010) for \(E({\lambda })\) corresponds to a highly accurate estimation. As the parameters of the dropout process are considerably more difficult to estimate, we use the MAPE threshold of 50% for a reasonably accurate estimation for \(E({\mu} )\). We apply these twofold thresholds to \(E({\lambda })\) and \(E({\mu} )\) with a type-I error of 5%. \(T_{cal}\) serves as a second dimension in the minimal requirements because of the high sensitivity of \(E({\mu} )\) estimate to the length of the calibration period.

Table 7 shows the new, combined data set requirements that result from applying the \(E({\lambda })\) and \(E({\mu} )\) criteria to our MCMC estimates. These are based on the purchase and dropout rate defined in Table 4. The combinations of N and \(T_{cal}\) which are stated here use the lowest possible value of the cohort size. Hence, the requirements related to \(T_{cal}\) can be relaxed when N is increased.

Table 7 Enhanced data set requirements

5 Discussion and implications

Using MCMC algorithms for the Pareto/NBD model is a powerful tool and undoubtedly worth the additional implementation effort. However, the results from study 1 emphasise the necessity of a sensitivity analysis for the hyper parameters. The prior distributions that we derived in our studies might give a tentative idea of how in particular the CV of the heterogeneity parameters could be chosen. Still, our results do not replace the acquisition of prior information on the specific data sets. Regarding the choice of the hyper parameters, available information from different sources like experts, theories, or other data sets should be considered (Rossi and Allenby 2003).

Despite the application of two MLE routines, we received no results for 7.6% of the data sets, whereas the parameters of all data sets could be estimated with MCMC while simultaneously reducing the risk of extreme outliers. Therefore, using MCMC algorithms for CBA model parameter estimations in practice is less defective but still manageable concerning calculation time when being applied to a single data set only.

We contribute to extant literature by demonstrating that the parameter recovery and forecasting accuracy of Abe’s algorithm is superior to other MCMC methods and to MLE. Moreover, it is the only procedure that generates the complete posterior distribution of the parameters on the individual as well as on the aggregate level and augments the unobserved individual dropout times \(\{{\tau} _{i}\}\). Thus, it provides the entire range of available information. In particular compared to MLE, the use of Abe’s algorithm herewith enriches the toolbox available to marketing management. Study 1 shows that using the individual parameters \(\{{\lambda }_{i},{\mu} _{i}\}\) rather than \(\{r,\alpha ,s,\beta \}\) on the aggregate level increases the accuracy of individual purchase forecasts. In addition, the provision of the full posterior distributions allows to explicitly account for (forecasting) uncertainty through confidence intervals or other distributional measures.

The individual level parameter values \(\{{\lambda }_{i},{\mu} _{i}\}\) enable managers to identify customers with a high purchase rate and a short estimated lifetime (i.e. large \({\lambda }_{i}\) and \({\mu} _{i}\)) for individually targeted marketing activities like discount coupons or win-back campaigns. Moreover, the posterior distribution of these individual parameters can be used to calculate the distribution of the next customer purchase time, allowing management to use predefined quantiles for activity timing. In cases where point estimates are required, study 1 shows that the median draw of the posterior distribution outperforms the ML estimators. On the aggregate level, the posterior distributions can be used to enrich the calculation of the CLV by considering uncertainty through confidence intervals.

Regardless of whether the parameters are estimated using MLE or MCMC, the size of the available data set plays a decisive role for the practical application of the Pareto/NBD model. In study 2, we amend the extant minimal data set requirements of Hoppe and Wagner (2010) which are restricted to the required cohort size by additionally considering a threshold based on the dropout process. Further, we expand the minimal requirements with a second dimension, including limitations for the minimal length of the calibration period.

Like any simulation-based research, there are limitations to the results of these studies. We were not able to fully replicate Hoppe and Wagner’s (2010) results and found generally stricter cohort size requirements for medium and high purchase frequencies. This may, to a certain extent, be caused by the smaller number of replications (30 instead of 100) that we used to reduce the computational effort. In addition, these requirements refer to perfectly Pareto/NBD distributed data sets of a predefined parameter space. The examination of their validity for data sets which violate these prerequisites is left for future research. Within our simulation studies, we applied identical prior distributions to all data sets irrespective of the true heterogeneity parameter values, which lead to more conservative values of the performance metrics. When employing an MCMC procedure to a single real data set, we can assume that the prior distribution may fit better. We thus propose replicating study 2 with data set specific hyper parameters in future research. This might allow relaxing the minimal data sets requirements in case of more precise prior information. Furthermore, the influence of the hyper parameters should also be examined in the context of a decreasing data set size. Future research can also test the MCMC procedure not only against MLE but may use alternative parameter estimation approaches like quantile-based procedures.