1 Introduction

Personalized marketing is a key strategy of modern database marketing that supports targeting recommendations, promotions and direct-mail campaigns in various business fields. The analysis of personalized marketing responses from retailer transaction data is challenging because of the fundamental sparsity of observed purchases. In truth, very few customers purchase most products on most of their shopping trips. When a purchase is recorded in one category, it is frequently for just one offering. The actual sample of transactional data is much smaller than the data space reflected by a data cube with dimensions corresponding to the number of customers, number of products and occasions. Under such circumstances, standard marketing models for choice break down because of the high frequency of nonpurchase for almost every product.

Additionally, increasing the number of products in a traditional marketing model is problematic because of potential complexities in the structure of demand based on an orthodox economics model and the accompanying increase in the required number of model parameters. Existing models of choice and demand, for example, are typically limited to fewer than twenty or so product alternatives that are tracked across possibly hundreds of customers [9, 30]. Unfortunately, that goal is often at odds with the goals of practitioners who want to optimize a marketing promotion for a wide set of customers and products in their shops.

As described in this paper, we propose a method of personalized market response analysis that can treat widely diverse products. The method identifies effective marketing promotions of individual products to individual customers using sparse transaction data. To resolve the difficulty of data sparsity, we first compress the data space comprising customers and products to a reduced-dimensional latent class. For the dimension reduction of customers and products, we propose a model that includes a latent variable model and a marketing model with response parameters to marketing variables such as discounts or marketing promotions. Response parameters are introduced into the latent class by connecting each choice to its own marketing variable. Consequently, it is possible to estimate the parameters stably because a sufficiently large sample size can be used. Then, we decompress the extracted associations back to individual customers using estimated parameters of customers and products for personalization.

Our model identifies the latent class for each customer at each point in time, providing information related to the array of products that a customer is likely to purchase. It is a key variable for construction of personalized information. We do not make a priori assumptions about substitute and complementary goods in the spirit of market basket analysis in data mining. Our model takes an exploratory approach to analysis. It does not test assumptions of the form of the utility function across hundreds of offerings. However, our model does include marketing variables so that their effects on choice can be measured and used for prediction.

The contributions of this paper are the following:

  • Proposition of an individual market response estimation method for widely diverse products.

  • Development of a marketing model with a latent variable model.

  • Findings of personalized effective marketing variables for widely diverse products from sparse transaction data in a supermarket.

Sections 2 and 3 describe a review of related work and preliminary research related to our method. We present the proposed method for personalized marketing for widely diverse products in Sect. 4. Section 5 presents a description of an empirical study using real transaction data. Conclusions are offered in Sect. 6.

2 Related works

2.1 Marketing model for personalization

The marketing model of customer heterogeneity [1, 23] for choice behavior commonly studied in the marketing field uses the framework of hierarchical Bayes modeling [30]. Details are described in Sect. 3.1. Heterogeneity models can measure the effects of marketing promotion for individual customers explicitly as market response coefficients. The main purpose is to elucidate the richness of the competitive environment within a product category or brand. The models are constructed by some marketing variables, parameters and structures with economics concepts such as a budget constraint, presence of substitutes and complements and/or utility functions. However, most advanced models entail a high computational cost to estimate parameters because the model structure that expresses a process of customer purchase behavior in the style of economics tends to be complicated. Therefore, existing models of choice and demand, for example, are typically limited to fewer than twenty or so product alternatives.

Our model is similar to adaptive personalization systems proposed by [3, 8, 10, 31]. However, it differs in that our model structure facilitates analysis of widely various product categories.

2.2 Dimension reduction method

Many statistical and data-mining methods for dimension reduction have been assessed for transaction data analysis: traditional latent class analysis [13], correspondence analysis [14], self-organization maps [38] and joint segmentation [29]. The benefits of such methods are that they can treat a large set of customer and product data to seek hidden patterns in reduced-dimensional space. Tensor factorization [25, 39] can decompose a data cube of a large set of customers, products and time periods to a scalable low-rank matrix to find hidden patterns related to customer behavior. However, such studies cannot address a marketing variable structure explicitly like marketing models. Our method is designed to extract information related to individual customers’ responses to changing marketing variables directly.

2.3 Topic modeling and latent variables model

The topic model, a kind of latent variable model, is a generalization of a finite mixture model in which each data point is associated with a draw from a mixing distribution [35]. Models of voting blocs [12, 32] track the votes of legislators (aye or nay) across multiple bills, with each bill associated with a potentially different concern or issue. Similarly, the latent Dirichlet allocation (LDA) model [6] allocates words within documents to a few latent topics with patterns that are meaningful and interpretable. Each vote and each word are associated with a potentially different issue or topic. Therefore, the mixing distribution is applied to the individual vector of observations and not to the entire set of observations (e.g., series of votes a legislator or set of words by an author) of the panelist. In our analysis of household purchases, we allow the vector of observed purchases across all product categories on an occasion to be related to a different latent context (topic or issue). This allowance enables us to view a customer’s purchases as responding to different needs or occasions (e.g., family dinner, snacks) and enables us to identify the ensemble of goods that collectively define latent purchase segments across numerous products.

In the analysis of purchase behavior using topic models for transaction data [18], dynamic patterns between purchased products and customer interests are extracted. [17] fused heterogeneous transaction data and customer lifestyle questionnaire data, whereas [19] identified customer purchase patterns using a topic model with price information related to the purchased products. These approaches identify patterns among customers and products. Topic models typified by the labeled LDA [28] and the supervised LDA [7] that extend LDA by incorporating additional data in the analysis have been proposed. Various latent variable models typified by the infinite relational model [22], the ideal point topic model [12], the stochastic block model [27] and the infinite latent feature model [15] have also been proposed for knowledge discovery of binary relations from multiple variables. However, none of these approaches is suitable for relating marketing variables to individual customer choices as explanatory variables.

3 Preliminary

3.1 Hierarchical Bayes probit model

The binary probit model is a popular marketing model for choice, i.e., purchase or not purchase. Let \(y_{cit}\) denote customer c’s purchase record of product i at time t, assigning \(y_{cit}=1\) if customer c purchased the product, and \(y_{cit}=0\) otherwise. We assume the dataset includes C customers and I products through T periods. Denote \(u_{cit}\) as the utility of customer c’s purchase record of product i at time t. We assume a binary probit model with \(u_{cit}>0\) if \(y_{cit}=1\), and \(u_{cit} \le 0\) if \(y_{cit}=0\). The marketing variables of products i at time t are expressed as a vector \({{\varvec{x}}}_{it} = [x_{it1}, \ldots , x_{tiM}]^{T}\), where M is the number of marketing variables. \({{\varvec{x}}}_{it}\) includes information related to the price or promotion of products.

Here, we consider an analysis of product i only. The binary probit model expresses \(u_{cit}\) by a linear regression model as

$$\begin{aligned} u_{cit} = {{\varvec{x}}}^{T}_{it} {\varvec{\beta }} + \epsilon _{cit}, \end{aligned}$$
(1)

where \({\varvec{\beta }} = [\beta _{1}, \ldots , \beta _{M}]^{T}\) is a regression coefficient vector with respect to product i and \( \epsilon _{uit} \) is a Gaussian error with mean 0 and variance 1. Next, we consider a probability of \(u_{cit} > 0\). The probability naturally is coincident with the probability of \( y_{cit}=1\). The probability \( p(u_{cit} > 0)\) can be determined as

$$\begin{aligned} p(u_{cit}> 0)= & {} p\left( {{\varvec{x}}}^{T}_{it} {\varvec{\beta }} + \epsilon _{cit}>0\right) \nonumber \\= & {} F({{\varvec{x}}}^{T}_{it} {\varvec{\beta }}), \end{aligned}$$
(2)

where F is a cumulative distribution function of the Gaussian distribution. These model structures and assumptions are a natural and reasonable assumption for customer choice. Many works use the model in marketing, economics and urban engineering [36].

If we extend the probit model to treat personalized parameter \({\varvec{\beta }}_{c} = [\beta _{c1}, \ldots , \beta _{cM}]^{T}\) \((c=1, \ldots , C)\) for individual customers simply, then the model is not able to estimate the coefficients because of a lack of data sampling for the reason that most customers do not purchase most products.

The hierarchical Bayes probit model [1] can estimate \({\varvec{\beta }}_{c} \) using the assumption of prior distribution of \({\varvec{\beta }}_{c}\). Multivariate normal distribution is used as a prior distribution of \({\varvec{\beta }}_{c} \) because it is a conjugate distribution of likelihood function of the probit model. The assumption of prior distribution is convenient for parameter estimation and is used in many existing works [30]. However, the models do not treat widely diverse products and are typically limited to fewer than twenty or so products [9, 30] because of high computational costs.

3.2 Dimension reduction by LDA

Here we briefly introduce the idea of topic models in the context of customer purchases. We seek the probability p(i|c) that customer c purchases product i. However, the probabilities cannot be calculated accurately because of data sparseness. The topic model calculates p(i|c) by introducing a latent class \(z \in \{1 \dots Z\}\) whose dimension is markedly smaller than the numbers of customers and products.

The latent variable is used to represent the sparse data matrix as a finite mixture of vectors commonly found in topic models.

$$\begin{aligned}&\left[ \begin{matrix} p\left( i=1|c=1 \right) &{} \cdots &{} p\left( i=1|c=C \right) \\ \vdots &{} \ddots &{} \vdots \\ p\left( i=I|c=1 \right) &{} \cdots &{} p\left( i=I|c=C \right) \\ \end{matrix} \right] \nonumber \\&=\sum \limits _{z=1}^{Z}{\left[ \begin{matrix} p\left( 1|z \right) \\ \vdots \\ p\left( I|z \right) \\ \end{matrix} \right] }\left[ \begin{matrix} p\left( z|1 \right) &{} \cdots &{} p\left( z|C \right) \\ \end{matrix} \right] . \end{aligned}$$
(3)

More specifically, we decompose a large probability matrix of size \(I \times C\) to two small probability matrices of sizes \(I \times Z\) and \(Z \times C\) based on the property of conditional independence. Hereinafter, we denote the probability that customer c belongs to the latent class z as p(z|c) and designates it as the membership probability. Also, for simplification, the probability that customers belonging to latent class z purchase the product i is p(i|z).

Parameter \(\theta _{cz}\) of categorical distribution is used for probability p(z|c). The categorical distribution is multinomial with parameters \(\varvec{\theta }_c\) = \(\left[ {{\theta }_{c1}}\cdots {{\theta }_{cZ}} \right] \). The \(\varvec{\theta }_c\) is specified so that the selection probability of customer c with respect to product i is conditionally independent if the latent class z is given: all information about customer heterogeneity of purchases is conveyed through the latent classes. The prior distribution for \(\varvec{\theta }_c\) is assumed the Dirichlet distribution as the natural conjugate prior distribution of categorical distribution:

$$\begin{aligned} {\varvec{\theta }_{c}}\sim \text {Dirichlet}\left( \varvec{\tilde{\gamma }} \right) , \end{aligned}$$
(4)

where \(\varvec{\tilde{\gamma }}\) is a hyperparameter of Dirichlet distribution.

The main difference between voting blocs model and LDA is assumed distributions for probabilities p(i|z) in the \(I \times Z\) matrix. The voting blocs model presumes a Bernoulli distribution for the probability p(i|z). LDA assumes a categorical (i.e., multinomial) distribution for the probability matrix.

3.3 Problem settings

Here, we suppose the following three situations. (1) Given that dataset \(\{ y_{cit} \}\) is sparse, that is, most \( y_{cit} \) is zero, and (2) that the number of target products I is greater than several hundred, then (3) we assume the following marketing model for the customer’s purchase behavior as

$$\begin{aligned} u_{cit} = {{\varvec{x}}}^{T}_{it} {\varvec{\alpha }}_{ci} + \epsilon _{cit}, \end{aligned}$$
(5)

where \({\varvec{\alpha }}_{ci} = [\alpha _{ci1}, \ldots , \alpha _{ciM}]^{T}\) is a regression coefficient vector of customer c with respect to product i. For the situation described above, we consider a method to ascertain personalized market response coefficients \(\{ {\varvec{\alpha }}_{ci} \}\).

4 Proposed method

4.1 Model development

Under circumstances of sparse data, it is not possible to estimate the parameters \({\varvec{\alpha }}_{ci} \) directly in existing methods such as maximum likelihood estimation because of a lack of sample size of purchase data. To resolve that difficulty, we reduce the dimension of customers and products to latent classes in which similar customers in terms of purchase behavior with marketing variables are summarized. We estimate the parameters associated with a latent class in the dimension reduced space using a purchased dataset of customers that belong to the same latent class. In that situation, it is possible to estimate the parameters stably because we can use a sufficiently large sample size. We recover information of \( {\varvec{\alpha }}_{ci} \) using the estimated parameters in latent class \( {\varvec{\beta }}_{zi}\) and latent class membership at each observation \( z_{cit} \). Definitions of \( {\varvec{\beta }}_{zi} \) and \( z_{cit} \) are described later.

Here, we couple the binary choice probability with a voting bloc model to reduce a space dimension of customers and products.

$$\begin{aligned} p\left( {{u}_{cit}}>0 \right) =\sum \limits _{z=1}^{Z}{p\left( {{u}_{it}}>0|z \right) p\left( z|c \right) } \end{aligned}$$
(6)

We denote the utility associated with the latent class z as \(u_{it}^{(z)}\); then, the choice probability can be represented as \(p(u_{it}>0|z) = p(u_{it}^{(z)}>0)\). Assuming a linear Gaussian structure on the utility \(u_{it}^{(z)}\) for marketing variables, the right-hand side of (3) can be represented as

$$\begin{aligned} \sum \limits _{z=1}^{Z}{\left[ \begin{matrix} F\left( {\varvec{x}_{it}^{T}}{\varvec{\beta }_{z1}} \right) \\ \vdots \\ F\left( {\varvec{x}_{It}^{T}}{\varvec{\beta }_{zI}} \right) \\ \end{matrix} \right] }\left[ {{\theta }_{1z}}\cdots {{\theta }_{Cz}} \right] \end{aligned}$$
(7)

where \(\varvec{\beta }_{zi}= [\beta _{zi1}, \dots , \beta _{ziM}]^T\) is a response coefficient vector of latent class z with respect to product i. The heterogeneity of latent class is introduced through a hierarchical model with a random effect for response coefficient \(\varvec{\beta }_{zi}\),

$$\begin{aligned} \varvec{\beta }_{zi} \sim N_M(\varvec{\mu }_i,V_i), \end{aligned}$$
(8)

where the prior distributions for \(\varvec{\mu }_i\) and \(V_i\) follow an M-dimensional multivariable normal distribution \(N_M(\tilde{\varvec{\mu }}\), \(\tilde{\sigma }^2 V_i)\) and an inverse Wishart distribution \(IW(\tilde{W}\),\(\tilde{w})\), where \(\tilde{\varvec{\mu }}\), \(\tilde{\sigma }^2\), \(\tilde{W}\) and \(\tilde{w}\) are hyperparameters specified by the analyst. We assume that the M-dimensional coefficient vector \( \varvec{\beta }_{zi}\) for each segment, z, is a draw from a distribution with mean and covariance that is product-specific.

The likelihood is given as

$$\begin{aligned}&\ell \left( \left\{ {{y}_{cit}} \right\} |\left\{ {\varvec{\theta }_{c}} \right\} ,\left\{ {\varvec{\beta }_{zi}} \right\} ,\left\{ {\varvec{x}_{it}} \right\} \right) \nonumber \\&=\prod \limits _{c=1}^{C}{\prod \limits _{i\in {{I}_{c}}}^{{}}{\prod \limits _{t\in {{T}_{c}}}^{{}}{\sum \limits _{z=1}^{Z}{\left[ {{\theta }_{cz}}p\left( {{y}_{cit}}|{\varvec{x}_{it}},{\varvec{\beta }_{zi}},z \right) \right] }}}} \end{aligned}$$
(9)

where \(p(y_{cit}|x_{it},\beta _{zi},z)\) denotes the kernel of the binary probit model conditional on z and where \(T_c\) denotes a subset of t in which customer c purchased any product in a store. Also, \(I_c\) is a subset of products i purchased by customer c at least once during the period \(t=1, \dots , T\), i.e., \({{T}_{c}}\in \left\{ t|\sum \nolimits _{i=1}^{I}{{{y}_{cit}}>0} \right\} \) and \({{I}_{c}}\in \left\{ i|\sum \nolimits _{t=1}^{T}{{{y}_{cit}}>0} \right\} \).

Equation (8) is difficult to use directly because the likelihood includes summations over latent class z. Instead, we use a data augmentation approach [34] with respect to latent variable z. We introduce variables \(z_{cit} \in \{1, \ldots , z \dots , Z\}\) denoting the label of the latent class for each customer c, each purchased product i and each purchasing event t. Conditioning on the \(z_{cit}\) for each purchasing transaction, as in the LDA [6], the likelihood in (7) simplifies to

$$\begin{aligned}&\ell \left( \left\{ {{y}_{cit}} \right\} | \left\{ {\varvec{\theta }_{c}} \right\} ,\left\{ {{z}_{cit}} \right\} ,\left\{ {\varvec{\beta }_{zi}} \right\} ,\left\{ {\varvec{x}_{it}} \right\} \right) \nonumber \\&= \prod \limits _{c=1}^{C}{\prod \limits _{i\in {{I}_{c}}}^{{}}{\prod \limits _{t\in {{T}_{c}}}^{{}}{p\left( {{z}_{cit}}=z|{\varvec{\theta }_{c}} \right) p\left( {{y}_{cit}}|{\varvec{x}_{it}},{\varvec{\beta }_{zi}},{{z}_{cit}=z} \right) }}} \end{aligned}$$
(10)

where \(p(z_{cit}=z|\varvec{\theta }_c)\) denotes a categorical distribution when \(\varvec{\theta }_c\) is given. Hereinafter, \(( z_{cit}=z )\) is denoted as \( z_{cit} \) to simplify notation.

The posterior distribution of parameters including latent variables of states \(\{z_{cit}\}\) and augmented utilities \(\{u_{cit}^{(z)}\}\) of proposed model is then given as

$$\begin{aligned}&p\left( {\{\varvec{\theta }_c \},\{z_{cit} \}, \left\{ u_{cit}^{(z)} \right\} , \{\varvec{\beta }_{zi} \},\{\varvec{\mu }_i \},\{V_i \}\mid \{\varvec{x}_{it} \},\{y_{cit} \}} \right) \nonumber \\&= p\left( {\left\{ {\varvec{\theta }_c } \right\} \mid \{z_{cit} \}} \right) \nonumber \\&\quad \times p\left( {\left\{ {z_{cit} } \right\} \mid \left\{ {\varvec{\theta }_c, \varvec{\beta }_{zi}, \varvec{x}_{it}, y_{cit} } \right\} } \right) \nonumber \\&\quad \times p\left( {\left\{ {u_{cit}^{(z)} } \right\} \mid \left\{ {\varvec{\beta }_{zi}, z_{cit}, \varvec{x}_{it}, y_{cit} } \right\} } \right) \nonumber \\&\quad \times p\left( {\left\{ {\varvec{\mu }_i, V_i } \right\} \mid \{\varvec{\beta }_{zi} \}} \right) \nonumber \\&\quad \times p\left( {\left\{ {\varvec{\beta }_{zi} } \right\} \mid \left\{ {u_{cit}^{(z)}, \varvec{\mu }_i, V_i, \varvec{x}_{it} } \right\} } \right) \nonumber \\&\propto p\left( { \{\varvec{\theta }_c \},\{z_{cit} \}, \left\{ u_{cit}^{(z)} \right\} , \{\varvec{\beta }_{zi} \},\{\varvec{\mu }_i \}, \{V_i \},\{\varvec{x}_{it} \},\{y_{cit} \} } \right) \nonumber \\&= \left[ {\prod \limits _{c = 1}^C {p\left( {\varvec{\theta }_c } \right) } } \right] \left[ {\prod \limits _{i = 1}^I {p\left( {\varvec{\mu }_i, V_i } \right) } \prod \limits _{z = 1}^Z {p\left( {\varvec{\beta }_{zi} \mid \varvec{\mu }_i, V_i } \right) } } \right] \nonumber \\&\quad \, \Biggl [ \prod \limits _{c = 1}^C \prod \limits _{i \in I_c} \prod \limits _{t \in T_c} p\left( {z_{cit} \mid \varvec{\theta }_c } \right) p\left( {u_{cit}^{(z)} \mid \varvec{\beta }_{zi}, z_{cit}, \varvec{x}_{it}, y_{cit} } \right) \nonumber \\&\quad \, p\left( {y_{cit} \mid \varvec{\beta }_{zi}, z_{cit}, \varvec{x}_{it} } \right) \Biggl ]. \end{aligned}$$
(11)

4.2 Characteristics of the proposed model

Figure 1 presents a graphical representation of the proposed model. Here, it is noteworthy that \(\{\varvec{\beta }_{zi}\}\) differs from smoothing parameters in the literature of LDA [6]. The \(\{\varvec{\beta }_{zi}\}\) in our model, which are regression coefficient vectors for marketing activities, play a key role in our analysis because latent segments and augmented utilities are characterized by the estimated \(\{\varvec{\beta }_{zi}\}\).

Fig. 1
figure 1

Graphical representation of the proposed model

The latent classes z serve to define types of purchase baskets across the I products. The first term of (7) defines a vector of choice probabilities for each product under study, assuming that the purchase occasion is of type z. Products with high probability are likely to be jointly present in the basket. Therefore, our model identifies likely bundles of goods purchased for shopping trips of different types. The second term is the probability that a customer’s purchases are of type z. Our model does not model heterogeneity in a traditional manner of marketing models, where there is a common set of customer’s parameters for all purchases of an individual. We instead assume that each purchase belongs to one of Z types, and that customers can also be characterized in terms of the probability their purchases are of these types.

Our model differs from related standard models in two respects. First, the likelihood is defined over products and time periods in which purchases are observed to take place at least once, as indicated by variables \(T_c\) and \(I_c\). It is composed not only of purchase but also of nonpurchase occasions for identifying market response parameters. In this sense, our model differs from topic models used in text analysis where the likelihood is formed using the words present in a corpus, not the words that are not present. Second, heterogeneity is introduced at the observation, allowing the different transactions of a customer to reflect different latent states, z at every (cit), as denoted by \(z_{cit}\). It provides us with useful information for characterizing customers and products and for predicting their purchases. This information differs from the traditional latent class model [21], where the likelihood of all customer purchases contributes to inferences about a customer’s latent class membership (z) and parameters (\(\beta \)).

4.3 Estimation of personalized market response coefficients

The estimated posterior mean \(\hat{\varvec{\beta }}_{zi}\), \(\hat{u}_{cit}^{(z)}\) and \(\hat{z}_{cit}^{(z)}\) can be transformed into statistics that are relevant for personalization. Here, \(\hat{z}_{cit}^{(z)} \equiv \varvec{E}[p(z_{cit}=z)], z=,1...,Z\) at each point of data cube (c, i, t). Given the estimates \(\hat{\varLambda }=\{\hat{\varvec{\beta }}_{zi}, \hat{u}_{cit}^{(z)}, \hat{z}_{cit}^{(z)} \}\), we can construct market response estimates for each customer and each product from \(\hat{\varLambda }=\{\hat{\varvec{\beta }}_{zi}, \hat{u}_{cit}^{(z)}, \hat{z}_{cit}^{(z)}\}\) by projecting the estimates of latent utility on marketing variables. The estimates are obtained from an auxiliary regression of latent utility \(\hat{U}_{ci}^{(k)}\) stacked by \(\hat{u}_{cit}^{(k)}\) with the state \(k= \text {argmax} \; \hat{z}_{cit}^{(z)}\) changing over time on the corresponding marketing variables \(X_{ci}\) constituted by \(\varvec{x}_{it}\) \((t \in T_c)\).

$$\begin{aligned} \hat{\varvec{\alpha }}_{ci}=\left( {X_{ci}}^{T}X_{ci}\right) ^{-1}{X_{ci}}^{T}\hat{U}_{ci}^{(k)}. \end{aligned}$$
(12)

The estimates presented above provide a bridge between the granularity of the model, where heterogeneity is introduced at each point in the data cube, and managerial inferences and decisions that are made across products (e.g., which customers to reward), across customers (e.g., which products to promote) and over time. In addition, the standard t test in the standard linear regression models is useful for testing the significance of estimates.

4.4 Parameter estimation

We use variational Bayes (VB) inference [5, 20], instead of the standard Markov chain Monte Carlo (MCMC) inference. MCMC methods can incur large computational cost in large-scale problems. VB inference approximates a posterior distribution of target by variational optimization in a computationally efficient manner. This approximation is necessary for our analysis. VB has another advantage over MCMC in that it is not prone to the label-switching problem encountered in MCMC estimation [24]. The VB inference, the update equation and the derivations for our model are detailed in Appendices A and B. The precision and computation time of parameter estimation of our model by the VB and MCMC in some situations are shown in Appendices E and F, respectively.

5 Application

5.1 Data description and settings

A customer database from a general merchandise store, recorded from April 1 to June 30 in 2002, is used in our analysis. A customer identifier, price, display and feature variables were recorded for each purchase occasion. The dataset includes 94,297 transactions involving 1650 customers and 500 products. The products were chosen by being displayed and featured at least once in the data period. The marketing variables are price \((P_{it})\), display \((D_{it})\) and feature \((F_{it})\); that is, \(\varvec{x}_{it}=[1 \,\, P_{it} \,\, D_{it} \,\, F_{it}]^T\). Also, \(P_{it}\) is the price relative to the maximum price of product i in the observational period. The display and feature are binary entries, equal to one if the product i is displayed or featured at time t, and zero otherwise.

In VB estimation, the iterations are terminated when the variational lower bound improves by less than \(10^{-3} \%\) of the current value in two consecutive iterations. (The variational lower bound is described in Appendix C.) The hyperparameters and initial values are set as explained in Appendix A. These settings for the hyperparameters and the stopping rule of the VB iterations are adopted hereinafter for all empirical studies.

Table 1 RMSEs of predictions of the four methods

5.2 Prediction performance

Table 1 presents the root-mean-square error (RMSE) of the four methods with respect to the number of Z. The RMSE represents the difference between purchased behavior \(y_{cit}=1\) and \(p(y_{cit}=1)\) in the data cube and is calculated using hold-out samples recorded during July 1–31 in 2002. We measure the prediction performance of the four methods to unknown samples. The table includes results of the probit model, the logit model, the latent class logit model [21] and the proposed method. The RMSEs of probit model and logit model are calculated on the presumption that the data are generated by only one consumer’s behavior for each product. The latent class logit model assumes latent class of customer only. The RMSEs of the three models are calculated independently for each product. The calculation of RMSEs of the latent class logit model with respect to \(Z=10,15\) and 20 does not converge. We used R function glm for probit and logit model. Then we used R package FlexMix for latent class logit model. Results show that the proposed method has a higher prediction performance than other methods.

Additionally, we find the decrease of RMSE of proposed method to be smooth around \(Z=10\) from the table. We illustrate the following analysis using a \(Z=10\) solution. Conditioning on the number of segments using variational lower bound is common practice in mixture model [5, 11]. We tried, but were unsuccessful in estimating an optimal Z because variances of estimated values of variational lower bound in multiple trials for each Z were too large to ascertain an optimal Z. Therefore, we leave this as an area for future research.

5.3 Insight to personalized marketing

5.3.1 Heterogeneity analysis

The management of pricing, displays and feature activity within a store involves decisions that cut across time and customers, and which require knowledge of which product categories are most sensitive to these actions. More recently, targeted coupon delivery systems have allowed for the individual-level customization of prices. Managing these decisions requires a view of the sensitivity of customers and product categories to these actions.

Individual-level estimates of market response are obtained using Eq. (12), and two-sided significance test on each estimate with the level of 5% is conducted by t test for deciding effectiveness of marketing variables in empirical analysis. In fact, customers will display variation in their sensitivity to variables such as price across product categories because of varying aspects of the product categories (e.g., necessary versus luxury goods, amount of product differentiation, price expectations) and different purposes of the shopping visit over time (e.g., shopping for oneself or others, large versus small shopping trip).

We can marginalize \(\hat{\varvec{\alpha }}_{ci}\) by either of its arguments, c and i, to obtain characterizations of customers and products useful for analysis. The empirical marginal distribution of customer parameter estimates is obtained by averaging across the 500 products in our analysis, i.e., \(\left\{ {\sum \nolimits _{c=1}^{C}{{{\hat{\varvec{\alpha }} }_{ci}}}}/{C}\; \right\} \). A histogram of 500 products for each marketing variable is displayed on the left side of Fig. 2, providing information related to the general distribution of heterogeneity faced by the firm for actions such as price customization. We find the individual estimates to be plausible in that the price coefficients are negative and the display and feature coefficients are estimated as positive.

Fig. 2
figure 2

Marginal distribution of parameter estimates of individual customers

Fig. 3
figure 3

Marginal distribution of parameter estimates of individual products

Fig. 4
figure 4

Personalized effective marketing variables for individual customers and products: 100 customers and 100 products

We can also summarize heterogeneity across customers and examine the distribution of marketing variables for the 500 products in our analyses. The empirical marginal distributions of individual products, averaging over 1650 customers, i.e., of \(\left\{ {\sum \nolimits _{i=1}^{I}{{{\hat{\varvec{\alpha }} }_{ci}}}}/{I}\; \right\} \), are depicted in Fig. 2. The products that were never displayed and featured in the data period have been omitted from the histograms in Fig. 3. These estimates are useful for ascertaining which product categories should receive merchandising support in the form of in-store displays and feature advertising. Results show that the estimates are plausible in most product categories with negative price coefficients, and positive display and feature coefficients, but there exists fairly wide variation in the effectiveness of these variables across products. Many product categories appear to be unresponsive to merchandising efforts.

5.3.2 Personalized effective marketing promotions

Figure 4 provides a two-dimensional summary of the data and coefficient estimates for top 100 products and customers. Figure 4a is a scatter plot of two-dimensional data cube with respect to customers (c) and products (i), aggregated along the time (t) dimension. If a customer has never purchased a specific product in the dataset, then the coordinate (ic) is colored “white.” It is “black" if they have purchased the product at least once. We observe that customer-product space is still very sparse.

Figures 4b–d shows the results of testing with a 5% level of significance level for nonzero individual response coefficients. In Fig. 4b, the coordinates with a significant price coefficient indicated as “black" and “white" show that the estimate is not significant. The effectiveness of displays and feature promotions is defined similarly. We find that our model produces many significant price, display and feature coefficients.

An interesting aspect of our analysis is that because of the imputation present in the latent variable model for nonpurchases, significant coefficients can arise even when a customer has never purchased a product. The latent variable model greatly reduces the dimensionality of the data cube and produces individual estimates in a sparse data environment. Our analyses yield coefficient estimates at the individual customers and products by way of the latent topics that transcend the product categories. Our model enables marketers to develop effective pricing and promotional strategies by recognizing the presence of latent topics, or shopping baskets, present at each point in time in the data cube.

6 Conclusion

We proposed a descriptive model of demand based on the idea of latent variables where products purchased by customers. We allow for a product’s purchase probability to be affected by price, display and feature advertising variables, but do not treat purchases as arising from a process of constrained utility maximization. An important benefit of this approach is that it enables us to side-step complications associated with competitive effects and model a much larger set of products than that possible with existing economic models. By retaining prices and other marketing variables in our model, we can predict the effects of these variables on own sales. This trade-off is unavoidable in the analysis of transaction databases where purchases are tracked across thousands of products. The proposed model is applicable to personalized marketing across numerous and diverse products. We show how the model is useful to produce information useful for personalized marketing for both specific customers and specific products, and how it effectively accommodates data sparseness caused by infrequent customer purchases.

Future research will combine marketing models and other latent variable models or tensor factorization methods and compare the prediction performance with that of the proposed model. We would like to apply the method to other market datasets to verify the prediction performance. Additionally, our model includes the assumption that the stability of the topic structure is over time. However, it is possible that customers’ market response and purchase patterns change over time because of factors such as new trends, state dependence and the arrival of new purchase and delivery technologies. We believe that the development of a dynamic topic model for purchase is an interesting extension of our work, and leave this point for future research.