Personalized market response analysis for a wide variety of products from sparse transaction data

Advanced database marketing is designed to ascertain individual customers’ market responses with a discount or display of widely various products from transaction data. However, transaction data recorded in a supermarket or electric commerce are fundamentally sparse because most customers purchase only a few products from all products in shops. Existing methods are not applicable to elucidate the personalized response because of a lack of sample size of purchased data. This paper proposes a personalized market response estimation method for a wide set of customers and products from these sparse data. The method compresses a sparse transaction data with information related to response to marketing variables into a reduced-dimensional space for feasible parameter estimation. Then, they are decompressed into original space using augmented latent variables to obtain individual response parameters. Results show that the method can find suitable marketing promotions for individual customers to every analyzed product.


Introduction
Personalized marketing is a key strategy of modern database marketing that supports targeting recommendations, promotions and direct-mail campaigns in various business fields.The analysis of personalized marketing responses from retailer transaction data is challenging because of the fundamental sparsity of observed purchases.In truth, very few customers purchase most products on most of their shopping trips.When a purchase is recorded in one category, it is frequently for just one offering.The actual sample of transactional data is much smaller than the data space reflected by a data cube with dimensions corresponding to the number of customers, number of products and occasions.Under such circumstances, standard marketing models for choice break down because of the high frequency of nonpurchase for almost every product.
Additionally, increasing the number of products in a traditional marketing model is problematic because of potential complexities in the structure of demand based on an orthodox economics model and the accompanying increase in the required number of model parameters.Existing models of choice and demand, for example, are typically limited to fewer than twenty or so product alternatives that are tracked across possibly hundreds of customers [9,30].Unfortunately, that goal is often at odds with the goals of practitioners who want to optimize a marketing promotion for a wide set of customers and products in their shops.
As described in this paper, we propose a method of personalized market response analysis that can treat widely diverse products.The method identifies effective marketing promotions of individual products to individual customers using sparse transaction data.To resolve the difficulty of data sparsity, we first compress the data space comprising customers and products to a reduced-dimensional latent class.For the dimension reduction of customers and products, we propose a model that includes a latent variable model and a marketing model with response parameters to marketing variables such as discounts or marketing promotions.Response parameters are introduced into the latent class by connecting each choice to its own marketing variable.Consequently, it is possible to estimate the parameters stably because a sufficiently large sample size can be used.Then, we decompress the extracted associations back to individual customers using estimated parameters of customers and products for personalization.
Our model identifies the latent class for each customer at each point in time, providing information related to the array of products that a customer is likely to purchase.It is a key variable for construction of personalized information.We do not make a priori assumptions about substitute and complementary goods in the spirit of market basket analysis in data mining.Our model takes an exploratory approach to analysis.It does not test assumptions of the form of the utility function across hundreds of offerings.However, our model does include marketing variables so that their effects on choice can be measured and used for prediction.
The contributions of this paper are the following: -Proposition of an individual market response estimation method for widely diverse products.-Development of a marketing model with a latent variable model.-Findings of personalized effective marketing variables for widely diverse products from sparse transaction data in a supermarket.
Sections 2 and 3 describe a review of related work and preliminary research related to our method.We present the proposed method for personalized marketing for widely diverse products in Sect. 4. Section 5 presents a description of an empirical study using real transaction data.Conclusions are offered in Sect.6.

Marketing model for personalization
The marketing model of customer heterogeneity [1,23] for choice behavior commonly studied in the marketing field uses the framework of hierarchical Bayes modeling [30].Details are described in Sect.3.1.Heterogeneity models can measure the effects of marketing promotion for individual customers explicitly as market response coefficients.The main purpose is to elucidate the richness of the competitive environment within a product category or brand.The models are constructed by some marketing variables, parameters and structures with economics concepts such as a budget constraint, presence of substitutes and complements and/or utility functions.However, most advanced models entail a high computational cost to estimate parameters because the model structure that expresses a process of customer purchase behavior in the style of economics tends to be com-plicated.Therefore, existing models of choice and demand, for example, are typically limited to fewer than twenty or so product alternatives.
Our model is similar to adaptive personalization systems proposed by [3,8,10,31].However, it differs in that our model structure facilitates analysis of widely various product categories.

Dimension reduction method
Many statistical and data-mining methods for dimension reduction have been assessed for transaction data analysis: traditional latent class analysis [13], correspondence analysis [14], self-organization maps [38] and joint segmentation [29].The benefits of such methods are that they can treat a large set of customer and product data to seek hidden patterns in reduced-dimensional space.Tensor factorization [25,39] can decompose a data cube of a large set of customers, products and time periods to a scalable low-rank matrix to find hidden patterns related to customer behavior.However, such studies cannot address a marketing variable structure explicitly like marketing models.Our method is designed to extract information related to individual customers' responses to changing marketing variables directly.

Topic modeling and latent variables model
The topic model, a kind of latent variable model, is a generalization of a finite mixture model in which each data point is associated with a draw from a mixing distribution [35].Models of voting blocs [12,32] track the votes of legislators (aye or nay) across multiple bills, with each bill associated with a potentially different concern or issue.Similarly, the latent Dirichlet allocation (LDA) model [6] allocates words within documents to a few latent topics with patterns that are meaningful and interpretable.Each vote and each word are associated with a potentially different issue or topic.Therefore, the mixing distribution is applied to the individual vector of observations and not to the entire set of observations (e.g., series of votes a legislator or set of words by an author) of the panelist.In our analysis of household purchases, we allow the vector of observed purchases across all product categories on an occasion to be related to a different latent context (topic or issue).This allowance enables us to view a customer's purchases as responding to different needs or occasions (e.g., family dinner, snacks) and enables us to identify the ensemble of goods that collectively define latent purchase segments across numerous products.
In the analysis of purchase behavior using topic models for transaction data [18], dynamic patterns between purchased products and customer interests are extracted.[17] fused heterogeneous transaction data and customer lifestyle questionnaire data, whereas [19] identified customer pur-chase patterns using a topic model with price information related to the purchased products.These approaches identify patterns among customers and products.Topic models typified by the labeled LDA [28] and the supervised LDA [7] that extend LDA by incorporating additional data in the analysis have been proposed.Various latent variable models typified by the infinite relational model [22], the ideal point topic model [12], the stochastic block model [27] and the infinite latent feature model [15] have also been proposed for knowledge discovery of binary relations from multiple variables.However, none of these approaches is suitable for relating marketing variables to individual customer choices as explanatory variables.

Hierarchical Bayes probit model
The binary probit model is a popular marketing model for choice, i.e., purchase or not purchase.Let y cit denote customer c's purchase record of product i at time t, assigning y cit = 1 if customer c purchased the product, and y cit = 0 otherwise.We assume the dataset includes C customers and I products through T periods.Denote u cit as the utility of customer c's purchase record of product i at time t.We assume a binary probit model with u cit > 0 if y cit = 1, and u cit ≤ 0 if y cit = 0.The marketing variables of products i at time t are expressed as a vector x it = [x it1 , . . ., x ti M ] T , where M is the number of marketing variables.x it includes information related to the price or promotion of products.
Here, we consider an analysis of product i only.The binary probit model expresses u cit by a linear regression model as where β = [β 1 , . . ., β M ] T is a regression coefficient vector with respect to product i and uit is a Gaussian error with mean 0 and variance 1. Next, we consider a probability of u cit > 0. The probability naturally is coincident with the probability of y cit = 1.The probability p(u cit > 0) can be determined as where F is a cumulative distribution function of the Gaussian distribution.These model structures and assumptions are a natural and reasonable assumption for customer choice.
Many works use the model in marketing, economics and urban engineering [36].
If we extend the probit model to treat personalized parameter β c = [β c1 , . . ., β cM ] T (c = 1, . . ., C) for individual customers simply, then the model is not able to estimate the coefficients because of a lack of data sampling for the reason that most customers do not purchase most products.
The hierarchical Bayes probit model [1] can estimate β c using the assumption of prior distribution of β c .Multivariate normal distribution is used as a prior distribution of β c because it is a conjugate distribution of likelihood function of the probit model.The assumption of prior distribution is convenient for parameter estimation and is used in many existing works [30].However, the models do not treat widely diverse products and are typically limited to fewer than twenty or so products [9,30] because of high computational costs.

Dimension reduction by LDA
Here we briefly introduce the idea of topic models in the context of customer purchases.We seek the probability p(i|c) that customer c purchases product i.However, the probabilities cannot be calculated accurately because of data sparseness.The topic model calculates p(i|c) by introducing a latent class z ∈ {1 . . .Z } whose dimension is markedly smaller than the numbers of customers and products.
The latent variable is used to represent the sparse data matrix as a finite mixture of vectors commonly found in topic models.
More specifically, we decompose a large probability matrix of size I × C to two small probability matrices of sizes I × Z and Z × C based on the property of conditional independence.Hereinafter, we denote the probability that customer c belongs to the latent class z as p(z|c) and designates it as the membership probability.Also, for simplification, the probability that customers belonging to latent class z purchase the product i is p(i|z).
Parameter θ cz of categorical distribution is used for probability p(z|c).The categorical distribution is multinomial with parameters The θ c is specified so that the selection probability of customer c with respect to product i is conditionally independent if the latent class z is given: all information about customer heterogeneity of purchases is conveyed through the latent classes.The prior distribution for θ c is assumed the Dirichlet distribution as the natural conjugate prior distribution of categorical distribution: where γ is a hyperparameter of Dirichlet distribution.The main difference between voting blocs model and LDA is assumed distributions for probabilities p(i|z) in the I × Z matrix.The voting blocs model presumes a Bernoulli distribution for the probability p(i|z).LDA assumes a categorical (i.e., multinomial) distribution for the probability matrix.

Problem settings
Here, we suppose the following three situations.(1) Given that dataset {y cit } is sparse, that is, most y cit is zero, and (2) that the number of target products I is greater than several hundred, then (3) we assume the following marketing model for the customer's purchase behavior as where α ci = [α ci1 , . . ., α ci M ] T is a regression coefficient vector of customer c with respect to product i.For the situation described above, we consider a method to ascertain personalized market response coefficients {α ci }.
4 Proposed method

Model development
Under circumstances of sparse data, it is not possible to estimate the parameters α ci directly in existing methods such as maximum likelihood estimation because of a lack of sample size of purchase data.To resolve that difficulty, we reduce the dimension of customers and products to latent classes in which similar customers in terms of purchase behavior with marketing variables are summarized.We estimate the parameters associated with a latent class in the dimension reduced space using a purchased dataset of customers that belong to the same latent class.In that situation, it is possible to estimate the parameters stably because we can use a sufficiently large sample size.We recover information of α ci using the estimated parameters in latent class β zi and latent class membership at each observation z cit .Definitions of β zi and z cit are described later.
Here, we couple the binary choice probability with a voting bloc model to reduce a space dimension of customers and products.
We denote the utility associated with the latent class z as u (z) it ; then, the choice probability can be represented as it > 0).Assuming a linear Gaussian structure on the utility u (z) it for marketing variables, the righthand side of (3) can be represented as where T is a response coefficient vector of latent class z with respect to product i.The heterogeneity of latent class is introduced through a hierarchical model with a random effect for response coefficient β zi , where the prior distributions for μ i and and an inverse Wishart distribution I W ( W , w), where μ, σ 2 , W and w are hyperparameters specified by the analyst.We assume that the M-dimensional coefficient vector β zi for each segment, z, is a draw from a distribution with mean and covariance that is product-specific.The likelihood is given as where p(y cit |x it , β zi , z) denotes the kernel of the binary probit model conditional on z and where T c denotes a subset of t in which customer c purchased any product in a store.Also, I c is a subset of products i purchased by customer c at least once during the period t = 1, . . ., T , i.e., T c ∈ t| I i=1 y cit > 0 and I c ∈ i| T t=1 y cit > 0 .Equation ( 8) is difficult to use directly because the likelihood includes summations over latent class z.Instead, we use a data augmentation approach [34] with respect to latent variable z.We introduce variables z cit ∈ {1, . . ., z . . ., Z } denoting the label of the latent class for each customer c, each purchased product i and each purchasing event t.Conditioning on the z cit for each purchasing transaction, as in the LDA [6], the likelihood in (7) simplifies to where p(z cit = z|θ c ) denotes a categorical distribution when θ c is given.Hereinafter, (z cit = z) is denoted as z cit to simplify notation.
The posterior distribution of parameters including latent variables of states {z cit } and augmented utilities {u

Characteristics of the proposed model
Figure 1 presents a graphical representation of the proposed model.Here, it is noteworthy that {β zi } differs from smoothing parameters in the literature of LDA [6].The {β zi } in our model, which are regression coefficient vectors for marketing activities, play a key role in our analysis because latent segments and augmented utilities are characterized by the estimated {β zi }.
The latent classes z serve to define types of purchase baskets across the I products.The first term of (7) defines a vector of choice probabilities for each product under study, assuming that the purchase occasion is of type z.Products with high probability are likely to be jointly present in the basket.Therefore, our model identifies likely bundles of goods purchased for shopping trips of different types.The second

Customer level
Purchase level ( T c )

Product level
Segment level x it y cit Fig. 1 Graphical representation of the proposed model term is the probability that a customer's purchases are of type z.Our model does not model heterogeneity in a traditional manner of marketing models, where there is a common set of customer's parameters for all purchases of an individual.We instead assume that each purchase belongs to one of Z types, and that customers can also be characterized in terms of the probability their purchases are of these types.
Our model differs from related standard models in two respects.First, the likelihood is defined over products and time periods in which purchases are observed to take place at least once, as indicated by variables T c and I c .It is composed not only of purchase but also of nonpurchase occasions for identifying market response parameters.In this sense, our model differs from topic models used in text analysis where the likelihood is formed using the words present in a corpus, not the words that are not present.Second, heterogeneity is introduced at the observation, allowing the different transactions of a customer to reflect different latent states, z at every (c, i, t), as denoted by z cit .It provides us with useful information for characterizing customers and products and for predicting their purchases.This information differs from the traditional latent class model [21], where the likelihood of all customer purchases contributes to inferences about a customer's latent class membership (z) and parameters (β).

Estimation of personalized market response coefficients
The estimated posterior mean βzi , û(z) cit and ẑ(z) cit can be transformed into statistics that are relevant for personalization.Here, The estimates presented above provide a bridge between the granularity of the model, where heterogeneity is introduced at each point in the data cube, and managerial inferences and decisions that are made across products (e.g., which customers to reward), across customers (e.g., which products to promote) and over time.In addition, the standard t test in the standard linear regression models is useful for testing the significance of estimates.

Parameter estimation
We use variational Bayes (VB) inference [5,20], instead of the standard Markov chain Monte Carlo (MCMC) inference.MCMC methods can incur large computational cost in large-scale problems.VB inference approximates a posterior distribution of target by variational optimization in a computationally efficient manner.This approximation is necessary for our analysis.VB has another advantage over MCMC in that it is not prone to the label-switching problem encountered in MCMC estimation [24].The VB inference, the update equation and the derivations for our model are detailed in Appendices A and B. The precision and computation time of parameter estimation of our model by the VB and MCMC in some situations are shown in Appendices E and F, respectively.

Data description and settings
A customer database from a general merchandise store, recorded from April 1 to June 30 in 2002, is used in our analysis.A customer identifier, price, display and feature variables were recorded for each purchase occasion.The dataset includes 94,297 transactions involving 1650 customers and 500 products.The products were chosen by being displayed and featured at least once in the data period.The marketing variables are price (P it ), display (D it ) and feature (F it ); that is, x it = [1 P it D it F it ] T .Also, P it is the price relative to the maximum price of product i in the observational period.The display and feature are binary entries, equal to one if the product i is displayed or featured at time t, and zero otherwise.
In VB estimation, the iterations are terminated when the variational lower bound improves by less than 10 −3 % of the current value in two consecutive iterations.(The variational lower bound is described in Appendix C.) The hyperparameters and initial values are set as explained in Appendix A. These settings for the hyperparameters and the stopping rule of the VB iterations are adopted hereinafter for all empirical studies.

Prediction performance
Table 1 presents the root-mean-square error (RMSE) of the four methods with respect to the number of Z .The RMSE represents the difference between purchased behavior y cit = 1 and p(y cit = 1) in the data cube and is calculated using hold-out samples recorded during July 1-31 in 2002.We measure the prediction performance of the four methods to unknown samples.The table includes results of the Additionally, we find the decrease of RMSE of proposed method to be smooth around Z = 10 from the table.We illustrate the following analysis using a Z = 10 solution.
Conditioning on the number of segments using variational lower bound is common practice in mixture model [5,11].We tried, but were unsuccessful in estimating an optimal Z because variances of estimated values of variational lower bound in multiple trials for each Z were too large to ascertain an optimal Z .Therefore, we leave this as an area for future research.

Heterogeneity analysis
The management of pricing, displays and feature activity within a store involves decisions that cut across time and customers, and which require knowledge of which product categories are most sensitive to these actions.More recently, targeted coupon delivery systems have allowed for the individual-level customization of prices.Managing these decisions requires a view of the sensitivity of customers and product categories to these actions.
Individual-level estimates of market response are obtained using Eq. ( 12), and two-sided significance test on each estimate with the level of 5% is conducted by t test for deciding effectiveness of marketing variables in empirical analysis.In fact, customers will display variation in their sensitivity to variables such as price across product categories because of varying aspects of the product categories (e.g., necessary versus luxury goods, amount of product differentiation, price expectations) and different purposes of the shopping visit over time (e.g., shopping for oneself or others, large versus small shopping trip).
We can marginalize αci by either of its arguments, c and i, to obtain characterizations of customers and products useful for analysis.The empirical marginal distribution of customer parameter estimates is obtained by averaging across the 500 products in our analysis, i.e., C c=1 αci /C .A histogram of 500 products for each marketing variable is displayed on the left side of Fig. 2, providing information related to the general distribution of heterogeneity faced by the firm for actions such as price customization.We find the individual estimates to be plausible in that the price coefficients are negative and the display and feature coefficients are estimated as positive.
We can also summarize heterogeneity across customers and examine the distribution of marketing variables for the 500 products in our analyses.The empirical marginal distributions of individual products, averaging over 1650 customers, i.e., of I i=1 αci /I , are depicted in Fig. 2. The products that were never displayed and featured in the data period have been omitted from the histograms in Fig. 3.These estimates are useful for ascertaining which product categories should receive merchandising support in the form of in-store displays and feature advertising.Results show that the estimates are plausible in most product categories with negative price coefficients, and positive display and feature coefficients, but there exists fairly wide variation in the effectiveness of these variables across products.Many product categories appear to be unresponsive to merchandising efforts.

Personalized effective marketing promotions
Figure 4 provides a two-dimensional summary of the data and coefficient estimates for top 100 products and customers.Figure 4a is a scatter plot of two-dimensional data cube with respect to customers (c) and products (i), aggregated along the time (t) dimension.If a customer has never purchased a specific product in the dataset, then the coordinate (i, c) is colored "white."It is "black" if they have purchased the prod-uct at least once.We observe that customer-product space is still very sparse.
Figures 4b-d shows the results of testing with a 5% level of significance level for nonzero individual response coefficients.In Fig. 4b, the coordinates with a significant price coefficient indicated as "black" and "white" show that the estimate is not significant.The effectiveness of displays and feature promotions is defined similarly.We find that our model produces many significant price, display and feature coefficients.
An interesting aspect of our analysis is that because of the imputation present in the latent variable model for nonpurchases, significant coefficients can arise even when a customer has never purchased a product.The latent variable model greatly reduces the dimensionality of the data cube and produces individual estimates in a sparse data environment.Our analyses yield coefficient estimates at the individual customers and products by way of the latent topics that transcend the product categories.Our model enables marketers to develop effective pricing and promotional strategies by recognizing the presence of latent topics, or shopping baskets, present at each point in time in the data cube.

Conclusion
We proposed a descriptive model of demand based on the idea of latent variables where products purchased by customers.We allow for a product's purchase probability to be affected by price, display and feature advertising variables, but do not treat purchases as arising from a process of constrained utility maximization.An important benefit of this approach is that it enables us to side-step complications associated with competitive effects and model a much larger set of products than that possible with existing economic models.By retaining prices and other marketing variables in our model, we can predict the effects of these variables on own sales.This trade-off is unavoidable in the analysis of transaction databases where purchases are tracked across thousands of products.The proposed model is applicable to personalized marketing across numerous and diverse products.We show how the model is useful to produce information useful for personalized marketing for both specific customers and specific products, and how it effectively accommodates data sparseness caused by infrequent customer purchases.
Future research will combine marketing models and other latent variable models or tensor factorization methods and compare the prediction performance with that of the proposed model.We would like to apply the method to other market datasets to verify the prediction performance.Additionally, our model includes the assumption that the stability of the topic structure is over time.However, it is possible that customers' market response and purchase patterns change over time because of factors such as new trends, state dependence and the arrival of new purchase and delivery technologies.We believe that the development of a dynamic topic model for purchase is an interesting extension of our work, and leave this point for future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix A: Variational Bayes inference for the proposed model
This appendix details the variational inference of proposed model.The target and approximate distributions are denoted, respectively, as p and q.The latter is called the variational distribution.Distributions p and q share a parameter set Θ.In general, when the data D are given, the log marginal likelihood log p( D) of the target distribution is decomposed into two components as log p ( D) = L (q) + K L (q p ) (A1) where L(q) is the variational lower bound in VB inference, and K L (q p ) is the Kullback-Leibler divergence of the target and variational distributions.Actually, K L (q p ) is well known to be zero if p and q are the same distribution.Therefore, a reasonable solution to estimating the posterior distribution p is the variational distribution q for which K L (q p ) is minimized.However, it is difficult to evaluate the value of K L (q p ) because the expression involves a posterior distribution of p(Θ| D).In contrast, L(q) involves a joint distribution p( D, Θ) that is easily evaluated in many cases because it is obtained as the product of the prior and the likelihood in Bayesian models.In fact, maximizing L(q) is equivalent to minimizing K L (q p ) because the log marginal likelihood of the target distribution is constant for a given dataset.Under these circumstances, assuming that the distribution q and parameter set Θ are decomposable for some groups, the parameters are called variational parameters q (Θ) = J j=1 q j Θ ( j) * and can be maximized by the following updating algorithm [20]: The variational parameters are updated for each variational parameter set Θ ( j) * until convergence of the algorithm.The initial variational parameters are proper random values.The VB is guaranteed to converge after several iterations because L(q) is convex with respect to each q j (Θ ( j) * ) [5].The variational lower bound increases monotonically as the iteration proceeds.Therefore, convergence can be confirmed by checking the value of L(q) at each iteration.
We introduce the variational distributions and parameters for the proposed model.The parameters and variational parameters are denoted as , respectively, while the variational distributions are configured as where q c is a Dirichlet distribution with variational parameter γ * c .Also, q z represents a categorical distribution with variational parameter θ * cit , q u denotes a truncated normal distribution, q β stands for an M-dimensional multivariable normal distribution with two variational parameters (mean vector β * zi and covariance matrix V β * zi ), and q μ,V signifies a multivariable normal-inverse Wishart distribution with variational parameters μ * i , σ μ * i , w * i , W * i .We set hyperparameters as γ = [0.01, . . ., 0.01] T , μ = [0, . . ., 0] T , σ 2 = 1, W = I M and w = 10 and initial values as {θ

Appendix B: Derivation of VB algorithm for proposed model
The update procedure derives from the analytical calculation of Equation (13).The update equation for each variational parameter is obtained from the following expectation values where D = {{x it } , {y cit }}.
The update procedures of variational parameters

Optimization of γ * c
The Dirichlet and categorical distributions are of the following forms.
Therein, Γ (•) is the gamma function.Also, δ(z cit = z) is the Dirac delta function defined as δ(z cit = z) = 1 if z cit = z and δ(z cit = z) = 0.The expectation value E =q θ log p ( D, Θ) is then calculated for each c as where θ * cit z is an element of θ * cit .Here and hereinafter, const.denotes any term not included in the relevant parameters.The second line of the above equations describes a log-Dirichlet function with parameter γz + i∈I c t∈T c θ * cit z .Therefore, we obtain the following. 123 Here we designate a digamma function as Ψ (•), which will be useful for later discussion, and summarize the property of truncated normal distribution in the probit model.u (z) cit follows a normal distribution with mean x T it β zi and variance 1.Moreover, u (z) cit must satisfy y cit = 1 if u cit > 0 and cit is generated from a truncated normal distribution as with In addition, the expectation value and variance are expressed as where . Consequently, the expected value E =q z log p ( D, Θ) is given as The first term in the right-hand side of Eq. (B9) is obtained as [6], whereas the second term is evaluated as To solve Eq. (B9) for θ * cit z , we must evaluate the four terms of Eq. (B10).The first term includes a CDF from which the expectation value is difficult to obtain analytically.Therefore, we expand the term as a zeroth-order Taylor expansion in terms of the CDF of normal distribution and the logarithm function.Such bold approximation is standard strategies for adapting topic models with VB to practical computation (e.g., zeroth-order Taylor approximation by [4,33], and zeroth-and first-order delta approximation by [8]).The four expectation values in Eq. (B10) are then written as Finally, θ * cit z is updated as where Optimization of β * zi and V

β * zi
First, we derive an inverse Wishart distribution function and adopt some well-known properties of multivariable normal and inverse Wishart distributions [2,5].
We obtain the optimization procedures of β * zi and V β * iz by the following expected value: The first and second terms of the second line are given by the last and third lines of Eq. (B11), whereas the third and fourth terms are given, respectively, by Eqs.(B2) and (B3), derived in a manner similar to Equation (B10).β * zi and V β * zi are then updated arithmetically as where ūzi ≡ E u and The ūzi is a vector, and X i and X zi are matrices.The numbers of elements in ūi , X i and X zi are decided by the size of the customer base and by T c .

Optimization of μ
Here we consider a joint distribution of a multivariable normal distribution of μ i and an inverse Wishart distribution of V i and derive the update equations for variational parameters of four types from this joint distribution.To this end, we require the following expectation value from the joint distribution function.
First, we extract from this expectation value all terms linked to multivariable variational parameters μ μ * i and σ μ * i .That is The second term in the equation above is obtained in the same manner as Eq.(B14).The multivariable normal distribution function is then constructed in a straightforward manner as shown below: Next, we optimize w * i and W * i using Eq.(B14) and the relation log q The expectation value E =q V log p ( D, Θ) is calculated in a straightforward manner using Eqs.(B15) and (B16).Finally, we obtain the update equations for w * i and W * i as It is noteworthy that σ μ * i and w * i are constant if the hyperparameters and the number latent class are given.

Posterior mean βzi , û(z) cit and ẑ(z) cit
The estimated posterior means βzi , û(z) cit and ẑ(z) cit used in Sect. 4 in order to construct statistics for joint segmentation and personalization are calculated as βzi ≡ E[ cit z using VB estimates after the iterative procedure converges.

Appendix C: Variational lower bound of proposed model
The variational lower bound L Θ * is given as u are the following: and The entropy of u (z)  cit is given as where ξ is a random variable of the distribution [16].Therefore, The value of L Θ * is calculated using summation of the 10 expectations from (C1)-(C10) above.N M ( β3 , σ I M ) (c = 201, . . ., 300), and σ is set as 0.1.I M is the identity matrix of size M.Then, the utilities for 30 days are simulated by ūcit = x T it ᾱci + ¯ cit (¯ cit ∼ N (0, 1)).The purchased records { ȳcit } are generated as ȳcit = 1 if ūcit > 0 and ȳcit = 0 otherwise.
Here, we generate 10 simulation datasets using the procedures explained above.Table 2 presents the means and standard deviations of estimates with 200 iteration using the ten simulation dataset.The numbers in Table 2

Appendix F: Computation time
The computation time is investigated for C = {1000, 5000, 10000}, I = {100, 500, 1000}, T = 30 and Z = {5, 10, 20} in the same situation of simulation study in Appendix E. Consequently, 27 scenarios were explored in the study.The MCMC estimator is described in Appendix D. Then, we forecast the simulation times for 6000 MCMC samples from 10 samples for computational feasibility.In fact, the selection of 6000 MCMC samples is consistent with the simulation study of [8].The simulated data are the same as those used above.The results reported below were calculated in identical computational environment (64-bit version of Python 2.7.5 with NumPy, implemented on a 3.5-GHz processor (Quad-Core Xeon; Intel Corp.) with 256-GB memory).Table 3 reports the computation time in hours for the VB and MCMC estimators.For both algorithms, the computational cost increases linearly with the size of the dataset specified in terms of the numbers of customers, products and latent classes.In all scenarios, the times of MCMC computations exceed those of VB.The VB algorithm is approximately 20-50 times more efficient than MCMC, depending on the scenario.The time of computation using large-scale data (C = 10000, I = 1000) by MCMC is estimated as over 450 h.MCMC becomes increasingly prohibitive as the numbers of customers and choice alternatives increase.

Appendix G: Interpretation of latent classes
We obtain the probability of customer segment membership by aggregating over products (i) and time (t): and aggregating over customers (c) and time (t) yields the probability of product segment membership.
Therein, I (•) is the indicator function equal to one if the augment holds and zero otherwise.We take the sums over the instances of purchase because we believe that nonpurchase can occur for many reasons other than nonmembership (e.g., having large household inventory of the product).Our estimates of customer and product latent membership are driven by customer actions and not their inactions.Our model of purchase behavior allows for heterogeneity at each observation that acknowledges that each purchase occasion can be viewed as the building block for analysis.Some occasions are associated with trips to the store, whereas other occasions might have been more focused on a specific set of offerings.Moreover, customers might exhibit behavior consistent with multiple occasions, or topics, over time.Although it might be desirable for firms to classify goods and respondents to segments for understanding customers and goods of different types, our model can be applied to analysis at a more disaggregate level.Alternatively, our model is useful to associate both offerings and customers to latent topics, or segments, for understanding and managing market basket purchases.Table 4 displays the results of the joint segmentation of products and customers using Eqs.(C1) and (C2).The five products with the highest probability and their average levels of marketing activity are shown for the respective segments.The first column reports the brand name.The second column reports the product category associated with the offering.The remaining columns display the average level of marketing activity, i.e., the average price rate, average display rate and average feature rate.The title of each segment includes the numbers of products and customers who are jointly classified into the same segment.The segments are interpreted as follows.
The first segment has 31 customers and 9 products assigned to it.This segment includes beverages across dif-ferent categories with small discount rates and low rates of feature advertising.The second segment is characterized as being composed of the identical brands in the dessert category.The products are infrequently discounted and have a higher rate of display than the first segment.Segments 3 through 7 have fewer customers and products.They exhibit greater variation in the level of marketing activity.Particularly, Segment 5 contains two offerings in both the ice cream and dressing categories with the same brand names, both with high rates of display and feature activity.Segment 6 includes mainly products from the drink category.It is similar in marketing activity with segment 5. Segment 7 also comprises drink products with higher marketing levels as well as other products with lower levels of activities.Products in segment 8 comprise a variety of product categories with higher level of display.Segment 9, the largest cluster with 946 customers and 332 products, is characterized as having the highest level of display activity.Segments 8 and 10 include less discounting and more displayed products.The former is double-sized and triple-sized in terms of customers and products.
The potential use of this information is in managing crosscategory behavior.Knowing the products typically purchased for shopping trips of different types is useful to ascertain the range of impact of price promotions and merchandising activity.If customers have a budget for a particular shopping occasion, rather than for a particular product category, then the influence of a price reduction will have a broader effect in traditional models of demand.Our model allows for the identification of the boundary of effects as part of the topic, or latent segment, characterization.

Fig. 2
Fig. 2 Marginal distribution of parameter estimates of individual customers
) where each component of L Θ * represents the expectation of variables of the proposed model.The expectations except L ( p) u and L (q)

Table 1
[21]s of predictions of the four methods the logit model, the latent class logit model[21]and the proposed method.The RMSEs of probit model and logit model are calculated on the presumption that the data are generated by only one consumer's behavior for each product.The latent class logit model assumes latent class of customer only.The RMSEs of the three models are calculated independently for each product.The calculation of RMSEs of the latent class logit model with respect to Z = 10, 15 and 20 does not converge.We used R function glm for probit and logit model.Then we used R package FlexMix for latent class logit model.Results show that the proposed method has a higher prediction performance than other methods.

Table 2
Estimates of simulation data are calculated as 50 −1 I i=1 βzi .( βzi represents a estimated posterior mean of β zi .)Results indicate that the VB estimates are close to true values for all parameters in every segment.

Table 3
Simulation time by VB and MCMC

Table 4
Characteristics of latent classes