How to Generalize from a Hierarchical Model?

Models of consumer heterogeneity play a pivotal role in marketing and economics, specifically in random coefficient or mixed logit models for aggregate or individual data and in hierarchical Bayesian models of heterogeneity. In applications, the inferential target often pertains to a population beyond the sample of consumers providing the data. For example, optimal prices inferred from the model are expected to be optimal in the population and not just optimal in the observed, finite sample. The population model, random coefficients distribution, or heterogeneity distribution is the natural and correct basis for generalizations from the observed sample to the market. However, in many if not most applications standard heterogeneity models such as the multivariate normal, or its finite mixture generalization lack economic rationality because they support regions of the parameter space that contradict basic economic arguments. For example, such population distributions support positive price coefficients or preferences against fuel-efficiency in cars. Likely as a consequence, it is common practice in applied research to rely on the collection of individual level mean estimates of consumers as a representation of population preferences that often substantially reduce the support for parameters in violation of economic expectations. To overcome the choice between relying on a mis-specified heterogeneity distribution and the collection of individual level means that fail to measure heterogeneity consistently, we develop an approach that facilitates the formulation of more economically faithful heterogeneity distributions based on prior constraints. In the common situation where the heterogeneity distribution comprises both constrained and unconstrained coefficients (e.g., brand and price coefficients), the choice of subjective prior parameters is an unresolved challenge. As a solution to this problem, we propose a marginal-conditional decomposition that avoids the conflict between wanting to be more informative about constrained parameters and only weakly informative about unconstrained parameters. We show how to efficiently sample from the implied posterior and illustrate the merits of our prior as well as the drawbacks of relying on means of individual level preferences for decision-making in two illustrative case studies.


Introduction
Models of consumer heterogeneity play a pivotal role in marketing and economics. Typical applications are random coefficients or mixed logit models for aggregate or panel data (e.g., Revelt and Train, 1998;Train, 2009), and hierarchical Bayesian models. Influential applications of these models involve inference from household scanner panel data or from discrete choice experiments (e.g., Allenby, Arora, and Ginter, 1998;Allenby and Lenk, 1994;Dubé, Hitsch, and Rossi, 2010;Rossi, McCulloch, and Allenby, 1996;Sawtooth, 2013). In most applications, the inferential target pertains to a population beyond the sample of consumers providing the data for model calibration. For example, pricing, product design, or product line decisions informed by the sample data through the model are expected to be optimal in the population and not just in the observed, finite sample. The population model, the heterogeneity or random coefficients distribution is the natural and correct basis for generalizations from the observed sample of consumers or respondents to the market. The fact that inferences about parameters of this distribution are consistent in the sample size (N ), even if the number of observations contributed by each consumer (T ) is very small, makes this approach attractive from a statistical perspective.
Unfortunately, standard population distributions often lack economic rationality. For example, Reiss and Wolak (2007) remark that the estimated distribution of marginal utility of fuel economy in Berry, Levinsohn, and Pakes (1995) suggests that about half of consumers in the car market dislike fuel economy. As another example, Dubé et al. (2010) and Dubé, Hitsch, Rossi, and Vitorino (2008) find support for positive price coefficients in the inferred heterogeneity distribution. Such economically unreasonable characterizations of consumer heterogeneity prevent meaningful counterfactual predictions from the model. As an obvious example, models that support positive price coefficients in the inferred heterogeneity distribution preclude model based price optimization.
While a completely theory driven specification of heterogeneity distributions appears to be beyond reach, some authors argue in favor of theory driven constraints in the population distribution (e.g., Allenby, Brazell, Howell, and Rossi, 2014;Boatwright, McCulloch, and Rossi, 1999). The goal is a heterogeneity model that is maximally flexible regarding some aspects of the population distribution, but deterministically constrained by economic theory regarding other aspects of this distribution. This paper builds on this idea and develops it further.
In applications, a prior understanding of preferences in the population often suggest a large number of sign and order restrictions, for example: that the price parameter in an indirect utility function is negative or that consumers prefer a more fuel efficient to a less fuel efficient car, everything else equal. So called constrained parameter problems are relevant across academic fields and a body of literature dealt with this topic. Gelfand, Smith, and Lee (1992) provide an overview of how to impose sign and order constraints based on truncated distributions using Gibbs sampling. Allenby, Arora, and Ginter (1995) introduce this approach into marketing in the context of individual level conjoint analysis. Boatwright et al. (1999) develop a sampler in the spirit of Gelfand et al. (1992), but for a hierarchical sales response regression model. However, sign and order restrictions in models of heterogeneity still present unresolved challenges. In principle, one could adopt truncated normal distributions that implement prior constraints as outlined in Gelfand et al. (1992) for heterogeneity distributions. However, as we show below, any truncated dis-tribution of heterogeneity leads to a so called "doubly intractable" inference problem. The log-normal prior avoids this difficulty. The basic idea of using log-normal distributions to implement sign and order constraints is not new. For example, Allenby et al. (2014) use the exponential transformation, β p = −exp(β * p ) with β * p ∈ R distributed according to a hierarchical normal mixture prior, to enforce that the model has zero support for positive price coefficients. In this specification, the problem is that β * p is measured on the log scale and standard diffuse subjective prior settings imply absurdly large and small values of transformed coefficients β p (e.g., Allenby et al., 2014). 1 In the common situation where the heterogeneity distribution thus comprises both constrained and unconstrained coefficients, the choice of subjective prior parameters is an unresolved challenge.
As a solution to this problem we propose a marginal-conditional decomposition that avoids the conflict between wanting to be more subjectively informative about constrained parameters and only weakly informative about unconstrained parameters. We show that this decomposition is important whenever the heterogeneity distribution comprises a mix of constrained and unconstrained coefficients, e.g., brand and price coefficients. Our decomposition applies both to the fully parametric multivariate normal setting as well as to its semi-parametric generalizations. In addition, we show how to efficiently sample from the implied posterior building on the likelihood based pre-tuning of proposal densities in Rossi, Allenby, and McCulloch (2005).
Finally, we contrast profit implications of relying on the inferred population distribution to an ad-hoc approach that approximates heterogeneity using means of individual level coefficients. This latter approach is still common in applied academic and industry research. It is ad-hoc because if fails to measure heterogeneity consistently, distorting inference towards the population mean. As a consequence, markets will misleadingly appear too homogeneous, translating into too little product differentiation and too much price competition in counterfactual calculations. A side-effect of this distortion is a reduction of sign and order violations in the approximated heterogeneity distribution that likely contributed to the popularity of this ad-hoc approach.
In a nutshell the goal of this paper is to facilitate the formulation of more economically faithful hierarchical prior distributions of heterogeneity for better market simulators and improved counterfactual calculations. We thereby hope to broaden the applicability of models of heterogeneity, and to convince applied academic and industry researchers to abandon market simulators built on means of individual level preferences. The remainder of the paper proceeds as follows: Section 2 formally introduces different ways of generalizations from hierarchical Bayesian models and discusses implications for market simulation. In Section 3 we develop the hierarchical prior formulation and in Section 4 we discuss efficient sampling of individual level coefficients. Section 5 then investigates the relative performance of the proposed approach using simulated data. Sections 6 and 7 report the results from two empirical illustrations based on household scanner panel data on purchases of fresh hen's eggs (Kotschedoff and Pachali, 2020) and data from a discrete-choice experiment on tablet PCs. Finally, we summarize and discuss results in Section 8.

Different ways of generalizations and market simulations
Different ways of generalizing from hierarchical models to consumer preferences, choices, and market shares in the target population are best illustrated in a decision theoretic framework. For this purpose, and without loss of generality, we abstract away from competition and fixed costs, and assume constant marginal prices and costs in the following. If the decision-maker knew the distribution of preferences in the population denoted as p(β|τ ), he would choose the action a ∈ A that maximizes profits π(a, β) p(β|τ ) dβ = E β|τ [π(a, β)] = π(a) by solving the following maximization problem: Here MS(a, β) is the market share from action a and preference β, as implied by a choice model, C(a) denotes marginal costs associated with action a, and P (a) the marginal price, which may itself constitute an action; thus (P (a)−C(a)) is the contribution margin. Finally, the proportionality results from ignoring the market size.
Because the preference distribution in the population is generally unknown, the decision-maker forms an expectation about profits based on data Y = y 1 . . . y i . . . y N , where y i is the T i -vector of observations from individual i in the sample, and based on prior assumptions about the choice model underlying MS(a, β), the distribution of preferences in the population p(β|τ ), and the parameters τ in this distribution. He then maximizes the posterior expected profit: This estimator of expected profits entirely relies on posterior knowledge of the hierarchical prior distribution. We thus refer to this approach as "generalizing based on the hierarchical prior". It is easily computed to an arbitrary degree of precision based on MCMC draws from the posterior distribution p(τ |Y ) coupled with draws from the hierarchical prior distribution p(β|τ ). However, because it entirely relies on the posterior of the hierarchical prior, all prior parametric assumptions will come to bear.
If, for example, the hierarchical prior supports positive and negative price coefficients as in a normal distribution, the posterior of the hierarchical prior will necessarily-and may substantially-support positive price coefficients. The problem may persist even if the data reliably locate all individual specific posterior price coefficient distributions in the negative domain. The reason is that the best normal approximation matches the first and second moment of the distribution to be fitted, which may result in substantial support for positive coefficients even if all coefficients to be fitted are negative.
To mitigate the extrapolation of parametric assumptions in directions that violate economic theory, market simulators often rely on the collection of individual level posterior mean estimates {β i } N i=1 whereβ i = β i p(β i |Y, y i )dβ i -the shrinkage of individual level posterior means to the population mean in general reduces the number of sign and order violations, albeit at the expense of severely inconsistent inferences about heterogeneity. Expected profits from action a are then estimated as: However, as we illustrate in Appendix A.1, this estimator that aggregates optimal, in the sense of a bias-variance trade-off, individual level estimates, itself fails optimality criteria and is inconsistent no matter how large the sample of consumers N , as long as individual level likelihoods are not perfectly informative about individual level preferences. In practice, individual level likelihoods tend to be diffuse, which motivates hierarchical models in the first place.
A third estimator of expected profits from action a builds on the collection of individual level posterior distributions. We refer to this form of generalization as lower level model non smoothed (n.s.) because it relies on the lower, individual level models, but does not summarize individual level posteriors to The difference between this estimator and that defined in Equation 2 is that y i is used both to inform the posterior p(τ |Y ) and the prediction to new consumers' preferences in p(β h |y i , τ ). When individual level posterior distributions essentially degenerate to a point because of highly informative individual level likelihoods, the estimator in Equation 4 converges to that defined in Equation 3.
When individual level posterior distributions come from diffuse individual level likelihoods, as usual, the estimator in Equation 4 will be very similar to that in Equation 2. Thus, parametric assumptions in the hierarchical prior distributions will be similarly influential. Consistent with these assessments, we only find negligible differences between generalizations based on the posterior of the hierarchical prior and lower level model n.s. in the empirical applications discussed below.
What way of generalization should we use for market simulation in practice? Every trained Bayesian analyst will point out the inconsistency associated with relying on the collection of individual level posterior means. Such an analyst knows that posterior predictive preference distributions as defined in Equation 2 and Equation 4 allow for consistent inference (in N ), however conditional on functional form assumptions.
However, because standard parametric and semi-parametric assumptions such as multivariate normal or its finite mixture generalization violate basic economic intuition in many applications, consistency conditional on these assumptions is not too helpful. Thus, many applied researchers and practitioners opt for generalizations, i.e., market simulation based on the collection of individual level posterior means (Equation 3) that often substantially reduce the share of sign and order violations. We aim to overcome the choice between relying on the posterior of a mis-specified hierarchical prior and the collection of individual level posterior means that fail to measure heterogeneity, by showing how to specify more economically faithful hierarchical prior distributions based on prior constraints. The goal is a hierarchical prior that both is maximally flexible regarding some aspects of the population distribution of preferences, and heavily constrained by theory regarding other aspects of this distribution.

Sign and order constraints
Sign and order constraints dogmatically express prior knowledge about the support of a distribution, e.g., that the price parameter in an indirect utility function is negative or that a consumer prefers a more fuel efficient to a less fuel efficient car for sure, everything else equal. So called constrained parameter problems are relevant across academic fields and a body of literature dealt with this topic. Gelfand et al. (1992) provide an overview of how to impose sign and order constraints based on truncated distributions using Gibbs sampling.  introduce this approach into marketing in the context of individual level conjoint analysis. Boatwright et al. (1999) develop a sampler in the spirit of Gelfand et al. (1992), but for a hierarchical sales response regression model. However, the implementation of sign and order restrictions in hierarchical Bayesian models is still without a generally accepted solution. In principle, one could adjust the sampler outlined by Gelfand et al. (1992) to hierarchical settings. However, as we show next, any truncation applied to the prior (and hence to the posterior) of individual level coefficients in a hierarchical setting leads to a so called "doubly intractable" inference problem in the hierarchical prior. Doubly intractable problems are characterized by a normalization constant that depends on target parameters (e.g., Möller, Pettitt, Reeves, and Berthelsen, 2006;Murray, Ghahramani, and MacKay, 2006). Consider the following truncated normal hierarchical prior for consumers' demand parameters: where R k c denotes the truncation region of a k-dimensional demand parameter vector β, ϕ denotes the multivariate normal density and Z(β, V β ) the corresponding normalizing constant: The conditional posterior distribution of parameters indexing the hierarchical prior then becomes: where p(β, V β ) denotes the subjective prior for hierarchical prior parameters. Equation 7 is an example of a doubly intractable inference problem because even after dropping the normalization constant d{β, V β } of the posterior giving rise to the proportionality, we are left with the intractable expression Z(β, V β ). This expression normalizes the multivariate normal density to the region of support defined by R k c and cannot be dropped because it depends on target parametersβ and V β . 2 As a consequence of truncation, we loose the convenience of conditionally conjugate updates of hierarchical prior parametersβ and V β regardless of what subjective prior distributions we employ.
More generally, all estimation and sampling techniques that require the evaluation of the conditional , including standard Metropolis-Hastings sampling, are hamstrung by the intractability of Z(β, V β ). 3 Boatwright et al. (1999) propose to numerically approximate Z(β, V β ) at each MCMC iteration using the GHK algorithm (Hajivassiliou, McFadden, and Ruud, 1996). While this seems reasonable in their application that involves sign constraints on at most four parameters in a model with five parameters in total, numerical approximations will be problematic in the high-dimensional parameter spaces, potentially involving a multiplicity of constraints that have become common in applications more recently.
The log-normal hierarchical prior avoids this difficulty. The basic idea of using log-normal distributions to implement sign and order constraints is not new. For example, Allenby et al. (2014) use the exponential transformation, β p = −exp(β * p ) with β * p ∈ R and distributed according to a hierarchical normal mixture prior, to enforce that the model has zero support for positive price coefficients. In this specification, the problem is that β * p is measured on the log scale and standard diffuse subjective prior settings imply absurdly large and small values of transformed coefficients β p (e.g., Allenby et al., 2014). 4 Thus, the problem is how to specify differentially informative subjective priors for constrained coefficients and unconstrained coefficients. The standard Normal-Inverse-Wishart (NIW) subjective prior for means and covariance matrices in the hierarchical prior distribution is limited in this regardmostly because the prior concentration of the IW-prior is controlled by a single parameter (the prior degrees of freedom also known as the prior shape).
Next, we present a solution to this problem that re-parameterizes the hierarchical prior. Our contributions in this context are, first, a marginal-conditional decomposition of the hierarchical prior distribution that enables the analyst to be differentially informative about the distribution of constrained and unconstrained parameters in the population a priori 5 , and second, the generalization of the pre-tuning of proposal densities in Rossi et al. (2005) to this hierarchical prior.
The proposed marginal-conditional decomposition becomes essential whenever the hierarchical prior comprises both constrained and unconstrained parameters such as e.g., in simple hierarchical choice models that feature brand coefficients and a price coefficient. The proposed generalization of pretuned proposal densities (Rossi et al., 2005) is particularly important in high dimensional models that feature a multiplicity of constraints.
2 Note that without truncation, i.e., when R k c = R k , Z(β, V β ) = 1 for all regularβ and V β . 3 Some researchers simply ignore Z(β, V β ) in the update of upper level parameters, i.e., use standard updates based on p(β, V β |{βi}) ∝ N i=1 ϕ(βi|β, V β )p(β, V β ). This "solution" results in an incoherent model in the sense that data generating parameters may not be recovered, even from infinitely large samples.
5 See McCulloch, Polson, and Rossi (2000) for another example of specifying flexible priors for covariance matrices.

Marginal-conditional decomposition
Our hierarchical prior starts with a standard normal distribution. 6 Unconstrained coefficients have a normal hierarchical prior while sign and order constraints are imposed through exponential transformations of normal variates resulting in log-normally distributed coefficients. Vice versa, we can log-transform from sign and order constrained parameters that enter the likelihood to unconstrained, a priori conditionally normally distributed variates. We formulate subjective priors over this unconstrained space but use a marginal-conditional decomposition to implement vastly different subjective priors for parameters that are exponentiated and those that are not.
We denote g : R k → R k c as the function that maps normally distributed variates β * i to sign and order constrained coefficients β i that enter multinomial likelihoods explaining individual choice data y i . We distinguish k c "constrained" coefficients β * c i , i.e., coefficients to be transformed to obey sign and order constraints, and k uc unconstrained coefficients β * uc i in the hierarchical prior.
With the goal of formulating rather different subjective priors for the parameters governing the distribution of β * c i and β * uc i , we re-express the multivariate normal distribution in Equation 8 in the form of a multivariate regression model that regresses unconstrained coefficients β * uc i on "constrained" coefficients β * c i : Here, B * uc and B * c are matrices with k uc and k c columns, respectively, and N rows each, collecting unconstrained and "constrained" coefficients from individuals in the sample, and ι is a (N × 1)-vector of 1's; Γ is a (k c × k uc ) matrix of regression coefficients, z a column vector of intercept coefficients of length k uc , and Σ is the (k uc × k uc ) conditional variance-covariance of unconstrained coefficients in the population.
The first two moments of the distribution of "constrained" coefficients are obtained from yet another multivariate regression model that regresses "constrained" coefficients on a vector of constants: Here, ι is again a (N × 1)-vector of 1's and V * is the marginal variance-covariance matrix of constrained coefficients. The multivariate regression models in Equations 9 and 10 imply the following re-parameterization of the joint distribution of β * i from Equation 8: The advantage of the re-parameterization in Equation 11 relative to the more standard parameterization in Equation 8 is that we can now specify arbitrarily informative subjective priors for the hierarchical prior distribution of "constrained" coefficients, i.e., for the parameters µ * c and V * without restricting the prior of unconstrained coefficients. That is, if we a priori set V * to a "small" covariance matrix, we can nevertheless elect to be minimally informative about the distribution of unconstrained parameters through Σ. Coupled with weakly informative priors for Γ and z, neither the correlation between "constrained" and unconstrained nor the marginal mean of unconstrained coefficients is directly affected by informative prior specifications for µ * c and V * .
However, the role of the prior on Γ in the implied prior for the covariance of unconstrained coefficients (see the lower right block of the covariance matrix in Equation 11) requires additional discussion. A priori, an increasing number of constrained coefficients coupled with a diffuse prior on Γ implies a marginal prior for the variance of unconstrained coefficients that may appear as favoring larger variances. In this context, it is important to keep in mind that the variance contribution through Γ is through the covariance between "constrained" and unconstrained coefficients (see the upper right and lower left block of the covariance matrix in Equation 11). Thus, the prior implication of large marginal variances of unconstrained coefficients stems from "mixing" over strong and qualitatively different (positive or negative) dependencies between constrained and unconstrained coefficients. However, strong dependence between "constrained" and unconstrained coefficients constitutes an extremely informative hierarchical prior. Hence, "mixing" over strong and qualitatively different (positive or negative) dependencies between constrained and unconstrained coefficients is not a possibility a posteriori, not even in small data sets. For example, even smallish data sets will enforce a choice between the two highly informative opposites of strong positive and strong negative dependence between a constrained and an unconstrained coefficient. In sum, large variances of unconstrained coefficients through Γ a posteriori result from strong dependence between "constrained" and unconstrained coefficients as per the likelihood.
Before going into more detail about suggested subjective choices, we illustrate the problem of formulating sensible priors for constrained coefficients in the smallest possible example where Here, the subjective prior is on parametersβ * and V β * in the normal distribution that generates β * i . Under what is widely considered a weakly informative subjective prior setting 7 forβ * and V β * , we obtain that a priori 25% of the constrained coefficients {β i } are larger than −.001, i.e., very close to zero, and another 25% are smaller than −1054 (see the right column in Table 1). 7 We follow Rossi et al. (2005) in the specification of a weakly informative subjective prior setting. Specifically, for a single parameter problem and based on the paramterization in Equation 12: Aµ * c = 0.01, νV * = 6 as well asV * = νV * .
This concentration of mass in the tails of the prior is undesirable and counter to what one would expect from a weakly informative prior for β i . The prior for β i in the column on the left in Table   1 has lower (upper) quartiles of −8.977 (−.113) and appears to be much more reasonable for, say, the population distribution of price coefficients in a heterogeneous multinomial logit model. However, this marginal prior distribution requires subjective priors forβ * and V β * discussed next that in most applications would be considered unduly informative as a prior for unconstrained coefficients where β i = β * i . We use the fully conjugate prior for (Γ z , Σ), where Γ z := (z, Γ ) , and the conditionally conjugate prior The conditionally conjugate prior for (µ * c , V * ) enables the researcher to directly express prior beliefs about the distribution of "constrained" coefficients in the population. We setμ * Allenby et al., 2014). Especially the choice of prior degrees of freedom ν V * , i.e., the shape parameter in the IW prior for V * , would be considered unduly informative as a default value in the context of only unconstrained parameters. However, our marginal-conditional decomposition of the hierarchical prior enables the analyst to be arbitrarily informative about the hierarchical prior for "constrained" coefficients, essentially without affecting the marginal hierarchical prior for unconstrained coefficients.
The fully conjugate prior for (Γ z , Σ) adjusts the influence of the subjective prior on Γ z as a function of the conditional variance-covariance Σ, which is desirable in situations without much prior knowledge. We use standard weakly informative, "barely proper" priors for parameters in the conditional hierarchical prior of unconstrained coefficients,γ z , A Γz , ν Σ ,Σ. Our marginal-conditional decomposition corresponds to the directed acyclic graph in Figure 1 which shows that the hierarchical prior for "constrained coefficients", (µ * c , V * ), and that of unconstrained coefficients, (Γ z , Σ), are independent conditional on draws of "constrained" coefficients, B * c . This conditional independence relationship gives rise to a Gibbs-sampler for the two-stage update of parameters in the hierarchical prior: In step 1, we use a random walk Metropolis-Hastings (RW-MH) step to draw individual level parameters {β * i } based on multinomial logit likelihoods, similar to Rossi et al. (2005). However, as described in detail in the following Section 4, we need to account for the change of variables in g : R k → R k c when tuning the MH-proposal using information from the likelihood. In step 2, we use a Gibbs-sampler to update Γ z and Σ, i.e., parameters in a fully conjugate multivariate regression model, conditional on both "constrained" and unconstrained coefficients and subjective prior parameters (omitted for simplification).
Step 3 employs another Gibbs-step to update (µ * c , V * ), i.e., parameters in a conditionally conjugate multivariate regression model, conditional on "constrained" coefficients and subjective prior parameters. Appendix A.2 details the posterior distributions associated with steps two and three.

Efficient MH-sampling
Next we discuss efficient sampling of individual level part worth coefficients {β * i } based on pre-tuned proposal densities in a MH-sampler conditional on draws of hierarchical prior parameters (Rossi et al., 2005). Our algorithmic implementation is for a MNL model at the individual level, but the approach obviously generalizes to other likelihoods. The pre-tuning in Rossi et al. (2005) employs a normal approximation to the likelihood. The MNL-likelihood information about {β i } can be computed in closed form. However, our hierarchical prior is on the distribution of {β * i }; therefore, we need to account for the change-of-variables in g : R k → R k c .
Following Rossi et al. (2005), we specify the proposal density of the RW-MH sampler as follows where r ∈ {1, . . . , R} is the r-th iteration of the MCMC chain, c denotes a fixed scaling factor and H * i is the Hessian information about β * i in individual i's data, evaluated at the maximum of the following fractional likelihood: This fractional likelihood is defined as a w-weighted combination of the individual specific likelihood Here J g is the k ×k Jacobian of the function g(β * i ) that maps conditional normally distributed variates β * i to their sign and order constrained counterparts β i . H i and J g are evaluated atβ i and g −1 (β i ) =β i * respectively, i.e., at the parameter value that maximizes the fractional likelihood in 14.
Appendix A.4 illustrates the value of the proposed tuning in the MH-upate of β * i in a small simulation that only involves choices of one individual. We find that the proposed tuning results in a sampler that is on average about 3.7 times more efficient than that using a simpler and more standard tuning (see Table 15). We note that these differences can magnify substantially in a hierarchical setting.

Simulation study
Next we illustrate the benefits of our proposed marginal-conditional decomposition in the presence of sign and order constraints using simulations. First, we compare prior distributions in the prototypical setting that combines constrained and unconstrained coefficients. Second, we analyze the posterior from simulated data under different priors and elaborate on the numerical properties of the proposed methodology.

Drawing from prior distributions
Suppose a hypothetical setting with two attributes A1 and A2 at two levels L1 and L2 each, yielding four possible product configurations. Both levels of the first attribute provide positive utility to every consumer, and its second level is weakly preferred to the first, again by all consumers. To reflect these sign and order restrictions, we denote the respective coefficients as {β +,i } and {β ++,i }, where i = 1, . . . , N indexes simulated consumers. Preferences for the levels of the second attribute are heterogeneous but without a uniform prior direction or ordering, such as e.g., the preferences for colors or flavors in applications. We denote the respective coefficients as {β uc 1 ,i } and {β uc 2 ,i }. The price coefficient is negative. We thus have the following set of constraints for every consumer i = 1, . . . , N : First, we compare (implied) marginal priors for coefficients β = g(β * ) based on the marginal-conditional decomposition in Equation 11 and a more standard parameterization (Equation 8) coupled with the more informative subjective prior settings suggested in Allenby et al. (2014). Allenby et al. (2014) propose to adjust the standard weakly informative prior settings to k + 15 (from k + 5) prior degrees of freedom for the IW-prior (where k denotes the dimension of individual demand parameters) in the standard one-component model, and to set the diagonal elements in the prior scale matrix to 0.5 for constrained coefficients and to 1 for unconstrained coefficients. In addition, the subjective prior information forβ * is increased to A µ * = .1 (from .01).
However, as described before, the problem with the standard parameterization is that these more informative subjective settings now apply to both constrained, i.e., to be transformed, and to unconstrained coefficients. While these settings yield much more sensible priors for constrained coefficients, they may be unduly informative for unconstrained coefficients. with the more informative settings discussed above imply a much more informative marginal prior for unconstrained coefficients than usual. At first sight, the comparison in the right-panel of Figure 2 seems to suggest that the standard parameterization coupled with the more informative settings from above simply imply less heterogeneity in β uc 1 a priori. However, it is important to realize that the increase in prior degrees of freedom in the IW prior will similarly fail to accommodate much more homogenous markets than what is implied by the prior settings. In fact, it is the joint possibility of extremely homogenous and extremely heterogenous markets under our suggested prior that causes pronounced peak at zero together with the fat, sub-exponential tails in the right panel of Figure 2.

Population distribution and data generation
We generate heterogeneous consumer preferences obeying sign and order constraints in Equation 16 using the following transformation and distribution: β * = 0.5 −0.5 0.8 2.5 2.5 and   Preferences for the two levels of A2 have the same expected value, but are more heterogeneous for the second level. Preferences for the first and second level of A1 correlate positively. Furthermore, consumers who prefer the second level of A1 are less price sensitive on average, Cov(β * ++ , β * p ) = −0.15. Similarly, consumers who prefer the first level of A2 are less price sensitive while preferences for the second level correlate positively with the absolute value of the price coefficient.
We generate a sample of N = 1000 consumers with preferences {β i } from this population distribution as input to generating discrete choice data Y . Each choice is from the full set of product alternatives at different, randomly drawn prices from a uniform distribution with support in [0.5, 3], plus an outside good. Consequently, there are p = 5 alternatives in each choice set. We fix the amount of individual level information at T = 4. Recall that many discrete choice studies in marketing barely reach one choice task per parameter to estimate at the individual level. The sparse individual level data scenario assumed in this simulation is therefore representative of applications in practice.
We remove the column pertaining to the first level of A 1 from the design matrix for identification. 10 Table 3 shows the mapping between data generating and identified parameters derived from the design matrix. Since we delete the first level of A1 from the design, it follows that β id Outside 0 0  10 In principle, the MCMC sampler could navigate the unidentified model at the individual level based on a proper (hierarchical) prior. Non-identification implies that two different vectors of preferences β 1 and β 2 with β 1 = β 2 can achieve the exact same likelihood maximum. In an unidentified model, the sampler then generates from the infinite number of different states of the same (high) likelihood for any individual i. However, this interferes with measuring preference heterogeneity in the population. Consider the case of two brands offered in a choice set without an outside option. Only the relative brand preference is likelihood-identified. Now consider two different individuals i and j having the exact same relative preferences. We could set βi = βi1 − ε βi2 − ε as well as βj = βj1 + ε βj2 + ε and create arbitrarily large preference heterogeneity for ε → ∞, while the likelihood of observed choices remains constant.

Estimates of heterogeneity
11 Note that the normal hierarchical prior for β id uc 1 and β id uc 2 used in estimation no longer exactly corresponds to the data generating heterogeneity distribution in this example. The data generating marginal distributions of β id uc 1 and β id uc 2 are sums of normally and log-normally distributed random variables, as per our identification constraint, and a mixture of normals prior may further improve generalizations to the population based on the (posterior of) the hierarchical prior.

Preferences for fresh hen's eggs
Our first empirical application analyzes Nielsen data on purchases of fresh hen's eggs by German households (see Kotschedoff and Pachali, 2020). It illustrates the empirical relevance of the proposed marginal-conditional decomposition of the hierarchical prior. In Germany, eggs are differentiated in terms of animal welfare as summarized in Table 4  The demand model in KP assumes that households have full information about the egg products offered by the ten retail chains included in the sample. Accordingly, household i's indirect utility from egg product g in chain l at period t is where g ∈ {Battery, Barn, F ree-range, Organic} and l ∈ {1, . . . , 10}. The indicator variable, 1{}, denotes whether egg label g has the package size six instead of ten eggs. The price is given by p glt and the mean utility of the outside option is normalized to zero, u iglt = 0. The error terms ε iglt is assumed to follow a type I extreme value distribution, as standard in the literature.
KP state that flexible estimation of the retail chain preference coefficients {ψ i,l } is particularly important in their demand specification, alleviating a potential bias from the full information assumption implicit to Equation 18: It is crucial that retail chain preference coefficients become very negativepotentially approaching negative infinity-for those chains a household never or very infrequently purchased eggs from. If a retail chain is estimated to be extremely unattractive to a consumer, the egg prices charged at this chain will not affect this consumer's egg purchasing decisions, independent of the consumer's actual price knowledge set. In addition, KP rely on the inferred information about {ψ i,l } when modeling competition among retail chains in a supply side model.
Here, we rely on the simplified demand framework in Equation 18 to illustrate the benefits of our marginal-conditional decomposition model as developed in Section 3. 13 The model is an example of the typical application featuring a mix of constrained and unconstrained coefficients in the context of a hierarchical model. While we cannot a priori constrain preferences for the retail chains and the battery egg taste coefficient, which measures preferences for battery eggs over the outside good, it seems meaningful and actually important to constrain the remaining parameters. This is because the amount of price variation across quality tiers in this data vastly exceeds the amount of temporal price variation within quality tiers. As a consequence, a household who is only observed to purchase the highest price alternative (organic eggs) could be rationalized as exhibiting positive preferences for high prices in a model without economically motivated constraints. Similarly, an unconstrained model could misleadingly rationalize the choice pattern of a household who only purchased the lowest price alternative (battery eggs) based on higher (direct utility) preferences for battery eggs than for qualitatively superior alternatives.
12 Furthermore, they only consider purchases at the top ten retail chains and define boiled and painted eggs as well as eggs from other type of poultry, e.g., quails and gooses, as outside good.
13 KP in addition control for seasonality and regime changes. However, these controls are irrelevant for the purpose of the illustration here.
We thus employ the constraints summarized in Table 5. Preferences for the four different egg labels should satisfy the quality ordering implied by Table 4 to identify the price coefficient. Everything else equal, for example, a household should not be worse off consuming an organic egg instead of a battery egg. Furthermore, the coefficient for the smaller package size and the price coefficient are constrained to be negative.      Figure 4 confirm the finding from the simulation study in Section 5: By imposing an informative prior on all coefficients (that is really needed for the constrained coefficients only) the standard formulation results in the dashed-dotted densities in green, which underestimate heterogeneity in these unconstrained coefficients. This is particularly apparent in the right panel of Figure 4, where the marginal posterior from the standard parameterization of the hierarchical prior (see Equation 8)-when coupled with informative subjective priors needed to "discipline" the distribution of constrained coefficients-fails to accommodate extremely negative preferences for retail chain 5 in the left tail.

Restricted Attributes Constraints
14 We run both MCMC samplers for R = 120, 000 iterations and keep every 40th draw. We then burn-off the first 2000 draws and perform our analysis based on the remaining 1000 draws from the converged posterior distribution. We assess convergence by inspecting time-series plots of draws, both at the level of individual respondents and in the hierarchical prior. 15 We estimate individual retail chain preferences relative to a baseline chain for likelihood identification, i.e., for l = 1, ψ i,l = ψ i,l − ψi,1 measures household i's preference for the lth retailer relative to the first as the baseline level.

Tablet PC preferences
Our second empirical application uses data from a commercial discrete-choice conjoint study investigating demand for tablet PCs ("tablets"). Here, we focus on the drawbacks of relying on individual level posterior meansβ i = β i p(β i |Y, y i )dβ i for market simulation (as defined in Section 2), and estimate implied losses in profits when relying on this method for decision-making. For estimation, we rely on the marginal-conditional decomposition of the hierarchical prior (see Section 3). We show how using posterior means translates into systematic over estimation of preferences for sign-and orderconstrained attribute levels. Finally, we show empirically how relying on individual level posterior means reduces sign and order violations, in the absence of a theoretically constrained hierarchical prior-arguably a major reason for the popularity of this approach in practice.    The original goal of this study was to help optimize brand A's product design given a fixed set of competitor offerings. As typical of industry grade discrete-choice conjoint studies, the number of parameters at the individual level (36 coefficients after imposing identification constraints) by far exceeds the number of individual level observations. As a consequence, a hierarchical model is required, the hierarchical prior's specification becomes critically important, and-in the likely scenario of heterogeneous preferences-individual level posterior distributions will reflect large amounts of posterior uncertainty about a specific respondent's preferences.
In combination with the ordinal nature of many of the attributes in this study, a standard hierarchical prior specification leads to questionable results. For instance, Figure 11 and Table 17 in Appendix A.6 showcase that posterior predictive distributions from an unconstrained hierarchical prior specification coupled with weakly informative subjective priors (e.g., Rossi et al., 2005) clearly violate basic economic intuition. Inferred preferences for levels of cash back refer to the amount of money a customer receives after purchase upon submitting the sales receipt to the manufacturer. According to Table 17 (Appendix A.6), more than 25% of draws from the posterior of the hierarchical prior imply that consumers dislike tablets with larger amounts of cash back. Perhaps even more problematic, the posterior of the hierarchical prior suggests that consumers in the market prefer a tablet with 100e cash back over the same tablet with 150e cash back (as indicated by the stochastic dominance of 100e cash back across all quantiles of the marginal posterior predictive distribution). In a market simulation, this could give rise to the odd outcome that tablets with smaller levels of cash back will be offered at higher prices, everything else equal. Finally, Table 18 (Appendix A.6) shows that the collection of individual level posterior means cuts the support for negative preferences for e.g., 50e cash back by about 50%. Recall that what may appear as a benefit here is the consequence of measuring heterogeneity inconsistently. These observations call for a diligently constrained hierarchical prior distribution of heterogeneity in the population.
The majority of attributes and levels in Table 9 are such that one can expect every respondent to strictly prefer one level over another level, everything else equal. Table 10 collects all ordinal and sign constraints we thus impose in the hierarchical prior distribution, based on (direct) utility considerations. We constrain preferences for eleven out of the fourteen attributes. We do not impose constraints on brand, operating system, and display size. Although some brands may be preferred on average, it would be wrong to impose the average preference ordering for every respondent, similar with operating systems. Display size may appear as an ordinal attribute at first, but is not once the inconvenience of larger displays in some usage situations, or when transporting the tablet, are taken into account. As a consequence, we face a mix of constrained and unconstrained coefficients that we argue is characteristic of most applications of hierarchical models, at least in marketing and economics.
We leverage the marginal-conditional decomposition of the hierarchical prior distribution developed in Section 3 to specify suitable subjective priors. We run the MCMC sampler using the tuned random walk proposal from Section 3 for R = 500, 000

Restricted Attributes Constraints
iterations and keep every 50th draw. We then burn-off the first 8000 draws and perform our analysis based on the remaining 2000 draws from the converged posterior distribution. We assess convergence by inspecting time-series plots of draws, both at the level of individual respondents and in the hierarchical prior. Here, we only report results for a model with a fully parametric, one-component hierarchical prior. 16 is normalized to zero for identification, and individual preferences for 50e, 100e, and 150e cash back are obtained as β CB 50 ,i = exp(β * CB 50 ,i ), β CB 100 ,i = β CB 50 ,i + exp(β * CB 100 ,i ), and β CB 150 ,i = β CB 100 ,i + exp(β * CB 150 ,i ), respectively. This way, the coefficient measuring the preference for 50e relative to no cash back is constrained to be positive, and coefficients associated with more cash back are constrained to be weakly larger than those associated with less cash back.

Predictive Performance and losses in profits
Next we illustrate the implications of these biases for predictive performance. We use the holdout log-

HL(y
We evaluate the predictive performance of the population preference distributions inferred from the collection of individual level posterior means and the posterior of the hierarchical prior using five-fold cross validation. K-fold cross-validation is a common approach to compare the predictive performance of different models for model choice (see e.g., Bishop, 2006). We split the complete set of N = 1046 choice vectors randomly into five disjoint subsets of approximately the same size. Y k train and Y k hold denote the k-th training and holdout sample, containing the data from about 800 (4 folds) and 200 (1 fold) respondents, respectively. The cross-validation estimator for the holdout log-likelihood is defined as the average of the holdout log-likelihoods across the five disjoint holdout data sets (Bengio and Grandvalet, 2004): HLL(A(Y k train ), y h ) denotes the predictive log-likelihood for holdout individual h in the k-th fold computed conditional on training data Y k train as input (see . The computations always use the same hierarchical Bayes model re-estimated using the respective training data, but summarized either using the collection of individual level posterior means, or the posterior of the hierarchical prior.   lying on the collection of individual level posterior means and the posterior of the hierarchical prior, the latter outperforms the former not only on average but also in every single fold. 18 Next we investigate the optimal product configuration for brand A. There are 460, 800 product opportunities for brand A in this study. We assume that brand A a priori fixes the levels of some attributes in order to make this problem manageable in the context of varying cost scenarios. We assume that brand A only offers tablets with operating system A, 8-inch display, no SD slot, a 32GB memory card, no smartphone synchronization, and 50e cash back. These assumptions reduce the action space to 360 unique product possibilities. For a market scenario, we assume that brands C, D, G are already in the market and configured as follows:  To more generally capture differences between optimal actions implied by the different approaches of generalizing to the market, we specify a grid of possible costs. This grid comprises 20 different cost settings and is constructed as follows. First, costs are assumed to be the same for the weakest level of each attribute within each scenario. Within attributes, we assume that the cost difference between the baseline and (weakly) preferred levels is determined by a constant factor, i.e.
. . , for the levels of a priori ordered attributes; L 1 is the least preferred level. We set f = 3 in this example and obtain 20 different scenarios by changing the cost of producing the least preferred levels {c L1 } of the ordinal attributes to be optimized.   Table 13 summarizes the distribution of product-specific costs across the 360 product opportunities for the first, fifth, tenth, fifteenth and twentieth cost scenario. As can be seen, the grid includes both small as well as large absolute cost differences. In the first cost scenario, it is straightforward for brand A to offer a tablet combining the most attractive attribute levels, i.e., high resolution, 128GB, 2.2 Ghz, 8 − 12 hours battery, WLAN + LTE (4G), and a value pack, from the attributes to be optimized. As cost differences between attribute levels increase, it becomes less and less profitable to offer this high quality combination of attributes and we compute the expected loss caused by relying on a suboptimal form of generalization each time.
18 The five-fold cross-validation log-likelihoods using the unconstrained model are −2997 and −2894 based on posterior means and the posterior of the hierarchical prior, respectively. Constraining the hierarchical prior therefore improves the predictive performance of the model, and regardless of how the model is translated into posterior predictions. Table 14 summarizes the distribution of brand A's expected percentage losses incurred by relying on the collection of individual level posterior means relative to inferred actions based on the posterior of the hierarchical prior a hp across cost scenarios. We find that optimization results that rely on the collection of individual level posterior means to represent market preferences are clearly inferior and the average percentage loss of 6.68% from using this latter method seems substantial.

Minimum Mean Maximum
Posterior Means 1.162 6.683 12.193 Table 14: Percentage losses from using posterior means across cost scenarios relative to optimal actions from the posterior of hierarchical prior.

Discussion
Models of consumer heterogeneity play a pivotal role in marketing and economics. Typical applications are random coefficients or mixed logit models for aggregate or panel data and hierarchical Bayesian models. Historically, statistical efficiency or computational arguments motivate the choice of heterogeneity model (e.g., Lenk, DeSarbo, Green, and Young, 1996).
However, what can be learned about and subsequently extrapolated from the inferred heterogeneity distribution is limited by functional form assumptions such as e.g., the assumption of multivariate normally distributed preferences. For example, consistent estimates of the first and second moments, and correlations in the heterogeneity distribution-all which can be accomplished based on a multivariate normal prior-will fail to translate into useful market simulators in the context of highly non-normal distributions, e.g., distributions that are highly asymmetric.
Various semi-parametric formulations have been advanced (e.g., Lenk and DeSarbo, 2000;Li and Ansari, 2014;Rossi, 2014)  The problem with standard, statistically motivated prior population distributions has been recognized in the academic literature early on (see the pioneering contribution by Boatwright et al., 1999), but no general solution has emerged. Recently, Allenby et al. (2014) introduced an informative subjective prior specification for log-normal hierarchical priors. These priors are easily implemented (compared to the truncated normal in Boatwright et al. (1999)), but require the analyst to depart from the standard weakly informative subjective prior settings in hierarchical models (e.g., Rossi et al., 2005).
In the common situation where the heterogeneity distribution comprises both constrained and unconstrained coefficients (e.g., brand and price coefficients), the choice of subjective prior parameters is an unresolved problem for which this paper proposes a solution.
The contribution of this paper is a marginal-conditional decomposition of the population distribution that allows researchers to be informative about constrained parameters, on a logarithmic scale, while retaining maximal flexibility regarding the (conditional) hierarchical prior of unconstrained coefficients.
The suggested specification is easily implemented and the additional computational effort is minimal.
Our specification becomes essential whenever the heterogeneity distribution comprises both constrained and unconstrained coefficients such as e.g., in heterogeneous or mixed choice models that feature brand coefficients and a price coefficient. Finally, we develop how to tune individual level proposal densities for numerically efficient MCMC inference in the presence of sign-and order-constraints.
This generalization of pre-tuned proposal densities (Rossi et al., 2005) is particularly important in high dimensional models that feature a multiplicity of constraints.
We thus overcome the choice between a mis-specified heterogeneity distribution and a the common adhoc use of the collection of individual level means that fail to measure heterogeneity consistently. The marginal-conditional decomposition developed in this paper facilitates the formulation of more economically faithful heterogeneity distributions based on prior constraints, broadening the applicability of hierarchically formulated choice and demand models in marketing and economics.
An aspect of the subjective prior for order constrained coefficient that we have not explored in this paper, but plan to investigate in future research, is that of prior scale differences and dependence between coefficients for an ordinally constrained attribute. It is easy to verify by simulation that prior scale differences and dependence can be used to express structured beliefs about heterogeneity in ordinal preferences. For example, the population could be heterogeneous in their valuation of a lower level of an ordinal attribute but relatively homogeneous in incremental preferences for the next higher level. Alternatively, the population could exhibit substantial heterogeneity in the incremental valuation of the next higher level. Finally, the amount of heterogeneity in the increment could be correlated with the valuation of the lower level, such that low, medium, or high valuations of the lower level co-occur with relatively more heterogeneity in the incremental valuation of the higher level.
Last but not least, it could be interesting to compare (a mixture of) multivariate truncated normal distributions to the log-normal prior formulation used in this paper. The recently proposed exchange algorithm can handle the "double-intractability" due to the intractable normalization of the truncated multivariate normal (Möller et al. (2006); Murray et al. (2006); see Kosyakova, Otter, Misra, and Neuerburg (forthcoming) for a recent adaptation of the exchange algorithm in marketing).

A.2 Posterior distributions: log-normal prior
We set B * c z := ι B * c in what follows. The posteriors associated with the priors in Equation 12 are (see e.g., Rossi et al., 2005): 19 Data generating part-worths are from a multivariate normal distribution, β ∼ N (β, V β ) with meanβ = 0 0.1 0.2 0.3 0.4 and variance-covariance matrix V β = diag 1 1.5 2 2.5 3 representing the population of consumers. 20 Densities are estimated using a hierarchical Bayesian MNL model over the identifiable parameters with standard weakly informative subjective prior settings as described in e.g., Rossi et al. (2005). with: A.4 Illustrating the value of the proposed tuning Our small illustration only involves choices by one individual, i.e., no unobserved heterogeneity. Inside goods are characterized by one five level, ordinal attribute: β * = g −1 (β) = β 1 ln(β 2 − β 1 ) ln(β 3 − β 2 ) ln(β 4 − β 3 ) ln(β 5 − β 4 ) = −1 0.2 0.5 −0.1 −0.5 The individual chooses repeatedly (T = 20 and T = 1000) from choice sets that contain all five possible inside goods and an outside good with utility normalized to zero according to an MNL model. We compare the numerical performance of our tuned MCMC chain to a simpler, more standard tuning with β * cand i ∼ N β * i , c 2 I . Our target quantity are numerical standard errors of posterior means denoted numSE from MCMC chains of length 1, 000, 000 initialized at data generating values. The numerical standard error approximates the variation in posterior means across different, independent same length runs of the MCMC, after convergence. The tuning parameter c 2 in the simpler, more standard proposal density is optimized targeting the average of numerical standard errors across the five parameters on the grid 0.01 0.06 0.11 . . . 1.46 . This parameter is set to its default value of c 2 = 1 (see Rossi et al., 2005)     A.6 Tablet PC preferences in an unconstrained model