Skip to main content

A comparison of variational approximations for fast inference in mixed logit models

Abstract

Variational Bayesian methods aim to address some of the weaknesses (computation time, storage costs and convergence monitoring) of mainstream Markov chain Monte Carlo based inference at the cost of a biased but more tractable approximation to the posterior distribution. We investigate the performance of variational approximations in the context of the mixed logit model, which is one of the most used models for discrete choice data. A typical treatment using the variational Bayesian methodology is hindered by the fact that the expectation of the so called log-sum-exponential function has no explicit expression. Therefore additional approximations are required to maintain tractability. In this paper we compare seven different possible bounds or approximations. We found that quadratic bounds are not sufficiently accurate. A recently proposed non-quadratic bound did perform well. We also found that the Taylor series approximation used in a previous study of variational Bayes for mixed logit models is only accurate for specific settings. Our proposed approximation based on quasi Monte Carlo sampling performed consistently well across all simulation settings while remaining computationally tractable.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Bishop CM (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  2. Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35

    MathSciNet  Article  MATH  Google Scholar 

  3. Böhning D (1992) Multinomial logistic regression algorithm. Ann Inst Stat Math 44(1):197–200

    Article  MATH  Google Scholar 

  4. Böhning D, Lindsay BG (1988) Monotonicity of quadratic-approximation algorithms. Ann Inst Stat Math 40(4):641–663

    MathSciNet  Article  MATH  Google Scholar 

  5. Bouchard G (2007) Efficient bounds for the softmax function and applications to approximate inference in hybrid models. In: NIPS 207 workshop on approximate inference in hybrid models. pp 1–9

  6. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  7. Braun M, McAuliffe J (2010) Variational inference for large-scale models of discrete choice. J Am Stat Assoc 105(489):324–335

    MathSciNet  Article  MATH  Google Scholar 

  8. Consonni G, Marin J-M (2007) Mean-field variational approximate Bayesian inference for latent variable models. Comput Stat Data Anal 52(2):790–798

    MathSciNet  Article  MATH  Google Scholar 

  9. Girolami M, Rogers S (2006) Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput 18(8):1790–1817

    MathSciNet  Article  MATH  Google Scholar 

  10. Gupta M, Srivastava S (2010) Parametric Bayesian estimation of differential entropy and relative entropy. Entropy 12(4):818–843

    MathSciNet  Article  MATH  Google Scholar 

  11. Hall P, Ormerod JT, Wand MP (2011) Theory of Gaussian variational approximation for a Poisson mixed model. Stat Sin 21:369–389

    MathSciNet  MATH  Google Scholar 

  12. Hickernell FJ, Hong HS, L’Écuyer P, Lemieux C (2000) Extensible lattice sequences for quasi-Monte Carlo quadrature. SIAM J Sci Comput 22(3):1117–1138

    MathSciNet  Article  MATH  Google Scholar 

  13. Jaakkola TS, Jordan MI (2000) Bayesian parameter estimation via variational methods. Stat Comput 10:25–37

    Article  Google Scholar 

  14. Jebara T, Choromanska A (2012) Majorization for CRFs and latent likelihoods. Adv Neural Inf Process Syst 25:557–565

    Google Scholar 

  15. Khan ME, Marlin BM, Bouchard G, Murphy K (2010) Variational bounds for mixed-data factor analysis. Adv Neural Inf Proc Syst 23:1–9

    Google Scholar 

  16. Knowles DA, Minka T (2011) Non-conjugate variational message passing for multinomial and binary regression. Adv Neural Inf Process Syst 24:1701–1709

    Google Scholar 

  17. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    MathSciNet  Article  MATH  Google Scholar 

  18. Lawrence N D, Milo M, Niranjan M, Rashbass P, Soullier S (2004) Reducing the variability in cDNA microarray image processing by Bayesian inference. Bioinformatics 20(4):518–526

    Article  Google Scholar 

  19. Levin DA, Peres Y, Wilmer EL (2009) Markov chains and mixing times. American Mathematical Society, Providence

    MATH  Google Scholar 

  20. McFadden D (1974) Conditional logit analysis of qualitative choice behaviour. In: Zarembka P (ed) Frontiers of econometrics. Academic Press, New York, pp 105–142

    Google Scholar 

  21. Ormerod JT, Wand MP (2010) Explaining variational approximations. Am Stat 64(2):140–153

    MathSciNet  Article  MATH  Google Scholar 

  22. Ormerod JT, Wand MP (2012) Gaussian variational approximate inference for generalized linear mixed models. J Comput Graph Stat 21(1):2–17

    MathSciNet  Article  Google Scholar 

  23. Parisi G (1988) Statistical field theory. Addison-Wesley, New York

    MATH  Google Scholar 

  24. R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. URL http://www.R-project.org/

  25. Rossi PE, Allenby GM, McCulloch RE (2005) Bayesian statistics and marketing. Wiley, Hoboken

    Book  MATH  Google Scholar 

  26. Rossi P (2015) Bayesm: Bayesian inference for marketing/micro-econometrics, 2015. http://CRAN.R-project.org/package=bayesm. R package version 3.0-2

  27. Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Ser B Methodol 71:319–392

    MathSciNet  Article  MATH  Google Scholar 

  28. Titterington DM (2011) The EM algorithm, variational approximations and expectation propagation for mixtures. In: Mengersen KL, Robert CP, Titterington DM (eds) Mixtures: estimation and applications. Wiley, Chichester, pp 1–30

    Chapter  Google Scholar 

  29. Train K (2009) Discrete choice methods with simulation, 2nd edn. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  30. Train K, Hudson K (2000) The impact of information on vehicle choices and the demand for electric vehicles in California.Technical report, Toyota and General Motors

  31. Train K, Weeks M (2005) Discrete choice models in preference space and willingness-to-pay space. In: Scarpa R, Alberini A (eds) Applications of simulation methods in environmental and resource economics. Springer, Dordrecht, pp 1–16

    Chapter  Google Scholar 

  32. Wang B, Titterington DM (2005) Inadequacy of interval estimates corresponding to variational Bayesian approximations. In: Cowell RG, Ghahramani Z (eds) Proceedings of the tenth international workshop on artificial intelligence and statistics, January 6–8, 2005, Savannah Hotel, Barbados, pp. 373–380. Society for Artificial Intelligence and Statistics

  33. Wang B, Titterington DM (2006) Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Anal 1(3):625–650

    MathSciNet  MATH  Google Scholar 

  34. Winn J, Bishop CM (2005) Variational message passing. J Mach Learn Res 6:661–694

    MathSciNet  MATH  Google Scholar 

  35. Yu J, Goos P, Vandebroek M (2010) Comparing different sampling schemes for approximating the integrals involved in the efficient design of stated choice experiments. Transp Res Part B Methodol 44(10):1268–1289

    Article  Google Scholar 

Download references

Acknowledgments

Nicolas Depraetere was funded by project G.0385.10N of the Flemish Research Foundation (FWO Flanders), Belgium.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nicolas Depraetere.

Additional information

Nicolas Depraetere was funded by project G.0385.10N of the Flemish Research Foundation (FWO Flanders), Belgium.

Appendices

Appendix 1: Variational Bayes for the mixed logit model

In this section the development of the variational lower bound in Eq. 8 is shown in more detail. In order to avoid confusion about different parameterizations, we define the following unnormalized forms for the multivariate normal and inverse-Wishart distributions:

$$\begin{aligned} q\left( \varvec{\zeta }; \varvec{\mu }_{\varvec{\zeta }}, \varvec{{\varSigma }}_{\varvec{\zeta }}\right)&\propto \left| \varvec{{\varSigma }}_{\varvec{\zeta }}\right| ^{-\frac{1}{2}} e^{-\frac{1}{2}\left( \varvec{\zeta } - \varvec{\mu }_{\varvec{\zeta }}\right) ^T\varvec{{\varSigma }}_{\varvec{\zeta }}^{-1}\left( \varvec{\zeta } - \varvec{\mu }_{\varvec{\zeta }}\right) }, \\ q\left( \varvec{{\varOmega }}; \varvec{{\varUpsilon }}^{-1}, \omega \right)&\propto \left| \varvec{{\varOmega }}\right| ^{-\frac{\omega + K + 1}{2}} e^{-\frac{1}{2}tr\left( \varvec{{\varUpsilon }}^{-1} \varvec{{\varOmega }}^{-1}\right) }. \end{aligned}$$

As the distributions of \(q\left( \varvec{\beta }_h\right) , h = 1,\ldots ,H\), have the same parametric form as the distribution of \(\varvec{\zeta }\), only details for the latter are shown. The log joint probability of the mixed logit model is, up to a constant:

$$\begin{aligned}&\sum _{h = 1}^H \sum _{t = 1}^{T} \left\{ \varvec{y}^T_{ht} \varvec{X}_{ht}\varvec{\beta }_h - \log \left[ \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right] \right\} \nonumber \\&\quad + \sum _{h=1}^H \left\{ -\frac{1}{2}\log \left| \varvec{{\varOmega }}\right| -\frac{1}{2}\varvec{\beta }_h^T\varvec{{\varOmega }}^{-1}\varvec{\beta }_h -\frac{1}{2}\varvec{\zeta }^T \varvec{{\varOmega }}^{-1}\varvec{\zeta } + \varvec{\beta }_h^T \varvec{{\varOmega }}^{-1}\varvec{\zeta }\right\} \nonumber \\&\quad -\frac{1}{2}\varvec{\zeta }^T\varvec{{\varOmega }}_0^{-1}\varvec{\zeta } + \varvec{\zeta }^T\varvec{{\varOmega }}_0^{-1}\varvec{\beta }_0 -\frac{\nu + K + 1}{2} \log \left| \varvec{{\varOmega }}\right| -\frac{1}{2} tr\left( \varvec{S}^{-1}\varvec{{\varOmega }}^{-1}\right) + \text {Constant}. \end{aligned}$$
(12)

Hyperparameters from priors are set before estimation and are thus constants. Terms that only contain constants required for normalization of the normal and inverse-Wishart distributions are dropped here as they are irrelevant for maximization. Because the assumed approximate posterior distribution is factorized, we only require moments of the normal and the inverse-Wishart distribution to evaluate the expected value of Eq. 12, which are relatively simple to obtain. In the following sets of equations all the expectations are taken with respect to the approximate posterior distributions of the respective variables. The typical expectations and entropy involving the parameter \(\varvec{\zeta }\) are:

$$\begin{aligned} {\mathbb {E}}\left[ \varvec{\zeta }\right]&= \varvec{\mu }_{\varvec{\zeta }}, \\ {\mathbb {E}}\left[ \varvec{\zeta }^T\varvec{{\varOmega }}_0\varvec{\zeta }\right]&= {\mathbb {E}}\left[ tr\left( \varvec{{\varOmega }}_0\varvec{\zeta }\varvec{\zeta }^T \right) \right] = tr\left[ \varvec{{\varOmega }}_0 \left( \varvec{{\varSigma }}_{\varvec{\zeta }} + \varvec{\mu }_{\varvec{\zeta }}\varvec{\mu }_{\varvec{\zeta }}^T\right) \right] , \\ H\left[ \varvec{\zeta }\right]&= \frac{K}{2}\log \left( 2\pi e\right) + \frac{1}{2}\left| \varvec{{\varSigma }}_{\varvec{\zeta }}\right| = \frac{1}{2}\left| \varvec{{\varSigma }}_{\varvec{\zeta }}\right| + \text {Constant}. \end{aligned}$$

The same type of expectations are required to evaluate the expectations for the \(\left( \varvec{\beta }_{1:H}\right) \) parameters. The expectations and entropy involving the parameter \(\varvec{{\varOmega }}\) are (Gupta and Srivastava 2010):

$$\begin{aligned} {\mathbb {E}}\left[ \varvec{{\varOmega }}^{-1}\right]&= \omega \varvec{{\varUpsilon }},\\ {\mathbb {E}}\left[ \log \left| \varvec{{\varOmega }}\right| \right]&= -\log \left| \varvec{{\varUpsilon }}\right| - \sum _{k=1}^K \psi \left( \frac{\omega + 1 - k}{2}\right) + \text {Constant}, \\ H\left[ \varvec{{\varOmega }}\right]&= \sum _{k=1}^K\log {\varGamma }\left( \frac{\omega +1-k}{2}\right) +\frac{\omega K}{2} - \frac{K+1}{2} \log \left| \varvec{{\varUpsilon }}\right| \\&\quad - \frac{\omega + K + 1}{2} \sum _{k=1}^K \psi \left( \frac{\omega +1-k}{2}\right) + \text {Constant}, \end{aligned}$$

where \(\psi \left( .\right) \) represents the digamma function, \(\psi \left( x\right) = \frac{d}{dx}\log {\varGamma }\left( x\right) \), and \({\varGamma }\left( x\right) \) represents the gamma function, \({\varGamma }\left( x\right) = \int _0^{\infty }t^{x-1}e^{-t}dt\). When these expectations are inserted into Eq. 12 we obtain the following expected joint log probability of the mixed logit model, again up to a constant:

$$\begin{aligned}&\sum _{h = 1}^H \sum _{t = 1}^{T} \left\{ \varvec{y}^T_{ht} \varvec{X}_{ht}\varvec{\mu }_h - {\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right] \right\} \nonumber \\&\quad +\sum _{h=1}^H \left\{ \frac{1}{2}\log \left| \varvec{{\varUpsilon }}\right| +\frac{1}{2} \sum _{k=1}^K \psi \left( \frac{\omega + 1 - k}{2}\right) \right. \nonumber \\&\quad \left. -\frac{1}{2}tr\left[ \omega \varvec{{\varUpsilon }} \left( \varvec{{\varSigma }}_h + \varvec{{\varSigma }}_{\varvec{\zeta }} + \varvec{\mu }_h \varvec{\mu }_h^T + \varvec{\mu }_{\varvec{\zeta }} \varvec{\mu }_{\varvec{\zeta }}^T\right) \right] + \omega \varvec{\mu }_h^T \varvec{{\varUpsilon }}\varvec{\mu }_{\varvec{\zeta }} \right\} \nonumber \\&\quad -\frac{1}{2}tr\left[ \varvec{{\varOmega }}_0^{-1} \left( \varvec{{\varSigma }}_{\varvec{\zeta }} + \varvec{\mu }_{\varvec{\zeta }}\varvec{\mu }_{\varvec{\zeta }}^T\right) \right] + \varvec{\mu }^T_{\varvec{\zeta }} \varvec{{\varOmega }}_0^{-1}\varvec{\beta }_0 + \frac{\nu + K + 1}{2} \log \left| \varvec{{\varUpsilon }}\right| \nonumber \\&\quad + \frac{\nu + K + 1}{2} \sum _{k=1}^K \psi \left( \frac{\omega + 1 - k}{2}\right) \nonumber \\&\quad -\frac{\omega }{2}tr\left[ \varvec{S}^{-1} \varvec{{\varUpsilon }}\right] + \text {Constant}. \end{aligned}$$
(13)

The entropy of the variational posterior is, up to a constant:

$$\begin{aligned}&\sum _{h=1}^H \left\{ \frac{1}{2}\log \left| \varvec{{\varSigma }}_h\right| \right\} + \frac{1}{2}\log \left| \varvec{{\varSigma }}_{\varvec{\zeta }}\right| + \sum _{k=1}^K\log {\varGamma }\left( \frac{\omega +1-k}{2}\right) + \frac{\omega K}{2} \nonumber \\&\qquad - \frac{K+1}{2} \log \left| \varvec{{\varUpsilon }}\right| - \frac{\omega + K + 1}{2} \sum _{k=1}^K \psi \left( \frac{\omega +1-k}{2}\right) + \text {Constant}. \end{aligned}$$
(14)

We can now calculate derivatives of the lower bound, i.e. Eqs. 1314, with respect to \(\varvec{\mu }_{\varvec{\zeta }}, \varvec{{\varSigma }}_{\varvec{\zeta }}, \omega \) and \(\varvec{{\varUpsilon }}\), and set them to \(\varvec{0}\):

$$\begin{aligned} \nabla _{\varvec{{\varSigma }}_{\varvec{\zeta }}}&= -\frac{1}{2} \left( H \omega \varvec{{\varUpsilon }} + \varvec{{\varOmega }}_0^{-1} - \varvec{{\varSigma }}^{-1}_{\varvec{\zeta }} \right) , \\ \nabla _{\varvec{\mu }_{\varvec{\zeta }}}&= - \left( H \omega \varvec{{\varUpsilon }} + \varvec{{\varOmega }}_0^{-1} \right) \varvec{\mu }_{\varvec{\zeta }} + \omega \varvec{{\varUpsilon }} \sum _{h=1}^H \varvec{\mu }_h + \varvec{{\varOmega }}_0^{-1} \varvec{\beta }_0, \\ \nabla _{\varvec{{\varUpsilon }}}&= \frac{\nu + H}{2} \varvec{{\varUpsilon }}^{-1} - \frac{\omega }{2} \left\{ \varvec{S}^{-1} + H \varvec{{\varSigma }}_{\varvec{\zeta }} + \sum _{h=1}^H \left[ \varvec{{\varSigma }}_h + \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }}\right) \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }} \right) ^T\right] \right\} , \\ \frac{\partial \left( 13 + 14\right) }{\partial \omega }&= \frac{K}{2} + \frac{H + \nu - \omega }{2} \sum _{k=1}^K \frac{\partial \psi \left( \frac{\omega + 1 - k}{2}\right) }{\partial \omega } \\&\qquad - \frac{1}{2} tr \left\{ \varvec{{\varUpsilon }} \left( \varvec{S}^{-1} + H \varvec{{\varSigma }}_{\varvec{\zeta }} + \sum _{h=1}^H \left[ \varvec{{\varSigma }}_h + \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }}\right) \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }} \right) ^T\right] \right) \right\} . \end{aligned}$$

Solving for the variational parameters we get the following closed form update equations:

$$\begin{aligned} \varvec{{\varSigma }}_{\varvec{\zeta }}&=\left( H \omega \varvec{{\varUpsilon }} + \varvec{{\varOmega }}_0^{-1}\right) ^{-1}, \\ \varvec{\mu }_{\varvec{\zeta }}&= \varvec{{\varSigma }}_{\varvec{\zeta }} \left( \omega \varvec{{\varUpsilon }} \sum _{h = 1}^H \varvec{\mu }_h + \varvec{{\varOmega }}_0^{-1} \varvec{\beta }_0 \right) , \\ \omega&= \nu + H , \\ \varvec{{\varUpsilon }}&= \left\{ \varvec{S}^{-1} + H \varvec{{\varSigma }}_{\varvec{\zeta }} + \sum _{h = 1}^H \left[ \varvec{{\varSigma }}_h + \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }}\right) \left( \varvec{\mu }_h - \varvec{\mu }_{\varvec{\zeta }}\right) ^T \right] \right\} ^{-1}. \end{aligned}$$

The degrees of freedom parameter \(\omega \) of the approximate posterior of \(\varvec{{\varOmega }}\) is not data dependent and can be fixed at its optimal value. The only unspecified parts of the estimation algorithm are the updates with respect to \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\), for all \(h=1, \ldots , H\).

Appendix 2: Derivation of bounds and approximations

Taylor series

Consider the second order Taylor series expansion of the LSE function \(f\left( \varvec{\beta }_h\right) = \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_h} \right) \) around the current mean \(\varvec{\mu }_h\), which results in

$$\begin{aligned} f\left( \varvec{\beta }_h\right)= & {} \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_h} \right) \approx \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\mu }_h} \right) + \left( \varvec{\beta }_h - \varvec{\mu }_h \right) ^T \nabla \left( \varvec{\mu }_h\right) \\&+ \frac{1}{2} \left( \varvec{\beta }_h - \varvec{\mu }_h\right) ^T H\left( \varvec{\mu }_h \right) \left( \varvec{\beta }_h - \varvec{\mu }_h \right) \end{aligned}$$

with gradient \(\nabla \left( \varvec{\mu }_h\right) = \varvec{X}_{ht}^T \varvec{p}_{ht}\), Hessian \(H\left( \varvec{\mu }_h \right) = \varvec{X}_{ht}^T \left[ \text {diag}\left( \varvec{p}_{ht}\right) - \varvec{p}_{ht}\varvec{p}_{ht}^T \right] \varvec{X}_{ht}\) and where \(\varvec{p}_{ht}\) is a J-dimensional vector with entries \(\frac{e^{\varvec{x}_{htj}^T\varvec{\mu }_h}}{\sum _{j^{'}=1}^J e^{\varvec{x}_{htj^{'}}^T\varvec{\mu }_h}}, \forall j = 1, \ldots , J\). Taking expectations with respect to \(\varvec{\beta }_h\) we obtain

$$\begin{aligned} {\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right] \approx \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\mu }_h}\right) + \frac{1}{2} tr\left[ \varvec{{\varSigma }}_h H\left( \varvec{\mu }_h\right) \right] . \end{aligned}$$

If we insert this approximation into Eq. 13 and collect all terms that depend only on \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) from Eqs. 13 and 14, we obtain, for each \(h=1, \ldots , H\), the following maximization problem:

$$\begin{aligned}&\underset{\varvec{\mu }_h, \varvec{{\varSigma }}_h}{\arg \max } \sum _{t=1}^{T} \left\{ \varvec{y}_{ht}^T \varvec{X}_{ht} \varvec{\mu }_h - \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\mu }_h} \right) - \frac{1}{2} tr\left[ \varvec{{\varSigma }}_h H\left( \varvec{\mu }_h\right) \right] \right\} \\&\qquad - \frac{\omega }{2} tr \left[ \varvec{{\varUpsilon }}\varvec{{\varSigma }}_h \right] - \frac{\omega }{2} \varvec{\mu }_h^T \varvec{{\varUpsilon }}\varvec{\mu }_h + \omega \varvec{\mu }_h^T \varvec{{\varUpsilon }} \varvec{\mu }_{\varvec{\zeta }} + \frac{1}{2} \log \left| \varvec{{\varSigma }}_h \right| . \end{aligned}$$

This approach was denoted by BM and \(BM_D\), where the latter restricted \(\varvec{{\varSigma }}_h\) to a diagonal matrix.

Quasi Monte Carlo

The maximization objective function for the QMC approach can be found by collecting all terms that depend only on \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) from Eqs. 13 and 14, and is given by, for each \(h=1, \ldots , H\):

$$\begin{aligned}&\underset{\varvec{\mu }_h, \varvec{{\varSigma }}_h}{\arg \max } \sum _{t=1}^{T} \left\{ \varvec{y}_{ht}^T \varvec{X}_{ht} \varvec{\mu }_h - \frac{1}{R} \sum _{r=1}^R \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_{hr}} \right) \right\} \\&\qquad - \frac{\omega }{2} tr \left[ \varvec{{\varUpsilon }}\varvec{{\varSigma }}_h \right] - \frac{\omega }{2} \varvec{\mu }_h^T \varvec{{\varUpsilon }}\varvec{\mu }_h + \omega \varvec{\mu }_h^T \varvec{{\varUpsilon }} \varvec{\mu }_{\varvec{\zeta }} + \frac{1}{2} \log \left| \varvec{{\varSigma }}_h \right| . \end{aligned}$$

The method to obtain the QMC sample \(\left( \varvec{\beta }_{h, 1:R}\right) \) is explained in Appendix 3. This approach was also considered with unrestricted and a diagonally restricted covariance matrices \(\varvec{{\varSigma }}_h\), denoted respectively by QMC and \(QMC_D\).

Jensen’s inequality

As \(\log \left( .\right) \) is a concave function, one can apply Jensen’s inequality to obtain:

$$\begin{aligned} {\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right]\le & {} \log \left( \sum _{j=1}^J {\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ e^{\varvec{x}_{htj}^T\varvec{\beta }_h}\right] \right) \\= & {} \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\mu }_h + \frac{1}{2}\varvec{x}_{htj}^T \varvec{{\varSigma }}_h \varvec{x}_{htj}} \right) . \end{aligned}$$

The equality follows because the middle expectation is the moment generating function of a multivariate normal distribution. If we insert this bound into Eq. 13 and collect all terms that depend only on \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) from Eqs. 13 and 14, we obtain the following maximization problem for each \(h= 1, \ldots , H\):

$$\begin{aligned}&\underset{\varvec{\mu }_h, \varvec{{\varSigma }}_h}{\arg \max } \sum _{t=1}^{T}\left\{ \varvec{y}_{ht}^T \varvec{X}_{ht} \varvec{\mu }_h - \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\mu }_h + \frac{1}{2} \varvec{x}_{htj}^T \varvec{{\varSigma }}_h \varvec{x}_{htj}} \right) \right\} \\&\quad - \frac{\omega }{2} tr \left[ \varvec{{\varUpsilon }}\varvec{{\varSigma }}_h \right] - \frac{\omega }{2} \varvec{\mu }_h^T \varvec{{\varUpsilon }}\varvec{\mu }_h + \omega \varvec{\mu }_h^T \varvec{{\varUpsilon }} \varvec{\mu }_{\varvec{\zeta }} + \frac{1}{2} \log \left| \varvec{{\varSigma }}_h \right| . \end{aligned}$$

This approach was denoted by JI and \(JI_D\), where the latter restricts \(\varvec{{\varSigma }}_h\) to a diagonal matrix. To obtain the KM and \(KM_D\) methods, we need to introduce additional variational parameters. We start from the identity \(\log \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_h} = \sum _{j=1}^J a_{htj} \varvec{x}_{htj}^T \varvec{\beta }_h + \log \sum _{j=1}^J e^{\left( \varvec{x}_{htj} - \sum _{j^{'}=1}^J a_{htj^{'}} \varvec{x}_{htj^{'}} \right) ^T\varvec{\beta }_h}\). Taking expectations and once again applying Jensen’s inequality leads to

$$\begin{aligned}&{\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right] \le \sum _{j=1}^J a_{htj}\varvec{x}_{htj}^T\varvec{\mu }_h \\&\quad + \log \left( \sum _{j=1}^J e^{\left( \varvec{x}_{htj} - \sum _{j^{'}=1}^J a_{htj^{'}} \varvec{x}_{htj^{'}}\right) ^T \varvec{\mu }_h + \frac{1}{2} \left( \varvec{x}_{htj} - \sum _{j^{'}=1}^J a_{htj^{'}} \varvec{x}_{htj^{'}}\right) ^T \varvec{{\varSigma }}_h \left( \varvec{x}_{htj} - \sum _{j^{'}=1}^J a_{htj^{'}} \varvec{x}_{htj^{'}}\right) }\right) . \end{aligned}$$

Inserting this expression into Eq. 13, we obtain a similar maximization problem as the previous one. However, we have introduced extra variational parameters \(\varvec{a} = \left( a_{1:H, 1:T, 1:J} \right) \) that also need to be optimized. Taking derivatives and equating them to 0 results in the following set of fixed point update equations

$$\begin{aligned} a_{htj} = \frac{e^{\varvec{x}_{htj}^{T}\varvec{\mu }_h + \frac{1}{2}\left( \varvec{x}_{htj} - 2 \sum _{j^{'}=1}^J a_{htj^{'}}\varvec{x}_{htj^{'}}\right) ^T\varvec{{\varSigma }}_h \varvec{x}_{htj}}}{ \sum _{j^{'}=1}^J e^{\varvec{x}_{htj^{'}}^{T}\varvec{\mu }_h + \frac{1}{2}\left( \varvec{x}_{htj^{'}} - 2 \sum _{j^{''}=1}^J a_{htj^{''}}\varvec{x}_{htj^{''}}\right) ^T\varvec{{\varSigma }}_h \varvec{x}_{htj^{'}}}}, \quad \forall h, t, j. \end{aligned}$$

This approach was denoted by KM and \(KM_D\), where the latter restricts \(\varvec{{\varSigma }}_h\) to a diagonal matrix.

Böhning–Lindsay

If we take a second order Taylor series expansion of the LSE function \(f\left( \varvec{X}_{ht}\varvec{\beta }_h\right) = \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_h} \right) \) around some vector \(\varvec{{\varPsi }}_{ht}\), we know that for some specific vector \(\varvec{{\varPsi }}^*_{ht}\), we get the following equality

$$\begin{aligned} f\left( \varvec{X}_{ht} \varvec{\beta }_h\right)&= \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T \varvec{\beta }_h} \right) = \log \left( \sum _{j=1}^J e^{{\varPsi }_{htj}}\right) + \left( \varvec{X}_{ht}\varvec{\beta }_h -\varvec{{\varPsi }}_{ht}\right) ^T \nabla \left( \varvec{{\varPsi }}_{ht}\right) \\&\qquad + \frac{1}{2} \left( \varvec{X}_{ht}\varvec{\beta }_h- \varvec{{\varPsi }}_{ht}\right) ^T H \left( \varvec{{\varPsi }}^*_{ht} \right) \left( \varvec{X}_{ht}\varvec{\beta }_h -\varvec{{\varPsi }}_{ht} \right) , \end{aligned}$$

where \(\nabla \left( \varvec{{\varPsi }}_{ht}\right) \) and \(H \left( \varvec{{\varPsi }}^*_{ht} \right) \) are the gradient and Hessian of \(f\left( \varvec{X}_{ht} \varvec{\beta }_h\right) \), evaluated at \(\varvec{{\varPsi }}_{ht}\) and \(\varvec{{\varPsi }}_{ht}^*\) respectively. Define \(\varvec{A} = \frac{1}{2}\left( \varvec{I}_J - \varvec{1}_J\varvec{1}_J^T / J\right) \), where \(\varvec{I}_J\) is the J-dimensional identity matrix and \(\varvec{1}_J\) is a J-dimensional vector of ones. Böhning and Lindsay (1988) and Böhning (1992) show that \(\varvec{A} \ge \varvec{H}\) with respect to the Loewner ordering, which means that \(\varvec{A}-\varvec{H}\) is positive semi-definite. Replacing \(H \left( \varvec{{\varPsi }}^*_{ht} \right) \) by \(\varvec{A}\) and taking expectations over \(\varvec{\beta }_h\), we can obtain the following bound:

$$\begin{aligned}&{\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right] \le \log \left( \sum _{j=1}^J e^{{\varPsi }_{htj}}\right) + \left( \varvec{X}_{ht}\varvec{\mu }_h-\varvec{{\varPsi }}_{ht}\right) ^T \nabla \left( \varvec{{\varPsi }}_{ht}\right) \\&\quad +\frac{1}{2} tr \left[ \varvec{X}_{ht}^T\varvec{A}\varvec{X}_{ht}\varvec{{\varSigma }}_h\right] +\frac{1}{2}\left( \varvec{X}_{ht}\varvec{\mu }_h- \varvec{{\varPsi }}_{ht}\right) ^T\varvec{A}\left( \varvec{X}_{ht}\varvec{\mu }_h-\varvec{{\varPsi }}_{ht}\right) . \end{aligned}$$

From this bound it is possible to generate explicit update equations for the subject specific parameters by inserting it into Eq. 13, collecting all terms that depend only on \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) from Eqs. 13 and 14, and equating derivatives with respect to \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) to \(\varvec{0}\), which results in:

$$\begin{aligned} \varvec{{\varSigma }}_h&= \left( \omega \varvec{{\varUpsilon }} + \sum _{t=1}^{T}\varvec{X}_{ht}^T\varvec{A}\varvec{X}_{ht}\right) ^{-1}, \quad h = 1, \ldots , H, \\ \varvec{\mu }_h&= \varvec{{\varSigma }}_h \left\{ \omega \varvec{{\varUpsilon }}\varvec{\mu }_{\varvec{\zeta }} + \sum _{t=1}^{T} \varvec{X}_{ht}^T \left[ \varvec{y}_{ht} - \nabla \left( \varvec{{\varPsi }}_{ht}\right) + \varvec{A}\varvec{{\varPsi }}_{ht}\right] \right\} ,\quad h = 1, \ldots , H. \end{aligned}$$

Using derivatives again, it can be seen that the update for the extra variational parameters \(\varvec{{\varPsi }}_{ht}\), \(\forall h, t\), turns out to be \(\varvec{{\varPsi }}_{ht} = \varvec{X}_{ht}\varvec{\mu }_h\). This approach was denoted by BL.

Bouchard

Bouchard (2007) observed that \(\sum _{j=1}^J e^{x_j} \le \prod _{j=1}^J \left( 1 + e^{x_j}\right) \). Replacing \(x_j\) by \(\varvec{x}_{htj}^T \varvec{\beta }_h - \alpha _{ht}\) and taking logarithms we obtain \(\log \left( \sum _{j=1}^Je^{\varvec{x}_{htj}^T \varvec{\beta }_h} \right) \le \alpha _{ht} + \sum _{j=1}^J \log \left( 1 + e^{\varvec{x}_{htj}^T \varvec{\beta }_h - \alpha _{ht}} \right) \). Jaakkola and Jordan (2000) derived the well known tangential bound \(\log \left( 1 + e^{x}\right) \le \frac{x - t}{2} + \frac{1}{4t} \tanh \left( \frac{t}{2}\right) \left( x^2-t^2\right) + \log \left( 1 + e^{t}\right) \). Combining these two results and taking expectations with respect to \(\varvec{\beta }_h\), we obtain the following quadratic lower bound:

$$\begin{aligned}&{\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right] \le \alpha _{ht} + \sum _{j=1}^J \frac{\varvec{x}_{htj}^T \varvec{\mu }_h - \alpha _{ht} - t_{htj}}{2} \\&\qquad + \sum _{j=1}^J \left\{ \lambda \left( t_{htj}\right) \left[ \left( \varvec{x}_{htj}^T\varvec{\mu }_h - \alpha _{ht}\right) ^2 - t_{htj}^2 + \varvec{x}_{htj}^T\varvec{{\varSigma }}_h\varvec{x}_{htj}\right] + \log \left( 1 + e^{t_{htj}}\right) \right\} , \end{aligned}$$

where \(\lambda \left( t\right) = \frac{1}{4t} \tanh \left( \frac{t}{2}\right) \). From this bound it is possible to generate explicit update equations for the subject specific parameters by inserting it into Eq. 13, collecting all terms that depend only on \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) from Eqs. 13 and 14, and equating derivatives with respect to \(\varvec{\mu }_h\) and \(\varvec{{\varSigma }}_h\) to \(\varvec{0}\), which results in:

$$\begin{aligned} \varvec{{\varSigma }}_h&= \left( \omega \varvec{{\varUpsilon }} + 2 \sum _{t=1}^{T} \sum _{j=1}^J \lambda \left( t_{htj}\right) \varvec{x}_{htj}\varvec{x}_{htj}^T\right) ^{-1}, \quad \forall h = 1, \ldots , H, \\ \varvec{\mu }_h&= \varvec{{\varSigma }}_h \left[ \omega \varvec{{\varUpsilon }} \varvec{\mu }_{\varvec{\zeta }} + \sum _{t=1}^{T}\sum _{j=1}^J \left( \varvec{y}_{htj}^T - \frac{1}{2} + 2 \alpha _{ht} \lambda \left( t_{htj}\right) \right) \varvec{x}_{htj}\right] , \quad \forall h = 1, \ldots , H. \end{aligned}$$

The extra variational parameters can be updated by the following fixed point equations:

$$\begin{aligned} \alpha _{ht}&= \frac{J/2 - 1 + 2 \sum _{j=1}^J \lambda \left( t_{htj}\right) \varvec{x}_{htj}^T \varvec{\mu }_h}{2 \sum _{j=1}^J \lambda \left( t_{htj}\right) }, \quad \quad \forall h, t,\\ t_{htj}&= \sqrt{\left( \varvec{x}_{htj}^T \varvec{\mu }_h - \alpha _{ht} \right) ^2 + \varvec{x}_{htj}^T \varvec{{\varSigma }}_h \varvec{x}_{htj}}, \quad \quad \forall h, t, j. \end{aligned}$$

Jebara–Choromanska

Jebara and Choromanska (2012) developed an algorithm to find a quadratic bound

$$\begin{aligned} \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h}\right) \le \log z_{ht} + \frac{1}{2}\left( \varvec{\beta }_h - \tilde{\varvec{\beta }}\right) ^T\varvec{S}_{ht}\left( \varvec{\beta }_h - \tilde{\varvec{\beta }}\right) + \left( \varvec{\beta }_h - \tilde{\varvec{\beta }}\right) ^T\varvec{m}_{ht} \end{aligned}$$

around some \(\tilde{\varvec{\beta }}\). The algorithm outputs \(z_{ht}\), \(\varvec{m}_{ht}\) and \(\varvec{S}_{ht}\) and is given in Algorithm 2. After taking expectations this bound leads to

$$\begin{aligned} {\mathbb {E}}_{q\left( \varvec{\beta }_h\right) }\left[ \log \left( \sum _{j=1}^J e^{\varvec{x}_{htj}^T\varvec{\beta }_h} \right) \right]\le & {} \log z_{ht} + \frac{1}{2}\left( \varvec{\mu }_h - \tilde{\varvec{\beta }}\right) ^T\varvec{S}_{ht} \left( \varvec{\mu }_h - \tilde{\varvec{\beta }}\right) \\&+ \frac{1}{2}tr\left[ \varvec{{\varSigma }}_h\varvec{S}_{ht}\right] + \left( \varvec{\mu }_h - \tilde{\varvec{\beta }}\right) ^T\varvec{m}_{ht}. \end{aligned}$$
figureb

This quadratic bound again leads to explicit updates of the subject specific variational parameters given by

$$\begin{aligned} \varvec{{\varSigma }}_h&= \left( \omega \varvec{{\varUpsilon }} + \sum _{t=1}^{T} \varvec{S}_{ht} \right) ^{-1}, \\ \varvec{\mu }_h&= \varvec{{\varSigma }}_h \left( \omega \varvec{{\varUpsilon }} \varvec{\mu }_{\varvec{\zeta }} + \sum _{t=1}^{T} \left\{ \varvec{X}_{ht}^T \varvec{y}_{ht} + \varvec{S}_{ht}\tilde{\varvec{\beta }} - \varvec{m}_{ht} \right\} \right) . \end{aligned}$$

We chose to select \(\tilde{\varvec{\beta }}\) as \(\varvec{\mu }_{\varvec{\zeta }}\). This method was denoted by JC.

Appendix 3: Generating quasi Monte Carlo samples

In this section we briefly show how we constructed the QMC samples. We chose to construct the QMC samples according to Hickernell et al. (2000), which are called extensible shifted lattice points (ESLP). For more details on the properties and optimal construction of such samples we refer you to the previously mentioned reference. The goal of QMC sampling is to sample from the K-dimensional unit cube \([0, 1)^K\) in a way such that the discrepancy between the empirical distribution of the QMC sample and the continuous uniform distribution is small. If this goal is attained, relatively precise high dimensional integration can be performed with a relatively small number of QMC samples, which benefits the computational efficiency of integration. Say that we require R samples, where R is some integer power of an integer base, i.e. \(R = b^m\), \(b \ge 2\) and b and m are integers. We also require a generating vector \(\varvec{h}\) of dimension K. Following Hickernell et al. (2000) we use the generating vector \(\varvec{h} = \left( 1, \eta , \eta ^2, \ldots , \eta ^{K-1}\right) ^T\). The next step is to write the integers \(0, 1, 2, \ldots , b^m - 1\) in base b form. So, for instance, if \(b=2\) and \(m=3\), we have \(R = 2^3 = 8\) samples. The integer 0 would be written as \(0\times 2^0 + 0\times 2^1 + 0\times 2^2\), 1 would be written as \(1\times 2^0 + 0\times 2^1 + 0\times 2^2\), \(\ldots \), and 7 would be written as \(7 = 1\times 2^0 + 1\times 2^1 + 1\times 2^2\). The integers \(0, \ldots , b^m - 1\) can thus be expressed as

$$\begin{aligned} i = \sum _{k=0}^{b^m - 1} i_k b^k = i_0 b^0 + i_1 b^1 + {\cdots }. \end{aligned}$$

Define now the function \(\phi _b\left( i\right) \) as

$$\begin{aligned} \phi _b \left( i\right) = \sum _{k=0}^{b^m -1} i_k b^{-\left( k+1\right) } = i_0 b^{-1} + i_1 b^{-2} + \cdots \end{aligned}$$

and a random shift vector \(\varvec{u} = \left( u_1, \ldots , u_K\right) ^T\), which is a random element of the unit cube \([0, 1)^K\). The ith QMC sample is now defined as \(\left( \left\{ \phi _b\left( i\right) h_1 +u_1 \right\} , \ldots , \left\{ \phi _b\left( i\right) \right. \right. \left. \left. h_K+u_K \right\} \right) ^T\) where \(\left\{ x\right\} \) is a function that takes the fractional part of x, i.e. \(\left\{ x\right\} = x \ (\text {mod}\ 1)\). Hickernell et al. (2000) used a periodizing transformation on the final QMC samples as this appeared to increase the accuracy of the method. We also used this transformation, which is defined as \(x^{'} = \left| 2x - 1 \right| \). Finally, as we are interested in samples from a multivariate normal distribution rather than from a multivariate uniform distribution we apply the inverse normal distribution transformation on all coordinates. This results in a QMC sample from a standard K-dimensional normal distribution. In our simulations we used base \(b=2\) and tried several exponents \(m = 6, \ldots , 12\). Furthermore, we used \(\eta = 1571\) from Hickernell et al. (2000; table 4.1), which is an appropriate value for exponents in the set \(\left( 6, 7, \ldots ,12\right) \) and up to \(K = 33\) dimensions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Depraetere, N., Vandebroek, M. A comparison of variational approximations for fast inference in mixed logit models. Comput Stat 32, 93–125 (2017). https://doi.org/10.1007/s00180-015-0638-y

Download citation

Keywords

  • Bayesian statistics
  • Variational Bayes
  • Discrete choice