Generation of synthetic datasets for discrete choice analysis

Garrow, Laurie A.; Bodea, Tudor D.; Lee, Misuk

doi:10.1007/s11116-009-9228-6

Generation of synthetic datasets for discrete choice analysis

Published: 14 October 2009

Volume 37, pages 183–202, (2010)
Cite this article

Transportation Aims and scope Submit manuscript

Laurie A. Garrow¹,
Tudor D. Bodea² &
Misuk Lee¹

775 Accesses
9 Citations
Explore all metrics

Abstract

Despite the widespread use of synthetic data in discrete choice analysis, little is known about how the methodology used to generate synthetic datasets influences the properties of parameter estimates and the validity of results based on these estimates. That is, there are two potential sources of biases when using synthetic discrete choice data: (1) bias due to the method used to generate the dataset; and, (2) bias due to parameter estimation. The primary objective of this study is to examine bias due to the underlying data generation method. This study compares three methods for generating synthetic datasets and uses design of experiments and analysis of variance methods to investigate the ability to recover estimates for “true” logsum parameters for nested logit models. The method that uses nested logit probabilities to generate the chosen alternative results in unbiased parameter estimates. The method that is based on Gumbel error component approximations reveals that while the error components themselves are unbiased, subtle empirical identification problems can arise when these error components are combined with synthetically generated utility functions. The method that is based on normal error component approximations reveals that all logsum coefficients are biased upwards; the bias dramatically increases for those nests that have a low choice frequency and is most pronounced for those nests with high correlations among alternatives. Based on the results of the analysis, several recommendations for the generation of synthetic datasets for discrete choice analyses are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of the Impact of Sample Size, Attribute Variance and Within-Sample Choice Distribution on the Estimation Accuracy of Multinomial Logit Models Using Simulated Data

Article 06 February 2018

RNR Simulation Tool: A Synthetic Datasets and Its Uses for Policy Simulations

Revealed stochastic choice with attributes

Article 05 January 2022

Notes

If μ = 0.8 and Δ = 0.10 then μ ₁ = 0.80, μ ₂ = μ ₁ − Δ = 0.70 and $ \mu_{3} = \mu_{1} - 2\cdot\Updelta = 0.60 $.
The correlation ρ _m in nest m characterized by the logsum coefficient μ _m is calculated as ρ _m = 1 − μ ²_m . This means that for a treatment defined by μ and ∆, the correlation $ \rho_{m} ,m \ne 1 $ is given as $ \rho_{m} = 1 - [\mu - (m - 1) \times \Updelta ]^{2} = \rho_{1} - (m - 1) \times \Updelta \times [2\mu - (m - 1) \times \Updelta ] $ which for μ and ∆ given in Table 1 is always greater than ρ ₁.
We are grateful to Chandra Bhat and Juan de Dios Ortuzar for bringing this to our attention, as the methodologies used to generate synthetic discrete choice datasets are typically not reported in the literature.
Here, the scale parameter is defined as μ where previously the scale was defined as γ.
The probability density function of a C(λ) distributed variable ν is: $ P_{\lambda } (\nu ) = {\frac{1}{\lambda }}\sum\nolimits_{n = 0}^{\infty } {[(( - 1)^{n} \cdot\,{ \exp }( - n\nu ))/n!\,\cdot\,\Upgamma ( - \lambda n)]} $. If ν is distributed as C(λ) and δ is a fixed scalar, then δ·ν is said to be distributed as C(λ, δ).
Logsum coefficients that are overestimated imply correlation among alternatives for each nest is underestimated.

References

Aptech Systems Inc.: GAUSS Mathematical and Statistical System. Aptech Systems, Inc, Maple Valley (2006)
Google Scholar
Banks, J., Carson, J.S., Nelson, B.L., Nicol, D.: Discrete-Event System Simulation, 4th edn. Prentice-Hall, Inc., Upper Salle River (2005)
Google Scholar
Cardell, N.S.: Variance components structures for the extreme value and logistic distributions. Econ. Theory 13(2), 185–213 (1997)
Article Google Scholar
Cherchi, E., Ortuzar, J.D.D.: Predicting best with mixed logit models: understanding some confounding effects. In: Inweldi, P.O. (ed.) Transportation Research Trends, pp. 215–235. Nova Science Publishers, Inc, New York (2008)
Google Scholar
Chiou, L., Walker, J.L.: Masking identification of discrete choice models under simulation methods. J. Econ. 141, 683–703 (2007)
Google Scholar
Daganzo, C.: Multinomial Probit: the Theory and its Applications to Demand Forecasting. Academic Press, New York (1979)
Google Scholar
Daly, A., Bierlaire, M.: A general and operational representation of generalised extreme value models. Transp. Res. Part B 40(4), 285–305 (2006)
Article Google Scholar
de Oliveria, T.: Decision and Modeling for Extremes. Some Recent Advances in Statistics, pp. 101–110. Academic Press, New York (1982)
Google Scholar
Garrow, L.A., Bodea, T.D.: A Rigorous Framework for Empirically Comparing Mixed GEV and Mixed MNL with Error Components Models. Paper presented at the European Transport Conference, Strasbourg, France (2005)
Gopinath, D., Schofield, M.L., Walker, J.L., Ben-Akiva, M.: Comparative Analysis of Discrete Choice Models with Flexible Substitution Patterns. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)
Hess, S., Bierlaire, M., Polak, J.W.: Capturing Taste Heterogeneity and Correlation Structure with Mixed GEV models. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)
Kotz, S., Balakrishnan, N., Johnson, N.L.: Continuous Multivariate Distributions, Vol. 1, 2nd edn. Wiley, New York (2000)
Google Scholar
Law, A.M., Kelton, W.D.: Simulation Modeling and Analysis, 3rd edn. McGraw Hill, Boston (2000)
Google Scholar
McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (ed.) Frontiers in Econometrics, pp. 105–142. Academic Press, New York (1974)
Google Scholar
McFadden, D.: Modeling the choice of residential location. In: Karlquist, A., et al. (eds.) Spatial Interaction Theory and Residential Location, pp. 75–96. North-Holland, Amsterdam (1978)
Google Scholar
Munizaga, M.A., Alvarez-Daziano, R.: Mixed Logit vs. Nested Logit and Probit Models. Paper presented at the 5th Tri-Annual Invitational Choice Symposium Workshop: Hybrid Choice Models, Formulation and Practical Issues, Asilomar (2001)
Munizaga, M.A., Alvarez-Daziano, R.: Evaluation of Mixed Logit as a Practical Modelling Alternative. Paper presented at the European Transport Conference, Cambridge, UK (2002)
R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org (2006)
Schruben, L., Kulkarni, R.: Some consequences of estimating parameters for the M/M/1 queue. Oper. Res. Lett. 1(2), 75–78 (1983)
Article Google Scholar
Sivakumar, A., Bhat, C.R., Okten, G.: Simulation estimation of mixed discrete choice models with the use of randomized quasi-Monte Carlo sequences: a comparative study. Transp. Res. Rec. 1921, 112–122 (2005)
Article Google Scholar
Wen, C.-H., Koppelman, F.: The generalized nested logit model. Transp. Res. Part B 35(7), 627–641 (2001)
Article Google Scholar
Williams, H.C.W.L.: On the formulation of travel demand models and economic measures of user benefit. Environ Plan 9A(3), 285–344 (1977)
Article Google Scholar
Williams, H.C.W.L., Ortuzar, J.D.: Behavioural theories of dispersion and the mis-specification of travel demand models. Transp. Res. Part B 16(3), 167–219 (1982)
Article Google Scholar

Download references

Acknowledgments

We are grateful to Chandra Bhat and Juan de Dios Ortuzar for their helpful comments and suggestions related to data generation procedures, and to Dave Goldsman and Melike Meterelliyoz for their constructive comments on how to relate our problem to the simulation literature. We are also grateful to the graduate students in Dr. Garrow’s Spring 2008 statistics course for running the NL models based on the probability method. Partial support for this project was provided by the National Science Foundation (SBE-0624269).

Author information

Authors and Affiliations

School of Civil and Environmental Engineering, Georgia Institute of Technology, 790 Atlantic Drive, Atlanta, GA, 30332-0355, USA
Laurie A. Garrow & Misuk Lee
InterContinental Hotels Group, Atlanta, GA, USA
Tudor D. Bodea

Authors

Laurie A. Garrow
View author publications
You can also search for this author in PubMed Google Scholar
Tudor D. Bodea
View author publications
You can also search for this author in PubMed Google Scholar
Misuk Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurie A. Garrow.

Appendix

Generation of correlation for multivariate gumbel distributions

Proposition 1. Let X ₀ and Z _i, i = 1,…,n, be independent standard Gumbel random variables. Define for i = 1,…,n,

$$ X_{i} = \max \{ X_{0} + \log (\phi_{i} ),Z_{i} + \log (1 - \phi_{i} )\} ,0 < \phi_{i} \le 1. $$

ϕ_i determines the dependency between X₀ and X_i. ϕ_i = 1 means perfect dependency and thus ϕ₀ = 1.

Then,

1.
(X ₀,…,X _n) is a (n + 1)-dimensional Gumbel vector, of which marginal is Gumbel distributed with mode 0 and dispersion 1.
2.
Two-dimensional CDF of (X _i, X _j) is
$$ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). $$

Also, (n + 1)-dimensional CDF is
$$ F_{{X_{0} , \ldots ,X_{n} }} \left( {x_{0} , \ldots ,x_{n} } \right) = \exp \left( { - \max_{i = 0, \ldots ,n} \left\{ {e^{{ - x_{i} }} + \sum\limits_{j = 1,j \ne i}^{n} {\left( {1 - \varphi_{j} } \right)e^{{ - x_{j} }} } } \right\}} \right) $$
3.
The correlation coefficient between X _i and X _j is
$$ \rho_{ij} = - 6\pi^{ - 2} \left( {\int\limits_{ - \infty }^{{\log (\phi_{j} /\phi_{i} )}} {\log \left( {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w + \int\limits_{{\log (\phi_{j} /\phi_{i} )}}^{\infty } {\log \left( {{\frac{{(1 - \phi_{j} )1 + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w} \right). $$

Proof

(1)

$$ \begin{aligned} P(X_{i} \le x) = & P\left( {X_{0} \le x - \log (\phi_{i} ),Z_{i} \le x - \log (1 - \phi_{i} )} \right) \\ = & \exp \left( { - {\text{e}}^{{ - (x - \log (\phi_{i} ))}} } \right)\exp \left( { - {\text{e}}^{{ - (x - \log (1 - \phi_{i} ))}} } \right) \\ = & \exp \left( { - {\text{e}}^{ - x} {\text{e}}^{{\log (\phi_{i} )}} } \right)\exp \left( { - {\text{e}}^{ - x} {\text{e}}^{{\log (1 - \phi_{i} )}} } \right) \\ = & \exp \left( { - \phi_{i} {\text{e}}^{ - x} - \left( {1 - \phi_{i} } \right){\text{e}}^{ - x} } \right) \\ = & \exp \left( { - {\text{e}}^{ - x} } \right) \\ \end{aligned} $$

(2)

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & P\left( {X_{0} \le x - \log (\phi_{i} ),Z_{i} \le x - \log (1 - \phi_{i} ),X_{0} \le y - \log (\phi_{j} ),Z_{j} \le y - \log (1 - \phi_{j} )} \right) \\ = & P\left( {X_{0} \le \min \left\{ {x - \log (\phi_{i} ),y - \log (\phi_{j} )} \right\}} \right)P\left( {Z_{i} \le y - \log (1 - \phi_{i} )} \right)P\left( {Z_{j} \le y - \log (1 - \phi_{j} )} \right) \\ = & \exp \left( { - {\text{e}}^{{ - \min \left\{ {x - \log (\phi_{i} ),y - \log (\phi_{j} )} \right\}}} } \right)\exp \left( { - {\text{e}}^{{ - (x - \log (1 - \phi_{i} ))}} } \right)\exp \left( { - {\text{e}}^{{ - (y - \log (1 - \phi_{j} ))}} } \right) \\ \end{aligned} $$

If $ y - x \ge \log \left( {\phi_{j} /\phi_{i} } \right) $, i.e. $ \phi_{i} e^{ - x} \ge \phi_{j} e^{ - y} $,

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & \exp \left( { - e^{{ - \left( {x - \log (\phi_{i} )} \right)}} } \right)\exp \left( { - e^{{ - \left( {x - \log (1 - \phi_{i} )} \right)}} } \right)\exp \left( { - e^{{ - \left( {y - \log (1 - \phi_{j} )} \right)}} } \right) \\ = & \exp \left( { - \phi_{i} e^{ - x} - \left( {1 - \phi_{i} } \right)e^{ - x} } \right)\exp \left( { - e^{{ - \left( {y - \log (1 - \phi_{j} )} \right)}} } \right) \\ = & \exp \left( { - e^{ - x} - \left( {1 - \phi_{j} } \right)e^{ - y} } \right). \\ \end{aligned} $$

If $ y - x < \log (\phi_{j} /\phi_{i} ) $,

$$ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - (1 - \phi_{i} ){\text{e}}^{ - x} - {\text{e}}^{ - y} } \right) $$

Thus, $ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). $

Let us consider (n + 1)-dimensional CDF.

$$ \begin{aligned} F_{{X_{0} , \ldots ,X_{n} }} (x_{0} , \ldots ,x_{n} ) = & P\left( {X_{0} \le x_{0} ,X_{1} \le x_{1} , \ldots ,X_{n - 1} \le x_{n - 1} ,X_{0} \le x_{n} - \log (\phi_{n} )} \right)P\left( {Z_{n} \le x_{n} - \log (1 - \phi_{n} )} \right) \\ = & P\left( {X_{0} \le \min_{i = 0, \ldots ,n} \{ x_{i} - \log (\phi_{i} )\} } \right)\prod\limits_{j = 1}^{n} {P\left( {Z_{j} \le x_{j} - \log (1 - \phi_{j} )} \right)} \\ = & \exp \left( { - {\text{e}}^{{ - \min_{i = 0, \ldots ,n} \{ x_{i} - \log (\phi_{i} )\} }} - \left. {\sum\limits_{j = 1}^{n} {(1 - \phi_{j} ){\text{e}}^{{ - x_{j} }} } } \right\}} \right) \\ = & \exp \left( { - \max_{i = 1, \ldots ,n} \left\{ {{\text{e}}^{{ - x_{i} }} + \sum\limits_{j = 1,j \ne i}^{n} {(1 - \phi_{j} ){\text{e}}^{{ - x_{j} }} } } \right\}} \right) \\ \end{aligned} $$

(3) Note that

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} )e^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right) \\ = & \exp \left( { - \max \{ {\text{e}}^{ - x} + {\text{e}}^{ - y} \} } \right)k(y - x) \\ \end{aligned} $$

with

$$ k(w;\phi_{i} ,\phi_{j} ) = \left\{ {\begin{array}{*{20}c} {{\frac{{(1 - \phi_{j} ) + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} & {{\text{if}}\,w \ge { \log }(\phi_{j} /\phi_{i} )} \\ {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} & {{\text{otherwise}}.} \\ \end{array} } \right. $$

The correlation coefficient between X _i and X _j is

$$ \begin{aligned} \rho_{ij} = & - 6\pi^{ - 2} \int\limits_{ - \infty }^{\infty } {\log (k(w))} {\text{d}}w \\ = & - 6\pi^{ - 2} \left( {\int\limits_{ - \infty }^{{\log (\phi_{j} /\phi_{i} )}} {\log \left( {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w + \int\limits_{{\log (\phi_{j} /\phi_{i} )}}^{\infty } {\log \left( {{\frac{{(1 - \phi_{j} )1 + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w} \right) \\ \end{aligned} $$

Since ϕ ₀ = 1, the correlation coefficient between X ₀ and X _i is derived as

$$ \rho_{ij} = - 6\pi^{ - 2} \int\limits_{0}^{{\phi_{i} }} {(1 - t)^{ - 1} \log (t)} {\text{d}}t. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garrow, L.A., Bodea, T.D. & Lee, M. Generation of synthetic datasets for discrete choice analysis. Transportation 37, 183–202 (2010). https://doi.org/10.1007/s11116-009-9228-6

Download citation

Published: 14 October 2009
Issue Date: March 2010
DOI: https://doi.org/10.1007/s11116-009-9228-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generation of synthetic datasets for discrete choice analysis

Abstract

Access this article

Similar content being viewed by others

Analysis of the Impact of Sample Size, Attribute Variance and Within-Sample Choice Distribution on the Estimation Accuracy of Multinomial Logit Models Using Simulated Data

RNR Simulation Tool: A Synthetic Datasets and Its Uses for Policy Simulations

Revealed stochastic choice with attributes

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generation of synthetic datasets for discrete choice analysis

Abstract

Access this article

Similar content being viewed by others

Analysis of the Impact of Sample Size, Attribute Variance and Within-Sample Choice Distribution on the Estimation Accuracy of Multinomial Logit Models Using Simulated Data

RNR Simulation Tool: A Synthetic Datasets and Its Uses for Policy Simulations

Revealed stochastic choice with attributes

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation