Abstract
Despite the widespread use of synthetic data in discrete choice analysis, little is known about how the methodology used to generate synthetic datasets influences the properties of parameter estimates and the validity of results based on these estimates. That is, there are two potential sources of biases when using synthetic discrete choice data: (1) bias due to the method used to generate the dataset; and, (2) bias due to parameter estimation. The primary objective of this study is to examine bias due to the underlying data generation method. This study compares three methods for generating synthetic datasets and uses design of experiments and analysis of variance methods to investigate the ability to recover estimates for “true” logsum parameters for nested logit models. The method that uses nested logit probabilities to generate the chosen alternative results in unbiased parameter estimates. The method that is based on Gumbel error component approximations reveals that while the error components themselves are unbiased, subtle empirical identification problems can arise when these error components are combined with synthetically generated utility functions. The method that is based on normal error component approximations reveals that all logsum coefficients are biased upwards; the bias dramatically increases for those nests that have a low choice frequency and is most pronounced for those nests with high correlations among alternatives. Based on the results of the analysis, several recommendations for the generation of synthetic datasets for discrete choice analyses are provided.
Similar content being viewed by others
Notes
If μ = 0.8 and Δ = 0.10 then μ 1 = 0.80, μ 2 = μ 1 − Δ = 0.70 and \( \mu_{3} = \mu_{1} - 2\cdot\Updelta = 0.60 \).
The correlation ρ m in nest m characterized by the logsum coefficient μ m is calculated as ρ m = 1 − μ 2 m . This means that for a treatment defined by μ and ∆, the correlation \( \rho_{m} ,m \ne 1 \) is given as \( \rho_{m} = 1 - [\mu - (m - 1) \times \Updelta ]^{2} = \rho_{1} - (m - 1) \times \Updelta \times [2\mu - (m - 1) \times \Updelta ] \) which for μ and ∆ given in Table 1 is always greater than ρ 1.
We are grateful to Chandra Bhat and Juan de Dios Ortuzar for bringing this to our attention, as the methodologies used to generate synthetic discrete choice datasets are typically not reported in the literature.
Here, the scale parameter is defined as μ where previously the scale was defined as γ.
The probability density function of a C(λ) distributed variable ν is: \( P_{\lambda } (\nu ) = {\frac{1}{\lambda }}\sum\nolimits_{n = 0}^{\infty } {[(( - 1)^{n} \cdot\,{ \exp }( - n\nu ))/n!\,\cdot\,\Upgamma ( - \lambda n)]} \). If ν is distributed as C(λ) and δ is a fixed scalar, then δ·ν is said to be distributed as C(λ, δ).
Logsum coefficients that are overestimated imply correlation among alternatives for each nest is underestimated.
References
Aptech Systems Inc.: GAUSS Mathematical and Statistical System. Aptech Systems, Inc, Maple Valley (2006)
Banks, J., Carson, J.S., Nelson, B.L., Nicol, D.: Discrete-Event System Simulation, 4th edn. Prentice-Hall, Inc., Upper Salle River (2005)
Cardell, N.S.: Variance components structures for the extreme value and logistic distributions. Econ. Theory 13(2), 185–213 (1997)
Cherchi, E., Ortuzar, J.D.D.: Predicting best with mixed logit models: understanding some confounding effects. In: Inweldi, P.O. (ed.) Transportation Research Trends, pp. 215–235. Nova Science Publishers, Inc, New York (2008)
Chiou, L., Walker, J.L.: Masking identification of discrete choice models under simulation methods. J. Econ. 141, 683–703 (2007)
Daganzo, C.: Multinomial Probit: the Theory and its Applications to Demand Forecasting. Academic Press, New York (1979)
Daly, A., Bierlaire, M.: A general and operational representation of generalised extreme value models. Transp. Res. Part B 40(4), 285–305 (2006)
de Oliveria, T.: Decision and Modeling for Extremes. Some Recent Advances in Statistics, pp. 101–110. Academic Press, New York (1982)
Garrow, L.A., Bodea, T.D.: A Rigorous Framework for Empirically Comparing Mixed GEV and Mixed MNL with Error Components Models. Paper presented at the European Transport Conference, Strasbourg, France (2005)
Gopinath, D., Schofield, M.L., Walker, J.L., Ben-Akiva, M.: Comparative Analysis of Discrete Choice Models with Flexible Substitution Patterns. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)
Hess, S., Bierlaire, M., Polak, J.W.: Capturing Taste Heterogeneity and Correlation Structure with Mixed GEV models. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)
Kotz, S., Balakrishnan, N., Johnson, N.L.: Continuous Multivariate Distributions, Vol. 1, 2nd edn. Wiley, New York (2000)
Law, A.M., Kelton, W.D.: Simulation Modeling and Analysis, 3rd edn. McGraw Hill, Boston (2000)
McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (ed.) Frontiers in Econometrics, pp. 105–142. Academic Press, New York (1974)
McFadden, D.: Modeling the choice of residential location. In: Karlquist, A., et al. (eds.) Spatial Interaction Theory and Residential Location, pp. 75–96. North-Holland, Amsterdam (1978)
Munizaga, M.A., Alvarez-Daziano, R.: Mixed Logit vs. Nested Logit and Probit Models. Paper presented at the 5th Tri-Annual Invitational Choice Symposium Workshop: Hybrid Choice Models, Formulation and Practical Issues, Asilomar (2001)
Munizaga, M.A., Alvarez-Daziano, R.: Evaluation of Mixed Logit as a Practical Modelling Alternative. Paper presented at the European Transport Conference, Cambridge, UK (2002)
R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org (2006)
Schruben, L., Kulkarni, R.: Some consequences of estimating parameters for the M/M/1 queue. Oper. Res. Lett. 1(2), 75–78 (1983)
Sivakumar, A., Bhat, C.R., Okten, G.: Simulation estimation of mixed discrete choice models with the use of randomized quasi-Monte Carlo sequences: a comparative study. Transp. Res. Rec. 1921, 112–122 (2005)
Wen, C.-H., Koppelman, F.: The generalized nested logit model. Transp. Res. Part B 35(7), 627–641 (2001)
Williams, H.C.W.L.: On the formulation of travel demand models and economic measures of user benefit. Environ Plan 9A(3), 285–344 (1977)
Williams, H.C.W.L., Ortuzar, J.D.: Behavioural theories of dispersion and the mis-specification of travel demand models. Transp. Res. Part B 16(3), 167–219 (1982)
Acknowledgments
We are grateful to Chandra Bhat and Juan de Dios Ortuzar for their helpful comments and suggestions related to data generation procedures, and to Dave Goldsman and Melike Meterelliyoz for their constructive comments on how to relate our problem to the simulation literature. We are also grateful to the graduate students in Dr. Garrow’s Spring 2008 statistics course for running the NL models based on the probability method. Partial support for this project was provided by the National Science Foundation (SBE-0624269).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Generation of correlation for multivariate gumbel distributions
Proposition 1. Let X 0 and Z i , i = 1,…,n, be independent standard Gumbel random variables. Define for i = 1,…,n,
ϕ i determines the dependency between X0 and X i . ϕ i = 1 means perfect dependency and thus ϕ0 = 1.
Then,
-
1.
(X 0,…,X n ) is a (n + 1)-dimensional Gumbel vector, of which marginal is Gumbel distributed with mode 0 and dispersion 1.
-
2.
Two-dimensional CDF of (X i , X j ) is
$$ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). $$Also, (n + 1)-dimensional CDF is
$$ F_{{X_{0} , \ldots ,X_{n} }} \left( {x_{0} , \ldots ,x_{n} } \right) = \exp \left( { - \max_{i = 0, \ldots ,n} \left\{ {e^{{ - x_{i} }} + \sum\limits_{j = 1,j \ne i}^{n} {\left( {1 - \varphi_{j} } \right)e^{{ - x_{j} }} } } \right\}} \right) $$ -
3.
The correlation coefficient between X i and X j is
$$ \rho_{ij} = - 6\pi^{ - 2} \left( {\int\limits_{ - \infty }^{{\log (\phi_{j} /\phi_{i} )}} {\log \left( {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w + \int\limits_{{\log (\phi_{j} /\phi_{i} )}}^{\infty } {\log \left( {{\frac{{(1 - \phi_{j} )1 + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w} \right). $$
Proof
(1)
(2)
If \( y - x \ge \log \left( {\phi_{j} /\phi_{i} } \right) \), i.e. \( \phi_{i} e^{ - x} \ge \phi_{j} e^{ - y} \),
If \( y - x < \log (\phi_{j} /\phi_{i} ) \),
Thus, \( F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). \)
Let us consider (n + 1)-dimensional CDF.
(3) Note that
with
The correlation coefficient between X i and X j is
Since ϕ 0 = 1, the correlation coefficient between X 0 and X i is derived as
Rights and permissions
About this article
Cite this article
Garrow, L.A., Bodea, T.D. & Lee, M. Generation of synthetic datasets for discrete choice analysis. Transportation 37, 183–202 (2010). https://doi.org/10.1007/s11116-009-9228-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11116-009-9228-6