Skip to main content
Log in

Generation of synthetic datasets for discrete choice analysis

  • Published:
Transportation Aims and scope Submit manuscript

Abstract

Despite the widespread use of synthetic data in discrete choice analysis, little is known about how the methodology used to generate synthetic datasets influences the properties of parameter estimates and the validity of results based on these estimates. That is, there are two potential sources of biases when using synthetic discrete choice data: (1) bias due to the method used to generate the dataset; and, (2) bias due to parameter estimation. The primary objective of this study is to examine bias due to the underlying data generation method. This study compares three methods for generating synthetic datasets and uses design of experiments and analysis of variance methods to investigate the ability to recover estimates for “true” logsum parameters for nested logit models. The method that uses nested logit probabilities to generate the chosen alternative results in unbiased parameter estimates. The method that is based on Gumbel error component approximations reveals that while the error components themselves are unbiased, subtle empirical identification problems can arise when these error components are combined with synthetically generated utility functions. The method that is based on normal error component approximations reveals that all logsum coefficients are biased upwards; the bias dramatically increases for those nests that have a low choice frequency and is most pronounced for those nests with high correlations among alternatives. Based on the results of the analysis, several recommendations for the generation of synthetic datasets for discrete choice analyses are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. If μ = 0.8 and Δ = 0.10 then μ 1 = 0.80, μ 2 = μ 1 − Δ = 0.70 and \( \mu_{3} = \mu_{1} - 2\cdot\Updelta = 0.60 \).

  2. The correlation ρ m in nest m characterized by the logsum coefficient μ m is calculated as ρ m  = 1 − μ 2 m . This means that for a treatment defined by μ and , the correlation \( \rho_{m} ,m \ne 1 \) is given as \( \rho_{m} = 1 - [\mu - (m - 1) \times \Updelta ]^{2} = \rho_{1} - (m - 1) \times \Updelta \times [2\mu - (m - 1) \times \Updelta ] \) which for μ and given in Table 1 is always greater than ρ 1.

  3. We are grateful to Chandra Bhat and Juan de Dios Ortuzar for bringing this to our attention, as the methodologies used to generate synthetic discrete choice datasets are typically not reported in the literature.

  4. Here, the scale parameter is defined as μ where previously the scale was defined as γ.

  5. The probability density function of a C(λ) distributed variable ν is: \( P_{\lambda } (\nu ) = {\frac{1}{\lambda }}\sum\nolimits_{n = 0}^{\infty } {[(( - 1)^{n} \cdot\,{ \exp }( - n\nu ))/n!\,\cdot\,\Upgamma ( - \lambda n)]} \). If ν is distributed as C(λ) and δ is a fixed scalar, then δ·ν is said to be distributed as C(λ, δ).

  6. Logsum coefficients that are overestimated imply correlation among alternatives for each nest is underestimated.

References

  • Aptech Systems Inc.: GAUSS Mathematical and Statistical System. Aptech Systems, Inc, Maple Valley (2006)

    Google Scholar 

  • Banks, J., Carson, J.S., Nelson, B.L., Nicol, D.: Discrete-Event System Simulation, 4th edn. Prentice-Hall, Inc., Upper Salle River (2005)

    Google Scholar 

  • Cardell, N.S.: Variance components structures for the extreme value and logistic distributions. Econ. Theory 13(2), 185–213 (1997)

    Article  Google Scholar 

  • Cherchi, E., Ortuzar, J.D.D.: Predicting best with mixed logit models: understanding some confounding effects. In: Inweldi, P.O. (ed.) Transportation Research Trends, pp. 215–235. Nova Science Publishers, Inc, New York (2008)

    Google Scholar 

  • Chiou, L., Walker, J.L.: Masking identification of discrete choice models under simulation methods. J. Econ. 141, 683–703 (2007)

    Google Scholar 

  • Daganzo, C.: Multinomial Probit: the Theory and its Applications to Demand Forecasting. Academic Press, New York (1979)

    Google Scholar 

  • Daly, A., Bierlaire, M.: A general and operational representation of generalised extreme value models. Transp. Res. Part B 40(4), 285–305 (2006)

    Article  Google Scholar 

  • de Oliveria, T.: Decision and Modeling for Extremes. Some Recent Advances in Statistics, pp. 101–110. Academic Press, New York (1982)

    Google Scholar 

  • Garrow, L.A., Bodea, T.D.: A Rigorous Framework for Empirically Comparing Mixed GEV and Mixed MNL with Error Components Models. Paper presented at the European Transport Conference, Strasbourg, France (2005)

  • Gopinath, D., Schofield, M.L., Walker, J.L., Ben-Akiva, M.: Comparative Analysis of Discrete Choice Models with Flexible Substitution Patterns. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)

  • Hess, S., Bierlaire, M., Polak, J.W.: Capturing Taste Heterogeneity and Correlation Structure with Mixed GEV models. Paper presented at the 84th Annual Meeting of the Transportation Research Board, Washington, DC (2005)

  • Kotz, S., Balakrishnan, N., Johnson, N.L.: Continuous Multivariate Distributions, Vol. 1, 2nd edn. Wiley, New York (2000)

    Google Scholar 

  • Law, A.M., Kelton, W.D.: Simulation Modeling and Analysis, 3rd edn. McGraw Hill, Boston (2000)

    Google Scholar 

  • McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (ed.) Frontiers in Econometrics, pp. 105–142. Academic Press, New York (1974)

    Google Scholar 

  • McFadden, D.: Modeling the choice of residential location. In: Karlquist, A., et al. (eds.) Spatial Interaction Theory and Residential Location, pp. 75–96. North-Holland, Amsterdam (1978)

    Google Scholar 

  • Munizaga, M.A., Alvarez-Daziano, R.: Mixed Logit vs. Nested Logit and Probit Models. Paper presented at the 5th Tri-Annual Invitational Choice Symposium Workshop: Hybrid Choice Models, Formulation and Practical Issues, Asilomar (2001)

  • Munizaga, M.A., Alvarez-Daziano, R.: Evaluation of Mixed Logit as a Practical Modelling Alternative. Paper presented at the European Transport Conference, Cambridge, UK (2002)

  • R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org (2006)

  • Schruben, L., Kulkarni, R.: Some consequences of estimating parameters for the M/M/1 queue. Oper. Res. Lett. 1(2), 75–78 (1983)

    Article  Google Scholar 

  • Sivakumar, A., Bhat, C.R., Okten, G.: Simulation estimation of mixed discrete choice models with the use of randomized quasi-Monte Carlo sequences: a comparative study. Transp. Res. Rec. 1921, 112–122 (2005)

    Article  Google Scholar 

  • Wen, C.-H., Koppelman, F.: The generalized nested logit model. Transp. Res. Part B 35(7), 627–641 (2001)

    Article  Google Scholar 

  • Williams, H.C.W.L.: On the formulation of travel demand models and economic measures of user benefit. Environ Plan 9A(3), 285–344 (1977)

    Article  Google Scholar 

  • Williams, H.C.W.L., Ortuzar, J.D.: Behavioural theories of dispersion and the mis-specification of travel demand models. Transp. Res. Part B 16(3), 167–219 (1982)

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to Chandra Bhat and Juan de Dios Ortuzar for their helpful comments and suggestions related to data generation procedures, and to Dave Goldsman and Melike Meterelliyoz for their constructive comments on how to relate our problem to the simulation literature. We are also grateful to the graduate students in Dr. Garrow’s Spring 2008 statistics course for running the NL models based on the probability method. Partial support for this project was provided by the National Science Foundation (SBE-0624269).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurie A. Garrow.

Appendix

Appendix

Generation of correlation for multivariate gumbel distributions

Proposition 1. Let X 0 and Z i , i = 1,…,n, be independent standard Gumbel random variables. Define for i = 1,…,n,

$$ X_{i} = \max \{ X_{0} + \log (\phi_{i} ),Z_{i} + \log (1 - \phi_{i} )\} ,0 < \phi_{i} \le 1. $$

ϕ i determines the dependency between X0 and X i . ϕ i  = 1 means perfect dependency and thus ϕ0 = 1.

Then,

  1. 1.

    (X 0,…,X n ) is a (n + 1)-dimensional Gumbel vector, of which marginal is Gumbel distributed with mode 0 and dispersion 1.

  2. 2.

    Two-dimensional CDF of (X i , X j ) is

    $$ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). $$

    Also, (n + 1)-dimensional CDF is

    $$ F_{{X_{0} , \ldots ,X_{n} }} \left( {x_{0} , \ldots ,x_{n} } \right) = \exp \left( { - \max_{i = 0, \ldots ,n} \left\{ {e^{{ - x_{i} }} + \sum\limits_{j = 1,j \ne i}^{n} {\left( {1 - \varphi_{j} } \right)e^{{ - x_{j} }} } } \right\}} \right) $$
  3. 3.

    The correlation coefficient between X i and X j is

    $$ \rho_{ij} = - 6\pi^{ - 2} \left( {\int\limits_{ - \infty }^{{\log (\phi_{j} /\phi_{i} )}} {\log \left( {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w + \int\limits_{{\log (\phi_{j} /\phi_{i} )}}^{\infty } {\log \left( {{\frac{{(1 - \phi_{j} )1 + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w} \right). $$

Proof

(1)

$$ \begin{aligned} P(X_{i} \le x) = & P\left( {X_{0} \le x - \log (\phi_{i} ),Z_{i} \le x - \log (1 - \phi_{i} )} \right) \\ = & \exp \left( { - {\text{e}}^{{ - (x - \log (\phi_{i} ))}} } \right)\exp \left( { - {\text{e}}^{{ - (x - \log (1 - \phi_{i} ))}} } \right) \\ = & \exp \left( { - {\text{e}}^{ - x} {\text{e}}^{{\log (\phi_{i} )}} } \right)\exp \left( { - {\text{e}}^{ - x} {\text{e}}^{{\log (1 - \phi_{i} )}} } \right) \\ = & \exp \left( { - \phi_{i} {\text{e}}^{ - x} - \left( {1 - \phi_{i} } \right){\text{e}}^{ - x} } \right) \\ = & \exp \left( { - {\text{e}}^{ - x} } \right) \\ \end{aligned} $$

(2)

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & P\left( {X_{0} \le x - \log (\phi_{i} ),Z_{i} \le x - \log (1 - \phi_{i} ),X_{0} \le y - \log (\phi_{j} ),Z_{j} \le y - \log (1 - \phi_{j} )} \right) \\ = & P\left( {X_{0} \le \min \left\{ {x - \log (\phi_{i} ),y - \log (\phi_{j} )} \right\}} \right)P\left( {Z_{i} \le y - \log (1 - \phi_{i} )} \right)P\left( {Z_{j} \le y - \log (1 - \phi_{j} )} \right) \\ = & \exp \left( { - {\text{e}}^{{ - \min \left\{ {x - \log (\phi_{i} ),y - \log (\phi_{j} )} \right\}}} } \right)\exp \left( { - {\text{e}}^{{ - (x - \log (1 - \phi_{i} ))}} } \right)\exp \left( { - {\text{e}}^{{ - (y - \log (1 - \phi_{j} ))}} } \right) \\ \end{aligned} $$

If \( y - x \ge \log \left( {\phi_{j} /\phi_{i} } \right) \), i.e. \( \phi_{i} e^{ - x} \ge \phi_{j} e^{ - y} \),

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & \exp \left( { - e^{{ - \left( {x - \log (\phi_{i} )} \right)}} } \right)\exp \left( { - e^{{ - \left( {x - \log (1 - \phi_{i} )} \right)}} } \right)\exp \left( { - e^{{ - \left( {y - \log (1 - \phi_{j} )} \right)}} } \right) \\ = & \exp \left( { - \phi_{i} e^{ - x} - \left( {1 - \phi_{i} } \right)e^{ - x} } \right)\exp \left( { - e^{{ - \left( {y - \log (1 - \phi_{j} )} \right)}} } \right) \\ = & \exp \left( { - e^{ - x} - \left( {1 - \phi_{j} } \right)e^{ - y} } \right). \\ \end{aligned} $$

If \( y - x < \log (\phi_{j} /\phi_{i} ) \),

$$ F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - (1 - \phi_{i} ){\text{e}}^{ - x} - {\text{e}}^{ - y} } \right) $$

Thus, \( F_{{X_{i} ,X_{j} }} (x,y) = \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} ){\text{e}}^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right). \)

Let us consider (n + 1)-dimensional CDF.

$$ \begin{aligned} F_{{X_{0} , \ldots ,X_{n} }} (x_{0} , \ldots ,x_{n} ) = & P\left( {X_{0} \le x_{0} ,X_{1} \le x_{1} , \ldots ,X_{n - 1} \le x_{n - 1} ,X_{0} \le x_{n} - \log (\phi_{n} )} \right)P\left( {Z_{n} \le x_{n} - \log (1 - \phi_{n} )} \right) \\ = & P\left( {X_{0} \le \min_{i = 0, \ldots ,n} \{ x_{i} - \log (\phi_{i} )\} } \right)\prod\limits_{j = 1}^{n} {P\left( {Z_{j} \le x_{j} - \log (1 - \phi_{j} )} \right)} \\ = & \exp \left( { - {\text{e}}^{{ - \min_{i = 0, \ldots ,n} \{ x_{i} - \log (\phi_{i} )\} }} - \left. {\sum\limits_{j = 1}^{n} {(1 - \phi_{j} ){\text{e}}^{{ - x_{j} }} } } \right\}} \right) \\ = & \exp \left( { - \max_{i = 1, \ldots ,n} \left\{ {{\text{e}}^{{ - x_{i} }} + \sum\limits_{j = 1,j \ne i}^{n} {(1 - \phi_{j} ){\text{e}}^{{ - x_{j} }} } } \right\}} \right) \\ \end{aligned} $$

(3) Note that

$$ \begin{aligned} F_{{X_{i} ,X_{j} }} (x,y) = & \exp \left( { - \max \left\{ {{\text{e}}^{ - x} + (1 - \phi_{j} )e^{ - y} ,(1 - \phi_{i} ){\text{e}}^{ - x} + {\text{e}}^{ - y} } \right\}} \right) \\ = & \exp \left( { - \max \{ {\text{e}}^{ - x} + {\text{e}}^{ - y} \} } \right)k(y - x) \\ \end{aligned} $$

with

$$ k(w;\phi_{i} ,\phi_{j} ) = \left\{ {\begin{array}{*{20}c} {{\frac{{(1 - \phi_{j} ) + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} & {{\text{if}}\,w \ge { \log }(\phi_{j} /\phi_{i} )} \\ {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} & {{\text{otherwise}}.} \\ \end{array} } \right. $$

The correlation coefficient between X i and X j is

$$ \begin{aligned} \rho_{ij} = & - 6\pi^{ - 2} \int\limits_{ - \infty }^{\infty } {\log (k(w))} {\text{d}}w \\ = & - 6\pi^{ - 2} \left( {\int\limits_{ - \infty }^{{\log (\phi_{j} /\phi_{i} )}} {\log \left( {{\frac{{1 + (1 - \phi_{i} ){\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w + \int\limits_{{\log (\phi_{j} /\phi_{i} )}}^{\infty } {\log \left( {{\frac{{(1 - \phi_{j} )1 + {\text{e}}^{w} }}{{1 + {\text{e}}^{w} }}}} \right)} {\text{d}}w} \right) \\ \end{aligned} $$

Since ϕ 0 = 1, the correlation coefficient between X 0 and X i is derived as

$$ \rho_{ij} = - 6\pi^{ - 2} \int\limits_{0}^{{\phi_{i} }} {(1 - t)^{ - 1} \log (t)} {\text{d}}t. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garrow, L.A., Bodea, T.D. & Lee, M. Generation of synthetic datasets for discrete choice analysis. Transportation 37, 183–202 (2010). https://doi.org/10.1007/s11116-009-9228-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11116-009-9228-6

Keywords

Navigation