Skip to main content
Log in

An unbiased model comparison test using cross-validation

  • Published:
Quality & Quantity Aims and scope Submit manuscript

Abstract

Social scientists often consider multiple empirical models of the same process. When these models are parametric and non-nested, the null hypothesis that two models fit the data equally well is commonly tested using methods introduced by Vuong (Econometrica 57(2):307–333, 1989) and Clarke (Am J Political Sci 45(3):724–744, 2001; J Confl Resolut 47(1):72–93, 2003; Political Anal 15(3):347–363, 2007). The objective of each is to compare the Kullback–Leibler Divergence (KLD) of the two models from the true model that generated the data. Here we show that both of these tests are based upon a biased estimator of the KLD, the individual log-likelihood contributions, and that the Clarke test is not proven to be consistent for the difference in KLDs. As a solution, we derive a test based upon cross-validated log-likelihood contributions, which represent an unbiased KLD estimate. We demonstrate the CVDM test’s superior performance via simulation, then apply it to two empirical examples from political science. We find that the test’s selection can diverge from those of the Vuong and Clarke tests and that this can ultimately lead to differences in substantive conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. According to Google Scholar, Vuong (1989) has been cited approximately 2,400 times and the relatively more recent work by Clarke (2001, 2003, 2007) has garnered a combined 229 citations.

  2. Moreover, unlike other information-theoretic model comparison criteria such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), the Vuong and Clarke tests can be used to test hypotheses about the equivalence of model fit in the same way one would use the \(F\) or likelihood ratio tests with nested models.

  3. It should be noted that Vuong (1989) is very clear that all of his results are in the limit, focusing on consistency rather than bias.

  4. We discovered this example by starting with two misspecified models and varying parameters in the data generating process until we arrived upon a simulation-based proof that it is possible for the signs of \(\tilde{\mu }(l^{(d)}_i)\) and \(E[l^{(d)}_i]\) to be different.

  5. It may seem odd to see partial degrees of freedom because the \(t\)-distribution is often used with reference to the number of observations less the number of parameters estimated (i.e., an integer). However, the \(t\)-distribution is a valid probability distribution for any \(df > 0\). This interval for \(df\) is chosen to produce the divergence in the sign of \(\tilde{\mu }(l^{(d)}_i)\) and \(E[l^{(d)}_i]\).

  6. The Laplace distribution is a symmetric, unbounded continuous distribution that has significantly heavier tails than the normal distribution (Clarke 2007). The MLE of the regression parameters with a Laplace distributed error term is equivalent to the estimate of the coefficients in median regression (Koenker 2005).

  7. We attempted to depict 95 % confidence intervals around the mean estimates of \(\tilde{\mu }(l^{(d)}_i)\) and \(E[l^{(d)}_i]\) over the 10,000 iterations, but it was impossible to distinguish them on the graph.

  8. Though this may seem somewhat restrictive, note that the general method of cross-validation can be used to conduct model comparison outside of ML estimators (see Diebold and Mariano 2002).

  9. For examples using the Vuong test, see Mebane and Sekhon (2002), Abbe et al. (2003), Mondak and Sanders (2005), Bailey (2007), Shellman and Stewart (2007), and Konisky and Woods (2009). For those employing the Clarke test see Souva (2005), Boockmann (2006), and Travis (2010).

  10. Another possibility is MR, which appears in one of our simulation examples above. For simplicity we only focus on the choice between OLS and RR here, though results do not change if we compare OLS and MR.

  11. Following Lange et al. (1989) and Western (1995), we set \(\nu \) = 4. However, this parameter could also be estimated from the data (Western 1995). In fact, analysts could use our CVDM test to compare a model in which \(\nu \) is estimated to a model setting \(\nu \) a priori.

  12. The distributions are fit to the data by ML. The estimated parameters are the mean and variance of the normal distribution and the median and dispersion parameter for the \(t\).

  13. We replicated each model exactly. All coefficients are standardized to allow ease of presentation.

  14. More formally, the skewness of these values is a statistically significant 1.05. The individual cross-validated log-likelihoods also exhibit this skewness.

  15. Although the Clarke test makes the same selection as the CVDM test in this case.

  16. It may sound odd to state the “expectation of the expected likelihood”, but this conveys the fact that the expected log-likelihood varies with the sample mean, resulting in the need for an outer expectation taken over the sampling distribution of the mean.

References

  • Abbe, O.G., Goodliffe, J., Herrnson, P.S., Patterson, K.D.: Agenda setting in Congressional elections: the impact of issues and campaigns on voting behavior. Political Res. Q. 56(4), 419–430 (2003)

    Article  Google Scholar 

  • Achen, C.H.: Let’s put garbage-can regressions and garbage-can probits where they belong. Confl. Manag. Peace Sci. 22(4), 327–339 (2005)

    Article  Google Scholar 

  • Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  Google Scholar 

  • Amaral, M.A., Dunsmore, I.R.: Optimal estimates of predictive distributions. Biometrika 67(3), 685–689 (1980)

    Article  Google Scholar 

  • Bailey, M.A.: Comparable preference estimates across time and institutions for the court, Congress, and presidency. Am. J. Political Sci. 51(3), 433–448 (2007)

    Article  Google Scholar 

  • Boockmann, B.: Partisan politics and treaty ratification: the acceptance of international labour organisation conventions by industrialised democracies, 1960–1996. Eur. J. Political Res. 45(1), 153–180 (2006)

    Article  Google Scholar 

  • Chaffin, W.W., Rhiel, S.G.: The effect of skewness and kurtosis on the one-sample \(t\) test and the impact of knowledge of the population standard deviation. J. Stat. Comput. Simul. 46(1), 79–90 (1993)

    Article  Google Scholar 

  • Clarke, K.A.: Testing nonnested models of international relations: reevaluating realism. Am. J. Political Sci. 45(3), 724–744 (2001)

    Article  Google Scholar 

  • Clarke, K.A.: Nonparametric model discrimination in international relations. J. Confl. Resolut. 47(1), 72–93 (2003)

    Article  Google Scholar 

  • Clarke, K.A.: A simple distribution-free test for nonnested hypotheses. Political Anal. 15(3), 347–363 (2007)

    Article  Google Scholar 

  • Diebold, F.X., Mariano, R.S.: Comparing predictive accuracy. J. Bus. Econ. Stat. 20(1), 134–144 (2002)

    Article  Google Scholar 

  • Gilula, Z., Haberman, S.J.: Density approximation by summary statistics: an information-theoretic approach. Scand. J. Stat. 27(3), 521–534 (2000)

    Article  Google Scholar 

  • Greene, W.H.: Econometric Analysis, 6th edn. Prentice Hall, Upper Saddle River (2008)

    Google Scholar 

  • Hall, P.: On Kullback–Leibler loss and density estimation. Ann. Stat. 15(4), 1491–1519 (1987)

    Article  Google Scholar 

  • Johnson, N.J.: Modified \(t\) tests and confidence intervals for asymmetrical populations. J. Am. Stat. Assoc. 73(363), 536–544 (1978)

    Google Scholar 

  • Joshi, M., Mason, T.D.: Between democracy and revolution: peasant support for insurgency versus democracy in Nepal. J. Peace Res. 45(6), 765–782 (2008)

    Article  Google Scholar 

  • Koenker, R.: Quantile Regression. Cambridge University Press, New York (2005)

    Book  Google Scholar 

  • Konishi, S., Kitagawa, G.: Generalised information criteria in model selection. Biometrika 83(4), 875–890 (1996)

    Article  Google Scholar 

  • Konisky, D.M., Woods, N.D.: Exporting air pollution? Regulatory enforcement and environmental free riding in the United States. Political Res. Q. 63(4), 771–782 (2010)

    Article  Google Scholar 

  • Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  Google Scholar 

  • Lange, K.L., Little, R.J.A., Taylor, J.M.G.: Robust statistical modeling using the \(t\) distribution. J. Am. Stat. Assoc. 84(408), 881–896 (1989)

    Google Scholar 

  • Mebane, W.R., Sekhon, J.S.: Coordination and policy moderation at midterm. Am. Political Sci. Rev. 96(1), 141–157 (2002)

    Article  Google Scholar 

  • Mondak, J.J., Sanders, M.S.: The complexity of tolerance and intolerance judgments: a response to Gibson. Political Behav. 27(4), 325–337 (2005)

    Article  Google Scholar 

  • Palazzolo, D.J., Moscardelli, V.G.: Policy crisis and political leadership: election law reform in the states after the 2000 presidential election. State Politics Policy Q. 6(3), 300–321 (2006)

    Article  Google Scholar 

  • Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)

    Article  Google Scholar 

  • Shellman, S.M., Stewart, B.M.: Political persecution or economic deprivation? A time-series analysis of Haitian exodus, 1990–2004. Confl. Manag. Peace Sci. 24(2), 121–137 (2007)

    Article  Google Scholar 

  • Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)

    Article  Google Scholar 

  • Souva, M.: Foreign policy determinants: comparing realist and domestic-political models of foreign policy. Confl. Manag. Peace Sci. 22(2), 149–163 (2005)

    Article  Google Scholar 

  • Travis, R.: Problems, politics, and policy streams: a reconsideration us foreign aid behavior toward Africa. Int. Stud. Q. 54(3), 797–821 (2010)

    Article  Google Scholar 

  • Vuong, Q.H.: Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57(2), 307–333 (1989)

    Article  Google Scholar 

  • Ward, M.D., Greenhill, B.D., Bakke, K.M.: The perils of policy by \(p\)-value: predicting civil conflicts. J. Peace Res. 47(4), 363–375 (2010)

    Article  Google Scholar 

  • Western, B.: Concepts and suggestions for robust regression analysis. Am. J. Political Sci. 39(3), 786–817 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeffrey J. Harden.

Appendix: Proof of Vuong test finite sample bias

Appendix: Proof of Vuong test finite sample bias

Here we derive the inequality given in Eq. 3 of the main text. Suppose \(\varvec{y}\) is a sample of \(n\) independent observations from a normal distribution with zero mean and variance \(\tau ^2\). Also, let \(g\) be a normal probability density function with the mean estimated as the sample mean of \(\varvec{y}\), and the variance fixed at \(\sigma ^2\). Then the expected value of the observed log-likelihood is

$$\begin{aligned} E[ll_o]&= E_{\varvec{y}}\left[ \frac{1}{n} \ln \left( \prod _{i=1}^n \frac{1}{\sqrt{2\pi \sigma ^2}} \exp \left[ \frac{-1}{2\sigma ^2} \left( y_i - \frac{1}{n}\sum _{j=1}^n y_j \right) ^2\right] \right) \right] \nonumber \\&= -\ln \left( \sqrt{2 \pi \sigma ^2} \right) - \frac{1}{2\sigma ^2n} E_{\varvec{y}} \left[ \sum _{i=1}^n \left( y_i - \frac{1}{n}\sum _{j=1}^n y_j \right) ^2 \right] \nonumber \\&= -\ln \left( \sqrt{2 \pi \sigma ^2} \right) - \frac{\tau ^2(n-1)}{2\sigma ^2n}. \end{aligned}$$
(8)

The expected value of the expected log-likelihood isFootnote 16

$$\begin{aligned} \begin{aligned} E[ll_e]&= E_{\bar{y}}\left[ E_{\varvec{y}}\left( \frac{1}{n} \sum _{i=1}^n \ln \frac{1}{\sqrt{2 \pi \sigma ^2}} - \frac{(y_i - \bar{y})^2}{2\sigma ^2} \right) \right] \\&= -\ln \left( \sqrt{2 \pi \sigma ^2} \right) - \frac{1}{2\sigma ^2} E_{\bar{y}}\left[ E_{\varvec{y}}\left( \frac{1}{n} \sum _{i=1}^n y_i^2 -2y_i\bar{y}+\bar{y}^2 \right) \right] \\&= -\ln \left( \sqrt{2 \pi \sigma ^2} \right) - \frac{1}{2\sigma ^2} E_{\bar{y}}\left[ \tau ^2 + \bar{y}^2 \right] \\&= -\ln \left( \sqrt{2 \pi \sigma ^2} \right) - \frac{\tau ^2+\frac{\tau ^2}{n}}{2\sigma ^2} \end{aligned} \end{aligned}$$
(9)

Now, considering two different values of \(\sigma ^2\), \(\sigma _1^2\) and \(\sigma _2^2\) with \(\sigma _1^2 < \sigma _2^2\), \(E_1[ll_o] > E_2[ll_o]\) iff

$$\begin{aligned} -\ln \left( \sqrt{2 \pi \sigma _1^2} \right) - \frac{\tau ^2(n-1)}{2\sigma _1^2n}&> -\ln \left( \sqrt{2 \pi \sigma _2^2} \right) - \frac{\tau ^2(n-1)}{2\sigma _2^2n}\nonumber \\ \tau ^2&< \frac{\ln \left( \sigma ^2_2 \right) -\ln \left( \sigma _1^2 \right) }{\frac{n-1}{n}\left( \frac{1}{\sigma _1^2}-\frac{1}{\sigma _2^2}\right) }, \end{aligned}$$
(10)

and \(E_1[ll_e] < E_2[ll_e]\) iff

$$\begin{aligned} -\ln \left( \sqrt{2 \pi \sigma _1^2} \right) - \frac{\tau ^2+\frac{\tau ^2}{n}}{2\sigma _1^2}&< -\ln \left( \sqrt{2 \pi \sigma _2^2} \right) - \frac{\tau ^2+\frac{\tau ^2}{n}}{2\sigma _2^2}\nonumber \\ \frac{\ln \left( \sigma ^2_2 \right) -\ln \left( \sigma _1^2 \right) }{\left[ 1+\frac{1}{n}\right] \left( \frac{1}{\sigma _1^2} -\frac{1}{\sigma _2^2}\right) }&< \tau ^2. \end{aligned}$$
(11)

Combining these two conditions gives the interval from Eq. 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Desmarais, B.A., Harden, J.J. An unbiased model comparison test using cross-validation. Qual Quant 48, 2155–2173 (2014). https://doi.org/10.1007/s11135-013-9884-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11135-013-9884-7

Keywords

Navigation