Skip to main content

On Model Selection Algorithms in Multi-dimensional Contingency Tables

  • Chapter
  • First Online:
Book cover Statistical Modelling in Biostatistics and Bioinformatics

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

  • 3027 Accesses

Abstract

We present a review focussed on model selection in log-linear models and contingency tables. The concepts of sparsity and high-dimensionality have become more important nowadays, for example, in the context of high-throughput genetic data. In particular, we describe recently developed automatic search algorithms for finding optimal hierarchical log-linear models (HLLMs) in sparse multi-dimensional contingency tables in R and some LASSO-type penalized likelihood model selection approaches. The methods rely, in part, on a new result which identifies and thus permits the rapid elimination of non-existent maximum likelihood estimators in high-dimensional tables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley.

    Book  MATH  Google Scholar 

  • Baker, R. J., Clarke, M. R. B., & Lane, P. W. (1985). Zero entries in sparse contingency tables. Computational Statistics and Data Analysis, 3, 33–45.

    Article  Google Scholar 

  • Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 25(1), 220–233.

    MATH  MathSciNet  Google Scholar 

  • Bishop, Y. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: MIT Press, The Massachusetts Institute of Technology.

    MATH  Google Scholar 

  • Bishop, Y. M. M. (1969). Full contingency tables, logits, and split contingency tables. Biometrics, 25(2), 383–399.

    Article  Google Scholar 

  • Charlson, M. E., Pompei, P., Ales, K. L., & MacKenzie, C. R. (1987). A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation. Journal of Chronic Diseases, 40(5), 373–383.

    Article  Google Scholar 

  • Christensen, R. (1997). Log-linear models and logistic regression (2nd ed.). New York: Springer.

    MATH  Google Scholar 

  • Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829–836.

    Article  MATH  MathSciNet  Google Scholar 

  • Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596–610.

    Article  MATH  Google Scholar 

  • Conde, S. (2011). Interactions: Log-linear models in sparse contingency tables (Ph.D. thesis). University of Limerick, Ireland.

    Google Scholar 

  • Conde, S., & MacKenzie, G. (2007). Modelling high dimensional sets of binary co-morbidities. In J. del Castillo, A. Espinal, & P. Puig (Eds.), Proceedings of the 22nd International Workshop on Statistical Modelling, Barcelona (pp. 177–180).

    Google Scholar 

  • Conde, S., & MacKenzie, G. (2008). Search algorithms for log-linear models in contingency tables. Comorbidity data. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 184–187).

    Google Scholar 

  • Conde, S., & MacKenzie, G. (2011). LASSO penalised likelihood in high-dimensional contingency tables. In D. Conesa, A. Forte, A. López-Quílez, & F. Muñoz (Eds.), Proceedings of the 26th International Workshop on Statistical Modelling, Valencia (pp. 127–132).

    Google Scholar 

  • Conde, S., & MacKenzie, G. (2012). Model selection in sparse contingency tables: LASSO penalties vs classical methods. In A. Komárek & S. Nagy (Eds.), Proceedings of the 27th International Workshop on Statistical Modelling, Prague (pp. 81–86).

    Google Scholar 

  • Conde, S., & MacKenzie, G. (2014). The smooth LASSO in sparse high-dimensional contingency tables (in preparation).

    Google Scholar 

  • Dahinden, C., Parmigiani, G., Emerick, M. C., & Bühlmann, P. (2007). Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics, 8, 476.

    Article  Google Scholar 

  • Darroch, J. N., Lauritzen, S. L., & Speed, T. P. (1980). Markov fields and log-linear interaction models for contingency tables. The Annals of Statistics, 8(3), 522–539.

    Article  MATH  MathSciNet  Google Scholar 

  • Davies, S. J., Phillips, L., Naish, P. F., & Russell, G. I. (2002). Quantifying comorbidity in peritoneal dialysis patients and its relationship to other predictors of survival. Nephrology Dialysis Transplantation, 17(6), 1085–1092.

    Article  Google Scholar 

  • Demidenko, E. (2004). Mixed models. New York: Wiley.

    Book  MATH  Google Scholar 

  • Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4), 427–444.

    Article  MathSciNet  Google Scholar 

  • Dobson, A. J. (2002). An introduction to generalized linear models. New York: Chapman & Hall/CRC.

    MATH  Google Scholar 

  • Edwards, D. (2000). Introduction to graphical modelling (2nd ed.). New York: Springer.

    Book  MATH  Google Scholar 

  • Edwards, D. (2012). A note on adding and deleting edges in hierarchical log-linear models. Computational Statistics, 27, 799–803.

    Article  MathSciNet  Google Scholar 

  • Edwards, D., & Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika, 72(2), 339–351.

    Article  MATH  MathSciNet  Google Scholar 

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

    Article  MATH  MathSciNet  Google Scholar 

  • Feinstein, A. R. (1970). The pre-therapeutic classification of co-morbidity in chronic disease. Journal of Chronic Diseases, 23(7), 455–468.

    Article  Google Scholar 

  • Fienberg, S. E. (1972). The analysis of incomplete multi-way contingency tables. Biometrics, 28(1), 177–202 [special Multivariate Issue].

    Google Scholar 

  • Fienberg, S. E., & Rinaldo, A. (2006). Computing maximum likelihood estimates in log-linear models. Manuscript extracted from Rinaldo’s Ph.D. thesis.

    Google Scholar 

  • Fienberg, S. E., & Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. The Annals of Statistics, 40(2), 996–1023.

    Article  MATH  MathSciNet  Google Scholar 

  • Fisher, R. A. (1922). On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87–94.

    Article  Google Scholar 

  • Friedman, J. H. (2008). Fast sparse regression and classification. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 27–57).

    Google Scholar 

  • Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

    Google Scholar 

  • Glonek, G. F. V., Darroch, J. N., & Speed, T. P. (1988). On the existence of maximum likelihood estimators for hierarchical loglinear models. Scandinavian Journal of Statistics, 15, 187–193.

    MATH  MathSciNet  Google Scholar 

  • Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. R. A. Fisher memorial lecture. Journal of the American Statistical Association, 63(324), 1091–1131.

    MATH  Google Scholar 

  • Goodman, L. A. (1971). The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics, 13(1), 33–61.

    Article  MATH  Google Scholar 

  • Green, P. J., & Silverman, B. W. (1994). Nonparametric regression and generalized linear models: A roughness penalty approach (Vol. 58). Monographs on statistics and applied probability (1st ed.). London: Chapman & Hall.

    Google Scholar 

  • Haberman, S. J. (1970). The general log-linear model (Ph.D. thesis). Department of Statistics, University of Chicago, Chicago, IL.

    Google Scholar 

  • Hall, W. H., Ramachandran, R., Narayan, S., Jani, A. B., & Vijayakumar, S. (2004). An electronic application for rapidly calculating Charlson comorbidity score. BMC Cancer, 4, 94.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.

    Book  MATH  Google Scholar 

  • Hu, M. Y. (1999). Model checking for incomplete high dimensional categorical data (Ph.D. thesis). University of California, Los Angeles.

    Google Scholar 

  • Kim, S. H., Choi, H., & Lee, S. (2008). Estimate-based goodness-of-fit test for large sparse multinomial distributions. Computational Statistics and Data Analysis, 53(4), 1122–1131

    Article  MathSciNet  Google Scholar 

  • Kou, C., & Pan, J. (2008). Variable selection in joint modelling of mean and covariance structures for longitudinal sata. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 309–314)

    Google Scholar 

  • Krajewski, P., & Siatkowski, I. (1990). Algorithm AS 252: Generating classes for log-linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 39(1), 143–176.

    MATH  Google Scholar 

  • Lang, S. (1992). Algebra (3rd ed.). Delhi: Pearson Education.

    Google Scholar 

  • MacKenzie, G. (2006). Screening multivariate comorbidities. Presentation. Assess, York, http://www.staff.ul.ie/mackenzieg/Assess/assess.html.

  • MacKenzie, G., & Conde, S. (2014). Model selection in sparse contingency tables (in preparation).

    Google Scholar 

  • MacKenzie, G., & O’Flaherty, M. (1982). Algorithm AS 173: Direct design matrix generation for balanced factorial experiments. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 74–80.

    MATH  Google Scholar 

  • Mantel, N. (1970). Incomplete contingency tables. Biometrics, 26(2), 291–304.

    Article  MathSciNet  Google Scholar 

  • Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables. Journal of the American Statistical Association, 100(471), 1009–1020.

    Article  MATH  MathSciNet  Google Scholar 

  • McCullagh, P., & Nelder, J. A. (1997). Generalized linear models. London: Chapman & Hall.

    Google Scholar 

  • Montgomery, D. C. (2001). Design and analysis of experiments (5th ed.). New York: Wiley.

    Google Scholar 

  • Muggeo, V. M. R. (2010). LASSO regression via smooth L 1-norm approximation. In A. W. Bowman (Ed.), Proceedings of the 25th International Workshop on Statistical Modelling, Glasgow (pp. 391–396).

    Google Scholar 

  • Nijenhuis, A., & Wilf, H. S. (1978). Combinatorial algorithms. New York: Academic.

    MATH  Google Scholar 

  • O’Flaherty, M., & MacKenzie, G. (1982). Algorithm AS 172: Direct simulation of nested Fortran DO-LOOPS. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 71–74.

    Google Scholar 

  • Pearson K. (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine Series, 50(5), 157–174.

    Article  MATH  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.

    MATH  MathSciNet  Google Scholar 

  • Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482.

    Article  MATH  MathSciNet  Google Scholar 

  • Wilks, S. S. (1935). The likelihood test of independence in contingency tables. The Annals of Mathematical Statistics, 6(4), 190–196.

    Article  Google Scholar 

  • Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62.

    Article  Google Scholar 

  • Wissmann, M., Toutenburg, H., & Shalabh (2007) Role of categorical variables in multicollinearity in the linear regression model. Technical Report 008, Department of Statistics, University of Munich.

    Google Scholar 

  • Zelterman, D. (1987). Goodness-of-fit tests for large sparse multinomial distributions. Journal of the American Statistical Association, 82(398), 624–629.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

The work in this paper was conducted in the Centre of Biostatistics, Limerick, Ireland, and supported by the Science Foundation Ireland (SFI, www.sfi.ie), project grant number 07/MI/012 (BIO-SI project, www3.ul.ie/bio-si). The first author’s Ph.D. scholarship was supported GlaxoSmithKline, England, UK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susana Conde .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Conde, S., MacKenzie, G. (2014). On Model Selection Algorithms in Multi-dimensional Contingency Tables. In: MacKenzie, G., Peng, D. (eds) Statistical Modelling in Biostatistics and Bioinformatics. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-04579-5_15

Download citation

Publish with us

Policies and ethics