On Model Selection Algorithms in Multi-dimensional Contingency Tables

Conde, Susana; MacKenzie, Gilbert

doi:10.1007/978-3-319-04579-5_15

Susana Conde³ &
Gilbert MacKenzie⁴

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

3027 Accesses

Abstract

We present a review focussed on model selection in log-linear models and contingency tables. The concepts of sparsity and high-dimensionality have become more important nowadays, for example, in the context of high-throughput genetic data. In particular, we describe recently developed automatic search algorithms for finding optimal hierarchical log-linear models (HLLMs) in sparse multi-dimensional contingency tables in R and some LASSO-type penalized likelihood model selection approaches. The methods rely, in part, on a new result which identifies and thus permits the rapid elimination of non-existent maximum likelihood estimators in high-dimensional tables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley.
Book MATH Google Scholar
Baker, R. J., Clarke, M. R. B., & Lane, P. W. (1985). Zero entries in sparse contingency tables. Computational Statistics and Data Analysis, 3, 33–45.
Article Google Scholar
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 25(1), 220–233.
MATH MathSciNet Google Scholar
Bishop, Y. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: MIT Press, The Massachusetts Institute of Technology.
MATH Google Scholar
Bishop, Y. M. M. (1969). Full contingency tables, logits, and split contingency tables. Biometrics, 25(2), 383–399.
Article Google Scholar
Charlson, M. E., Pompei, P., Ales, K. L., & MacKenzie, C. R. (1987). A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation. Journal of Chronic Diseases, 40(5), 373–383.
Article Google Scholar
Christensen, R. (1997). Log-linear models and logistic regression (2nd ed.). New York: Springer.
MATH Google Scholar
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829–836.
Article MATH MathSciNet Google Scholar
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596–610.
Article MATH Google Scholar
Conde, S. (2011). Interactions: Log-linear models in sparse contingency tables (Ph.D. thesis). University of Limerick, Ireland.
Google Scholar
Conde, S., & MacKenzie, G. (2007). Modelling high dimensional sets of binary co-morbidities. In J. del Castillo, A. Espinal, & P. Puig (Eds.), Proceedings of the 22nd International Workshop on Statistical Modelling, Barcelona (pp. 177–180).
Google Scholar
Conde, S., & MacKenzie, G. (2008). Search algorithms for log-linear models in contingency tables. Comorbidity data. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 184–187).
Google Scholar
Conde, S., & MacKenzie, G. (2011). LASSO penalised likelihood in high-dimensional contingency tables. In D. Conesa, A. Forte, A. López-Quílez, & F. Muñoz (Eds.), Proceedings of the 26th International Workshop on Statistical Modelling, Valencia (pp. 127–132).
Google Scholar
Conde, S., & MacKenzie, G. (2012). Model selection in sparse contingency tables: LASSO penalties vs classical methods. In A. Komárek & S. Nagy (Eds.), Proceedings of the 27th International Workshop on Statistical Modelling, Prague (pp. 81–86).
Google Scholar
Conde, S., & MacKenzie, G. (2014). The smooth LASSO in sparse high-dimensional contingency tables (in preparation).
Google Scholar
Dahinden, C., Parmigiani, G., Emerick, M. C., & Bühlmann, P. (2007). Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics, 8, 476.
Article Google Scholar
Darroch, J. N., Lauritzen, S. L., & Speed, T. P. (1980). Markov fields and log-linear interaction models for contingency tables. The Annals of Statistics, 8(3), 522–539.
Article MATH MathSciNet Google Scholar
Davies, S. J., Phillips, L., Naish, P. F., & Russell, G. I. (2002). Quantifying comorbidity in peritoneal dialysis patients and its relationship to other predictors of survival. Nephrology Dialysis Transplantation, 17(6), 1085–1092.
Article Google Scholar
Demidenko, E. (2004). Mixed models. New York: Wiley.
Book MATH Google Scholar
Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4), 427–444.
Article MathSciNet Google Scholar
Dobson, A. J. (2002). An introduction to generalized linear models. New York: Chapman & Hall/CRC.
MATH Google Scholar
Edwards, D. (2000). Introduction to graphical modelling (2nd ed.). New York: Springer.
Book MATH Google Scholar
Edwards, D. (2012). A note on adding and deleting edges in hierarchical log-linear models. Computational Statistics, 27, 799–803.
Article MathSciNet Google Scholar
Edwards, D., & Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika, 72(2), 339–351.
Article MATH MathSciNet Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Article MATH MathSciNet Google Scholar
Feinstein, A. R. (1970). The pre-therapeutic classification of co-morbidity in chronic disease. Journal of Chronic Diseases, 23(7), 455–468.
Article Google Scholar
Fienberg, S. E. (1972). The analysis of incomplete multi-way contingency tables. Biometrics, 28(1), 177–202 [special Multivariate Issue].
Google Scholar
Fienberg, S. E., & Rinaldo, A. (2006). Computing maximum likelihood estimates in log-linear models. Manuscript extracted from Rinaldo’s Ph.D. thesis.
Google Scholar
Fienberg, S. E., & Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. The Annals of Statistics, 40(2), 996–1023.
Article MATH MathSciNet Google Scholar
Fisher, R. A. (1922). On the interpretation of χ ² from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87–94.
Article Google Scholar
Friedman, J. H. (2008). Fast sparse regression and classification. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 27–57).
Google Scholar
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Google Scholar
Glonek, G. F. V., Darroch, J. N., & Speed, T. P. (1988). On the existence of maximum likelihood estimators for hierarchical loglinear models. Scandinavian Journal of Statistics, 15, 187–193.
MATH MathSciNet Google Scholar
Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. R. A. Fisher memorial lecture. Journal of the American Statistical Association, 63(324), 1091–1131.
MATH Google Scholar
Goodman, L. A. (1971). The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics, 13(1), 33–61.
Article MATH Google Scholar
Green, P. J., & Silverman, B. W. (1994). Nonparametric regression and generalized linear models: A roughness penalty approach (Vol. 58). Monographs on statistics and applied probability (1st ed.). London: Chapman & Hall.
Google Scholar
Haberman, S. J. (1970). The general log-linear model (Ph.D. thesis). Department of Statistics, University of Chicago, Chicago, IL.
Google Scholar
Hall, W. H., Ramachandran, R., Narayan, S., Jani, A. B., & Vijayakumar, S. (2004). An electronic application for rapidly calculating Charlson comorbidity score. BMC Cancer, 4, 94.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
Book MATH Google Scholar
Hu, M. Y. (1999). Model checking for incomplete high dimensional categorical data (Ph.D. thesis). University of California, Los Angeles.
Google Scholar
Kim, S. H., Choi, H., & Lee, S. (2008). Estimate-based goodness-of-fit test for large sparse multinomial distributions. Computational Statistics and Data Analysis, 53(4), 1122–1131
Article MathSciNet Google Scholar
Kou, C., & Pan, J. (2008). Variable selection in joint modelling of mean and covariance structures for longitudinal sata. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 309–314)
Google Scholar
Krajewski, P., & Siatkowski, I. (1990). Algorithm AS 252: Generating classes for log-linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 39(1), 143–176.
MATH Google Scholar
Lang, S. (1992). Algebra (3rd ed.). Delhi: Pearson Education.
Google Scholar
MacKenzie, G. (2006). Screening multivariate comorbidities. Presentation. Assess, York, http://www.staff.ul.ie/mackenzieg/Assess/assess.html.
MacKenzie, G., & Conde, S. (2014). Model selection in sparse contingency tables (in preparation).
Google Scholar
MacKenzie, G., & O’Flaherty, M. (1982). Algorithm AS 173: Direct design matrix generation for balanced factorial experiments. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 74–80.
MATH Google Scholar
Mantel, N. (1970). Incomplete contingency tables. Biometrics, 26(2), 291–304.
Article MathSciNet Google Scholar
Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2ⁿ contingency tables. Journal of the American Statistical Association, 100(471), 1009–1020.
Article MATH MathSciNet Google Scholar
McCullagh, P., & Nelder, J. A. (1997). Generalized linear models. London: Chapman & Hall.
Google Scholar
Montgomery, D. C. (2001). Design and analysis of experiments (5th ed.). New York: Wiley.
Google Scholar
Muggeo, V. M. R. (2010). LASSO regression via smooth L ₁-norm approximation. In A. W. Bowman (Ed.), Proceedings of the 25th International Workshop on Statistical Modelling, Glasgow (pp. 391–396).
Google Scholar
Nijenhuis, A., & Wilf, H. S. (1978). Combinatorial algorithms. New York: Academic.
MATH Google Scholar
O’Flaherty, M., & MacKenzie, G. (1982). Algorithm AS 172: Direct simulation of nested Fortran DO-LOOPS. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 71–74.
Google Scholar
Pearson K. (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine Series, 50(5), 157–174.
Article MATH Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
MATH MathSciNet Google Scholar
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482.
Article MATH MathSciNet Google Scholar
Wilks, S. S. (1935). The likelihood test of independence in contingency tables. The Annals of Mathematical Statistics, 6(4), 190–196.
Article Google Scholar
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62.
Article Google Scholar
Wissmann, M., Toutenburg, H., & Shalabh (2007) Role of categorical variables in multicollinearity in the linear regression model. Technical Report 008, Department of Statistics, University of Munich.
Google Scholar
Zelterman, D. (1987). Goodness-of-fit tests for large sparse multinomial distributions. Journal of the American Statistical Association, 82(398), 624–629.
Article MATH MathSciNet Google Scholar

Download references

Acknowledgements

The work in this paper was conducted in the Centre of Biostatistics, Limerick, Ireland, and supported by the Science Foundation Ireland (SFI, www.sfi.ie), project grant number 07/MI/012 (BIO-SI project, www3.ul.ie/bio-si). The first author’s Ph.D. scholarship was supported GlaxoSmithKline, England, UK.

Author information

Authors and Affiliations

Department of Mathematics, Imperial College, London, UK
Susana Conde
The Centre for Biostatistics, University of Limerick, Limerick, Ireland
Gilbert MacKenzie

Authors

Susana Conde
View author publications
You can also search for this author in PubMed Google Scholar
Gilbert MacKenzie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susana Conde .

Editor information

Editors and Affiliations

Centre of Biostatistics, University of Limerick, Limerick, Ireland
Gilbert MacKenzie
Centre of Biostatistics, University of Limerick, Limerick, Ireland
Defen Peng

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Conde, S., MacKenzie, G. (2014). On Model Selection Algorithms in Multi-dimensional Contingency Tables. In: MacKenzie, G., Peng, D. (eds) Statistical Modelling in Biostatistics and Bioinformatics. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-04579-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-04579-5_15
Published: 13 March 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04578-8
Online ISBN: 978-3-319-04579-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics