Abstract
We present a review focussed on model selection in log-linear models and contingency tables. The concepts of sparsity and high-dimensionality have become more important nowadays, for example, in the context of high-throughput genetic data. In particular, we describe recently developed automatic search algorithms for finding optimal hierarchical log-linear models (HLLMs) in sparse multi-dimensional contingency tables in R and some LASSO-type penalized likelihood model selection approaches. The methods rely, in part, on a new result which identifies and thus permits the rapid elimination of non-existent maximum likelihood estimators in high-dimensional tables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley.
Baker, R. J., Clarke, M. R. B., & Lane, P. W. (1985). Zero entries in sparse contingency tables. Computational Statistics and Data Analysis, 3, 33–45.
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 25(1), 220–233.
Bishop, Y. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: MIT Press, The Massachusetts Institute of Technology.
Bishop, Y. M. M. (1969). Full contingency tables, logits, and split contingency tables. Biometrics, 25(2), 383–399.
Charlson, M. E., Pompei, P., Ales, K. L., & MacKenzie, C. R. (1987). A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation. Journal of Chronic Diseases, 40(5), 373–383.
Christensen, R. (1997). Log-linear models and logistic regression (2nd ed.). New York: Springer.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829–836.
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596–610.
Conde, S. (2011). Interactions: Log-linear models in sparse contingency tables (Ph.D. thesis). University of Limerick, Ireland.
Conde, S., & MacKenzie, G. (2007). Modelling high dimensional sets of binary co-morbidities. In J. del Castillo, A. Espinal, & P. Puig (Eds.), Proceedings of the 22nd International Workshop on Statistical Modelling, Barcelona (pp. 177–180).
Conde, S., & MacKenzie, G. (2008). Search algorithms for log-linear models in contingency tables. Comorbidity data. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 184–187).
Conde, S., & MacKenzie, G. (2011). LASSO penalised likelihood in high-dimensional contingency tables. In D. Conesa, A. Forte, A. López-Quílez, & F. Muñoz (Eds.), Proceedings of the 26th International Workshop on Statistical Modelling, Valencia (pp. 127–132).
Conde, S., & MacKenzie, G. (2012). Model selection in sparse contingency tables: LASSO penalties vs classical methods. In A. Komárek & S. Nagy (Eds.), Proceedings of the 27th International Workshop on Statistical Modelling, Prague (pp. 81–86).
Conde, S., & MacKenzie, G. (2014). The smooth LASSO in sparse high-dimensional contingency tables (in preparation).
Dahinden, C., Parmigiani, G., Emerick, M. C., & Bühlmann, P. (2007). Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics, 8, 476.
Darroch, J. N., Lauritzen, S. L., & Speed, T. P. (1980). Markov fields and log-linear interaction models for contingency tables. The Annals of Statistics, 8(3), 522–539.
Davies, S. J., Phillips, L., Naish, P. F., & Russell, G. I. (2002). Quantifying comorbidity in peritoneal dialysis patients and its relationship to other predictors of survival. Nephrology Dialysis Transplantation, 17(6), 1085–1092.
Demidenko, E. (2004). Mixed models. New York: Wiley.
Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4), 427–444.
Dobson, A. J. (2002). An introduction to generalized linear models. New York: Chapman & Hall/CRC.
Edwards, D. (2000). Introduction to graphical modelling (2nd ed.). New York: Springer.
Edwards, D. (2012). A note on adding and deleting edges in hierarchical log-linear models. Computational Statistics, 27, 799–803.
Edwards, D., & Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika, 72(2), 339–351.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Feinstein, A. R. (1970). The pre-therapeutic classification of co-morbidity in chronic disease. Journal of Chronic Diseases, 23(7), 455–468.
Fienberg, S. E. (1972). The analysis of incomplete multi-way contingency tables. Biometrics, 28(1), 177–202 [special Multivariate Issue].
Fienberg, S. E., & Rinaldo, A. (2006). Computing maximum likelihood estimates in log-linear models. Manuscript extracted from Rinaldo’s Ph.D. thesis.
Fienberg, S. E., & Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. The Annals of Statistics, 40(2), 996–1023.
Fisher, R. A. (1922). On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87–94.
Friedman, J. H. (2008). Fast sparse regression and classification. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 27–57).
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Glonek, G. F. V., Darroch, J. N., & Speed, T. P. (1988). On the existence of maximum likelihood estimators for hierarchical loglinear models. Scandinavian Journal of Statistics, 15, 187–193.
Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. R. A. Fisher memorial lecture. Journal of the American Statistical Association, 63(324), 1091–1131.
Goodman, L. A. (1971). The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics, 13(1), 33–61.
Green, P. J., & Silverman, B. W. (1994). Nonparametric regression and generalized linear models: A roughness penalty approach (Vol. 58). Monographs on statistics and applied probability (1st ed.). London: Chapman & Hall.
Haberman, S. J. (1970). The general log-linear model (Ph.D. thesis). Department of Statistics, University of Chicago, Chicago, IL.
Hall, W. H., Ramachandran, R., Narayan, S., Jani, A. B., & Vijayakumar, S. (2004). An electronic application for rapidly calculating Charlson comorbidity score. BMC Cancer, 4, 94.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
Hu, M. Y. (1999). Model checking for incomplete high dimensional categorical data (Ph.D. thesis). University of California, Los Angeles.
Kim, S. H., Choi, H., & Lee, S. (2008). Estimate-based goodness-of-fit test for large sparse multinomial distributions. Computational Statistics and Data Analysis, 53(4), 1122–1131
Kou, C., & Pan, J. (2008). Variable selection in joint modelling of mean and covariance structures for longitudinal sata. In P. H. Eilers (Ed.), Proceedings of the 23rd International Workshop on Statistical Modelling, Utrecht (pp. 309–314)
Krajewski, P., & Siatkowski, I. (1990). Algorithm AS 252: Generating classes for log-linear models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 39(1), 143–176.
Lang, S. (1992). Algebra (3rd ed.). Delhi: Pearson Education.
MacKenzie, G. (2006). Screening multivariate comorbidities. Presentation. Assess, York, http://www.staff.ul.ie/mackenzieg/Assess/assess.html.
MacKenzie, G., & Conde, S. (2014). Model selection in sparse contingency tables (in preparation).
MacKenzie, G., & O’Flaherty, M. (1982). Algorithm AS 173: Direct design matrix generation for balanced factorial experiments. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 74–80.
Mantel, N. (1970). Incomplete contingency tables. Biometrics, 26(2), 291–304.
Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables. Journal of the American Statistical Association, 100(471), 1009–1020.
McCullagh, P., & Nelder, J. A. (1997). Generalized linear models. London: Chapman & Hall.
Montgomery, D. C. (2001). Design and analysis of experiments (5th ed.). New York: Wiley.
Muggeo, V. M. R. (2010). LASSO regression via smooth L 1-norm approximation. In A. W. Bowman (Ed.), Proceedings of the 25th International Workshop on Statistical Modelling, Glasgow (pp. 391–396).
Nijenhuis, A., & Wilf, H. S. (1978). Combinatorial algorithms. New York: Academic.
O’Flaherty, M., & MacKenzie, G. (1982). Algorithm AS 172: Direct simulation of nested Fortran DO-LOOPS. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 71–74.
Pearson K. (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine Series, 50(5), 157–174.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482.
Wilks, S. S. (1935). The likelihood test of independence in contingency tables. The Annals of Mathematical Statistics, 6(4), 190–196.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62.
Wissmann, M., Toutenburg, H., & Shalabh (2007) Role of categorical variables in multicollinearity in the linear regression model. Technical Report 008, Department of Statistics, University of Munich.
Zelterman, D. (1987). Goodness-of-fit tests for large sparse multinomial distributions. Journal of the American Statistical Association, 82(398), 624–629.
Acknowledgements
The work in this paper was conducted in the Centre of Biostatistics, Limerick, Ireland, and supported by the Science Foundation Ireland (SFI, www.sfi.ie), project grant number 07/MI/012 (BIO-SI project, www3.ul.ie/bio-si). The first author’s Ph.D. scholarship was supported GlaxoSmithKline, England, UK.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Conde, S., MacKenzie, G. (2014). On Model Selection Algorithms in Multi-dimensional Contingency Tables. In: MacKenzie, G., Peng, D. (eds) Statistical Modelling in Biostatistics and Bioinformatics. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-04579-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-04579-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04578-8
Online ISBN: 978-3-319-04579-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)