Misspecification Resistant Model Selection Using Information Complexity with Applications

  • Hamparsum Bozdogan
  • J. Andrew Howe
  • Suman Katragadda
  • Caterina Liberati
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


In this paper, we address two issues that have long plagued researchers in statistical modeling and data mining. The first is well-known as the “curse of dimensionality”. Very large datasets are becoming more and more frequent, as mankind is now measuring everything he can as frequently as he can. Statistical analysis techniques developed even 50 years ago can founder in all this data. The second issue we address is that of model misspecification – specifically that of an incorrect assumed functional form. These issues are addressed in the context of multivariate regression modeling. To drive dimension reduction and model selection, we use the newly developed form of Bozdogan’s ICOMP, introduced in Bozdogan and Howe (Misspecification resistant multivariate regression models using the genetic algorithm and information complexity as the fitness function, Technical report 1, (2012)), that penalizes models with a complexity measure of the “sandwich” model covariance matrix. This information criterion is used by the genetic algorithm as the objective function in a two-step hybrid dimension reduction process. First, we use probabilistic principle components analysis to independently reduce the number of response and predictor variables. Then, we use the genetic algorithm with the multivariate Gaussian regression model to identify the best subset regression model. We apply these methods to identify a substantially reduced multivariate regression relationship for a dataset regarding Italian high school students. From 29 response variables, we get 4, and from 46 regressors, we get 1.


Genetic Algorithm Multivariate Regression Multivariate Regression Model Model Misspecification Statistical Analysis Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Box, G., & Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B (Methodological),26, 211–246.Google Scholar
  2. Bozdogan, H. (1988). Icomp: A new model-selection criteria. In H. Bock (Ed.), Classification and related methods of data analysis (pp. 599–608). Amsterdam: Elsevier.Google Scholar
  3. Bozdogan, H. (2004). Intelligent statistical data mining with information complexity and genetic algorithms. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery (pp. 15–56). Boca Raton: Chapman and Hall/CRC.Google Scholar
  4. Bozdogan, H., & Howe, J. (2009). The curse of dimensionality in large-scale experiments using a novel hybridized dimension reduction approach. The University of Tennessee. (Tech. Rep. 1).Google Scholar
  5. Bozdogan, H., & Howe, J. (2012). Misspecification resistant multivariate regression models using the genetic algorithm and information complexity as the fitness function. European Journal of Pure and Applied Mathematics, 5(2), 211–249.MathSciNetGoogle Scholar
  6. Goldberg, D. (1989). Genetic algorithms in search, optimization and machine learning. Boston: Addison-Wesley.zbMATHGoogle Scholar
  7. Haupt, R., & Haupt, S. (2004). Practical genetic algorithms. Hoboken: Wiley.zbMATHGoogle Scholar
  8. Holland, J. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. Ann Arbor: The University of Michigan Press.Google Scholar
  9. Kullback, A., & Leibler, R. (1951). On information and sufficiency. Annals of Mathematical Statistics,22, 79–86.Google Scholar
  10. Magnus, J. (2007). The asymptotic variance of the pseudo maximum likelihood estimator. Econometric Theory,23, 1022–1032.Google Scholar
  11. Magnus, J., & Neudecker, H. (1988). Matrix differential calculus with applications in statistis and econometrics. New York: Wiley.CrossRefGoogle Scholar
  12. Mardia, K. (1974). Applications of some measures of multivariate skewness and kurtosis in testing normality and robustness studies. Sankhya,B36, 115–128.MathSciNetGoogle Scholar
  13. Tipping, M., & Bishop, C. (1997). Probabilistic principal component analysis (Tech. Rep. NCRG/97/010). Neural Computing Research Group, Aston University.Google Scholar
  14. Van Emden, M. (1971). An analysis of complexity. In Mathematical centre tracts (Vol. 35). Amsterdam: Mathematisch Centrum.Google Scholar
  15. Vose, M. (1999). The simple genetic algorithm: Foundations and theory. Cambridge: MIT.zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hamparsum Bozdogan
    • 1
  • J. Andrew Howe
    • 2
  • Suman Katragadda
    • 3
  • Caterina Liberati
    • 4
  1. 1.Department of Statistics, Operations, and Management ScienceUniversity of TennesseeKnoxvilleUSA
  2. 2.Business Analytics Trans Atlantic Petroleum IstanbulIstanbulTurkey
  3. 3.Advanced Analytics Express-Scripts Company St. LouisSt. LouisUSA
  4. 4.Economics DepartmentUniversita’ degli Studi Milano-BicoccaMilanItaly

Personalised recommendations