Abstract
A supersaturated design is a factorial design in which the number of effects to be estimated is greater than the number of runs. It is used in many experiments, for screening purpose, i.e., for studying a large number of factors and identifying the active ones. In this paper, we propose a method for screening out the important factors from a large set of potentially active variables through the symmetrical uncertainty measure combined with the information gain measure. We develop an information theoretical analysis method by using Shannon and some other entropy measures such as Rényi entropy, Havrda–Charvát entropy, and Tsallis entropy, on data and assuming generalized linear models for a Bernoulli response. This method is quite advantageous as it enables us to use supersaturated designs for analyzing data on generalized linear models. Empirical study demonstrates that this method performs well giving low Type I and Type II error rates for any entropy measure we use. Moreover, the proposed method is more efficient when compared to the existing ROC methodology of identifying the significant factors for a dichotomous response in terms of error rates.
Similar content being viewed by others
References
Abraham B, Chipman H, Vijayan K (1999) Some risks in the construction and analysis of supersaturated designs. Technometrics 41: 135–141
Beattie SD, Fong DKF, Lin DKJ (2002) A two-stage Bayesian model selection srategy for supersaturated designs. Technometrics 44: 55–63
Biesiada J, Duch W et al (2007) Feature selection for high-dimensional data: a Pearson redundancy based filter. In: Kurzynski M (eds) Computer recognitions systems 2, vol 45. Springer, Berlin, pp 242–249
Box GEP, Meyer RD (1986) An analysis for unreplicated fractional factorials. Technometrics 28: 11–18
Candes EJ, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35: 2313–2351
Chipman H, Hamada M, Wu CFJ (1997) A Bayesian variable selection approach for analyzing designed experiments with complex aliasing. Technometrics 39: 372–381
Dash M, Liu H, Motoda H (2000) Consistency based feature selection. In: Proceedings of the fourth Pacific Asia conference on knowledge discovery and data mining. Springer, pp 98–109
Gini C (1912) Variabilita e mutabilita: contributo allo studio delle distribuzioni e relazioni stati-stiche. Studi Economico-Giuridici dell’Univ. di Cagliari 3: 1–158
Hall MA (1999) Correlation based feature selection for machine learning. PhD thesis, Department of Computer Science, Waikato
Havrda J, Charvát F (1967) Quantification method of classification processes: concept of structural entropy. Kybernetika 3: 30–35
Holcomb DR, Montgomery DC, Carlyle WM (2003) Analysis of supersaturated designs. J Qual Technol 35: 13–27
Jones B, Lin DKJ, Nachtsheimc CJ (2008) Bayesian D-optimal supersaturated designs. J Stat Plan Inference 138: 86–92
Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the tenth National conference on artificial intelligence. AAAI Press/The MIT Press, Menlo Park, pp 129–134
Koukouvinos C, Mylona K, Simos DE (2008) E(s 2)-optimal and minimax-optimal cyclic supersaturated designs via multi-objective simulated annealing. J Stat Plan Inference 138: 1639–1646
Li R, Lin DKJ (2002) Data analysis in supersaturated designs. Stat Probab Lett 59: 135–144
Lin DKJ (1993) A new class of supersaturated designs. Technometrics 35: 28–31
Lin DKJ (1995) Generating systematic supersaturated designs. Technometrics 37: 213–225
Lu X, Wu X (2004) A strategy of searching active factors in supersaturated screening experiments. J Qual Technol 36: 392–399
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London
Montgomery DC, Peck EA, Vining GG (2006) Introduction to linear regression analysis, 4th edn. Wiley, Hoboken
Pepe MS (2000a) Receiver operating characteristic methodology. J Am Stat Assoc 95: 308–311
Pepe MS (2000b) An interpretation for ROC curve and inference using GLM procedures. Biometrics 56: 352–359
Phoa FKH, Pan Y-H, Xu H (2009) Analysis of supersaturated designs via the Dantzig selector. J Stat Plan Inference 139: 2362–2372
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C. Cambridge University Press, Cambridge
Quinlan JR (1986) Induction of decision trees. Mach Learn 1: 81–106
Rényi A (1961) On measures of information and entropy. In: Proceedings of the 4th Berkeley symposium on mathematics, statistics and probability. Berkeley University Press, Berkeley, pp 547–561
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379-423 (623–656)
Tang B, Wu CFJ (1997) A method for constructing supersaturated designs and its E(s 2)-optimality. Can J Stat 25: 191–201
Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Stat Phys 52: 479–487
Wang PC (1995) Comments on Lin (1993). Technometrics 37: 358–359
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the twentieth international conference on machine learning (ICML-2003), Washington, DC, pp 856–863
Zhang QZ, Zhang RC, Liu MQ (2007) A method for screening active effects in supersaturated designs. J Stat Plan Inference 137: 235–248
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Balakrishnan, N., Koukouvinos, C. & Parpoula, C. An information theoretical algorithm for analyzing supersaturated designs for a binary response. Metrika 76, 1–18 (2013). https://doi.org/10.1007/s00184-011-0373-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-011-0373-5