Abstract
A supersaturated design is a factorial design in which the number of factors to be estimated is larger than the available number of experimental runs. The cost and time required for many industrial experimentations can be reduced by using the class of supersaturated designs, since the main goal for such a design is to identify only a few of the factors under consideration that have dominant effects and to do this identification at a minimal cost. While most of the literature on supersaturated designs has focused on the construction of designs and their optimality properties, the data analysis of such designs has not been developed to a great extent. In this paper, we propose a supersaturated design analysis method, by assuming generalized linear models for discrete responses, for analyzing main effects designs and identifying simultaneously the effects that are significant. Empirical study demonstrates that this method performs well with low Type I and Type II error rates. The proposed method is therefore useful as it enables us to use supersaturated designs for analyzing data on discrete response regression models.
Similar content being viewed by others
References
Abraham B, Chipman H, Vijayan K (1999) Some risks in the construction and analysis of supersaturated designs. Technometrics 41:135–141
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New York
Balakrishnan N, Koukouvinos C, Parpoula C (2013) An information theoretical algorithm for analyzing supersaturated designs for a binary response. Metrika 76:1–18
Beattie SD, Fong DKF, Lin DKJ (2002) A two-stage Bayesian model selection strategy for supersaturated designs. Technometrics 44:55–63
Box GEP, Meyer RD (1986) An analysis for unreplicated fractional factorials. Technometrics 28:11–18
Cameron AC, Trivedi PK (1998) Regression analysis of count data. Cambridge University Press, New York
Candes EJ, Tao T (2007) The Dantzig selector: statistical estimation when \({p}\) is much larger than \({n}\). Ann Stat 35:2313–2351
Chipman H, Hamada M, Wu CFJ (1997) A Bayesian variable selection approach for analyzing designed experiments with complex aliasing. Technometrics 39:372–381
Czado C, Raftery AE (2006) Choosing the link function and accounting for link uncertainty in generalized linear models using Bayes factors. Stat Pap 47:419–442
Draper NR, Pukelsheim F (1996) An overview of design of experiments. Stat Pap 37:1–32
Erdman D, Jackson L, Sinko A (2008) Zero-inflated Poisson and zero-inflated negative binomial models using the COUNTREG Procedure, Paper 322–2008. SAS Institute Inc., Cary
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
Hamada M, Wu CFJ (1992) Analysis of designed experiments with complex aliasing. J Qual Technol 24:130–137
Hilbe JM (2008) Negative binomial regression. Cambridge University Press, New York
Holcomb DR, Montgomery DC, Carlyle WM (2003) Analysis of supersaturated designs. J Qual Technol 35:13–27
Hong CS, Kim BJ (2011) Mutual information and redundancy for categorical data. Stat Pap 52:17–31
Koukouvinos C, Mylona K, Simos DE (2008) \(E(s^2)\)-optimal and minimax-optimal cyclic supersaturated designs via multi-objective simulated annealing. J Stat Planning Inference 138:1639–1646
Koukouvinos C, Parpoula C (2012) Analyzing supersaturated designs by means of an information based criterion. Commun Stat Simul Comput 41:44–57
Li R, Lin DKJ (2002) Data analysis in supersaturated designs. Stat Probab Lett 59:135–144
Lin DKJ (1993) A new class of supersaturated designs. Technometrics 35:28–31
Lin DKJ (1995) Generating systematic supersaturated designs. Technometrics 37:213–225
Lu X, Wu X (2004) A strategy of searching active factors in supersaturated screening experiments. J Qual Technol 36:392–399
Marley CJ, Woods DC (2010) A comparison of design and model selection methods for supersaturated experiments. Comput Stat Data Anal 54:3158–3167
McCullagh P, Nelder J (1997) Generalized linear models, 2nd edn. Chapman & Hall, New York
Montgomery DC, Peck EA, Vining GG (2006) Introduction to linear regression analysis, 4th edn. Wiley, Hoboken
Myers RH, Montgomery DC, Vining GG (2002) Generalized linear models: with applications in engineering and the sciences. Wiley, New York
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238
Pettersson H (2005) Optimal design in average for inference in generalized linear models. Stat Pap 46:79–100
Phoa FKH, Pan Y-H, Xu H (2009) Analysis of supersaturated designs via the Dantzig selector. J Stat Planning Inference 139:2362–2372
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C. Cambridge University Press, Cambridge
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423 and 623–656.
Tang B, Wu CFJ (1997) A method for constructing supersaturated designs and its \(E(s^2)\)-optimality. Can J Stat 25:191–201
Wang PC (1995) Comments on Lin (1993). Technometrics 37:358–359
Yu L, Liu H, (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-2003), Washington, DC, pp 856–863.
Zhang QZ, Zhang RC, Liu MQ (2007) A method for screening active effects in supersaturated designs. J Stat Planning Inference 137:235–248
Acknowledgments
The research of the third author was financially supported by a scholarship awarded by the Secretariat of the Research Committee of National Technical University of Athens. The authors would like to thank the Associate Editor and the referees for their constructive and useful suggestions which resulted in an improvement on an earlier version of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
For each of the 44 models presented in Table 1, 1,000 datasets were generated for each considered Scenario, and the results obtained after the application of our method are presented in Tables 6, 7, 8 and 9 for Scenario I, II, III and IV, respectively, in accordance with the threshold values examined. In the first column, the number that corresponds to each used model is given. Columns named “Type I” and ”Type II” present the average values over 1,000 simulations of the Type I and Type II error rates corresponding to every threshold value. The last line of each table presents the average Type I and Type II error values for the 44 models considered.
Comparative results for each of the following scenarios are as follows:
-
Scenario I: From Table 2, we readily observe that the average Type I and Type II error values for the 44 models considered are 0.23 and 0.06, respectively, for \(SU\) taken to be median\((\mathbf su )\) and \(w=\frac{k}{2}\). With this choice of thresholds, the proposed method has significantly lower Type II error values, and slightly higher Type I error values (see Table 6 for comparison).
-
Scenario II: From Table 3, we readily observe that the average Type I and Type II error values for the 44 models considered are 0.29 and 0.04, respectively, for \(SU\) taken to be median\((\mathbf su )\) and \(w=\frac{k}{2}\). With this choice of thresholds, the proposed method has significantly lower Type II error values, and slightly higher Type I error values (see Table 7 for comparison).
-
Scenario III: From Table 4, we readily observe that the average Type I and Type II error values for the 44 models considered are 0.29 and 0.05, respectively, for \(SU\) taken to be median\((\mathbf su )\) and \(w=\frac{k}{2}\). With this choice of thresholds, the proposed method has significantly lower Type II error values, and slightly higher Type I error values (see Table 8 for comparison).
-
Scenario IV: From Table 5, we readily observe that the average Type I and Type II error values for the 44 models considered are 0.30 and 0.07, respectively, for \(SU\) taken to be median\((\mathbf su )\) and \(w=\frac{k}{2}\). With this choice of thresholds, the proposed method has significantly lower Type II error values, and slightly higher Type I error values (see Table 9 for comparison).
These results suggest that the proposed method for \(SU\) taken to be median\((\mathbf su )\) and \(w=\frac{k}{2}\) is quite robust for count response modelling. The average Type I and Type II error values are almost identical under Scenarios II, III, IV considered.
In general, we conclude that for each Scenario considered, the proposed method seems to perform efficiently after selecting \(SU\) to be median\((\mathbf su )\) and \(w=\frac{k}{2}\). With this choice of thresholds, the proposed method achieves the lowest Type II error values for all Scenarios considered. The fact that this choice of thresholds results in slightly higher Type I error values is not troublesome since the use of SSDs is mainly to screen the factors that should be considered for further investigation. Hence, the low Type II error rates are especially desirable, even though both Type I and Type II error rates are important and should be kept as low as possible. However, under situations of effect sparsity that holds in SSDs, Type I errors are quite likely to occur, of course.
Rights and permissions
About this article
Cite this article
Balakrishnan, N., Koukouvinos, C. & Parpoula, C. Analyzing supersaturated designs for discrete responses via generalized linear models. Stat Papers 56, 121–145 (2015). https://doi.org/10.1007/s00362-013-0569-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-013-0569-z
Keywords
- Entropy
- Error rates
- Factor screening
- Discrete response regression models
- Information gain
- Symmetrical uncertainty