Smooth Tests of Fit for Gaussian Mixtures
Model based clustering and classification are often based on a finite mixture distribution. The most popular choice for the mixture component distribution is the Gaussian distribution (Fraley and Raftery, J Stat Softw 18(6):1–13, 2007). Many tests, for example those based on goodness of fit measures, focus on detecting the order of the mixture. However what is often neglected are diagnostic tests to confirm the distributional assumptions. This may lead to the cluster analysis having invalid conclusions.
Smooth tests (Rayner et al., Smooth tests of goodness of fit: using R, 2nd edn. Wiley, Singapore, 2009) can be used to test the distributional assumptions against the so-called general smooth alternatives in the sense of Neyman (Skandinavisk Aktuarietidskr 20:150–99, 1937). To test for a mixture distribution we present smooth tests that have the additional advantage that they permit the testing of sub-hypotheses using components. These test statistics are asymptotically chi-squared distributed. Results of the simulation study show that bootstrapping needs to be applied for small to medium sample sizes to maintain the P(type I error) at the nominal level and that the proposed tests have high power against various alternatives. Lastly the tests are illustrated on a data set on the average amount of precipitation in inches for each of 70 United States and Puerto Rico cities (Mcneil, Interactive data analysis. Wiley, New York, 1977).
- Fraley, C., & Raftery, A. E. (2007). Model-based methods of classification: Using the mclust software in chemometrics. Journal of Statistical Software, 18(6), 1–13. http://www.jstatsoft.org/.
- Mcneil, D. R. (1977). Interactive data analysis. New York: Wiley.Google Scholar
- Neyman, J. (1937). Smooth test for goodness of fit. Skandinavisk Aktuarietidskr, 20, 150–99.Google Scholar