Annals of Operations Research

, Volume 216, Issue 1, pp 129–144 | Cite as

Clustering noise-included data by controlling decision errors



Cluster analysis is an unsupervised learning technique for partitioning objects into several clusters. Assuming that noisy objects are included, we propose a soft clustering method which assigns objects that are significantly different from noise into one of the specified number of clusters by controlling decision errors through multiple testing. The parameters of the Gaussian mixture model are estimated from the EM algorithm. Using the estimated probability density function, we formulated a multiple hypothesis testing for the clustering problem, and the positive false discovery rate (pFDR) is calculated as our decision error. The proposed procedure classifies objects into significant data or noise simultaneously according to the specified target pFDR level. When applied to real and artificial data sets, it was able to control the target pFDR reasonably well, offering a satisfactory clustering performance.


Clustering Gaussian mixture Multiple testing p-value 



We would like to thank the Guest Editor, Dr. Victoria Chen, and anonymous reviewers for their helpful comments. This research was supported with Basic Science Research Program through the National Research Foundation of Korea from the Ministry of Education, Science and Technology (Project No. 2011-0012879).


  1. Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59, 2–5. Google Scholar
  2. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, 57, 289–300. Google Scholar
  3. Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum. CrossRefGoogle Scholar
  4. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. Google Scholar
  5. Celeux, G. & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793. CrossRefGoogle Scholar
  6. Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168, 151–168. CrossRefGoogle Scholar
  7. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). New York: Wiley. Google Scholar
  8. Dudoit, S., Shaffer, J. P., & Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1), 71–103. CrossRefGoogle Scholar
  9. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. CrossRefGoogle Scholar
  10. Fraley, C., & Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631. CrossRefGoogle Scholar
  11. Han, J., Kamber, M., & Tung, A. (2001). Geographic data mining and knowledge discovery. In: H. Miller & J. Han (Eds.), Spatial clustering methods in data mining: a survey (pp. 1–29). London: Taylor & Francis. Google Scholar
  12. Little, M. A., McSharry, P. E., Hunter, E. J., & Ramig, L. O. (2008). Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Transactions on Biomedical Engineering, 56, 1015–1022. CrossRefGoogle Scholar
  13. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill. Google Scholar
  14. Park, H.-S., Jun, C.-H., & Yoo, J.-Y. (2009). Classifying genes according to predefined patterns by controlling false discovery rate. Expert Systems with Applications, 36, 11753–11759. CrossRefGoogle Scholar
  15. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B, 64, 479–498. CrossRefGoogle Scholar
  16. UCI (University of California–Irvine) data repository: University of California–Irvine. Center for Machine Learning and Intelligent Systems.
  17. Xu, R., & Wunsch, D. II (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16, 645–678. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Department of Industrial and Management EngineeringPohang University of Science and TechnologyPohangSouth Korea

Personalised recommendations