Annals of Operations Research

, Volume 216, Issue 1, pp 129–144

Clustering noise-included data by controlling decision errors

Article

DOI: 10.1007/s10479-012-1238-7

Cite this article as:
Park, HS., Lee, J. & Jun, CH. Ann Oper Res (2014) 216: 129. doi:10.1007/s10479-012-1238-7
  • 110 Downloads

Abstract

Cluster analysis is an unsupervised learning technique for partitioning objects into several clusters. Assuming that noisy objects are included, we propose a soft clustering method which assigns objects that are significantly different from noise into one of the specified number of clusters by controlling decision errors through multiple testing. The parameters of the Gaussian mixture model are estimated from the EM algorithm. Using the estimated probability density function, we formulated a multiple hypothesis testing for the clustering problem, and the positive false discovery rate (pFDR) is calculated as our decision error. The proposed procedure classifies objects into significant data or noise simultaneously according to the specified target pFDR level. When applied to real and artificial data sets, it was able to control the target pFDR reasonably well, offering a satisfactory clustering performance.

Keywords

ClusteringGaussian mixtureMultiple testingp-value

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Department of Industrial and Management EngineeringPohang University of Science and TechnologyPohangSouth Korea