Clustering noise-included data by controlling decision errors

Park, Hae-Sang; Lee, Jeonghwa; Jun, Chi-Hyuck

doi:10.1007/s10479-012-1238-7

Clustering noise-included data by controlling decision errors

Published: 16 November 2012

Volume 216, pages 129–144, (2014)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Hae-Sang Park¹,
Jeonghwa Lee¹ &
Chi-Hyuck Jun¹

205 Accesses
3 Citations
Explore all metrics

Abstract

Cluster analysis is an unsupervised learning technique for partitioning objects into several clusters. Assuming that noisy objects are included, we propose a soft clustering method which assigns objects that are significantly different from noise into one of the specified number of clusters by controlling decision errors through multiple testing. The parameters of the Gaussian mixture model are estimated from the EM algorithm. Using the estimated probability density function, we formulated a multiple hypothesis testing for the clustering problem, and the positive false discovery rate (pFDR) is calculated as our decision error. The proposed procedure classifies objects into significant data or noise simultaneously according to the specified target pFDR level. When applied to real and artificial data sets, it was able to control the target pFDR reasonably well, offering a satisfactory clustering performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic assessment of model-based clustering

Article 26 August 2015

Advances in Robust Constrained Model Based Clustering

Energy-Based Centroid Identification and Cluster Propagation with Noise Detection

References

Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59, 2–5.
Google Scholar
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, 57, 289–300.
Google Scholar
Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
Book Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
Google Scholar
Celeux, G. & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793.
Article Google Scholar
Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168, 151–168.
Article Google Scholar
Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). New York: Wiley.
Google Scholar
Dudoit, S., Shaffer, J. P., & Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1), 71–103.
Article Google Scholar
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.
Article Google Scholar
Fraley, C., & Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.
Article Google Scholar
Han, J., Kamber, M., & Tung, A. (2001). Geographic data mining and knowledge discovery. In: H. Miller & J. Han (Eds.), Spatial clustering methods in data mining: a survey (pp. 1–29). London: Taylor & Francis.
Google Scholar
Little, M. A., McSharry, P. E., Hunter, E. J., & Ramig, L. O. (2008). Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Transactions on Biomedical Engineering, 56, 1015–1022.
Article Google Scholar
Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.
Google Scholar
Park, H.-S., Jun, C.-H., & Yoo, J.-Y. (2009). Classifying genes according to predefined patterns by controlling false discovery rate. Expert Systems with Applications, 36, 11753–11759.
Article Google Scholar
Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B, 64, 479–498.
Article Google Scholar
UCI (University of California–Irvine) data repository: University of California–Irvine. Center for Machine Learning and Intelligent Systems. http://archive.ics.uci.edu/ml/.
Xu, R., & Wunsch, D. II (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16, 645–678.
Article Google Scholar

Download references

Acknowledgements

We would like to thank the Guest Editor, Dr. Victoria Chen, and anonymous reviewers for their helpful comments. This research was supported with Basic Science Research Program through the National Research Foundation of Korea from the Ministry of Education, Science and Technology (Project No. 2011-0012879).

Author information

Authors and Affiliations

Department of Industrial and Management Engineering, Pohang University of Science and Technology, San 31 Hyoja-dong, Pohang, 790-784, South Korea
Hae-Sang Park, Jeonghwa Lee & Chi-Hyuck Jun

Authors

Hae-Sang Park
View author publications
You can also search for this author in PubMed Google Scholar
Jeonghwa Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Hyuck Jun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chi-Hyuck Jun.

Appendix: Validity of the method for calculating p-value

In order to check the validity of the proposed method for calculating p-value explained in Sect. 2.2, we will consider the univariate normal case. Let us assume that each object follows a standard normal distribution. Then, the true p-value is obtained by (20).

$$ \mbox{true}\ p\hbox{-value} = 2 \times(1\hbox{-cdf}) $$

(20)

where cdf is the cumulative distribution function of the standard normal distribution.

To compare the performance, 1000 observations were randomly generated from the standard normal distribution. Table 18 shows a partial result of the proposed method as compared with the true p-value. The mean absolute error (MAE) between the true p-values and the estimated p-values for 1000 values is 0.0032, which seems to be acceptably small.

Table 18 Estimated p-values from the proposed Monte Carlo method (10 values are listed)

Full size table

Although the results are not shown here, the proposed method for calculating p-value was also tested for other distributions such as F distribution and chi-square distribution, showing good performance.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, HS., Lee, J. & Jun, CH. Clustering noise-included data by controlling decision errors. Ann Oper Res 216, 129–144 (2014). https://doi.org/10.1007/s10479-012-1238-7

Download citation

Published: 16 November 2012
Issue Date: May 2014
DOI: https://doi.org/10.1007/s10479-012-1238-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering noise-included data by controlling decision errors

Abstract

Access this article

Similar content being viewed by others

Probabilistic assessment of model-based clustering

Advances in Robust Constrained Model Based Clustering

Energy-Based Centroid Identification and Cluster Propagation with Noise Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Validity of the method for calculating p-value

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering noise-included data by controlling decision errors

Abstract

Access this article

Similar content being viewed by others

Probabilistic assessment of model-based clustering

Advances in Robust Constrained Model Based Clustering

Energy-Based Centroid Identification and Cluster Propagation with Noise Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Validity of the method for calculating p-value

Appendix: Validity of the method for calculating p-value

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation