Improving K-Means by Outlier Removal

  • Ville Hautamäki
  • Svetlana Cherednichenko
  • Ismo Kärkkäinen
  • Tomi Kinnunen
  • Pasi Fränti
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3540)

Abstract

We present an Outlier Removal Clustering (ORC) algorithm that provides outlier detection and data clustering simultaneously. The method employs both clustering and outlier discovery to improve estimation of the centroids of the generative distribution. The proposed algorithm consists of two stages. The first stage consist of purely K-means process, while the second stage iteratively removes the vectors which are far from their cluster centroids. We provide experimental results on three different synthetic datasets and three map images which were corrupted by lossy compression. The results indicate that the proposed method has a lower error on datasets with overlapping clusters than the competing methods.

Keywords

Outlier Detection Mean Absolute Error Cluster Centroid Lossy Compression Code Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer desing. IEEE Transactions on Communications 28, 84–95 (1980)CrossRefGoogle Scholar
  2. 2.
    Kinnunen, T., Karpov, E., Fränti, P.: Real-time speaker identification and verification. In: IEEE Transactions on Speech and Audio Processing (2005) (Accepted for publication)Google Scholar
  3. 3.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
  4. 4.
    Dunn, J.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3, 32–57 (1974)CrossRefGoogle Scholar
  5. 5.
    Guha, S., Rastogi, R., Shim, K.: CURE an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, pp. 73–84 (1998)Google Scholar
  6. 6.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1, 141–182 (1997)CrossRefGoogle Scholar
  7. 7.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)Google Scholar
  8. 8.
    Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, pp. 512–521 (1999)Google Scholar
  9. 9.
    Hautamäki, V., Kärkkäinen, I., Fränti, P.: Outlier detection using k-nearest neighbour graph. In: 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, United Kingdom, pp. 430–433 (2004)Google Scholar
  10. 10.
    Virmajoki, O.: Pairwise Nearest Neighbor Method Revisited. PhD thesis, University of Joensuu, Joensuu, Finland (2004)Google Scholar
  11. 11.
    Kopylov, P., Fränti, P.: Color quantization of map images. In: IASTED Conference on Visualization, Imaging, and Image Processing (VIIP 2004), Marbella, Spain, pp. 837–842 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Ville Hautamäki
    • 1
  • Svetlana Cherednichenko
    • 1
  • Ismo Kärkkäinen
    • 1
  • Tomi Kinnunen
    • 1
  • Pasi Fränti
    • 1
  1. 1.Speech and Image Processing Unit, Department of Computer ScienceUniversity of JoensuuJoensuuFinland

Personalised recommendations