Enhanced Centroid-Based Classification Technique by Filtering Outliers

  • Kwangcheol Shin
  • Ajith Abraham
  • SangYong Han
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4188)


Document clustering or unsupervised document classification has been used to enhance information retrieval. Recently this has become an intense area of research due to its practical importance. Outliers are the elements whose similarity to the centroid of the corresponding category is below some threshold value. In this paper, we show that excluding outliers from the noisy training data significantly improves the performance of the centroid-based classifier which is the best known method. The proposed method performs about 10% better than the centroid-based classifier.


Genetic Programming Data Item Vector Space Model Document Cluster Training Category 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cohen, W.W., Hirsh, H.: Joins that generalize: Text Classification using WHIRL. In: Proc. of the Fourth Int’l. Conference on Knowledge Discovery and Data Mining (1998)Google Scholar
  2. 2.
    Han, E.-H(S.), Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 424–431. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  3. 3.
    Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: SIGIR 1994 (1994)Google Scholar
  4. 4.
    Ross Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  5. 5.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)MATHGoogle Scholar
  6. 6.
    Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  7. 7.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  8. 8.
    Dhillon, I.S., Fan, J., Guan, Y.: Efficient Clustering of Very Large Document Collections. In: Data Mining for Scientific and Engineering Applications. Kluwer, Dordrecht (2001)Google Scholar
  9. 9.
    MacLeod, K.: An application specific neural model for document clustering. In: Proceedings of the Fourth Annual Parallel Processing Symposium, vol. 1, pp. 5–16 (1990)Google Scholar
  10. 10.
    Svingen, B.: Using genetic programming for document classification. In: FLAIRS 1998, Proceedings of the Eleventh International Florida Artificial Intelligence Research, pp. 63–67 (1998)Google Scholar
  11. 11.
    Hyotyniemi, H.: Text document classification with self-organizing maps. In: STeP 1996 - Genes, Nets and Symbols, Finnish Artificial Intelligence Conference, pp. 64–72 (1996)Google Scholar
  12. 12.
    Lam, W., Low, K.-F.: Automatic document classification based on probabilistic reasoning: Model and performance analysis. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2719–2723 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kwangcheol Shin
    • 1
  • Ajith Abraham
    • 1
  • SangYong Han
    • 1
  1. 1.School of Computer Science and EngineeringChung-Ang UniversitySeoulKorea

Personalised recommendations