Instance Selection in Text Classification Using the Silhouette Coefficient Measure

  • Debangana Dey
  • Thamar Solorio
  • Manuel Montes y Gómez
  • Hugo Jair Escalante
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7094)


The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.


Instance Selection Outlier Elimination Text Classification Supervised Machine Learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Settles, B.: Active Learning Literature Surve, Computer Sciences Technical Report 1648. University of Wisconsin–Madison (2009)Google Scholar
  2. 2.
    Hubert, M., Struyf, A., Rousseeuw, P.: Clustering in an Object-Oriented Environment. Journal of Statistical Software 1(4) (1996)Google Scholar
  3. 3.
    Czarnowski, I.: Cluster-based instance selection for machine classification. Knowledge and Information Systems (2011)Google Scholar
  4. 4.
    Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. In: Advances in Neural Information Processing systems (NIPS-12) (1999)Google Scholar
  5. 5.
    Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1998, pp. 96–103 (1998)Google Scholar
  6. 6.
    Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Object selection based on clustering and border objects. In: Kurzynski, M., et al. (eds.) Computer Recognition Systems 2, Wroclaw, Poland. ASC, vol. 45, pp. 27–34 (2007)Google Scholar
  8. 8.
    Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 286 (2000)Google Scholar
  9. 9.
    Brighton, H.: Advances in Instance Selection for Instance-Based Learning. Algorithms. Knowledge Creation Diffusion Utilization, 153–172 (2002)Google Scholar
  10. 10.
    Lumini, A., Nanni, L.: A clustering method for automatic biometric template selection. Pattern Recognition 39, 495–497 (2006)CrossRefzbMATHGoogle Scholar
  11. 11.
    Shin, K., Abraham, A., Han, S.-Y.: Enhanced Centroid-Based Classification Technique by Filtering Outliers. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 159–163. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Computational Intelligence 16, 1–28 (2000)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Prototype selection via prototype relevance. In: Ruiz-Shulcloper, J., Kropatsch, W.G. (eds.) CIARP 2008. LNCS, vol. 5197, pp. 153–160. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proc. of 9th International Conference on Machine Learning, pp. 249–256 (1992)Google Scholar
  15. 15.
    Hall, M., Frank, E., Holmes, G., Bernhard, P., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)Google Scholar
  16. 16.
  17. 17.
  18. 18.
  19. 19.
    Rousseeuw, P.J.: Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20, 53–65 (1987), doi:10.1016/0377-0427(87)90125-7CrossRefzbMATHGoogle Scholar
  20. 20.
    Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)CrossRefGoogle Scholar
  21. 21.
    Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)CrossRefGoogle Scholar
  22. 22.
    Zipf, G.K.: Human Behavior and the Principle of Last-Effort. Addison-Wesley, Cambridge (1949)Google Scholar
  23. 23.
    Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and Control (1967)Google Scholar
  24. 24.
    Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)Google Scholar
  25. 25.
    Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A.F. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Debangana Dey
    • 1
  • Thamar Solorio
    • 1
  • Manuel Montes y Gómez
    • 1
    • 2
  • Hugo Jair Escalante
    • 2
    • 3
  1. 1.Department of Computer and Information SciencesUniversity of Alabama at BirminghamBirminghamUSA
  2. 2.National Institute of Astrophysics, Optics and ElectronicsPueblaMexico
  3. 3.Universidad Autonoma de Nuevo LeonSan Nicolas de los GarzaMexico

Personalised recommendations