Skip to main content

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7094))

Abstract

The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Settles, B.: Active Learning Literature Surve, Computer Sciences Technical Report 1648. University of Wisconsin–Madison (2009)

    Google Scholar 

  2. Hubert, M., Struyf, A., Rousseeuw, P.: Clustering in an Object-Oriented Environment. Journal of Statistical Software 1(4) (1996)

    Google Scholar 

  3. Czarnowski, I.: Cluster-based instance selection for machine classification. Knowledge and Information Systems (2011)

    Google Scholar 

  4. Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. In: Advances in Neural Information Processing systems (NIPS-12) (1999)

    Google Scholar 

  5. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1998, pp. 96–103 (1998)

    Google Scholar 

  6. Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)

    MathSciNet  MATH  Google Scholar 

  7. Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Object selection based on clustering and border objects. In: Kurzynski, M., et al. (eds.) Computer Recognition Systems 2, Wroclaw, Poland. ASC, vol. 45, pp. 27–34 (2007)

    Google Scholar 

  8. Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 286 (2000)

    Google Scholar 

  9. Brighton, H.: Advances in Instance Selection for Instance-Based Learning. Algorithms. Knowledge Creation Diffusion Utilization, 153–172 (2002)

    Google Scholar 

  10. Lumini, A., Nanni, L.: A clustering method for automatic biometric template selection. Pattern Recognition 39, 495–497 (2006)

    Article  MATH  Google Scholar 

  11. Shin, K., Abraham, A., Han, S.-Y.: Enhanced Centroid-Based Classification Technique by Filtering Outliers. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 159–163. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Computational Intelligence 16, 1–28 (2000)

    Article  MathSciNet  Google Scholar 

  13. Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Prototype selection via prototype relevance. In: Ruiz-Shulcloper, J., Kropatsch, W.G. (eds.) CIARP 2008. LNCS, vol. 5197, pp. 153–160. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proc. of 9th International Conference on Machine Learning, pp. 249–256 (1992)

    Google Scholar 

  15. Hall, M., Frank, E., Holmes, G., Bernhard, P., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)

    Google Scholar 

  16. Classic4 Dataset, ftp://ftp.cs.cornell.edu/pub/smart/

  17. Reuters R8 Dataset, http://www.daviddlewis.com/resources/testcollections/reuters21578

  18. 20 Newsgroup Dataset, http://people.csail.mit.edu/jrennie/public_html/20Newsgroup/

  19. Rousseeuw, P.J.: Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20, 53–65 (1987), doi:10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  20. Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)

    Article  Google Scholar 

  21. Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)

    Article  Google Scholar 

  22. Zipf, G.K.: Human Behavior and the Principle of Last-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

  23. Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and Control (1967)

    Google Scholar 

  24. Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)

    Google Scholar 

  25. Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A.F. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dey, D., Solorio, T., Montes y Gómez, M., Escalante, H.J. (2011). Instance Selection in Text Classification Using the Silhouette Coefficient Measure. In: Batyrshin, I., Sidorov, G. (eds) Advances in Artificial Intelligence. MICAI 2011. Lecture Notes in Computer Science(), vol 7094. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25324-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25324-9_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25323-2

  • Online ISBN: 978-3-642-25324-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics