Abstract
The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Settles, B.: Active Learning Literature Surve, Computer Sciences Technical Report 1648. University of Wisconsin–Madison (2009)
Hubert, M., Struyf, A., Rousseeuw, P.: Clustering in an Object-Oriented Environment. Journal of Statistical Software 1(4) (1996)
Czarnowski, I.: Cluster-based instance selection for machine classification. Knowledge and Information Systems (2011)
Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. In: Advances in Neural Information Processing systems (NIPS-12) (1999)
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1998, pp. 96–103 (1998)
Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Object selection based on clustering and border objects. In: Kurzynski, M., et al. (eds.) Computer Recognition Systems 2, Wroclaw, Poland. ASC, vol. 45, pp. 27–34 (2007)
Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 286 (2000)
Brighton, H.: Advances in Instance Selection for Instance-Based Learning. Algorithms. Knowledge Creation Diffusion Utilization, 153–172 (2002)
Lumini, A., Nanni, L.: A clustering method for automatic biometric template selection. Pattern Recognition 39, 495–497 (2006)
Shin, K., Abraham, A., Han, S.-Y.: Enhanced Centroid-Based Classification Technique by Filtering Outliers. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 159–163. Springer, Heidelberg (2006)
Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Computational Intelligence 16, 1–28 (2000)
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Prototype selection via prototype relevance. In: Ruiz-Shulcloper, J., Kropatsch, W.G. (eds.) CIARP 2008. LNCS, vol. 5197, pp. 153–160. Springer, Heidelberg (2008)
Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proc. of 9th International Conference on Machine Learning, pp. 249–256 (1992)
Hall, M., Frank, E., Holmes, G., Bernhard, P., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)
Classic4 Dataset, ftp://ftp.cs.cornell.edu/pub/smart/
Reuters R8 Dataset, http://www.daviddlewis.com/resources/testcollections/reuters21578
20 Newsgroup Dataset, http://people.csail.mit.edu/jrennie/public_html/20Newsgroup/
Rousseeuw, P.J.: Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20, 53–65 (1987), doi:10.1016/0377-0427(87)90125-7
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)
Zipf, G.K.: Human Behavior and the Principle of Last-Effort. Addison-Wesley, Cambridge (1949)
Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and Control (1967)
Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A.F. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dey, D., Solorio, T., Montes y Gómez, M., Escalante, H.J. (2011). Instance Selection in Text Classification Using the Silhouette Coefficient Measure. In: Batyrshin, I., Sidorov, G. (eds) Advances in Artificial Intelligence. MICAI 2011. Lecture Notes in Computer Science(), vol 7094. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25324-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-25324-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25323-2
Online ISBN: 978-3-642-25324-9
eBook Packages: Computer ScienceComputer Science (R0)