Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Dey, Debangana; Solorio, Thamar; Montes y Gómez, Manuel; Escalante, Hugo Jair

doi:10.1007/978-3-642-25324-9_31

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Debangana Dey²¹,
Thamar Solorio²¹,
Manuel Montes y Gómez^21,22 &
…
Hugo Jair Escalante^22,23

Conference paper

1346 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7094))

Abstract

The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Settles, B.: Active Learning Literature Surve, Computer Sciences Technical Report 1648. University of Wisconsin–Madison (2009)
Google Scholar
Hubert, M., Struyf, A., Rousseeuw, P.: Clustering in an Object-Oriented Environment. Journal of Statistical Software 1(4) (1996)
Google Scholar
Czarnowski, I.: Cluster-based instance selection for machine classification. Knowledge and Information Systems (2011)
Google Scholar
Slonim, N., Tishby, N.: Agglomerative Information Bottleneck. In: Advances in Neural Information Processing systems (NIPS-12) (1999)
Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1998, pp. 96–103 (1998)
Google Scholar
Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
MathSciNet MATH Google Scholar
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Object selection based on clustering and border objects. In: Kurzynski, M., et al. (eds.) Computer Recognition Systems 2, Wroclaw, Poland. ASC, vol. 45, pp. 27–34 (2007)
Google Scholar
Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 286 (2000)
Google Scholar
Brighton, H.: Advances in Instance Selection for Instance-Based Learning. Algorithms. Knowledge Creation Diffusion Utilization, 153–172 (2002)
Google Scholar
Lumini, A., Nanni, L.: A clustering method for automatic biometric template selection. Pattern Recognition 39, 495–497 (2006)
Article MATH Google Scholar
Shin, K., Abraham, A., Han, S.-Y.: Enhanced Centroid-Based Classification Technique by Filtering Outliers. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 159–163. Springer, Heidelberg (2006)
Chapter Google Scholar
Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Computational Intelligence 16, 1–28 (2000)
Article MathSciNet Google Scholar
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Prototype selection via prototype relevance. In: Ruiz-Shulcloper, J., Kropatsch, W.G. (eds.) CIARP 2008. LNCS, vol. 5197, pp. 153–160. Springer, Heidelberg (2008)
Chapter Google Scholar
Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proc. of 9th International Conference on Machine Learning, pp. 249–256 (1992)
Google Scholar
Hall, M., Frank, E., Holmes, G., Bernhard, P., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)
Google Scholar
Classic4 Dataset, ftp://ftp.cs.cornell.edu/pub/smart/
Reuters R8 Dataset, http://www.daviddlewis.com/resources/testcollections/reuters21578
20 Newsgroup Dataset, http://people.csail.mit.edu/jrennie/public_html/20Newsgroup/
Rousseeuw, P.J.: Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20, 53–65 (1987), doi:10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)
Article Google Scholar
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. Computer Journal 54(7), 1148–1165 (2011)
Article Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Last-Effort. Addison-Wesley, Cambridge (1949)
Google Scholar
Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and Control (1967)
Google Scholar
Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)
Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A.F. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, University of Alabama at Birmingham, 1300 University Boulevard, Birmingham, AL, 35294, USA
Debangana Dey, Thamar Solorio & Manuel Montes y Gómez
National Institute of Astrophysics, Optics and Electronics, Luis Enrique Erro No. 1, Sta. Maria Tonantzintla, Puebla, CP: 72840, Mexico
Manuel Montes y Gómez & Hugo Jair Escalante
Universidad Autonoma de Nuevo Leon, San Nicolas de los Garza, 66450, N. L., Mexico
Hugo Jair Escalante

Authors

Debangana Dey
View author publications
You can also search for this author in PubMed Google Scholar
Thamar Solorio
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes y Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Jair Escalante
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Mexican Petroleum Institute (IMP), Eje Central Lazaro Cardenas Norte, 152, Col. San Bartolo Atepehuacan, CP 07730,, Mexico DF,, Mexico
Ildar Batyrshin
National Polytechnic Institute (IPN), Center for Computing Research (CIC), Av. Juan Dios Bátiz, s/n, Col. Nueva Industrial Vallejo, CP 07738, Mexico D.F., Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dey, D., Solorio, T., Montes y Gómez, M., Escalante, H.J. (2011). Instance Selection in Text Classification Using the Silhouette Coefficient Measure. In: Batyrshin, I., Sidorov, G. (eds) Advances in Artificial Intelligence. MICAI 2011. Lecture Notes in Computer Science(), vol 7094. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25324-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-25324-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25323-2
Online ISBN: 978-3-642-25324-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics