Using Thesaurus to Improve Multiclass Text Classification

Maghsoodi, Nooshin; Homayounpour, Mohammad Mehdi

doi:10.1007/978-3-642-19437-5_20

Nooshin Maghsoodi¹⁷ &
Mohammad Mehdi Homayounpour¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6609))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1302 Accesses
1 Citations

Abstract

With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Huang, Y.L.: A theoretic and empirical research of cluster indexing for mandarin Chinese full text document. The Journal of Library and Information Science 24, 1023–2125 (1998)
Google Scholar
Lee, C., Lee, G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 42, 155–165 (2006)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14^th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, Nashville (1997)
Google Scholar
Dumais, S.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management of Contents, pp. 148–155. ACM, Bethesda (1998)
Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93. University of Nevada, Las Vegas (1994)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of the Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Google Scholar
Schutze, H., Hull, D., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pp. 229–237. ACM, Seattle (1995)
Google Scholar
Joachims, T.: Text categorization with support machines: learning with many features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Wang, T., Chiang, H.: Fuzzy support vector machine for multi-class text categorization. Information Processing and Management 43(4), 914–929 (2007)
Article Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Article Google Scholar
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Workshop on Text-based Information Retrieval (TIR 2004) at the 27th German Conference on Artificial Intelligence, pp. 149–166 (2004)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR 2003, pp. 61–69 (2003)
Google Scholar
Wang, P., Hu, J., Zeng, H., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowledge Information System 19(3), 265–281 (2009)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Feature Generation for Text Categorization Using World Knowledge. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1048–1053 (2005)
Google Scholar
Song, X., Huang, J., Zhou, J., Chen, X.: Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity. In: International Forum on Computer Science-Technology and Applications, pp. 187–190 (2009)
Google Scholar
Campos, L., Romero, A.: Bayesian network models for hierarchical text classification from a thesaurus. Approximate Reasoning 50, 932–944 (2009)
Article Google Scholar
Fararuy, J.: Farhang-e maqulei (thesaurus) and electronic transmission of Farsi content. In: Proceeding of the First Workshop on Farsi Language and Computer (2004) (in persian)
Google Scholar
Roget’s Thesaurus, http://www.rain.org/~karpeles/rogfrm.html
Salton, G., Yang, C., Wang, A.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sebastiani, F.: Machine learning automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Bijankhan, M.: 100 millions word Farsi Corpus. Technical Report, Research Center for Intelligent Signal Processing (2008)
Google Scholar
Stemmer, P. (Version 0.9.7) [Computer Progtam], http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html
Dıaz, I., Ranilla, J., Montanes, E., Fernandez, J., Combarro, E.F.: Improving performance of text categorization by combining filtering and support vector machines. Journal of the American Society for Information Science and Technology 55(7), 579–592 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Intelligent Signal and Speech Processing, Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
Nooshin Maghsoodi & Mohammad Mehdi Homayounpour

Authors

Nooshin Maghsoodi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Mehdi Homayounpour
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maghsoodi, N., Homayounpour, M.M. (2011). Using Thesaurus to Improve Multiclass Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-19437-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics