Skip to main content

Using Thesaurus to Improve Multiclass Text Classification

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6609))

Abstract

With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Huang, Y.L.: A theoretic and empirical research of cluster indexing for mandarin Chinese full text document. The Journal of Library and Information Science 24, 1023–2125 (1998)

    Google Scholar 

  2. Lee, C., Lee, G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 42, 155–165 (2006)

    Article  Google Scholar 

  3. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, Nashville (1997)

    Google Scholar 

  4. Dumais, S.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management of Contents, pp. 148–155. ACM, Bethesda (1998)

    Google Scholar 

  5. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93. University of Nevada, Las Vegas (1994)

    Google Scholar 

  6. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of the Workshop on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  7. Schutze, H., Hull, D., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pp. 229–237. ACM, Seattle (1995)

    Google Scholar 

  8. Joachims, T.: Text categorization with support machines: learning with many features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  9. Wang, T., Chiang, H.: Fuzzy support vector machine for multi-class text categorization. Information Processing and Management 43(4), 914–929 (2007)

    Article  Google Scholar 

  10. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)

    Article  Google Scholar 

  11. Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Workshop on Text-based Information Retrieval (TIR 2004) at the 27th German Conference on Artificial Intelligence, pp. 149–166 (2004)

    Google Scholar 

  12. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR 2003, pp. 61–69 (2003)

    Google Scholar 

  13. Wang, P., Hu, J., Zeng, H., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowledge Information System 19(3), 265–281 (2009)

    Article  Google Scholar 

  14. Gabrilovich, E., Markovitch, S.: Feature Generation for Text Categorization Using World Knowledge. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1048–1053 (2005)

    Google Scholar 

  15. Song, X., Huang, J., Zhou, J., Chen, X.: Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity. In: International Forum on Computer Science-Technology and Applications, pp. 187–190 (2009)

    Google Scholar 

  16. Campos, L., Romero, A.: Bayesian network models for hierarchical text classification from a thesaurus. Approximate Reasoning 50, 932–944 (2009)

    Article  Google Scholar 

  17. Fararuy, J.: Farhang-e maqulei (thesaurus) and electronic transmission of Farsi content. In: Proceeding of the First Workshop on Farsi Language and Computer (2004) (in persian)

    Google Scholar 

  18. Roget’s Thesaurus, http://www.rain.org/~karpeles/rogfrm.html

  19. Salton, G., Yang, C., Wang, A.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  20. Sebastiani, F.: Machine learning automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  21. Bijankhan, M.: 100 millions word Farsi Corpus. Technical Report, Research Center for Intelligent Signal Processing (2008)

    Google Scholar 

  22. Stemmer, P. (Version 0.9.7) [Computer Progtam], http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html

  23. Dıaz, I., Ranilla, J., Montanes, E., Fernandez, J., Combarro, E.F.: Improving performance of text categorization by combining filtering and support vector machines. Journal of the American Society for Information Science and Technology 55(7), 579–592 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Maghsoodi, N., Homayounpour, M.M. (2011). Using Thesaurus to Improve Multiclass Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19437-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19436-8

  • Online ISBN: 978-3-642-19437-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics