Abstract
With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Huang, Y.L.: A theoretic and empirical research of cluster indexing for mandarin Chinese full text document. The Journal of Library and Information Science 24, 1023–2125 (1998)
Lee, C., Lee, G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management 42, 155–165 (2006)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, Nashville (1997)
Dumais, S.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management of Contents, pp. 148–155. ACM, Bethesda (1998)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93. University of Nevada, Las Vegas (1994)
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of the Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Schutze, H., Hull, D., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pp. 229–237. ACM, Seattle (1995)
Joachims, T.: Text categorization with support machines: learning with many features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Wang, T., Chiang, H.: Fuzzy support vector machine for multi-class text categorization. Information Processing and Management 43(4), 914–929 (2007)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Workshop on Text-based Information Retrieval (TIR 2004) at the 27th German Conference on Artificial Intelligence, pp. 149–166 (2004)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR 2003, pp. 61–69 (2003)
Wang, P., Hu, J., Zeng, H., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowledge Information System 19(3), 265–281 (2009)
Gabrilovich, E., Markovitch, S.: Feature Generation for Text Categorization Using World Knowledge. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1048–1053 (2005)
Song, X., Huang, J., Zhou, J., Chen, X.: Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity. In: International Forum on Computer Science-Technology and Applications, pp. 187–190 (2009)
Campos, L., Romero, A.: Bayesian network models for hierarchical text classification from a thesaurus. Approximate Reasoning 50, 932–944 (2009)
Fararuy, J.: Farhang-e maqulei (thesaurus) and electronic transmission of Farsi content. In: Proceeding of the First Workshop on Farsi Language and Computer (2004) (in persian)
Roget’s Thesaurus, http://www.rain.org/~karpeles/rogfrm.html
Salton, G., Yang, C., Wang, A.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Sebastiani, F.: Machine learning automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Bijankhan, M.: 100 millions word Farsi Corpus. Technical Report, Research Center for Intelligent Signal Processing (2008)
Stemmer, P. (Version 0.9.7) [Computer Progtam], http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html
Dıaz, I., Ranilla, J., Montanes, E., Fernandez, J., Combarro, E.F.: Improving performance of text categorization by combining filtering and support vector machines. Journal of the American Society for Information Science and Technology 55(7), 579–592 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maghsoodi, N., Homayounpour, M.M. (2011). Using Thesaurus to Improve Multiclass Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-19437-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)