Skip to main content

Topic Classification Problem Solving for Morphologically Complex Languages

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2016)

Abstract

In this paper we are presenting a topic classification task for the morphologically complex Lithuanian and Russian languages, using popular supervised machine learning techniques. In our research we experimentally investigated two text classification methods and a big variety of feature types covering different levels of abstraction: character, lexical, and morpho-syntactic. In order to have comparable results for the both languages, we kept experimental conditions as similar as possible: the datasets were composed of the normative texts, taken from the news portals; contained similar topics; and had the same number of texts in each topic.

The best results (~0.86 of the accuracy) were achieved with the Support Vector Machine method and the token lemmas as a feature representation type. The character feature type capturing relevant patterns of the complex inflectional morphology without any external morphological tools was the second best. Since these findings hold for the both Lithuanian and Russian languages, we assume, they should hold for the entire group of the Baltic and Slavic languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://tekstynas.vdu.lt/page.xhtml?id=morphological-annotator.

  2. 2.

    http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

  3. 3.

    http://www.cs.waikato.ac.nz/ml/weka/.

References

  1. Ageev, M.S., Dobrov, B.V., Lukashevich, N.V., Sidorov, A.V.: Experimental search/classification algorithms and comparison with the “basic line”. In: All-Russian Scientific Conference (RCDL 2004), pp. 62–89 (2004). (in Russian)

    Google Scholar 

  2. Bina, B., Ahmadi, M.H., Rahgozar, M.: Farsi text classification using N-Grams and Knn algorithm a comparative study. In: Proceedings of the International Conference on Data Mining (DMIN 2008), pp. 385–390 (2008)

    Google Scholar 

  3. Boulis, C., Ostendorf, M.: Text classification by augmenting the bag-of-words representation with redundancy-compensated bigrams. In: Proceedings of the SIAM International Conference on Data Mining at the Workshop on Feature Selection in Data Mining, (SIAM-FSDM 2005) (2005)

    Google Scholar 

  4. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  5. Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL 2007), pp. 94–99 (2007)

    Google Scholar 

  6. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  7. Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  8. Fortuna, B., Mladenič, D.: Using string kernels for classification of slovenian web documents. In: Proceedings of From Data and Information Analysis to Knowledge Engineering, pp. 358–365 (2005)

    Google Scholar 

  9. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21st International Conference on Machine Learning, pp. 321–328 (2004)

    Google Scholar 

  10. Gaustad, T., Bouma, G.: Accurate stemming of dutch for text classification. In: Proceedings of the Computational Linguistics in the Netherlands, pp. 104–117 (2002)

    Google Scholar 

  11. Gharib, T.F., Habib, M.B., Fayed, Z.T.: Arabic text classification using support vector machines. Int. J. Comput. Appl. 16(4), 192–199 (2009)

    Google Scholar 

  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  13. Hayes, P.J., Weinstein, S.P.: CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In: Proceedings of the 2nd Conference on Innovative Applications of Artificial Intelligence (IAAI-90), pp. 49–64 (1990)

    Google Scholar 

  14. Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Proceedings of the 8th International Conference on Computer Recognition Systems, pp. 877–885 (2013)

    Google Scholar 

  15. Hrala, M., Král, P.: Multi-label document classification in Czech. In: Proceedings of 16th International Conference on Text, Speech, and Dialogue, pp. 343–351 (2013)

    Google Scholar 

  16. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 8(4), 966–974 (2005)

    Google Scholar 

  17. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European Conference on Machine Learning 1398, pp. 137–142 (1998)

    Google Scholar 

  18. Kapočiūtė-Dzikienė, J., Vaassen, F., Daelemans, W., Krupavičius, A.: Improving topic classification for highly inflective languages. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pp. 1393–1410 (2012)

    Google Scholar 

  19. Khreisat, K.: Arabic text classification using N-gram frequency statistics: a comparative study. In: Proceedings of International Conference on Data Mining (DMIN 2006), pp. 78–82 (2006)

    Google Scholar 

  20. Kotsiantis, S.B.: Supervised machine learning: a review of classification techniques. Informatica 31, 249–268 (2007)

    MathSciNet  MATH  Google Scholar 

  21. Lehečka, J., Švec, J.: Improving multi-label document classification of Czech news articles. In: Proceedings of the 18th International Conference on Text, Speech and Dialogue, pp. 307–315 (2015)

    Google Scholar 

  22. Leopold, E., Kindermann, J.: Text categorization with support vector machines. how to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)

    Article  MATH  Google Scholar 

  23. Lewis, D.D, Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR-94), pp. 3–12 (1994)

    Google Scholar 

  24. Mackutė-Varoneckienė, A., Krilavičius, T., Morkevičius, V., Medelis, Ž.: Automatic Classification of Lithuanian Parliament Bills. Technical report No. 2014-CS-01, Baltic Institute of Advanced Technology, Vilnius, Lithuania, p. 6 (2014)

    Google Scholar 

  25. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  26. McNemar, Q.M.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)

    Article  Google Scholar 

  27. Nastase, V., Sayyad, J., Caropreso, M.F.: Using Dependency Relations for Text Classification. Technical report TR-2007-12, University of Ottawa, Ottawa, Canada, p. 13 (2007)

    Google Scholar 

  28. Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 110–117 (2003)

    Google Scholar 

  29. Radovanović, Miloš, Ivanović, Mirjana: Document representations for classification of short web-page descriptions. In: Tjoa, A.Min, Trujillo, Juan (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 544–553. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  30. Rimkutė, E., Daudaravičius, V.: Morphological annotation of the Lithuanian corpus. Kalbų studijos 11, 30–35 (2007). (in Lithuanian)

    Google Scholar 

  31. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM 2002), pp. 659–661 (2002)

    Google Scholar 

  32. Saveski, M., Trajkovski I., Pehcevski J.: Classification of macedonian news articles. In: Proceedings of the Conference on Information Technologies for Young Researchers, pp. 1–5 (2011)

    Google Scholar 

  33. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing (1994)

    Google Scholar 

  34. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  35. Šilić, A., Chauchat, J.H., Bašić, B.D., Morin, A.: N-Grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus. In: Proceedings of the 13th Portuguese Conference on Artificial Intelligence, pp. 671–682 (2007)

    Google Scholar 

  36. Sokyrko, A.B., Toldova, C.J.: Comparison of the effectiveness of two methods by removing the lexical and morphological ambiguity in the Russian language (hidden Markov model and syntactic parser) (2005). Technical report at http://www.aot.ru/docs/RusCorporaHMM.htm, (in Russian)

  37. Stas, J., Zlacky, D., Hladek, D., Juhar, J.: Categorization of unorganized text corpora for better domain-specific language modeling. Adv. Electr. Electron. Eng. 11(5), 398–403 (2013)

    Google Scholar 

  38. Tan, C.M., Yuan-Fang, W., Chan-Do, L.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  39. Tóth, J., Kondelová, A., Rozinaj, G.: Advanced text categorization methods with statistical approach. Electrorevue 4(2), 40–44 (2013)

    Google Scholar 

  40. Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q.A., Al-Shawakfa, E.M., Alsmadi, I.: The effect of stemming on arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. 1(3), 54–70 (2011)

    Article  Google Scholar 

  41. Westa, Mateusz, Szymański, Julian, Krawczyk, Henryk: Text classifiers for automatic articles categorization. In: Rutkowski, Leszek, Korytkowski, Marcin, Scherer, Rafał, Tadeusiewicz, Ryszard, Zadeh, Lotfi A., Zurada, Jacek M. (eds.) ICAISC 2012, Part II. LNCS, vol. 7268, pp. 196–204. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  42. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)

    Google Scholar 

  43. Zhikov, V., Nikolova, I., Tolosi, L., Ivanov, Y., Georgiev, G.: Theme extraction in bulgarian: experiments in supervised and unsupervised settings. In: Proceedings of CLoBL 2012: Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages (2012)

    Google Scholar 

  44. Zhikov, V., Nikolova, I., Tolosi, L., Ivanov, Y., Popov, B., Georgiev, G.: Enhancing social news media in bulgarian with natural language processing. INFOtheca 2(13), 6–18 (2012)

    Google Scholar 

  45. Zinkevičius, V.: Morphological Analysis with Lemuoklis. Darbai ir dienos 24, 246–273 (2000). (in Lithuanian)

    Google Scholar 

Download references

Acknowledgments

This research is funded by ESFA (DADA, VP1-3.1-ŠMM-10-V-02-025).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jurgita Kapočiūtė-Dzikienė .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kapočiūtė-Dzikienė, J., Krilavičius, T. (2016). Topic Classification Problem Solving for Morphologically Complex Languages. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46254-7_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46253-0

  • Online ISBN: 978-3-319-46254-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics