Categorization of Multilingual Scientific Documents by a Compound Classification System

  • Jarosław Protasiewicz
  • Marcin Mirończuk
  • Sławomir Dadas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10246)

Abstract

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

Keywords

Multilingual text classification Compound classification system Multinomial Naïve Bayes Long Short-Term Memory 

References

  1. 1.
    Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1–2), 105–121 (2010)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Amini, M.-R., Goutte, C., Usunier, N.: Combining coregularization and consensus-based self-training for multilingual text categorization. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 475–482. ACM, New York (2010)Google Scholar
  3. 3.
    Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in Neural Information Processing Systems, pp. 28–36 (2009)Google Scholar
  4. 4.
    Chollet, F.: Keras (2015). https://github.com/fchollet/keras
  5. 5.
    Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71496-5_49 CrossRefGoogle Scholar
  6. 6.
    García-Adeva, J.-J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. CEPIS promotes, p. 43 (2005)Google Scholar
  7. 7.
    Gonalves, T., Quaresma, P.: Multilingual text classification through combination of monolingual classifiers. In: Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, pp. 29–38 (2010)Google Scholar
  8. 8.
    Guo, Y., Xiao, M.: Cross language text classification via subspace co-regularized multi-view learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1615–1622. ACM, New York (2012)Google Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  10. 10.
    Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2) (2015)Google Scholar
  11. 11.
    Lee, C.-H., Yang, H.-C.: Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst. Appl. 36(2), 2400–2410 (2009)CrossRefGoogle Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  13. 13.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATHGoogle Scholar
  14. 14.
    Pinto, D., Civera, J., Barron-Cedeno, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)CrossRefMATHGoogle Scholar
  15. 15.
    Protasiewicz, J., Pedrycz, W., Kozłowski, M., Dadas, S., Stanisławek, T., Kopacz, A., Gałężewska, M.: A recommender system of reviewers and experts in reviewing problems. Knowl.-Based Syst. 206, 164–178 (2016)CrossRefGoogle Scholar
  16. 16.
    Protasiewicz, J., Stanislawek, T., Dadas, S.: Multilingual and hierarchical classification of large datasets of scientific publications. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1670–1675. IEEE (2015)Google Scholar
  17. 17.
    Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pp. 529–535 (2005)Google Scholar
  18. 18.
    Science-Metrix. Ontology of scientific journals (v1.03), September 2011Google Scholar
  19. 19.
    Suzuki, M., Yamagishi, N., Tsai, Y.-C., Hirasawa, S.: Multilingual text categorization using Character N-gram. In: IEEE Conference on Soft Computing in Industrial Applications, SMCia 2008, pp. 49–54 (2008)Google Scholar
  20. 20.
    Xiao, M., Guo, Y.: Semi-supervised representation learning for cross-lingual text classification. In: EMNLP, pp. 1465–1475. Citeseer (2013)Google Scholar
  21. 21.
    Yang, H.-C., Hsiao, H.-W., Lee, C.-H.: Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manage. 47(5), 647–666 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Jarosław Protasiewicz
    • 1
  • Marcin Mirończuk
    • 1
  • Sławomir Dadas
    • 1
  1. 1.National Information Processing InstituteWarsawPoland

Personalised recommendations