Skip to main content

Categorization of Multilingual Scientific Documents by a Compound Classification System

Part of the Lecture Notes in Computer Science book series (LNAI,volume 10246)

Abstract

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

Keywords

  • Multilingual text classification
  • Compound classification system
  • Multinomial Naïve Bayes
  • Long Short-Term Memory

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-59060-8_51
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-59060-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.

References

  1. Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1–2), 105–121 (2010)

    MathSciNet  CrossRef  Google Scholar 

  2. Amini, M.-R., Goutte, C., Usunier, N.: Combining coregularization and consensus-based self-training for multilingual text categorization. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 475–482. ACM, New York (2010)

    Google Scholar 

  3. Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in Neural Information Processing Systems, pp. 28–36 (2009)

    Google Scholar 

  4. Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  5. Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_49

    CrossRef  Google Scholar 

  6. García-Adeva, J.-J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. CEPIS promotes, p. 43 (2005)

    Google Scholar 

  7. Gonalves, T., Quaresma, P.: Multilingual text classification through combination of monolingual classifiers. In: Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, pp. 29–38 (2010)

    Google Scholar 

  8. Guo, Y., Xiao, M.: Cross language text classification via subspace co-regularized multi-view learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1615–1622. ACM, New York (2012)

    Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    CrossRef  Google Scholar 

  10. Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2) (2015)

    Google Scholar 

  11. Lee, C.-H., Yang, H.-C.: Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst. Appl. 36(2), 2400–2410 (2009)

    CrossRef  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Pinto, D., Civera, J., Barron-Cedeno, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)

    CrossRef  MATH  Google Scholar 

  15. Protasiewicz, J., Pedrycz, W., Kozłowski, M., Dadas, S., Stanisławek, T., Kopacz, A., Gałężewska, M.: A recommender system of reviewers and experts in reviewing problems. Knowl.-Based Syst. 206, 164–178 (2016)

    CrossRef  Google Scholar 

  16. Protasiewicz, J., Stanislawek, T., Dadas, S.: Multilingual and hierarchical classification of large datasets of scientific publications. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1670–1675. IEEE (2015)

    Google Scholar 

  17. Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pp. 529–535 (2005)

    Google Scholar 

  18. Science-Metrix. Ontology of scientific journals (v1.03), September 2011

    Google Scholar 

  19. Suzuki, M., Yamagishi, N., Tsai, Y.-C., Hirasawa, S.: Multilingual text categorization using Character N-gram. In: IEEE Conference on Soft Computing in Industrial Applications, SMCia 2008, pp. 49–54 (2008)

    Google Scholar 

  20. Xiao, M., Guo, Y.: Semi-supervised representation learning for cross-lingual text classification. In: EMNLP, pp. 1465–1475. Citeseer (2013)

    Google Scholar 

  21. Yang, H.-C., Hsiao, H.-W., Lee, C.-H.: Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manage. 47(5), 647–666 (2011)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jarosław Protasiewicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Protasiewicz, J., Mirończuk, M., Dadas, S. (2017). Categorization of Multilingual Scientific Documents by a Compound Classification System. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59060-8_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59059-2

  • Online ISBN: 978-3-319-59060-8

  • eBook Packages: Computer ScienceComputer Science (R0)