Skip to main content

KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages

  • Conference paper
  • First Online:
Augmented Intelligence and Intelligent Tutoring Systems (ITS 2023)

Abstract

There is a growing interest in automatic text processing and knowledge extraction from text repositories which often requires building new language resources and technologies. We present the KEMMRL model designed for the under-resourced but morphologically rich Croatian language. The proposed model uses natural language processing techniques, state-of-the-art deep learning algorithms and a rule-based approach to generate knowledge representations. The output of the newly developed HRtagger and HRparser methods in combination with the KEMMRL model is knowledge represented in the form of an ordered recursive hypergraph. Since the performance of KEMMRL is highly dependent on the applied deep learning methods, we evaluated them using hr500k reference corpus in the training and testing phase and manually designed out-of-domain Semantic Hypergraph Corpus (SemCro). The results of standard evaluation metrics showed that the HRtagger and HRparser achieved significantly better results than other state-of-the-art methods. These methods also showed the best results in measuring the structural similarity of hypergraphs, the highest average similarity to the manually annotated semantic hypergraphs and the number of semantic hyperedges correctly annotated by the model. The semantic hypergraph proved to be an ideal structure to capture and represent knowledge from more complex sentences without information loss. Researchers and developers of similar morphologically rich languages can customize and extend KEMMRL to their requirements. This article highlights the potential benefits of implementing the KEMMRL model into an Intelligent Tutoring System (ITS), and future research may focus on developing and testing such implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ReLDIanno – text annotation service for processing slovenian, croatian and serbian – CLARIN Slovenia. https://www.clarin.si/info/k-centre/web-services-documentation/. Accessed 21 Jan 2022

  2. Agić, Ž., Merkler, D., Berović, D.: Parsing croatian and serbian by using croatian dependency treebanks. In: Proceedings of the 4th Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 22–33. Association for Computational Linguistics, Seattle, Washington, USA (2013). https://aclanthology.org/W13-4903

  3. Agić, Ž., Tiedemann, J., Merkler, D., Krek, S., Dobrovoljc, K., Može, S.: Cross-lingual dependency parsing of related languages with rich morphosyntactic Tagsets. In: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pp. 13–24. Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/W14-4203. https://aclanthology.org/W14-4203

  4. Željko Agić, Ljubešić, N.: The SETIMES.HR linguistically annotated corpus of croatian, pp. 1724–1727 (2014). http://nlp.ffzg.hr/resources/corpora/

  5. Željko Agić, Ljubešić, N.: Universal dependencies for croatian (that work for serbian, too), pp. 1–8 (2015). http://universaldependencies.github, http://bsnlp-2015.cs.helsinki.fi/bsnlp2015-book.pdf

  6. Željko Agić, Ljubešić, N., Merkler, D.: Lemmatization and morphosyntactic tagging of croatian and serbian. In: Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 48–57 (2013). http://www.nljubesic.net/resources/corpora/setimes/, http://nlp.ffzg.hr/resources/models/tagging/

  7. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web, pp. 2670–2676 (2007)

    Google Scholar 

  8. Batanović, V., Cvetanović, M., Nikolic, B.: A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts. PLOS ONE 15, e0242050 (2020). https://doi.org/10.1371/journal.pone.0242050

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805

  10. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). https://doi.org/10.1109/CDC.2012.6426698, http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html

  11. Eberendu, A.C.: Unstructured data: an overview of the data of big data. Int. J. Comput. Trends Technol. 38, 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38P109

  12. Erjavec, T.: Multext-east: morphosyntactic resources for central and eastern European languages. Lang. Res. Eval. 46, 131–142 (2012). https://doi.org/10.1007/s10579-011-9174-8, https://link.springer.com/article/10.1007/s10579-011-9174-8

  13. Halácsy, P., Kornai, A., Oravecz, C.: Hunpos-an open source trigram tagger, pp. 209–212 (2007). https://doi.org/10.5555/1557769, http://mokk.bme.hu/resources/hunpos/

  14. Ljubešić, N., Agić, Ž., Klubička, F., Batanović, V., Erjavec, T.: Training corpus hr500k 1.0 (2018). http://hdl.handle.net/11356/1183, slovenian language resource repository CLARIN.SI

  15. Ljubesic, N., et al., (eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, 23–28 May 2016. European Language Resources Association (ELRA) (2016). http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html

  16. Ljubešić, N., Lauc, D.: BERTić- The transformer language model for bosnian, croatian, montenegrin and serbian, pp. 37–42 (2021). https://www.clarin.si/info/k-centre/, http://arxiv.org/abs/2104.09243

  17. de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal dependencies. Comput. Linguist. 47, 255–308 (2021). https://doi.org/10.1162/COLI-a-00402, http://universaldependencies.org/

  18. Menezes, T., Roth, C.: Semantic hypergraphs. CoRR abs/1908.10784 (2019). http://arxiv.org/abs/1908.10784

  19. Menezes, T., Roth, C.: Semantic hypergraphs. https://arxiv.org/abs/1908.10784 (2019). https://doi.org/10.48550/ARXIV.1908.10784

  20. Paroubek, P., Chaudiron, S., Hirschman, L., Chaudiron, S., Hirschman, L.: Principles of evaluation in natural language processing. Revue TAL 48, 7–31 (2007). http://www.technolangue.net

  21. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). https://doi.org/10.1214/AOMS/1177729586

  22. Stankov, S., Rosić, M., Žitko, B., Grubišić, A.: Tex-sys model for building intelligent tutoring systems. Comput. Educ. 5, 1017–1036 (2008)

    Article  Google Scholar 

  23. Ulčar, M., Robnik-Šikonja, M.: Finest Bert and Crosloengual Bert, pp. 104–111. Springer International Publishing (2020)

    Google Scholar 

  24. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Download references

Acknowledgements

The paper is part of the work supported by the Office of Naval Research Grant No.N00014-20-1-2066

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Vasić .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vasić, D., Žitko, B., Grubišić, A., Gašpar, A. (2023). KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages. In: Frasson, C., Mylonas, P., Troussas, C. (eds) Augmented Intelligence and Intelligent Tutoring Systems. ITS 2023. Lecture Notes in Computer Science, vol 13891. Springer, Cham. https://doi.org/10.1007/978-3-031-32883-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-32883-1_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-32882-4

  • Online ISBN: 978-3-031-32883-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics