KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages

Vasić, Daniel; Žitko, Branko; Grubišić, Ani; Gašpar, Angelina

doi:10.1007/978-3-031-32883-1_19

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13891))

Included in the following conference series:

International Conference on Intelligent Tutoring Systems

1024 Accesses

Abstract

There is a growing interest in automatic text processing and knowledge extraction from text repositories which often requires building new language resources and technologies. We present the KEMMRL model designed for the under-resourced but morphologically rich Croatian language. The proposed model uses natural language processing techniques, state-of-the-art deep learning algorithms and a rule-based approach to generate knowledge representations. The output of the newly developed HRtagger and HRparser methods in combination with the KEMMRL model is knowledge represented in the form of an ordered recursive hypergraph. Since the performance of KEMMRL is highly dependent on the applied deep learning methods, we evaluated them using hr500k reference corpus in the training and testing phase and manually designed out-of-domain Semantic Hypergraph Corpus (SemCro). The results of standard evaluation metrics showed that the HRtagger and HRparser achieved significantly better results than other state-of-the-art methods. These methods also showed the best results in measuring the structural similarity of hypergraphs, the highest average similarity to the manually annotated semantic hypergraphs and the number of semantic hyperedges correctly annotated by the model. The semantic hypergraph proved to be an ideal structure to capture and represent knowledge from more complex sentences without information loss. Researchers and developers of similar morphologically rich languages can customize and extend KEMMRL to their requirements. This article highlights the potential benefits of implementing the KEMMRL model into an Intelligent Tutoring System (ITS), and future research may focus on developing and testing such implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ReLDIanno – text annotation service for processing slovenian, croatian and serbian – CLARIN Slovenia. https://www.clarin.si/info/k-centre/web-services-documentation/. Accessed 21 Jan 2022
Agić, Ž., Merkler, D., Berović, D.: Parsing croatian and serbian by using croatian dependency treebanks. In: Proceedings of the 4th Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 22–33. Association for Computational Linguistics, Seattle, Washington, USA (2013). https://aclanthology.org/W13-4903
Agić, Ž., Tiedemann, J., Merkler, D., Krek, S., Dobrovoljc, K., Može, S.: Cross-lingual dependency parsing of related languages with rich morphosyntactic Tagsets. In: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pp. 13–24. Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/W14-4203. https://aclanthology.org/W14-4203
Željko Agić, Ljubešić, N.: The SETIMES.HR linguistically annotated corpus of croatian, pp. 1724–1727 (2014). http://nlp.ffzg.hr/resources/corpora/
Željko Agić, Ljubešić, N.: Universal dependencies for croatian (that work for serbian, too), pp. 1–8 (2015). http://universaldependencies.github, http://bsnlp-2015.cs.helsinki.fi/bsnlp2015-book.pdf
Željko Agić, Ljubešić, N., Merkler, D.: Lemmatization and morphosyntactic tagging of croatian and serbian. In: Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 48–57 (2013). http://www.nljubesic.net/resources/corpora/setimes/, http://nlp.ffzg.hr/resources/models/tagging/
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web, pp. 2670–2676 (2007)
Google Scholar
Batanović, V., Cvetanović, M., Nikolic, B.: A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts. PLOS ONE 15, e0242050 (2020). https://doi.org/10.1371/journal.pone.0242050
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). https://doi.org/10.1109/CDC.2012.6426698, http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html
Eberendu, A.C.: Unstructured data: an overview of the data of big data. Int. J. Comput. Trends Technol. 38, 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38P109
Erjavec, T.: Multext-east: morphosyntactic resources for central and eastern European languages. Lang. Res. Eval. 46, 131–142 (2012). https://doi.org/10.1007/s10579-011-9174-8, https://link.springer.com/article/10.1007/s10579-011-9174-8
Halácsy, P., Kornai, A., Oravecz, C.: Hunpos-an open source trigram tagger, pp. 209–212 (2007). https://doi.org/10.5555/1557769, http://mokk.bme.hu/resources/hunpos/
Ljubešić, N., Agić, Ž., Klubička, F., Batanović, V., Erjavec, T.: Training corpus hr500k 1.0 (2018). http://hdl.handle.net/11356/1183, slovenian language resource repository CLARIN.SI
Ljubesic, N., et al., (eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, 23–28 May 2016. European Language Resources Association (ELRA) (2016). http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
Ljubešić, N., Lauc, D.: BERTić- The transformer language model for bosnian, croatian, montenegrin and serbian, pp. 37–42 (2021). https://www.clarin.si/info/k-centre/, http://arxiv.org/abs/2104.09243
de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal dependencies. Comput. Linguist. 47, 255–308 (2021). https://doi.org/10.1162/COLI-a-00402, http://universaldependencies.org/
Menezes, T., Roth, C.: Semantic hypergraphs. CoRR abs/1908.10784 (2019). http://arxiv.org/abs/1908.10784
Menezes, T., Roth, C.: Semantic hypergraphs. https://arxiv.org/abs/1908.10784 (2019). https://doi.org/10.48550/ARXIV.1908.10784
Paroubek, P., Chaudiron, S., Hirschman, L., Chaudiron, S., Hirschman, L.: Principles of evaluation in natural language processing. Revue TAL 48, 7–31 (2007). http://www.technolangue.net
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). https://doi.org/10.1214/AOMS/1177729586
Stankov, S., Rosić, M., Žitko, B., Grubišić, A.: Tex-sys model for building intelligent tutoring systems. Comput. Educ. 5, 1017–1036 (2008)
Article Google Scholar
Ulčar, M., Robnik-Šikonja, M.: Finest Bert and Crosloengual Bert, pp. 104–111. Springer International Publishing (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Download references

Acknowledgements

The paper is part of the work supported by the Office of Naval Research Grant No.N00014-20-1-2066

Author information

Authors and Affiliations

Faculty of Science and Education, University of Mostar, Matice hrvatske b.b., Mostar, 88000, Bosnia and Herzegovina
Daniel Vasić
Faculty of Science, University of Split, Rudera Boškovića 33 Split, 21000, Split, Croatia
Branko Žitko, Ani Grubišić & Angelina Gašpar
Catholic Faculty of Theology, University of Split, Zrinsko-Frankopanska 19, 21000, Split, Croatia
Daniel Vasić, Branko Žitko, Ani Grubišić & Angelina Gašpar

Authors

Daniel Vasić
View author publications
You can also search for this author in PubMed Google Scholar
Branko Žitko
View author publications
You can also search for this author in PubMed Google Scholar
Ani Grubišić
View author publications
You can also search for this author in PubMed Google Scholar
Angelina Gašpar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Vasić .

Editor information

Editors and Affiliations

University of Montreal, Montreal, Canada
Claude Frasson
University of West Attica, Athens, Greece
Phivos Mylonas
University of West Attica, Athens, Greece
Christos Troussas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vasić, D., Žitko, B., Grubišić, A., Gašpar, A. (2023). KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages. In: Frasson, C., Mylonas, P., Troussas, C. (eds) Augmented Intelligence and Intelligent Tutoring Systems. ITS 2023. Lecture Notes in Computer Science, vol 13891. Springer, Cham. https://doi.org/10.1007/978-3-031-32883-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-32883-1_19
Published: 22 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32882-4
Online ISBN: 978-3-031-32883-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages