Abstract
There is a growing interest in automatic text processing and knowledge extraction from text repositories which often requires building new language resources and technologies. We present the KEMMRL model designed for the under-resourced but morphologically rich Croatian language. The proposed model uses natural language processing techniques, state-of-the-art deep learning algorithms and a rule-based approach to generate knowledge representations. The output of the newly developed HRtagger and HRparser methods in combination with the KEMMRL model is knowledge represented in the form of an ordered recursive hypergraph. Since the performance of KEMMRL is highly dependent on the applied deep learning methods, we evaluated them using hr500k reference corpus in the training and testing phase and manually designed out-of-domain Semantic Hypergraph Corpus (SemCro). The results of standard evaluation metrics showed that the HRtagger and HRparser achieved significantly better results than other state-of-the-art methods. These methods also showed the best results in measuring the structural similarity of hypergraphs, the highest average similarity to the manually annotated semantic hypergraphs and the number of semantic hyperedges correctly annotated by the model. The semantic hypergraph proved to be an ideal structure to capture and represent knowledge from more complex sentences without information loss. Researchers and developers of similar morphologically rich languages can customize and extend KEMMRL to their requirements. This article highlights the potential benefits of implementing the KEMMRL model into an Intelligent Tutoring System (ITS), and future research may focus on developing and testing such implementations.
Keywords
- Knowledge extraction
- natural language processing
- deep learning techniques
- morphologically rich languages
- knowledge representation
- semantic hypergraph
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
ReLDIanno – text annotation service for processing slovenian, croatian and serbian – CLARIN Slovenia. https://www.clarin.si/info/k-centre/web-services-documentation/. Accessed 21 Jan 2022
Agić, Ž., Merkler, D., Berović, D.: Parsing croatian and serbian by using croatian dependency treebanks. In: Proceedings of the 4th Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 22–33. Association for Computational Linguistics, Seattle, Washington, USA (2013). https://aclanthology.org/W13-4903
Agić, Ž., Tiedemann, J., Merkler, D., Krek, S., Dobrovoljc, K., Može, S.: Cross-lingual dependency parsing of related languages with rich morphosyntactic Tagsets. In: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pp. 13–24. Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/W14-4203. https://aclanthology.org/W14-4203
Željko Agić, Ljubešić, N.: The SETIMES.HR linguistically annotated corpus of croatian, pp. 1724–1727 (2014). http://nlp.ffzg.hr/resources/corpora/
Željko Agić, Ljubešić, N.: Universal dependencies for croatian (that work for serbian, too), pp. 1–8 (2015). http://universaldependencies.github, http://bsnlp-2015.cs.helsinki.fi/bsnlp2015-book.pdf
Željko Agić, Ljubešić, N., Merkler, D.: Lemmatization and morphosyntactic tagging of croatian and serbian. In: Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 48–57 (2013). http://www.nljubesic.net/resources/corpora/setimes/, http://nlp.ffzg.hr/resources/models/tagging/
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web, pp. 2670–2676 (2007)
Batanović, V., Cvetanović, M., Nikolic, B.: A versatile framework for resource-limited sentiment articulation, annotation, and analysis of short texts. PLOS ONE 15, e0242050 (2020). https://doi.org/10.1371/journal.pone.0242050
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). https://doi.org/10.1109/CDC.2012.6426698, http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html
Eberendu, A.C.: Unstructured data: an overview of the data of big data. Int. J. Comput. Trends Technol. 38, 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38P109
Erjavec, T.: Multext-east: morphosyntactic resources for central and eastern European languages. Lang. Res. Eval. 46, 131–142 (2012). https://doi.org/10.1007/s10579-011-9174-8, https://link.springer.com/article/10.1007/s10579-011-9174-8
Halácsy, P., Kornai, A., Oravecz, C.: Hunpos-an open source trigram tagger, pp. 209–212 (2007). https://doi.org/10.5555/1557769, http://mokk.bme.hu/resources/hunpos/
Ljubešić, N., Agić, Ž., Klubička, F., Batanović, V., Erjavec, T.: Training corpus hr500k 1.0 (2018). http://hdl.handle.net/11356/1183, slovenian language resource repository CLARIN.SI
Ljubesic, N., et al., (eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, 23–28 May 2016. European Language Resources Association (ELRA) (2016). http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
Ljubešić, N., Lauc, D.: BERTić- The transformer language model for bosnian, croatian, montenegrin and serbian, pp. 37–42 (2021). https://www.clarin.si/info/k-centre/, http://arxiv.org/abs/2104.09243
de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal dependencies. Comput. Linguist. 47, 255–308 (2021). https://doi.org/10.1162/COLI-a-00402, http://universaldependencies.org/
Menezes, T., Roth, C.: Semantic hypergraphs. CoRR abs/1908.10784 (2019). http://arxiv.org/abs/1908.10784
Menezes, T., Roth, C.: Semantic hypergraphs. https://arxiv.org/abs/1908.10784 (2019). https://doi.org/10.48550/ARXIV.1908.10784
Paroubek, P., Chaudiron, S., Hirschman, L., Chaudiron, S., Hirschman, L.: Principles of evaluation in natural language processing. Revue TAL 48, 7–31 (2007). http://www.technolangue.net
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). https://doi.org/10.1214/AOMS/1177729586
Stankov, S., Rosić, M., Žitko, B., Grubišić, A.: Tex-sys model for building intelligent tutoring systems. Comput. Educ. 5, 1017–1036 (2008)
Ulčar, M., Robnik-Šikonja, M.: Finest Bert and Crosloengual Bert, pp. 104–111. Springer International Publishing (2020)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Acknowledgements
The paper is part of the work supported by the Office of Naval Research Grant No.N00014-20-1-2066
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vasić, D., Žitko, B., Grubišić, A., Gašpar, A. (2023). KEMMRL: Knowledge Extraction Model for Morphologically Rich Languages. In: Frasson, C., Mylonas, P., Troussas, C. (eds) Augmented Intelligence and Intelligent Tutoring Systems. ITS 2023. Lecture Notes in Computer Science, vol 13891. Springer, Cham. https://doi.org/10.1007/978-3-031-32883-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-32883-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32882-4
Online ISBN: 978-3-031-32883-1
eBook Packages: Computer ScienceComputer Science (R0)