Abstract
Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Craven, M., Kumlien, J.: Constructing biological knowledge bases by extracting information from text sources. In: ISMB, vol. 1999, pp. 77–86 (1999)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL 2019, Minneapolis, June 2019, pp. 4171–4186 (2019)
Ding, N., Hu, S., Zhao, W., et al.: OpenPrompt: an open-source framework for prompt-learning. Proc. ACL 2022, 105–113 (2021)
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embedding. In: Proceedings of the EMNLP 2021, pp. 6894–6910 (2021)
Heist, N., Paulheim, H.: Entity extraction from Wikipedia list pages. In: The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, 2020, pp. 327–342 (2020)
Heist, N., Paulheim H.: The CaLiGraph ontology as a challenge for OWL reasoners. In: Proceedings of the Semantic Reasoning Evaluation Challenge (SemREC 2021) (2021)
Heist, Nicolas, Paulheim, Heiko: Uncovering the semantics of Wikipedia categories. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11778, pp. 219–236. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_13
Jeong, J.-W., Hong, H.-K., Lee, D.-H.: Ontology-based automatic video annotation technique in smart TV environment. IEEE Trans. Consumer Electron. 57(4), 1830–1836 (2011)
Kolyvakis, P., Kalousis, A., Kiritsis, D.: Deepalignment: unsupervised ontology matching with refined word vectors. In: Proceedings of the 2018 Conference on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 787–798 (2018)
Liu, P., Yuan, W., Fu, J., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Proc. ACL-IJCNLP 2009, 1003–1011 (2009)
Morgan, A.A., Hirschman, L., Colosimo, M., Yeh, A.S., Colombe, J.B.: Gene name identification and normalization using a model organism database. J. Biomed. Inform. 37(6), 396–410 (2004)
Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22, 31–72 (2011)
Snow, R., Jurafsky, D., Ng, A.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)
Wang, L.L., Bhagavatula, C., Neumann, M., et al.: Ontology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 Workshop, pp. 47–55 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Z., Zhang, Z., Qin, J., Iwaihara, M. (2023). SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14457. Springer, Singapore. https://doi.org/10.1007/978-981-99-8085-7_12
Download citation
DOI: https://doi.org/10.1007/978-981-99-8085-7_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8084-0
Online ISBN: 978-981-99-8085-7
eBook Packages: Computer ScienceComputer Science (R0)