SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features

Wang, Zhaoyi; Zhang, Zhenyang; Qin, Jiaxin; Iwaihara, Mizuho

doi:10.1007/978-981-99-8085-7_12

Zhaoyi Wang¹⁰,
Zhenyang Zhang¹⁰,
Jiaxin Qin¹⁰^nAff11 &
…
Mizuho Iwaihara¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14457))

Included in the following conference series:

International Conference on Asian Digital Libraries

257 Accesses

Abstract

Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Craven, M., Kumlien, J.: Constructing biological knowledge bases by extracting information from text sources. In: ISMB, vol. 1999, pp. 77–86 (1999)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL 2019, Minneapolis, June 2019, pp. 4171–4186 (2019)
Google Scholar
Ding, N., Hu, S., Zhao, W., et al.: OpenPrompt: an open-source framework for prompt-learning. Proc. ACL 2022, 105–113 (2021)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embedding. In: Proceedings of the EMNLP 2021, pp. 6894–6910 (2021)
Google Scholar
Heist, N., Paulheim, H.: Entity extraction from Wikipedia list pages. In: The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, 2020, pp. 327–342 (2020)
Google Scholar
Heist, N., Paulheim H.: The CaLiGraph ontology as a challenge for OWL reasoners. In: Proceedings of the Semantic Reasoning Evaluation Challenge (SemREC 2021) (2021)
Google Scholar
Heist, Nicolas, Paulheim, Heiko: Uncovering the semantics of Wikipedia categories. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11778, pp. 219–236. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_13
Chapter Google Scholar
Jeong, J.-W., Hong, H.-K., Lee, D.-H.: Ontology-based automatic video annotation technique in smart TV environment. IEEE Trans. Consumer Electron. 57(4), 1830–1836 (2011)
Article Google Scholar
Kolyvakis, P., Kalousis, A., Kiritsis, D.: Deepalignment: unsupervised ontology matching with refined word vectors. In: Proceedings of the 2018 Conference on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 787–798 (2018)
Google Scholar
Liu, P., Yuan, W., Fu, J., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Article Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Proc. ACL-IJCNLP 2009, 1003–1011 (2009)
Article Google Scholar
Morgan, A.A., Hirschman, L., Colosimo, M., Yeh, A.S., Colombe, J.B.: Gene name identification and normalization using a model organism database. J. Biomed. Inform. 37(6), 396–410 (2004)
Article Google Scholar
Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22, 31–72 (2011)
Article MathSciNet MATH Google Scholar
Snow, R., Jurafsky, D., Ng, A.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)
Google Scholar
Wang, L.L., Bhagavatula, C., Neumann, M., et al.: Ontology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 Workshop, pp. 47–55 (2018)
Google Scholar
http://CaLiGraph.org/statistics.html
https://en.wikipedia.org/
https://relatedwords.org/
https://spacy.io/usage/processing-pipelines
https://www.DBpedia.org/resources/ontology/

Download references

Author information

Jiaxin Qin
Present address: United Automotive Electronic Systems Co., Ltd., Beijing, China

Authors and Affiliations

Graduate School of Information, Production, and Systems, Waseda University, Kitakyushu, 808-0135, Japan
Zhaoyi Wang, Zhenyang Zhang, Jiaxin Qin & Mizuho Iwaihara

Authors

Zhaoyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Mizuho Iwaihara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mizuho Iwaihara .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Dion H. Goh
Academia Sinica, Taipei, Taiwan
Shu-Jiun Chen
Mahidol University, Tambon Salaya, Amphoe Phutthamonthon, Thailand
Suppawong Tuarob

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Zhang, Z., Qin, J., Iwaihara, M. (2023). SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14457. Springer, Singapore. https://doi.org/10.1007/978-981-99-8085-7_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-8085-7_12
Published: 30 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8084-0
Online ISBN: 978-981-99-8085-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics