Skip to main content

SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features

  • Conference paper
  • First Online:
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration (ICADL 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14457))

Included in the following conference series:

  • 257 Accesses

Abstract

Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Craven, M., Kumlien, J.: Constructing biological knowledge bases by extracting information from text sources. In: ISMB, vol. 1999, pp. 77–86 (1999)

    Google Scholar 

  2. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL 2019, Minneapolis, June 2019, pp. 4171–4186 (2019)

    Google Scholar 

  3. Ding, N., Hu, S., Zhao, W., et al.: OpenPrompt: an open-source framework for prompt-learning. Proc. ACL 2022, 105–113 (2021)

    Google Scholar 

  4. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embedding. In: Proceedings of the EMNLP 2021, pp. 6894–6910 (2021)

    Google Scholar 

  5. Heist, N., Paulheim, H.: Entity extraction from Wikipedia list pages. In: The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, 2020, pp. 327–342 (2020)

    Google Scholar 

  6. Heist, N., Paulheim H.: The CaLiGraph ontology as a challenge for OWL reasoners. In: Proceedings of the Semantic Reasoning Evaluation Challenge (SemREC 2021) (2021)

    Google Scholar 

  7. Heist, Nicolas, Paulheim, Heiko: Uncovering the semantics of Wikipedia categories. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11778, pp. 219–236. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_13

    Chapter  Google Scholar 

  8. Jeong, J.-W., Hong, H.-K., Lee, D.-H.: Ontology-based automatic video annotation technique in smart TV environment. IEEE Trans. Consumer Electron. 57(4), 1830–1836 (2011)

    Article  Google Scholar 

  9. Kolyvakis, P., Kalousis, A., Kiritsis, D.: Deepalignment: unsupervised ontology matching with refined word vectors. In: Proceedings of the 2018 Conference on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 787–798 (2018)

    Google Scholar 

  10. Liu, P., Yuan, W., Fu, J., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)

    Article  Google Scholar 

  11. Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  12. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Proc. ACL-IJCNLP 2009, 1003–1011 (2009)

    Article  Google Scholar 

  13. Morgan, A.A., Hirschman, L., Colosimo, M., Yeh, A.S., Colombe, J.B.: Gene name identification and normalization using a model organism database. J. Biomed. Inform. 37(6), 396–410 (2004)

    Article  Google Scholar 

  14. Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22, 31–72 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Snow, R., Jurafsky, D., Ng, A.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)

    Google Scholar 

  16. Wang, L.L., Bhagavatula, C., Neumann, M., et al.: Ontology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 Workshop, pp. 47–55 (2018)

    Google Scholar 

  17. http://CaLiGraph.org/statistics.html

  18. https://en.wikipedia.org/

  19. https://relatedwords.org/

  20. https://spacy.io/usage/processing-pipelines

  21. https://www.DBpedia.org/resources/ontology/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mizuho Iwaihara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Zhang, Z., Qin, J., Iwaihara, M. (2023). SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14457. Springer, Singapore. https://doi.org/10.1007/978-981-99-8085-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8085-7_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8084-0

  • Online ISBN: 978-981-99-8085-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics