Skip to main content

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2020 (ISWC 2020)

Abstract

Knowledge graphs (KGs) have become the preferred technology for representing, sharing and adding knowledge to modern AI applications. While KGs have become a mainstream technology, the RDF/SPARQL-centric toolset for operating with them at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications. In this paper we present KGTK, a data science-centric toolkit designed to represent, create, transform, enhance and analyze KGs. KGTK represents graphs in tables and leverages popular libraries developed for data science applications, enabling a wide audience of developers to easily construct knowledge graph pipelines for their applications. We illustrate the framework with real-world scenarios where we have used KGTK to integrate and manipulate large KGs, such as Wikidata, DBpedia and ConceptNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

  • 01 November 2020

    In the originally published version of chapter 18 the name of Rongpeng Li was misspelled. This has been corrected.

Notes

  1. 1.

    https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects.

  2. 2.

    https://neo4j.com.

  3. 3.

    https://graphy.link/.

  4. 4.

    https://rdflib.readthedocs.io/en/stable/.

  5. 5.

    https://graph-tool.skewed.de/.

  6. 6.

    https://networkx.github.io/.

  7. 7.

    https://spacy.io/.

  8. 8.

    https://linux.die.net/man/1/awk.

  9. 9.

    https://gephi.org/.

  10. 10.

    http://ctdbase.org/.

  11. 11.

    https://blender.cs.illinois.edu/.

  12. 12.

    A list of such projects can be found in https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects.

  13. 13.

    https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph, https://github.com/GillesVandewiele/COVID-KG/.

  14. 14.

    http://pubannotation.org/collections/CORD-19.

  15. 15.

    https://scisight.apps.allenai.org/clusters.

  16. 16.

    https://github.com/vespa-engine/cord-19/blob/master/README.md.

  17. 17.

    https://kgtk.readthedocs.io/en/latest/specification/.

  18. 18.

    https://kgtk.readthedocs.io/en/latest.

  19. 19.

    https://linux.die.net/man/7/pipe.

  20. 20.

    https://github.com/usc-isi-i2/kgtk/tree/master/examples.

  21. 21.

    https://pandas.pydata.org.

  22. 22.

    https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2.

  23. 23.

    https://github.com/usc-isi-i2/CKG-COVID-19/blob/dev/build-covid-kg.ipynb.

  24. 24.

    https://github.com/usc-isi-i2/kgtk/blob/master/examples/CSKG.ipynb.

  25. 25.

    https://datamart-upload.readthedocs.io/en/latest/REST-API-tutorial/.

  26. 26.

    https://github.com/NCATS-Tangerine/kgx.

  27. 27.

    https://tools.wmflabs.org/sqid/.

References

  1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  2. Beek, W., Raad, J., Wielemaker, J., van Harmelen, F.: sameAs.cc: the closure of 500M owl:sameAs statements. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 65–80. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_5

    Chapter  Google Scholar 

  3. Beek, W., Rietveld, L., Ilievski, F., Schlobach, S.: LOD lab: scalable linked data processing. In: Pan, J.Z., et al. (eds.) Reasoning Web 2016. LNCS, vol. 9885, pp. 124–155. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49493-7_4

    Chapter  Google Scholar 

  4. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_18

    Chapter  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  6. Fernández, J.D., Beek, W., Martínez-Prieto, M.A., Arias, M.: LOD-a-lot. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 75–83. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_7

    Chapter  Google Scholar 

  7. Fernández, J.D., Martínez-Prieto, M.A., Polleres, A., Reindorf, J.: HDTQ: managing RDF datasets in compressed space. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 191–208. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_13

    Chapter  Google Scholar 

  8. Gazzotti, R., Michel, F., Gandon, F.: CORD-19 named entities knowledge graph (CORD19-NEKG) (2020). https://github.com/Wimmics/cord19-nekg, University Côte d’Azur, Inria, CNRS

  9. Hartig, O.: RDF* and SPARQL*: an alternative approach to annotate statements in RDF. In: International Semantic Web Conference (Posters, Demos & Industry Tracks) (2017)

    Google Scholar 

  10. Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying Wikidata: comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_10

    Chapter  Google Scholar 

  11. Ilievski, F., Szekely, P., Cheng, J., Zhang, F., Qasemi, E.: Consolidating commonsense knowledge. arXiv preprint arXiv:2006.06114 (2020)

  12. Kenig, B., Gal, A.: MFIBlocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)

    Article  Google Scholar 

  13. Lerer, A., et al.: PyTorch-BigGraph: a large-scale graph embedding system. arXiv preprint arXiv:1903.12287 (2019)

  14. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)

    Book  Google Scholar 

  15. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  16. Martínez-Prieto, M.A., Arias Gallego, M., Fernández, J.D.: Exchange and consumption of huge RDF data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 437–452. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_36

    Chapter  Google Scholar 

  17. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Piccinno, F., Ferragina, P.: From TagME to WAT: a new entity annotator. In: Proceedings of the First International Workshop on Entity Recognition & Disambiguation, pp. 55–62 (2014)

    Google Scholar 

  19. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  20. Sap, M., et al.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3027–3035 (2019)

    Google Scholar 

  21. Seaborne, A., Carothers, G.: RDF 1.1 N-triples. W3C recommendation, W3C, February 2014. http://www.w3.org/TR/2014/REC-n-triples-20140225/

  22. Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge (2016)

    Google Scholar 

  23. Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., Van de Walle, R.: Web-scale querying through linked data fragments. In: LDOW. Citeseer (2014)

    Google Scholar 

  24. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  25. Wang, L.L., et al.: CORD-19: The COVID-19 open research dataset. ArXiv abs/2004.10706 (2020)

    Google Scholar 

  26. Wu, L., Petroni, F., Josifoski, M., Riedel, S., Zettlemoyer, L.: Zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814 (2019)

Download references

Acknowledgements

This material is based on research sponsored by Air Force Research Laboratory under agreement number FA8750-20-2-10002. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filip Ilievski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ilievski, F. et al. (2020). KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62466-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62465-1

  • Online ISBN: 978-3-030-62466-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics