LLM-assisted Knowledge Graph Engineering: Experiments with ChatGPT

Knowledge Graphs (KG) provide us with a structured, flexible, transparent, cross-system, and collaborative way of organizing our knowledge and data across various domains in society and industrial as well as scientific disciplines. KGs surpass any other form of representation in terms of effectiveness. However, Knowledge Graph Engineering (KGE) requires in-depth experiences of graph structures, web technologies, existing models and vocabularies, rule sets, logic, as well as best practices. It also demands a significant amount of work. Considering the advancements in large language models (LLMs) and their interfaces and applications in recent years, we have conducted comprehensive experiments with ChatGPT to explore its potential in supporting KGE. In this paper, we present a selection of these experiments and their results to demonstrate how ChatGPT can assist us in the development and management of KGs.


Introduction
In the last years, Artificial Intelligence (AI) has shown great promise in improving or revolutionizing various fields of research and practice, including knowledge engineering.The recent big leap in AI-based assistant chatbots, like ChatGPT (Generative Pre-trained Transformer) model, has created new opportunities to automate knowledge engineering tasks and reduce the workload on human experts.With the growing volume of information in different fields, the need for scalable and efficient methods to manage and extract knowledge from data that also adapt to new sources is critical.Despite the advances in research w.r.t.(semi)automation, knowledge engineering tasks still rely vastly on human experts.On one hand, this process can be time-consuming, resource-intensive, and susceptible to errors.On the other hand, the reliance on human expertise in knowledge engineering exposes it to workforce shortages (as knowledge engineers are scarce and the demand is growing) and the risk of expertise loss.These factors can impact the resilience and sustainability of systems and operations that rely on knowledge engineering.AI-based assistant bot approaches, such as Chat-GPT, could bridge this gap by providing a unified tool for tasks in knowledge engineering, to reduce the workload of knowledge engineers themselves, but also make knowledge engineering more accessible to a broader audience.ChatGPT, in particular, has shown promise in generating responses in a variety of syntactical representations (including code and markup languages) to user queries or task descriptions written in natural language.
In this paper, we discuss and investigate the potential of ChatGPT to support or automate various knowledge engineering tasks (e.g.ontology generation, SPARQL query generation).We will explore the benefits, pitfalls and challenges of using it and identify potential avenues for future research.

Related Work
ChatGPT, a Large Language Model (LLM) published by OpenAI4 , raised the interest in the broad field of Machine Learning (ML) 5 and especially LLMs [4] on a broad scale.While there are current discussions and analysis on the capabilities of LLMs like ChatGPT in general (e.g.[1]), there is little in the area of knowledge graph engineering.Ekaputra et al. [3] gives a general overview of current research on the combination of the broad field of ML and semantic web.
Searching Google Scholar and Semantic Scholar with "knowledge graph Chat-GPT", "ontology ChatGPT" and "rdf ChatGPT" in the beginning of April 2023 results in only two relevant papers.The first one, [7], reviews the differences between conversational AI models, prominent ChatGPT, and state-of-the-art question-answering systems for knowledge graphs.In their survey and experiments, they detect capabilities of their used frameworks but highlight ChatG-PTs explainability and robustness.The second one, [6], discusses the usage of ChatGPT for database management tasks when tabular schema is expressed in a natural language.They conclude among others that ChatGPT is able to assist in complex semantic integration and table joins to simplify database management and enhance productivity.The applied approaches and results of these two papers indicate that the idea of using LLMs like ChatGPT in the field of KG engineering is encouraging and that the LLMs might assist KG engineers in their workflows.Still, the research on the usage of LLMs for knowledge graph engineers is scarce and seems to be a new research area.
There exist some non-and semi-scientific resources which render the topic from a practical and experience perspective.We want to highlight here a helpful blog post by Kurt Cagle [2] on ChatGPT for "knowledge graph workers" and a blog post by Konrad Kaliciński [5] on knowledgegraph generation in Neo4J assisted by ChatGPT.

LLM-Assisted Knowledge Graph Engineering -Potential Application Areas
In discussion rounds with knowledge graph engineering experts we identified the following preliminary list of potential use cases in the domain of knowledge graph engineering applicable to LLMs assistance: -Assistance in knowledge graph usage: • Generate SPARQL queries from natural language questions (related experiment in Section 4. Given the limited space of this paper, we evaluate a subset of the application areas with experiments in the following section.

Experiments
To evaluate the capabilities of LLMs at the example of ChatGPT for assisting with knowledge graph engineering, we present several experiments and their results.Further details about them is given in the Supplemental Online Resources.Most experiments were conducted with ChatGPT with the LLM GPT-3.5-turbo6(named ChatGPT-3 from here on), some additionally with ChatGPT with the LLM GPT-47 (named ChatGPT-4 from here on).

SPARQL Query Generation for a Custom Small Knowledge Graph
For a first evaluation, we designed a small knowledge graph as shown in Listing 1. Specifically, we wanted to know whether (1) GPT can explain connections between indirectly related entities, (2) create SPARQL queries over the given model and (3) reconstruct the model if all properties and classes were relabelled.
Listing 1: An organizational KG with two people working in different departments of the same organization.
We issued the following prompt, which includes the knowledge graph from Listing 1, on ChatGPT-3 and ChatGPT-4: Prompt 1: Given the RDF/Turtle model below, are there any connections between US and UK? <rdf-model> In the knowledge graph of Listing 1, there is a connection between the two countries via the two people living in these, which got a job in different departments of the same company.While ChatGPT-3 fails to identify this relation, ChatGPT-4 successfully identifies it in all cases.
We further asked both ChatGPT models with prompt 2 and received five SPARQL queries each, which we analysed for their syntactic correctness, plausible query structure, and result quality.The results for prompt 2 are listed in table 1 and show that both models produce syntactically correct queries, which in most cases are plausible and produce corrects results in 3/5 (ChatGPT3) and 2/5 (ChatGPT4) cases.
Prompt 2: Given the RDF/Turtle model below, create a SPARQL query that lists for every person the country, company and department and role.Please adhere strictly to the given model.<rdf-model> In essence, AI-based query generation is possible and it can produce valid queries.However, the process needs result validation in two dimensions: 1) validating the query itself by matching to static information, like available classes and properties in the graph, as well as 2) validating the executed query results to let ChatGPT generate new queries in case of empty result sets in order to find working queries in a try & error approach.As a last prompt on the knowledge graph from Listing 1, we created a derived RDF graph by relabelling all classes and properties with sequentially numbered IRIs in the example namespace, like eg:prop1 and eg:class2.Given the relabelled model, we tasked ChatGPT: Prompt 3: Given the RDF/Turtle model below, please replace all properties and classes with the most likely standard ones.<rdf-model> With ChatGPT-3 only 2/5 iterations succeeded in carrying out all substitutions.In those succeeding cases, the quality was still not as expected because of limited ontology reuse: Only IRIs in the example namespace were introduced, rather than reusing the foaf, vcard, and org vocabularies.Yet, the ad-hoc properties and classes were reasonably named, such as eg:firstName, eg:countryName or eg:departmentName.In contrast, ChatGPT-4 delivered better results: All classes and properties were substituted with those from standard vocabularies -foaf, vcard, and org were correctly identified.For some iterations, ChatGPT-4 used the schema.orgvocabulary instead of the org vocabulary as an alternative approach.

Token Counts for Knowledge Graphs Schemas
After the results with the small custom knowledge graph we wanted to check the size of some well known knowledge graphs with respect to LLMs.
We counted tokens for various public knowledge graphs in different serialization formats with the library tiktoken8 as recommended for ChatGPT.Table 2 lists the token counts for a couple of combinations ordered by token count.More data and information is available in the Supplemental Online Resources.The turtle serialization seem to result in minimal token count, but is still bigger than the similar SQL schema added for comparison.All knowledge graphs exceed the token limit for GPT-3.5 and 3 of 4 knowledge graphs listed here exceed the limit for GPT-4.

SPARQL Query Generation for the Mondial Knowledge Graph
In addition to the experiments with the small custom knowledge graph (see Section 4.1) we tested ChatGPT with the bigger mondial knowledge graph9 which is published since decades with the latest "main revision" 2015.We asked ChatGPT to generate a SPARQL query for a natural language question from a sparql university lecture 10 .We used the following prompt five times with ChatGPT-3 and ChatGPT-4 each: Prompt 4: Please create a sparql query based on the mondial knowledge graph for the following question: which river has the most riparian states?
The results are documented in the Supplemental Online Resources together with detailed comments on the given queries.Table 3 gives some statistics.In summary, all SPARQL queries given by ChatGPT were syntactically correct, but none of them worked when executed.Actually all queries had at least one error preventing the correct execution like referencing a wrong namespace, wrong usage of properties or referencing undefined classes.

Knowledge Extraction from Fact Sheets
As an experiment to evaluate knowledge extraction capabilities, we used PDF fact sheets of 3D printer specifications from different additive manufacturing (AM) vendor websites.The goal is to build a KG about existing 3D printers and their type as well as capabilities.We fed plaintext excerpts (extracted via pdfplumber) from these PDFs into ChatGPT-3 and prompted it to: Prompt 5: Convert the following $$vendor$$ 3d printer specification into a JSON LD formatted Knowledge Graph.The node for this KG should be Printer as a main node, Type of 3d printer such as FDM, SLA, and SLS, Manufacturer, Material, Applications, and Technique.
Since the fact sheets are usually formatted using a table scheme, the nature of these plain texts is that mostly the printer entity is mentioned in the beginning of the text which then is further characterized in a key-value style.As a result, the text typically does not use full sentences and contains only one entity that is described in detail, but several dependant entities (like printing materials).However, the format of the key-value pairs can be noisy.Key names can be separated with colons, new line feeds, or in contrast multiple key-value pairs can be in the same line, which could impose a challenge.Nevertheless, ChatGPT was able to identify the key-value pairs of the evaluation document in a reliably way.Unfortunately, it delivered out of 5 test runs for this document 4 partial and 1 complete JSON document.In spite of that, we summarize first insights gained from a knowledge engineering perspective (but for the sake of brevity, we refer to the output documents in the experiment supplements) -The JSON-LD output format prioritizes usage of schema.orgvocabulary in the 5 evaluation runs.This works good for well-known entities and properties (e.g.Organization @type for the manufacturer, or the name property), however, for the AM-specific feature key names or terms like printer ChatGPT-3 invents reasonable but non-existent property names (in the schema.orgnamespace) instead of accurately creating a new namespace or using a dedicated AM ontology for that purpose.-Requesting turtle as output format instead, leads to different results.E.g. the property namespace prefix is based on the printer ID and therefore printer descriptions are not interoperable and can not be queried in unified way in a joint KG.-Successfully splitting x,y and z values of the maximum print dimension (instead of extracting all dimensions into one string literal) works in 3 runs.Although ChatGPT-3 accurately appends the unit of measurement to all x,y,z values (which is only mentioned after the z value in the input) in those cases, this is a modelling flaw, as querying the KG will be more complex.In one run it addressed this issue by separating units into a separate unit code field.
-A similar effect was observed when it comes to modelling the dependent entities.E.g., in 4 runs, the manufacturer was modelled correctly as a separate typed entity, in 1 as string literal instead.As a general conclusion of the experiment, ChatGPT-3 has overall solid skills to extract the key value pairs from the sheets, but the correct modelling or representation in terms of a KG significantly varies from run to run.Subsequently, none of the generated JSON documents contained sufficient information on their own, but only a subset that was modelled accurately.A question for future research is whether cherrypicking of individual JSON elements from outputs of several runs and combining them into one final document or iteratively refining the output by giving ChatGPT generic modelling feedback (like use an ontology, or separate unit information, etc.) can be automated in a good and scalable way.

Knowledge Graph Exploration
Experts in the field of knowledge graphs are familiar with concepts from RDF Schema (RDFS) (domain/range, subPropertyOf, subClassOf) and Web Ontology Language (OWL) (ObjectProperty, DatatypeProperty, FunctionalProperty, ...).Often, each of these experts has their preferred tools and methods for gaining an overview of an ontology they are not yet familiar with.We asked ChatGPT-3 two different questions requesting the mermaid 11 visualization of the most important concepts and their connections: Prompt 6: Can you create me a visualization showing the most important classes and concepts and how they are linked for dbpedia ontology, serialized for mermaid?

Prompt 7:
Can you create me a visualization of the most common concepts of the DBpedia ontology and their connections focusing on domain and range defined in properties.
We expected a graph with at least eight nodes and their corresponding edges.The identifiers for the nodes and edges are expected to follow the Turtle or SPARQL prefix:concept notation.If the first question did not achieve the goal, we asked additional questions or demands to ChatGPT-3.The results are presented in table 4 and we evaluated the displayed graphs based on the following criteria: Prompt 6 led to an answer with a hierarchical graph representation of the important classes defined in the DBpedia ontology.The diagram already met our requirements regarding the minimum number and labelling after the first answer and can be seen in the Supplemental Online Resources.
The class hierarchy was represented by the rdfs:subPropertyOf relation, and the nodes were labelled in prefix notation, as were the edges.By arranging it as a tree using the subClassOf-pattern, only two different properties were used for the relations (edges).The root node was of type owl:Thing other nodes are connected as (sub)classes from the DBpedia ontology.These were: Place, Organization, Event, Work, Species, and Person.The class Work had one more subClas-sOf relation to the class MusicalWork.The class Person had the most complex representation, with two more subClassOf relations leading to foaf:Person and foaf:Agent, the latter of which is a subclass of the root node (owl:Thing).
In the second prompt (Prompt 7 ChatGPT-3 referred to a graphic file within the answer text that no longer existed.Upon further inquiry, a mermaid diagram was generated.It was of type "Graph" and contained thirteen common concepts and seventeen edges, which were all unique.The labels of both, nodes and edges contain no prefixes, but were addable with further inquiry.Only the generated concept dbo:Occupation is non-existent.All remaining nodes and edges comply with the rules of the ontology, even if the concepts used are derived through further subclass relationships.The resulting diagram is shown in the Supplemental Online Resources. While prompt 6 leads to a result that can be more comprehensively achieved with conventional tools for visualizing RDF, the result from prompt 7 provides an overview of concepts (classes) and properties that can be used to relate instances of these classes to each other.

Conclusion and Future Work
From the perspective of a knowledge graph engineer, ChatGPT has demonstrated impressive capabilities.It successfully generated knowledge graphs from semi-structured textual data, translated natural language questions into syntactically correct and well-structured SPARQL queries for the given knowledge graphs, and even generated overview diagrams for large knowledge graph schemas, as outlined in section 4.An detailed analysis revealed that the generated results contain mistakes, of which some are subtle.For some use cases, this might be harmless and can be tackled with additional validation steps in general, like with the metrics we used for SPARQL queries.In general, our conclusion is, that one needs to keep in mind ChatGPT's tendency to hallucinate 12 , especially when applied to the field of knowledge graph engineering where many engineers are used to mathematical precision and logic.
The closed-source nature of ChatGPT challenges scientific research on it in two ways: 1. Detailed capability ratings of closed-source probabilistic models require much effort 2. Result reproducibility is bound to service availability and results might be irreproducible at a later date (due to service changes) Thus, open training corpora and LLMS are mandatory for proper scientific research.
In the future, metrics are to be found to rate generated ChatGPT answers automatically, like we broached with SPARQL queries.This again enables to extend the number of test cases for a specific experiment and to generate profound statistical results.Another research focus should be given to methods that let the LLM access a broader/necessary context to increase the chance for correct answers.

Table 1 .
Findings in generated SPARQL queries for prompt 2.

Table 2 .
Token counts for selected knowledge graphs and serialisations

Table 3 .
Findings in generated sparql queries for prompt 4.

Table 4 .
Diagram content overview.One more prompt was needed to serialize a graph ** One more prompt was needed to add prefixed labels *