Introduction

Large language models (LLMs) have gained popularity in the NLP field due to their impressive performance [1,2,3]. LLMs pre-trained with massive data show far superior capabilities of understanding and reasoning on natural language than other language models. Although LLMs have good performance on a wide range of downstream tasks, they still struggle with knowledge-intensive challenges. Studies have shown that LLMs suffer from hallucinations and knowledge limitations, including outdated or incorrect facts and a lack of specialized knowledge [4,5,6]. Furthermore, LLMs face difficulties in performing reasoning with long logical sequences or intricate structures [7]. These shortcomings restrict their use, especially in high-risk and high-sensitivity fields such as medicine.

Improving LLMs with external knowledge is an intuitive approach to overcoming their limitations. This is particularly useful for question-answering (QA) tasks. For these tasks, the process involves retrieving correct and real-time knowledge relevant to the question, constructing prompts, and then feeding these prompts to the LLM for analysis or summary, as demonstrated in Fig. 1. A knowledge graph (KG) is a vital source of such external knowledge, which supports temporal and multimodal knowledge through structured data storage techniques [8, 9]. Knowledge graphs, which store comprehensive real-world information as a graph of triples, offer more robust and stronger semantic logic than plain text and are better suited to support logical reasoning tasks.

To enhance the performance of LLMs with a knowledge graph, retrieving and reasoning over a multi-hop path is essential. However, there are three main challenges: First, each entity in the knowledge graph is connected to a large number of relations, but most are irrelevant to the given question. Without efficient filtering, the lengthy context and excess invalid information can lead to incorrect reasoning in LLMs [10]. Second, questions often require a multi-hop search over the graph, which can cause the search area to grow exponentially. Therefore, effective retrieval and pruning methods are essential. Finally, the knowledge graph’s triple structure can be difficult for general LLMs to process, as they aren’t typically pre-trained or fine-tuned on structured data [11]. Finding an appropriate knowledge representation is crucial for effective prompting.

Fig. 1
figure 1

LLM suffers from hallucination and knowledge limitation, which can be solved with external knowledge

Considering the above challenges, this paper proposes KnowledgeNavigator, a novel general framework to implement enhanced knowledge graph reasoning. It consists of three stages: Question Analysis, Knowledge Retrieval, and Reasoning. KnowledgeNavigator starts by predicting the retrieval scope required for the question and creates a set of similar queries. Guided by the question, it iteratively retrieves and filters relevant relations and entities at each hop within the knowledge graph. This process ensures that only the necessary knowledge is recalled to answer the question. Subsequently, this knowledge is synthesized and converted into natural language to minimize redundancy and circumvent the processing limitations of LLMs on triples. The refined knowledge is then fed to LLM for advanced reasoning. In this pipline, the knowledge graph serves as an external knowledge source, while the LLM enhances the understanding of question semantics, predicts search direction, and facilitates reasoning. Both components function as plug-ins within KnowledgeNavigator. This design allows KnowledgeNavigator to support any knowledge graph and backbone LLM, capitalizing on the timely updated knowledge and domain-specific information in the knowledge graph without the overhead of frequent retraining of LLM.

The main contributions of this paper can be summarized as follows:

  • KnowledgeNavigator proposes a novel framework that leverages semantic and structural information to guide LLMs in enhanced multi-hop reasoning on knowledge graphs. KnowledgeNavigator effectively retrieves external knowledge to assist in generating reliable answers for KGQA tasks.

  • KnowledgeNavigator features a general process design. The iterative retrieval module deploys similar question generation and voting mechanisms to rerank candidate knowledge, enhancing the alignment between target queries and reasoning paths. The knowledge representation module reorganizes and converts triple knowledge into LLM-friendly formats, reducing the complexity of reasoning. Therefore, KnowledgeNavigator can be directly used for various LLMs and KGs without retraining or fine-tuning.

  • KnowledgeNavigator is evaluated on various KGQA benchmarks to validate its superiority. KnowledgeNavigator outperforms all LLM-based baselines and achieves competitive performance with fully supervised models.

Related work

Knowledge reasoning for KGQA

Essentially, a knowledge graph is a semantic network with various entities, concepts, and relations between them [12]. The KGQA task, as an important application of knowledge graphs in the NLP field, aims to generate answers for a given question by mining and reasoning on the existing knowledge graph [13]. Reasoning over knowledge graphs is crucial for supporting KGQA due to the inherent limitations of knowledge graphs, which can be incomplete and noisy to varying degrees. Early knowledge reasoning mainly relies on logical rules, which require experts to design grammars and rules for specific domains. These methods have strong interpretability but require a lot of manual intervention and cannot be generalized efficiently [14,15,16]. With the development of representation learning, many studies consider both local and global or high-level and low-level knowledge correlations to enhance feature extraction and representation capabilities and support various downstream tasks [17, 18]. In KGQA, many studies also apply embeddings with rich semantic information to map the entities and relations to a low-dimensional vector space and capture their potential semantic relationships to extract the optimal answer. These studies greatly improve the performance of knowledge reasoning for KGQA, but the effectiveness of these methods relies on the representation of embedding models and lacks interpretability [19,20,21]. To better solve complex multi-hop reasoning, more researchers currently apply neural networks to learn the interaction patterns between entities and relations in the knowledge graph to achieve automatic and accurate reasoning and improve the generalization of reasoning models [22, 23]. Modified metaheuristics have also been applied to improve the model’s ability to represent patterns by exploring the optimal hyperparameter combination, thereby improving the performance of KGQA tasks [24, 25].

Knowledge graph enhanced LLM

Knowledge graphs support the structured representation of real-world knowledge, through temporal and personalized design and can meet a variety of different knowledge storage and usage requirements [26]. Therefore, knowledge graphs are applied to enhance LLM pre-training and LLM generation as an important knowledge source [27]. The knowledge graph contains structured information that has clearer logic and reasoning paths compared to natural language. Therefore, many studies utilize entities and relations to build a corpus and design various training tasks, aiming to enhance the effectiveness of LLM pre-training [28,29,30]. However, both retraining and continued pretraining of LLM require high computing resources and time costs, making it challenging to keep up with the rapidly evolving knowledge applications. Therefore, a more straightforward approach to addressing the lack of knowledge in LLM is to construct knowledge-enhanced prompts with factual information. Many works retrieve knowledge related to the target question through external retrieval algorithms and incorporate this knowledge into prompts for LLM to aid in reasoning within unfamiliar domains [31, 32]. However, in scenarios involving long reasoning chains and long-tail knowledge, it is challenging to effectively retrieve helpful knowledge to support LLM. To address these challenges, our work sets up a comprehensive process encompassing question analysis, knowledge retrieval, and reasoning, enabling efficient and accurate knowledge retrieval and effective expression. This approach aims to meet the demands of LLM for conducting complex reasoning tasks that lack internal knowledge, facilitated by knowledge graphs.

Method

KnowledgeNavigator is designed to support KGQA tasks by performing enhanced reasoning on knowledge graphs. The reasoning process of KnowledgeNavigator contains three stages: Question Analysis, Knowledge Retrieval, and Reasoning as shown in Fig. 2.

Fig. 2
figure 2

An overview of KnowledgeNavigator. The framework consists of three consecutive phases: Question Analysis, Knowledge Retrieval, and Reasoning. The given example comes from MetaQA, describing a 2-hop reasoning task starting from Babaloo Mandel and ending with entities including Tom Hanks. In the knowledge graph, solid lines indicate that entities or relations are retrieved as reasoning knowledge, while dashed lines indicate that entities or relations are discarded

Question analysis

The multi-hop reasoning of questions is the main challenge in KGQA tasks. The Question Analysis stage enhances and restricts the reasoning through pre-analyzing the given question, which supports enhanced reasoning on the knowledge graph. This approach helps to improve retrieval efficiency and accuracy.

To answer a question Q, KnowledgeNavigator first predicts the potential hop number \(h_Q\) of the question to obtain all the knowledge required for it, starting from the core entities. The hop number indicates the maximum reasoning depth required to retrieve the information. The process of hop number prediction is a classification task. KnowledgeNavigator implements it with a fine-tuned pre-trained language model (PLM) and a simple linear classifier:

$$\begin{aligned} V_Q= & {} PLM(Q) \end{aligned}$$
(1)
$$\begin{aligned} h_Q= & {} \arg \max _{h} P(h|V_Q), h \in \{1, 2, \ldots , H\} \end{aligned}$$
(2)

The reasoning logic of each question in the KGQA task is implied in the semantics of the question itself. Therefore, knowledge graph reasoning is a process of mining this reasoning logic from the question. In order to enhance this mining, KnowledgeNavigator generates a set of similar questions \(S = {\{s^Q_1, s^Q_2, \ldots , s^Q_m\}}\) with the same semantics as the original question using LLM. Various ways of phrasing the same question can shed light on the reasoning logic from different angles. Therefore, these similar questions serve to enrich the information available during the Knowledge Retrieval stage.

In the case of Fig. 2, KnowledgeNavigator predicts the number of reasoning hops \(h_Q\) starting from the core entity "Babaloo Mandel" is 2 with PLM fine-tuned with the MetaQA 2-hop dataset. It then generates S containing two variants of the original question.

Knowledge retrieval

Extracting relevant knowledge from the knowledge graph is crucial for answering a given question. The Knowledge Retrieval stage aims to extract the logical path by performing advanced reasoning on the knowledge graph. This constructs a smaller, more focused subgraph that aids in generating answers. The retrieval process is mainly achieved by interacting with LLM, which helps avoid the expense of retraining the model for various tasks.

Knowledge retrieval is an iterative search process with a depth limit of \(h_Q\). In each iteration i, KnowledgeNavigator begins with a set of core entities \(E_i = {\{e^1_i, e^2_i, \ldots , e^n_i\}}\). It then explores all one-hop relations connected to each entity, forming a candidate relation set \(R^n_i = \{r^{n,1}_i, r^{n,2}_i, \ldots , r^{n,k}_i\}\). Since an entity may have many relations in a knowledge graph, not all are relevant to the question. It is necessary to prune the reasoning path to minimize the influence of unrelated or noisy knowledge on answer generation. KnowledgeNavigator linearizes the candidate relations for each entity into a string and formats it alongside the entity and question variations in S as prompts for LLM. The LLM is tasked with choosing the K most relevant relations from \(R^n_i\) based on the question variant.

Based on the results of relation filtering, a weighted voting mechanism is employed to rank the frequency of each relation linked to entity \(e^n_i\). Relations chosen for the original question are given double the weight of those from variants generated by the LLM in the first stage:

$$\begin{aligned} \text {Score}(r)= & {} \sum _{s \in S} w(s) \cdot \mathbb {I}(r, LLM(e, s, R)) \end{aligned}$$
(3)
$$\begin{aligned} w(s)= & {} {\left\{ \begin{array}{ll} 2 &{} \text {if } s = Q \\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

At this stage, the indicator function \(\mathbb {I}\) denotes whether a relation r from set R is chosen by the LLM. Specifically, the function assigns a value of 1 if the relation is selected and 0 if it is not.

The ranking of relations for each entity is carried out independently, enhancing the diversity of the reasoning process. After filtering all relations in iteration i, KnowledgeNavigator selects M optimal relations for each entity. It then queries the triples \((e^n_i, optimal\_r^n_i, tail)\) and \((head, optimal\_r^n_i, e^n_i)\) from the knowledge graph. These triples form part of the reasoning path and are included in the retrieved knowledge set RK. The untraversed entities in tail and head are compiled into the core entity set \(E_{i+1}\) for the next iteration.

In this stage, KnowledgeNavigator begins with the core entities \(E_0\) extracted from the given question Q. It then iteratively filters relations and incorporates triples to RK until \(h_Q\) is reached. These triples in RK will be used as reasoning knowledge for the next stage.

Reasoning

Through several iterations, KnowledgeNavigator summarizes enough knowledge in RK to address the given question. The reasoning stage then leverages this knowledge to generate the answer.

The knowledge retrieved from the knowledge graph is structured as triples in the format of [headrelationtail]. Each triple is an implicit expression of the reasoning path. To fully answer a question, the entities and relations from multiple triples could be linked to create a reasoning path and further a reasoning sub-graph. By merging nodes and condensing this sub-graph through triple aggregation, the reasoning efficiency of the LLMs could be enhanced. For instance, KnowledgeNavigator aggregates triples T within RK that share the same head or tail entity and relation into a single, consolidated triple:

$$\begin{aligned} f_{head}(T)= & {} \left\{ (h, r, [a_1, \ldots , a_n]) \mid \forall (h, r, a_i) \in T \right\} \end{aligned}$$
(5)
$$\begin{aligned} f_{tail}(T)= & {} \left\{ ([h_1, \ldots , h_n], r, a) \mid \forall (h_i, r, a) \in T \right\} \end{aligned}$$
(6)

This can effectively reduce redundant information details and enhance the ability to represent knowledge. KnowledgeNavigator then converts the aggregated triples into natural language using templates (e.g. The {relation} of {head} is(are) : {tail}). This can circumvent the limited capacity of LLM to understand data structured as triples. Subsequently, the natural language-formatted knowledge is merged into a single string and fed to LLM along with the question Q. The LLM is prompted to generate an answer completely based on the provided external knowledge without using its own learned knowledge.

Case study

Figure 2 is an example of KnowledgeNavigator performing a KGQA task. First of all, KnowledgeNavigator predicts the reasoning hop number starting from the core entity "Babaloo Mandel" based on a PLM and generates two similar questions for the target question with LLM.

Then, KnowledgeNavigator extracts all relations linked to "Babaloo Mandel" and serializes them into "birth_year; birth _place; written_by; created_by", as part of the prompt. For the only core entity "Babaloo Mandel", LLM selects the optimal relation linked to it based on each question variant and gets the weighted voting result {written_by:3, created_by:1}. As the number of optimal relation selections for each entity is set to one, the triples with "written_by" as the optimal relation and "Babaloo Mandel" as the head or tail entity (i.e. [Splash, written_by, Babaloo Mandel] and [Parenthood, written_by, Babaloo Mandel]) are extracted as the first step of the reasoning path. The tail entities "Splash" and "Parenthood" are selected as the core entities for the second iteration to continue the knowledge retrieval.

After KnowledgeNavigator reaches the predicted hops, the triples retrieved from the knowledge graph can be combined into an effective reasoning path (e.g. Babaloo Mandel - written_by - Splash - starred_actors - Dianne Wiest), and further into a reasoning sub-graph. Take the knowledge about "Splash" as an example, KnowledgeNavigator will first combine the triples into "[Splash, starred_actors, [Dary Hannah, Tom Hanks]]", and then convert it into "The actors starred in Splash are: Dary Hannah and Tom Hanks" with a template. All triples retrieved are finally concatenated into a string and fed to LLM as part of the answer generation prompt.

Experiments

Dataset

To test the ability of KnowledgeNavigator on multi-hop knowledge graph reasoning tasks, it is evaluated on two datasets: MetaQA [33] and WebQSP [34]. In the KGQA task on both datasets, Hits@1 is used as the evaluation metric to evaluate the correctness of the answers generated by LLM, following the previous works [35,36,37,38].

MetaQA is a large-scale KGQA dataset in the movie domain that provides a knowledge graph with 43k entities, 9 relations, 135k triples, and 407k questions. The question set is extracted from the Facebook MovieQA dataset, containing questions that require 1-hop to 3-hop reasoning away from the head entities. Each question consists of a head entity, a reasoning path, and the answer entities. To verify KnowledgeNavigator’s multi-hop reasoning capability, the 2-hop and 3-hop vanilla datasets in MetaQA are used for experiments.

WebQSP is a benchmark with fewer questions but a large-scale knowledge graph, which can effectively evaluate the large-scale search ability of KnowledgeNavigator. WebQSP provides questions up to 2 hops based on freebase, each question contains a topic entity, constraints, inferential chains, and SPARQL queries for finding the answer. The base knowledge graph is set up with the latest version of freebase data dumps provided by Google including 3.12B triples [39]. WebQSP provides 4737 questions, and 11 questions that have no gold answers are removed from it.

Table 1 The performance of KnowledgeNavigator and baselines on MetaQA and WebQSP. The best result in each block is in bold

Baselines

To evaluate the effectiveness of KnowledgeNavigator in reasoning on knowledge graphs, it is compared with a set of well-known baseline models in the field of KGQA, which are all built with fully supervised. These baselines are divided into two categories based on their retrieval method: (1) Embedding-based methods: KV-Mem [35], EmbedKGQA [40], NSM [41], Transfernet [42]. (2) Retrieval-augmented methods: GraftNet [36], CBR-SUBG [43]. All of these baselines were evaluated on both MetaQA and WebQSP. In addition, UniKGQA [44], KAPING [37], TOG [38] and StructGPT [45] are included as baselines for using LLM for KGQA tasks. These frameworks are all based on un-fine-tuned LLM for knowledge retrieval and question reasoning.

The LLama-2-70B-Chat [46] and ChatGPT are applied as large language model baselines. Specifically, the same template is applied to prompt both of these large language models. The only distinction between the baseline and KnowledgeNavigator is the external knowledge retrieved.

Experiment details

KnowledgeNavigator is decoupled from LLM, any LLM can be used as a plug-in component for reasoning. ChatGPT and LLama-2-70B-Chat are applied as the LLM components in experiments. ChatGPT is called with the OpenAI API. LLama-2-70B-Chat is deployed locally with 4 NVIDIA A100 80 G without quantification, thus avoiding model quality loss. The context length is set to 4096 as the default, and the maximum number of tokens to generate per output sequence is set to 1024. The bert-base-uncased and a linear classifier are fine-tuned on the training set of the datasets for hop prediction.

For both datasets, KnowledgeNavigator generates two variants for each question, and the hop prediction is conducted within the range of 1 to 3. In MetaQA, KnowledgeNavigator performs weighted ranking on the top-1 relation for each (question variant, entity, relations) group and selects the top-1 ranked result for the next iteration. For WebQSP, these two parameters are set to top-2. In the few-shot scenario, a few-shot is composed of two examples from the training set of the same dataset, which is in the same format as the target task.

Main results

Fig. 3
figure 3

Performance of KnowledgeNavigator with different number of similar questions on MetaQA and WebQSP

Fig. 4
figure 4

Performance of KnowledgeNavigator with different knowledge formats on MetaQA and WebQSP

Table 1 shows the performance of KnowledgeNavigator and the baselines on KGQA datasets. Through optimized knowledge retrieval and answer reasoning, KnowledgeNavigator achieves impressive accuracy of 99.5% on MetaQA 2-hop with LLama-2-70B-Chat, 95.0% and 83.5% on MetaQA 3-hop and WebQSP tasks with ChatGPT. KnowledgeNavigator outperforms all baseline models on WebQSP and surpasses all other LLM-based methods on all three datasets.

First, LLM can answer questions in KGQA tasks without relying on external knowledge and even outperforms KV-Mem on the WebQSP benchmark. However, there is still a significant performance gap between the LLM and state-of-the-art models for KGQA tasks. This suggests that the LLM faces challenges in reasoning and answering complex questions using only its internal knowledge.

TOG and StructGPT proposed to retrieve question related knowledge from the knowledge graphs to assist LLM in reasoning and achieved better performance in KGQA tasks. However, StructGPT only applies LLM in knowledge retrieval and directly uses the tail entity of the triples as the answer. This approach ignores the underlying reasoning logic between triples and only achieves limited performance. TOG requires LLM to judge whether the knowledge of each hop meets the requirements of the question. This not only leads to the accumulation of errors but also significantly increases the time cost. In contrast, KnowledgeNavigator effectively considers the semantic relationships between questions, entities, relations, and retrieval history through multi-hop sequence retrieval. KnowledgeNavigator outperforms the best results using LLM by 2.2%, 8%, and 7.3% on the three datasets, respectively, and controls the retrieval time cost within an acceptable range.

As for the Embedding-based methods, they achieve multi-hop reasoning mainly based on the similarity of embeddings between questions and entities. The model needs to apply calculations on the entire knowledge graph for each new question, which results in high retrieval complexity and low accuracy. The retrieval-augmented methods reduce reasoning complexity by retrieving relevant subgraphs from knowledge graphs and deploying reasoning on the subgraphs instead. However, these studies are based on fully supervised models, and it is difficult to extend to other applications without retraining. On the other hand, as a general framework, KnowledgeNavigator can be combined with any knowledge graph and LLM without the need for retraining or fine-tuning. This allows it to utilize the latest knowledge in real-time with better versatility and generalization while achieving comparable performance to fully supervised models.

Ablation study

An ablation study is performed on KnowledgeNavigator, aiming to analyze the impact of similar questions and knowledge representation forms. The ablation study involves experiments with a varying number of similar questions and forms of knowledge representation. In the ablation study, LLama-2-70B-Chat and ChatGPT are deployed as the backbone LLMs and use the same prompt template with 2-shot examples for all cases. Figure 3 and 4 shows the results of the ablation study.

Impact of number of similar questions

Figure 3 shows the accuracy of using 0, 2, and 4 similar questions for voting in relation selection. The performance trends of LLama-2-70B-Chat and ChatGPT as backbones are similar when used in scenarios with varying numbers of similar questions. For MetaQA with relatively simple tasks, relying on LLMs alone is already sufficient for correctly understanding the original question and selecting the next-hop relation, even without similar questions, especially in MetaQA 2-hop tasks. The presence of more voters ensures a more stable relation selection. Therefore, in both backbones, the performance of KnowledgeNavigator improves as the number of similar questions participating in the voting increases. In WebQSP, the limitations imposed by the low quality of the original questions restrict the KnowledgeNavigator’s ability to discover the correct retrieval paths. Therefore, utilizing similar questions can bring greater performance improvements. With two similar questions, KnowledgeNavigator achieved accuracy improvements of 4.4% and 4.5% on LLama-2-70B-Chat and ChatGPT, respectively. However, the semantic ambiguity of the original questions also leads to the instability of voting. When using LLama-2-70B-Chat, the additional similar questions result in an increase in voting errors.

Moreover, as the number of KnowledgeNavigator’s requests to the LLM rises linearly with the number of similar questions, there’s a balance to be struck between computational costs and it’s effectiveness. In our experiments, the number of similar questions is set to 2 as the default to control the computational cost.

Impact of knowledge formats

Figure 4 shows the impact of different knowledge representation forms on the performance of KnowledgeNavigator. In this part, the LLMs are prompted with different representation forms of the same knowledge. Specifically, for "w/ Individual Triples" and "w/ Linked Triples", all triples are concatenated into a string in the form of [headrelationtail] or \([head, relation, [tail_1, \ldots , tail_n]]\), for "w/ Individual Sentences", each triple is converted into a separate natural language sentence using a template and concatenated into a string.

For different knowledge formats, it can be found that the performance of KnowledgeNavigator increases with the logical closeness of the knowledge representation. First, for both triples and sentences, using aggregated knowledge can effectively reduce the redundant information and improve the density of the knowledge in prompts, therefore reducing the difficulty of reasoning. Second, for the general LLMs that haven’t been fine-tuned, using knowledge in natural language format can avoid the errors caused by their insufficient understanding capabilities on structured data.

In various backbones, ChatGPT demonstrates a stronger ability to comprehend structured information. This enables effective reasoning even with low knowledge density triples. Moreover, ChatGPT achieves superior performance compared to LLama-2-70B-Chat in MetaQA 3-hop and WebQSP when utilizing the same knowledge format, thanks to its advanced reasoning ability.

Error analysis

Fig. 5
figure 5

Distribution of random 100 error samples on each dataset

To analyze the causes of errors in KnowledgeNavigator, 100 error samples are randomly extracted from the results of MetaQA and WebQSP. The errors are manually analyzed and classified into 4 categories according to the reasons:

  1. 1.

    Relation selection error: Wrong relations are selected in the Knowledge Retrieval stage, resulting in the failure to retrieve the correct knowledge.

  2. 2.

    Reasoning error: KnowledgeNavigator retrieves the correct knowledge, but performs wrong reasoning in answer generation.

  3. 3.

    Hallucination: KnowledgeNavigator does not generate answers based on the retrieved external knowledge.

  4. 4.

    Other errors: Including intermediate errors causing search interruption, and excessively long context leading to knowledge truncation, etc.

Figure 5 shows the error analysis results on the three datasets. It is easy to find that the error distributions on the three datasets are different. The reasoning error is the main error type on MetaQA 2-hop, accounting for 79%, while the relation selection error is the main error type on MetaQA 3-hop and WebQSP, accounting for 95% and 69% respectively. This is because the semantics of the questions in MetaQA 3-hop and WebQSP are more complex: For MetaQA, the 3-hop question features a longer reasoning path and a more intricate knowledge sub-graph. The inherent limitations in the reasoning capabilities of LLMs restrict the accuracy of reasoning and relation selection. For WebQSP, each entity is associated with numerous similar and imprecisely articulated relations in freebase, which complicates the task for LLM to understand and select the most relevant relations for the next iteration. Meanwhile, as the reasoning logic of questions in MetaQA 2-hop is more straightforward, LLM rarely selects the wrong relations or makes wrong reasoning results, and therefore, hallucinations and other errors do not appear in the samples.

According to the error statistics, the performance of LLM on KGQA tasks can be further improved through targeted optimization. Specifically, its ability to select the relevant relations can be enhanced by enhancing the semantics of the questions or strengthening the connection between the knowledge graph and the reasoning path. LLM can also be optimized to perform reasoning by optimizing the prompt and knowledge representation.

Complexity analysis

To illustrate the ability and generalization of KnowledgeNavigator in handling different QA tasks, the complexity and cost of retrieval are summarized into the following categories:

  1. 1.

    Graph density: The increased density of the knowledge graph will lead to a greater number and complexity of relations between nodes. This can weaken the significance of semantic differences between relations, and thereby bring challenges to relation selection in multi-hop reasoning. This is also a common difficulty in other KGQA studies. However, in KnowledgeNavigator, the voting mechanism based on similar questions and the beam-like search can greatly alleviate this problem.

  2. 2.

    Complexity of questions: The complexity of questions is proportional to the number of hops required for KnowledgeNavigator to retrieve knowledge in the graph. This is an inevitable process of multi-hop reasoning.

  3. 3.

    LLM response time: KnowledgeNavigator calls LLM multiple times to generate similar questions and select the optimal relation during knowledge retrieval. Therefore, the total retrieval time is related to the response time of LLM. However, the generalization of KnowledgeNavigator allows it to be directly applied to different knowledge graphs and different types of questions without the need for data preprocessing or model retraining. This makes KnowledgeNavigator more time-efficient in practical applications.

Conclusion

This paper studies the challenge of knowledge limitations in LLM and introduces KnowledgeNavigator to improve the reasoning and question answering capabilities of LLM on knowledge graphs. KnowledgeNavigator consists of three stages: question analysis, knowledge retrieval and reasoning. During question analysis, KnowledgeNavigator first pre-analyzes the question and generates variants for it to assist reasoning. Then, relying on the guidance of LLM, it iteratively retrieves and filters candidate entities and relations within the knowledge graph to extract relevant external knowledge. Finally, this knowledge is transformed into an effective prompt to improve LLM’s performance on knowledge-intensive tasks. KnowledgeNavigator is evaluated with KGQA metrics, and the results indicate that introducing external knowledge from the knowledge graph benefits LLM in handling complex tasks. KnowledgeNavigator outperforms other frameworks deploying LLM on enhanced KGQA and achieves comparable performance with previous fully supervised models. An ablation study is also conducted to confirm the effectiveness of each component in KnowledgeNavigator and analyze the errors. KnowledgeNavigator is a general framework, but its performance depends on the natural language understanding and reasoning capabilities of LLM, as well as its response latency. Therefore, reducing the complexity of reasoning to adapt to more scenarios may be a future research direction.