What is in the KGQA Benchmark Datasets? Survey on Challenges in Datasets for Question Answering on Knowledge Graphs

Question Answering based on Knowledge Graphs (KGQA) still faces difficult challenges when transforming natural language (NL) to SPARQL queries. Simple questions only referring to one triple are answerable by most QA systems, but more complex questions requiring complex queries containing subqueries or several functions are still a tough challenge within this field of research. Evaluation results of QA systems therefore also might depend on the benchmark dataset the system has been tested on. For the purpose to give an overview and reveal specific characteristics, we examined currently available KGQA datasets regarding several challenging aspects. This paper presents a detailed look into the datasets and compares them in terms of challenges a KGQA system is facing.


Introduction
Question answering (QA) aims at answering questions formulated in natural language on data sources and, therefore, combines methods from natural language processing (NLP), linguistics, database processing, and information retrieval. Though early research activities have been already conducted in the sixties, QA has received a great attention again over the last few years. Main reasons are significant progress in speech recognition and NLP but also the public availability of knowledge bases as a primary data source to answer questions from general domains.
In general, applications that transform natural language questions to formal queries on structured data can be summarized as the class of Natural Language Interfaces to Databases (NLIDB). Approaches based on semantic knowledge bases, such as RDF knowledge graphs-which we reference as Question Answering on Knowledge Graphs (KGQA) in the following-are a very promising idea because they can rely on large knowledge datasets such as DBpedia and also simplify tasks such as mapping and disambiguation. Thus, B Nadine Steinmetz nadine.steinmetz@tu-ilmenau.de Kai-Uwe Sattler kus@tu-ilmenau.de 1 Technische Universität Ilmenau, Ilmenau, Germany numerous approaches and systems have been proposed in the past and also various datasets and challenges have been published. One of the most prominent examples for the QA on DBpedia community is the QALD workshop series 1 . For each edition of the challenge, new datasets for training and testing have been published. Over the years, the datasets have increased in terms of amount of questions and of course they have been adapted to the current DBpedia versions at that time. In addition, further datasets have been created and published for the purpose to evaluate KGQA systems that transform NL to DBpedia-based SPARQL queries.
However, the multitude of datasets makes it difficult for researchers to chose the right dataset. Thus, we present in this work a comparative survey of available datasets for KGQA. The intention of this survey is two-fold: -provide QA researchers with an overview of existing datasets, their structure and characteristics, and -show the specific challenges KGQA systems require to overcome.
We performed an extensive analysis of 26 different datasets: the training and test datasets of all QALD challenges (18 datasets in total), LC-QuAD 1.0, and SimpleDBpedi-aQA. These datasets are all based on DBpedia of 2016 2 (the latest pure DBpedia versions without migrated Wikidata information). In addition, we analyzed the WebQuestions and SimpleQuestions datasets to provide a comparison in terms of linguistic characteristics.
We analyzed and compare these datasets in view of the following challenges to KGQA systems.
For all aspects, we introduce the characteristic of the aspect, describe our analysis methods and measures, as well as discuss our findings.
The remainder of this paper is organized as follows: Sect. 2 introduces some related work mainly surveys on existing (KG)QA systems. We introduce the datasets and some general information about them in Sect. 3. Our analysis results are presented in Sect. 4, and we conclude our work in Sect. 5.

Related Work
In this survey, the research field of interest is Question Answering (QA). In specific, we focus on the transformation of natural language (NL) to SPARQL queries which we refer to as Question Answering over knowledge graphs (KG QA). Since Semantic Web technologies enable knowledge to be represented as RDF triples in triple stores, the access to this structured knowledge via NL has become an interesting research field. The first challenge on Question Answering over Linked Data (QALD) has been organized in 2011 in co-location with the Extended Semantic Web Conference (ESWC) 3 . The 9th and latest edition took place in 2018 in co-location with the International Semantic Web Conference (ISWC). For all nine editions of the challenge, the organizers provided datasets in terms of training and test data. These datasets are among the datasets under observation for this survey and are described more in detail in Sect. 3.
The first survey (to the best of our knowledge) explicitly referring to KGQA and comparing KGQA systems has been published by Höffner et al. [6]. The authors present an overview of 62 different KGQA systems. The comparison is accomplished based on several challenges the authors have identified: ambiguity, complexity of queries, and the lexical gap amongst others. For our survey, we adopted the challenges listed by the authors as analysis aspects of the datasets. We also added a few more aspects. Section 4 presents more details on the challenges we chose for our analysis processes.
Bouziane et al. [4] published their work on a survey of QA systems. The authors compare 31 different (KG)QA systems regarding specific characteristics, such as interfaces to databases, open domains, ontologies, and the focus on (web) documents. Beside a more or less detailed description of the systems, the authors present an overview of the quality of these systems in terms of success rate, respectively, correct answers.
Just recently, a survey on Natural Language Interfaces for databases (NLIDB) in general has been published by Affolter et al. [1]. The authors focus on QA systems in general, but not on KGQA systems specifically. They take KGQA systems into account when comparing them to other systems that transform natural language to SQL queries. Overall, the authors present an overview of 24 (KG)QA systems and evaluate them based on a set of 10 different questions.
The surveys described above focus on the overview and comparison of NLIDB systems in general or specifically KGQA systems. In contrast and supplementary, we focus on the datasets that are available to evaluate KGQA systemsspecifically based on DBpedia. We analyzed several KGQA datasets and examined specific characteristics regarding the challenges researchers are facing when developing a KGQA system. For this study, we focused on datasets that provide questions to be answered via DBpedia (cf. Auer et al. [2]).

Benchmark Datasets
For the task of KGQA, a dataset should at least contain the following information: -the NL question string, -the SPARQL query that gives the relevant answers, and -a specified SPARQL endpoint and affected graph.
For reasons of unavailability of the endpoint or an updated knowledge graph version, it is helpful to have the expected results provided in the dataset. Thus, researchers are able to reproduce the results retrieved on an outdated knowledge base when the SPARQL endpoint has been updated.
For this study, we analyzed the most popular KGQA datasets based on DBpedia (cf. Table 1 in Kacupaj et al. [7]): -the datasets of the QALD challenge (train and test dataset, respectively, 18 datasets overall) -LC-QuAD 1.0 (train and test dataset) -SimpleDBpediaQA (train and test dataset) Several other datasets have been published in terms of QA, respectively, NLI on knowledge bases other than triple stores containing RDF data. Due to missing SPARQL queries, these datasets cannot be compared to the datasets introduced above in all aspects. But, to provide a comparison regarding linguistic characteristics, we take into account the datasets WebQuestions and SimpleQuestions for the dataset analysis presented in this survey.
Overall, we analyzed 26 datasets. For our analysis process, we only utilized the English language questions-in case the dataset provides the questions in multiple languages. This means, all further analysis results and statistics refer to the English language parts of the datasets.
The benchmark datasets are described more in detail in the next sections. The analysis results are summarized in Sect. 4.

QALD
In recent years, the QALD challenge has become a wellestablished competition in terms of KGQA on DBpedia facts. By now, nine challenges have been organized since 2011. For each challenge, the organizers provided a training and a test dataset 4 . In early years, these datasets contained at least the NL question, SPARQL query and the relevant results. Later, keywords, answer type and the information about required aggregation functions, a different knowledge base other than DBpedia and hybrid question answering on RDF and free text has been added to the datasets. For the latest edition of the challenge, the datasets contain the following fields: -answertype-values are one of Boolean, date, number, resource, string -aggregation-true or false -onlydbo-constitutes, if only DBpedia ontology properties are required for the SPARQL query; true or false -hybrid-always set false for QALD 8 and 9 datasets -question-each question is represented in different languages. At maximum 12 languages are available: de, ru, pt, en, hi_IN, fa, it, pt_BR, fr, ro, es, nl. Cf. Table 1 for more details which languages are available for the datasets. -query-the SPARQL query -answers-the result of the query provided as result bindings The datasets for challenges 1-5 have been provided in XML format and for challenges 6-9 in JSON format. The 4 The datasets are available here: https://github.com/ag-sc/QALD. datasets have been compiled manually or from query logs regarding the NL question. Table 2 shows the results of the challenges QALD 8/9 and highlights the best-performing systems.
Overall, for the QALD challenges we have analyzed 18 different datasets, each containing between 41 and 408 questions. The datasets QALD 8 and 9 are the most recent and based on the latest DBpedia version. For the benefit of clearness, we include the analysis results only for the most recent datasets (8 and 9) and provide the results for QALD 1-7 in the "Appendix."

LC-QuAD
In 2017, the LC-QuAD 1.0 dataset has been published. LC-QuAD 2.0 has been published in early 2019. Both datasets are structured in a test and a training dataset 5 . While LC-QuAD 1.0 provides SPARQL queries over pure DBpedia (version of 2016), LC-QuAD 2.0 provides SPARQL queries based on Wikidata and the Wikidata-migrated DBpedia version of 2018. As all other datasets of this survey utilize pure DBpedia as of 2016 and therefore in terms of comparability, we provide our analysis results for the LC-QuAD 1.0 dataset. The test dataset contains 1000 and the training dataset contains 4000 question-query pairs. The datasets are structured using the following fields for each record: -_id, the record id -corrected_question, the actual NL question -intermediary_question, the NL question having surface forms of (named) entities enclosed with angle brackets -sparql_query, SPARQL query based on the 04-2016 release of DBpedia -sparql_template_id, one of 37 different SPARQL template ids applicable for the respective query Both, the training and the test dataset contain 37 different types of SPARQL template IDs. Trivedi et al. [11] describe the LC-QuAD 1.0 dataset in detail. The creators of LC-QuAD 1.0 have published evaluation results of KGQA systems against their dataset. Table 3 shows these results. Details on the competing systems can be found on the authors' website 6 .

SimpleDBpediaQA
The SimpleDBpediaQA (SDBQA) dataset has been introduced by Azmy et al. [3] as a derivative of the SimpleQuestions dataset. The authors created the new dataset using a mapping of Freebase to DBpedia and provided a subset of the   original questions. Table 4 shows an overview of the original and the derived datasets. The dataset is formatted as JSON files in the following manner: -ID -Query-the actual NL question -Subject-the DBpedia URI of the entity required in the SPARQL query -FreebasePredicate-the URI of the Freebase property from the original SimpleQuestions dataset -PredicateList-a list of formalized SPARQL query triples, containing the following keys: -Predicate-the DBpedia URI of the required property in the triple -Direction-forward or backward-states if the entity of the Subject field is used as subject (forward) or as object (backward) within the triple -Constraint-either null or an URI of a DBpedia ontology class If the PredicateList field contains more than one object, the objects need to be joined in the SPARQL query via the  Figure 1 shows a sample question object and the resulting SPARQL query.
To the best of our knowledge, no KGQA system has been evaluated on the SDBQA dataset, yet. Respectively, evaluation results of a performance on the dataset have not been published so far.

WebQuestions
WebQuestions consists of a test, a validation and training dataset. The dataset has been created based on Freebase facts. The dataset provides the answers to a question as triple facts, describing a subject-relationship-object as explanation for the answers. The datasets (provided in JSON format 7 ) contain the following keys: -url-the Freebase URI of the focus entity of the question -targetValue-the list of answers for the questions -utterance-the actual question The training dataset contains 3778 records, and the test dataset contains 2032 records. State-of-the-art systems achieve an accuracy of 45.5% on the dataset (cf. Brown et al. [5]).

SimpleQuestions
The first version of the dataset has been published in 2015. This version has been used for the creation of the Sim- 7 Available for download at: https://worksheets.codalab.org/ worksheets/0xba659fe363cb46e7a505c5b6a774dc8a. pleDBpediaQA dataset, as described in 3.3. For our survey, we analyzed version 2.0 of the SimpleQuestions dataset 8 . Similar to WebQuestions, SimpleQuestions facts have been extracted from Freebase. The questions then have been created manually based on the extracted facts. The dataset is a tab-separated text file with four columns: -the first three columns contain the subject, property and object of the fact triple grounded in the Freebase knowledge graph -fourth column: the actual NL question The training dataset contains 75,909 records, and the test dataset contains 21,686 records. The latest QA approach evaluated against the SimpleQuestions dataset achieves an accuracy of 78.1% (cf. Petrochuk and Zettlemoyer [8]).

Dataset Analysis
For the analysis of the 26 datasets, we took into account a set of aspects, respectively, challenges (KG)QA systems are facing when utilizing a dataset. Adopted from Höffner et al. [6], we examined the datasets regarding the following aspects:  In addition, we analyzed the datasets for existing ontology types. The types of occurring named entities give a hint about the domain of the question. The results are shown in Sect. 4.6.
Another challenging aspect of KGQA systems is the identification of the question type which results in the definition of the answer type. Hereby, it is analyzed if the question asks for a date, or a resource, etc. We added an answer type analysis based on the given data to our survey and analyzed the datasets regarding 13 different types of answers. The results are shown in Sect. 4.7. Table 5 gives an overview of general statistical parameters. The table includes the number of records contained in the dataset, the number of POS sequences, and normalized POS sequences retrieved from the NL questions 9 , the minimum, maximum and average number of words in the NL questions. For the analysis, we processed the provided information of the datasets, such as the question, the SPARQL query as well as additional information if available. Also, we performed several NLP algorithms on the NL question, such as Partof-Speech (POS) tagging and Named Entity Linking (NEL). For POS tagging, we utilized the Stanford POS tagger 10 with the model english-left3words-distsim. The table gives a short overview of all analyzed datasets. We provide further statistics, the description of our analysis processes and the result discussion in detail in the following sections. 9 With POS sequence, we refer to the extracted POS tags in the same order as they occur in the NL question. The purpose of this analysis is described in detail in Sect. 4.5. 10 https://nlp.stanford.edu/software/tagger.shtml.

Ambiguity
Main findings: -Datasets contain surface forms with high amount of entity candidates. -Many required named entities are not the most popular (respectively easy to be disambiguated) within the candidate list of the surface form.

Topic Definition
The more ambiguous the question in a dataset, the harder it is to retrieve the correct answer. For our analysis, we examined several aspects of ambiguity: -How many named entities are mentioned in the NL question, minimum/ maximum per question? The more entities to be disambiguated, the harder is the query generation. -How many entity candidates can be retrieved from the underlying knowledge base 11 for each mention? (which means, how ambiguous is the respective surface form?) -Is the most popular candidate (in terms of the indegree of Wikipedia links) the correct one required for the SPARQL query? The disambiguation process is easier in case the most popular candidate is the relevant one, which means, how hard is it to disambiguate the surface forms.

Analysis Description
As for the datasets, we do not have information which textual parts of the NL question refer to which part in the given SPARQL query (if any). Especially for the description of relationships, respectively, reference of ontology properties, this is a difficult task for a KGQA system. For the analysis of the dataset, we therefore took a detailed look at the surface forms 12 , the entity candidates and the respective SPARQL query. For each question, we took into account only specific POS tags (cf. Table 6) to identify the mentioned named entities and considered the following POS sequences: These, POS sequences have been detected for Steinmetz [10] from the most common POS sequences of DBpedia labels.
Here, N refers to any noun which can be a singular or mass noun (NN), a plural noun (NNS), singular proper noun (NNP), or a plural proper noun (NNPS). Each POS sequence might be followed by more nouns which are also taken into account as part of the mentioned entity. For each identified sequence in the question, we retrieve potential entity candidates from DBpedia. For the dictionary, disambiguation and redirect labels are utilized. Then, this extracted list of entity candidates for the complete question is compared to the entities contained in the provided SPARQL query 13 .
In addition, we utilized the Falcon 2.0 API to identify surface forms and entity candidates in the questions. The API has been introduced by Sakor et al. [9] and proposes to identify entities and relations within short texts or questions over Wikidata and DBpedia. For the analysis, we requested the API with the following parameters: -db=1, for DBpedia entities -k=500, for the top 500 entity candidates for an identified surface form 14 Tables 7 and 8 show the result of our analysis. For each approach to identify the entities in the NL question, we detect the following measures, as shown in the table: -number of surface forms that reference entities-how many entities can be detected compared to the number of required entities for SPARQL query? (Entities NL, Entities SPARQL) -maximum number of entity candidates per surface form-how hard is it to disambiguate the entities? (max Candidates) -the number of named entities (identified in the SPARQL query) that are the most popular in terms of indegree within all candidates for a surface form (Most popular) As there is no DBpedia-based SPARQL query provided in WebQuestions and SimpleQuestions, the information about entities in SPARQL queries and if the most popular entity candidate is the correct one cannot be examined and is marked as not applicable (n/a) in the tables.

Result Discussion
Our analysis shows, that ambiguity is a serious challenge for QA systems throughout all datasets. There are mentions of named entities having more than 100 entity candidates w.r.t. DBpedia. For our approach, the most ambiguous term within all QALD datasets is Lincoln, having 215 entity candidates 15 . The most ambiguous term for all datasets is contained in the SimpleQuestions datasets: pilot with 479 entity candidates 16 .
In general, the Falcon 2.0 API provides far less entity candidates per surface form. The phrase with the highest amount of entity candidates is Jacob and Abraham with 44 entity candidates. It is contained in the LC-QuAD train dataset.
We also analyzed, how hard the disambiguation process for the detected entities would be. For this, we detected, if  the required entity is the most popular among the list of candidates for the respective surface form. A disambiguation or ranking process can be considered more simple, if the NL questions always mention very popular entities with the respective surface forms. However, our analysis shows that in many cases the relevant entity is not the most popular among the candidates of the list. The SDBQA datasets seem to be very hard to disambiguate, as we detected the lowest amount of entities that are the most popular for both datasets and both entity detection approaches. For the Falcon 2.0 API, the required entity is the most popular in only 31% of the cases. According to our analysis, the QALD 8 train dataset requires the least elaborate disambiguation process as up to 70% (for the Falcon 2.0 API, 67% for our approach) of the required entities are the most popular among the candidates.
Overall this means, a QA system must be able to disambiguate the mentioned entities-either using an answer ranking or according to the given context. Or, the system provides queries for all (or a subset of) relevant entities and provides the results to the user to receive feedback which entity and result is the demanded one.

Lexical Gap
Main findings: -High amount of named entities and relations are hard to be identified within the NL questions. -There is a significant difference between datasets regarding severity of lexical gap.

Topic Definition
In knowledge graphs, facts are described using subject, property, and object. Properties serve as descriptions of relationships between subject and object, whereas subject and object represent entities (resp. sometimes objects are literals). As natural language is very expressive, names for entities can vary and relationships can be phrased in many different ways. The lexical gap refers to missing links between an entity or relationship described in natural language and the labels available for that entity, a property or a class in the underlying knowledge base.

Analysis Description
For the analysis of the extent of the lexical gap within the datasets, we used different approaches to detect entities and relations within the NL and compared the candidate lists with the entities and properties of the respective SPARQL queries. We count all entities and properties from the SPARQL query that are not found in the candidate lists from the NL question. We assume, for these entities/properties a lexical gap exists regarding labels and potential mentions in natural language. In addition to our own approach and the Falcon 2.0 API to identify entities, we also utilized the Spotlight API 17 . For our approach and the Falcon 2.0 API, we considered all entity candidates for an identified surface form. In this way, we analyzed, if the relevant entity can be identified at all. Unfortunately, Spotlight API returns only the most relevant entity for the given context-not a candidate list. Therefore, we could only consider this one entity for the analysis. We compared the list of entities (candidates) with the list of entities extracted from the SPARQL query. The query entities not contained in the (candidate) list are summed up over the complete dataset. Table 9 shows the results for the lexical gap analysis. The table contains the named entities extracted from the SPARQL query (Entities SPARQL), the number of entities from the SPARQL queries that were not found in the NL question (Entities not found) and the percentage of entities that were not found to the overall number of entities in the SPARQL queries (Percentage not found) (by our approach, Falcon 2.0 API and the Spotlight API).
We also analyzed the extent of the lexical gap for the required properties of the SPARQL queries. For this, we also utilized the Falcon 2.0 API. For the analysis, we extracted the DBpedia ontology properties from the SPARQL query 18 . We 17 https://www.dbpedia-spotlight.org/api. 18 As the Falcon 2.0 API only considers properties from the DBpedia ontology, we did not take into account additional properties contained in the query, such as rdfs:label, rdf:type, dc:subject, or foaf:name. then compared this list to the list of the extracted relations by the Falcon 2.0 API from the NL question. We counted the properties that were not found by the API in proportion to the number of properties required for the SPARQL query. Table 10 shows the amount of extracted properties from the SPARQL query (Properties SPARQL), and the results of the Falcon API for the extraction of relations (Relations Not Found). The number depicts the total number of properties not found by the API. The percentage depicts the proportion of relations not found compared to the number of required properties as extracted from the SPARQL query.

Result Discussion
Regarding the analysis of the identification of the required entities in the NL question, the amount of entities not found is remarkably high. By our approach, a minimum of 25% of the entities contained in the SPARQL queries of the QALD datasets could not be found. For the QALD 8 test dataset, the percentage is as high as 46%. For instance, the question What was the university of the rugby player who coached the Stanford rugby teams during 1906-1917? requires the entity dbr:1906-17_Stanford_rugby_teams. For this, different parts of the question (and also numbers besides nouns) must be combined to find the label for this entity.
In comparison, the Spotlight API achieved even a lower rate of correct entities detected for most datasets (respectively, a higher percentage of entities not identified correctly from the NL question). The Spotlight API only returns the most likely entity for each identified surface form according to the given context. But, with the questions the context is meager and a disambiguation is apparently not successful in many cases. This experiment shows that the disambiguation process should not be considered before creating the SPARQL queries during the QA pipeline. A sample question where the API fails is Does the Toyota Verossa have the front engine design platform?. The required entities here are dbr:Toyota_Verossa and dbr:Front-engine_design. The API only detects the first one.
The Falcon 2.0 API performs similar or slightly better than our approach on the QALD datasets. The results for the LC-QuAD datasets are very good-only 9% of the entities are not among the candidates extracted by the API. In contrast, the API performs worse than our approach on the SDBQA datasets-the amount of entities that could not be identified is as high as 40%. A sample question where the Falcon 2.0 API fails to identify the relevant entities is Which computer scientist won an oscar?. Here, the required entities are dbr:Computer_Science and dbr:Academy_Award.
As Table 10 shows, the correct identification of DBpedia ontology properties is even harder than the entity identification.
The amount of properties not detected by the Falcon 2.0 API is remarkably high for all datasets, but especially for the SDBQA datasets with 73%. Mostly, this results from the fact that DBpedia facts and subgraphs are modeled along the ontology and not directly as expressed in natural language. For instance, the question Give me English actors starring in Lovesick. requires the properties dbo:country and dbo:birthPlace to create the English heritage of the requested actors. Obviously, these relations cannot be deduced from the NL alone. But, the API also fails to detect the property dbo:knownFor within the the question What is Elon Musk famous for?.
We provide our analysis results for properties not identified by the Falcon 2.0 API as JSON dataset 19 . Future mapping processes to identity alternative labels for DBpedia ontology properties might benefit from this dataset.
Our analyses show that the datasets contain a high number of questions where the correct entities and properties required for the SPARQL query cannot be detected by all of the approaches considered for our analyses. Furthermore, this means that for many questions the correct SPARQL query cannot be created using the correct entities which results in incorrect answers.
Apparently, the lexical gap is a significant challenge not only for mapping of relationship descriptions to ontology properties, but even for the identification of the correct entities mentioned in the NL question. But obviously, there are significant differences between the datasets.

Complex Queries
Main findings: -Only the QALD datasets require SPARQL operators other than UNION. LC-QuAD datasets do not contain any SPARQL operator. -Entity detection approaches often identify more entities in the NL question than required, especially for the LC-QuAD datasets. -Except for QALD 8 test and the SDBQA datasets, all datasets contain questions where the required SPARQL query does not contain any named entity.

Topic Definition
The expressiveness of semantic knowledge bases is based on the rather simple data structure having facts stored as triples and the effective approach of using these graph patterns in the SPARQL query to access the knowledge. However, SPARQL supports several operators which might lead to rather complex queries. Obviously, more complex queries result from complex questions and are certainly a challenge for developers of KGQA systems.

Analysis Description
For our analysis, we examined the datasets (that provide a SPARQL query) on the existence of the following query operators: FILTER, OFFSET, LIMIT, ORDER, GROUP, UNION, OPTIONAL, Subquery, HAVING, ASK type.
Detailed information how often each operator occurs in each dataset is given in Table 11. As none of the datasets contain SPARQL queries with subqueries, we left that out in the table.
As another parameter for complexity, we also counted the maximum/minimum number of entities extracted from the SPARQL query and the maximum/average/median number of triples in the SPARQL query. The results are also shown in Table 12.
A further essential process step within KGQA systems is the identification of the correct focus in the NL question. The challenge here is to examine which part of the question is the subject of interest and how it relates to the rest of the question. For template-based KGQA systems, the graph patterns of the SPARQL query are constructed around this focus. Sequence-to-sequence systems can benefit from a preceding focus identification, as the trained model might make use of To examine this aspect of complexity, we analyzed and compared the number of named entities in the NL question and the SPARQL query. If the NL question contains more than the SPARQL query, the process of focus identification is an essential step. If the numbers of entities in the NL question and the SPARQL query are equal, this might be a hint, that all entities found in the natural language can be adopted for the SPARQL query. If there are more entities in the SPARQL query than in the NL question, an analysis process might be required to deduce the additional entities from the focus(es) and relationships extracted from the linguistics of the question. Table 13 shows the results for all datasets and contrasts the results for our approach and the results of Falcon 2.0 API. The table contains the information if more named entities have been found in the SPARQL query (More in SPARQL) compared to the NL question or in the NL question (More in NL) compared to the SPARQL query, or if the number of identified named entities are equal in the SPARQL query and the NL question.

Result Discussion
Our analysis shows, that for all datasets the test datasets reflect the complexity of the training datasets or they contain even less complex queries, because sometimes operators are not present in the test dataset although they occurred in the training dataset. The HAVING and OFFSET operator is utilized only rarely. None of the SPARQL operators is present in any query of the LC-QuAD datasets. Only, the QALD datasets contain all operators to some extent. The SDBQA datasets naturally do not contain ASK queries or any other SPARQL operators other than UNION queries. The UNION queries are only utilized to model the SPARQL query with alternative properties-as described in Sect. 3.3. For almost all datasets, the minimum number of named entities contained in the SPARQL queries is zero. An example for a question resulting in zero named entities in the SPARQL query is: Which actors have the last name "Affleck"?. Here, the query only asks for a specific type of entities that contain the string "Affleck" as object for a property lastName (i.e., foaf:lastName). Figure 2 shows this sample question and the resulting SPARQL query without a named entity in it.
Regarding maximum/minimum number of entities, the dataset QALD 8 test stands out among the others. All SPARQL queries comprise exactly one entity 20 which could be a hint that this dataset is a bit easier to process in terms of evaluation results than the others. This assumption can be confirmed by the analysis results shown in Table 11. QALD 8 test comprises only a few SPARQL operators and no ASK question. On the other hand, our analysis process was not able to find a relatively high number of entities (between 17 and 19 out of 41), as shown in Table 9.
In most cases, the number of entities detected in the NL question is higher than the number of entities extracted from the SPARQL query-as shown in Table 13. For our approach, this results from the detection process itself rather aiming at high recall than high precision. As described in Sect. 4.3, we detected entities in the NL question according to several POS sequences-all patterns include at least one noun. This procedure extracts all (combined) nouns from the question, even though it is not relevant as entity for the query. But, the Falcon 2.0 API also extracts in many cases too many entities, especially for the LC-QuAD datasets.
An example for a question having more named entities in the NL than in the SPARQL query is: How many gold medals did Michael Phelps win at the 2008 Olympics?. Here, both our algorithm and the Falcon 2.0 API detect Michael Phelps and 2008 Olympics as named entities, but the SPARQL query only asks for the gold medalist dbr:Michael_Phelps and filters the respective events for the strings "2008" and "Olympics": SELECT COUNT(?sub) as ?c WHERE {?sub http://dbo:goldMedalist dbr:Michael_Phelps . FILTER(contains(str(?sub), "2008") && contains(str(?sub), "Olympics")) } Nevertheless, the number of questions where there are more entities in the SPARQL query is also reasonable high for all datasets. In these cases, the additional entities must be deduced from the linguistics of the question or along the edges of the knowledge graph. An example for such a case that often occurs is the mistreatment of an apparent type information using a property and a resource in the SPARQL query. For instance, in many cases a type constraint is expressed in the way Which [ontology class name] was [...]?. So, for these cases the phrase following the word which must be used to identify the correct ontology class from the KG. But in some cases-specifically for the DBpedia-such class membership is modeled using a property. For instance, the question Which professional surfers were born in Australia? might ask for instances of the class dbo:Surfer. But, the given SPARQL query in the dataset models the fact using the property dbo:occupation and the resource dbr:Surfer: SELECT DISTINCT ?uri WHERE {?uri dbo:occupation dbr:Surfer ; dbo:birthPlace dbr:Australia } This example shows, that an obvious class membership can also be modeled as relationship between entities. This circumstance must be taken into account, when transforming NL questions to SPARQL (for DBpedia).

Templates
Main findings: -Datasets represent a high number of different POS sequences: 56,844 out of 171,487 different questions. -We identified 22 different graph patterns within the SPARQL queries, but only a few (between 5-7) are frequently required.

Topic Definition
As described in [6], template-based approaches try to identify patterns within the natural language and transform them to SPARQL query templates. The relevant parts of the templates are then mapped to the underlying knowledge base, and the complete query is created. Most approaches use linguistic and syntactic parsers to identify similar natural language patterns that lead to the same SPARQL query template. For the analysis of the datasets regarding templates, we followed the assumption that the amount of different patterns is limited. Of course, natural language can be very expressive (also depending on the language), but in terms of KGQA, we assumed that a SPARQL query template can only be deduced from a limited number of NL patterns. Therefore, we extracted the POS sequences of the NL questions and performed a normalization step. Furthermore, templates can also be found regarding the SPARQL query. The query represents a subgraph of the complete knowledge graph. Depending on the different options how the subjects and objects of the triples are connected, different graph patterns are depicted. Therefore, we analyzed the SPARQL queries of the datasets in order to detect the amount of different graph patterns.

Analysis Description
We retrieved the Part-of-Speech (POS) patterns for all questions of all datasets. Which means, we annotated a question with POS tags-utilizing the Stanford POS tagger-and retrieved the patterns by only using the tags in the order they occur in the question. Furthermore, we processed the POS sequences in terms of normalization. After the identification of named entities in the NL question 21 , we replaced all POS tags that belong to this entity with the placeholder RESOURCE. Consecutive RESOURCE occurrences  Table 5.
In addition to the POS sequences of the NL questions, we also analyzed what type of subgraphs require to be constructed for the SPARQL queries. Therefore, we extracted the graph patterns and counted the occurrence of the patterns per dataset. For the extraction of the graphs, we considered the following principles: -We removed GROUP, ORDER, LIMIT, OFFSET, HAV-ING and FILTER restrictions. These operators do not affect the subgraph.
-As OPTIONAL triples are not necessarily required to answer a question, we also removed these clauses. -SPARQL queries containing UNION clauses are disaggregated to all relevant graphs. As all graphs might contribute to answer the question, all graphs are assigned as graph pattern for this question.
After extraction of all graphs from the queries, we analyzed the set of graphs for isomorphism and counted the occurrence of the graph patterns for each dataset.

Result Discussion
As shown in Utilizing the normalization step, the overall number of sequences is reduced to 50,455. But still, there is no normalized POS sequence that occurs in all datasets. The most frequent normalized sequence with 2601 occurrences is WP VBZ DT NN IN RESOURCE which also originates from questions like Who is the owner of Universal Studios? or What is the revenue of IBM?. This sequence only occurs in 4 of the 26 datasets. The most frequent sequence in terms of different datasets is WP VBD RESOURCE 25 . This sequence occurs in 24 of the 26 datasets.
Obviously, the number of different POS sequences that must be taken into account might be limited, but on a very high level.
Overall, we identified 22 different graph patterns for QALD 1-9, LC-QuAD and SDBQA. The patterns are shown in Fig. 3.
In addition, we analyzed by how many different normalized POS sequences the graph patterns are represented within each dataset. The results for this analysis are shown in Table  14.
As already detected for the aspect of complex queries, the SPARQL query graphs require at most 5 triples. These few graph patterns are represented by two different subgraphsgraph IDs 15 and 22 in Fig. 3. These patterns are only contained in the QALD 8 (both) and QALD 9 train datasets. All other datasets contain 4 triples at most.
Within the QALD datasets, only 5 different graphs (graph IDs 1-5) are remarkably present, while the other graph patterns are only used sparsely or not at all. The LC-QuAD datasets contain 7 different graph patterns that are mainly used for the queries. Only one further pattern is used a few times. That means, LC-QuAD only utilizes 8 different patterns with 3 triples at most.

Ontology Types
Main findings: -In general, DBpedia ontology classes are mostly too general to deduce a domain focus of the dataset. -SDBQA contains many entities assigned to DBpedia ontology classes that might hint at the entertainment domain. -For the other datasets, a domain cannot be allocated.

Topic Definition
The difficulty to identify the correct formal query for a given NL question is also dependent on the specific domain of the question. For some domains (technical), terms are nearly unique. That means, the disambiguation task can be omitted and the mapping of surface forms to properties, classes and entities is straightforward. For other domains, these tasks might be much more difficult which hinders the overall task of question answering. Therefore, we analyzed the datasets for the ontology classes that inherit from the entities used in the SPARQL queries. These ontology classes give a hint about the domain of the question. For instance, the SPARQL query contains an entity of class dbo:Athlete 26 the question is most likely from the sports domain.

Analysis Description
For the analysis, we extracted the entities of the given SPARQL queries and retrieved the respective ontology classes via rdf:type information of the DBpedia knowledge graph. We took into account all assigned classes along the class hierarchy of the DBpedia ontology. Table 15 shows the top 10 DBpedia ontology classes and their frequency belonging to named entities of the SPARQL queries distributed over all datasets. The table also lists the top 5 classes for each dataset group (train and test dataset together) separately.

Answer Types
Main findings: -Answers and answer types for SDBQA datasets cannot be retrieved for a large amount of questions. -For LC-QuAD the number answer type is only given by COUNT operator -not via property with number as range. -For most datasets, the resource answer type mostly does not require a type assignment in the SPARQL query.

Topic Definition
Recently, a challenge on answer type prediction has been published as part of the International Semantic Web Conference 2020 (ISWC) 27 . The task of this challenge is to predict the answer type of the question according to the structure of the NL question. For instance, the question Who is the heaviest player of the Chicago Bulls? requires the answer to be of type dbo:BasketballPlayer or the question How many employees does IBM have? requires the answer to be of type xsd:integer.

Analysis Description
For the analysis of the datasets regarding answer types, we defined 10 different types: -date -Boolean-resulting from an ASK question -string-asking for string objects, such as last names or nick names 27 https://smart-task.github.io/#.
-number count -a number resulting from a COUNT operator in the SPARQL query -number property-a number resulting from a property in the SPARQL query -resource list typed-a list or resources with a specific type -resource list untyped-a list of resources without specific type -resource typed-one resource with a specific type -resource untyped-one resource without specific type -unknown-the answer type could not be detected The QALD challenge provides a hint about the answer type in the datasets, but only for the latest editions. Also, the provided answer types are more general than the types we included for our survey. Therefore, we performed an analysis regarding answer types for all KGQA datasets. Some datasets provide the answers for each question as part of the dataset. In this case, we analyzed the answer type according to the provided answers. For some datasets, the answers are not provided: both LC-QuAD datasets, the SDBQA datasets, and some test datasets of the QALD challenge. In this case, we used the SPARQL query to retrieve the answers the respective DBpedia version. If we could not retrieve the answers, we further analyzed the question: -the questions starts with When-the answer types is set to date -the query starts with ASK-the answer type is set to boolean -the query contains a COUNT operator for the only variable-the answer type is set to number count If none of these analysis steps results on a proper answer type, the type is set to unknown. This applies for many of the LC-QuAD questions, because no results could be retrieved. Table  16 shows the results of our analysis. The table contains the overall numbers of the occurrences of the answer types we pre-defined.

Result Discussion
The most obvious observation is the high number of unknown answer types for both LC-QuAD datasets. This results from missing results in the datasets and the missing response answers for the SPARQL queries when fetching the answers on the DBpedia graph of 2016-10. Overall, we had to set the answer type to unknown for a remarkably high amount of questions for all datasets. This means, that for these questions the answers are not available and the question cannot be answered-either because of missing facts in the knowledge graph or faulty SPARQL queries. However, a KGQA system would try to generate a query for these questions and retrieve answers. The system would fail for these cases. We provide the results of our answer type analysis as a separate dataset. The dataset contains the id from the original dataset, the name of the source dataset, the question string and the detected answer type as a JSON file for each dataset file 28 .

Discussion & Summary
The analysis presented in this paper gives a thorough overview of the KGQA evaluation datasets currently available. We examined 22 datasets that provide NL language questions (some of them in multiple languages) and a respective SPARQL query. Additionally, four further datasets containing a reasonable number of interesting questions have been taken into account for comparison issues. Based on several aspects, we examined essential characteristics of the datasets to be able to compare them. The performed experiments reveal the requirements that KGQA systems need to fulfill regarding SPARQL functions, disambiguating surface forms, or detecting the correct answer type. Therefore, the survey provides researchers with extensive information which specific challenges are contained in the datasets (amongst others): -The required entities are often hard to be identified, because of very ambiguous surface forms and tough disambiguation processes. -Also, the lexical gap is remarkably high for entity and relations names. -The datasets differ in severity of complex queries in terms of SPARQL operators required for the SPARQL query.
In terms of comparability, researchers need a dataset that provides realistic questions, the SPARQL query and answers according to a current SPARQL endpoint.
Unfortunately, the QALD datasets for editions 1-7 arein general-outdated regarding the DBpedia version compared to the version currently available at the public DBpedia SPARQL endpoint 29 . However, the DBpedia versions are simply outdated, because in newer versions facts are missing or properties are replaced. The general approach, how facts are modeled, is maintained throughout the versions of the knowledge graph. Therefore, even the outdated datasets are a useful source for sample questions and complex queries. The LC-QuAD 1.0 datasets provide a reasonable amount of records, but we identified two problems: -compared to the QALD datasets, LC-QuAD 1.0 does not contain any SPARQL queries with additional options, such as UNION, OPTIONAL, HAVING, etc., and -a large amount of the SPARQL queries (referencing DBpedia 2016-10) do not provide any result on the respective SPARQL endpoint 30 .
SDBQA is the dataset with the highest amount of questions. But similar to the LC-QuAD 1.0 datasets, it does not contain any further SPARQL options except for the UNION operator. And likewise, the dataset contains a high amount of questions containing properties from the GOLD ontology, which is not contained in the DBpedia datasets of 2016-10 (anymore).
Our results show that actually there are differences between the datasets. While the datasets of QALD datasets overall are fairly similar and only individual datasets stand out, the differences to the LC-QuAD and SDBQA datasets are significant. However, the WebQuestions and Simple-Questions datasets show similar structure and characteristics as the questions of the KGQA datasets. Altogether, the four QA datasets contain over 26.000 questions and might serve as a good source for further examinations of questions and their structure often asked on the internet.
With our work, we aim at a detailed insight in KGQA datasets available for evaluation. We provide the results of our answer type analysis and the property detection fails as separate datasets for download and further examination. 29 As of January 2021. 30 As of January 2021. Overall, we examined 26 different datasets based on several challenging aspects and provided statistical numbers on ambiguity, complexity, templates, the lexical gap, ontology types, and answer types. Although the datasets show significant differences for several aspects, none of the datasets stands out in terms of a low or high difficulty level when all aspects are considered altogether. Nevertheless, our analysis results exemplify the characteristics of each dataset in detail. In this way, developers of KGQA systems are able to choose a certain training dataset when they want to further focus on a specific challenging aspect. Overall, our analysis results show that (KG)QA is a sophisticated, but interesting research field which deals with the diversity of natural language and the expressiveness of SPARQL queries.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.