Keywords

1 Introduction

The task of Question Answering over Linked Data (QALD) has received increased attention over the last years (see the surveys [14, 36]). The task consists in mapping natural language questions into an executable form, e.g. a SPARQL query in particular, that allows to retrieve answers to the question from a given knowledge base. Consider the question: Who created Wikipedia?, which can be interpreted as the following SPARQL query with respect to DBpediaFootnote 1:

figure a

An important challenge in mapping natural language questions to SPARQL queries lies in overcoming the so called ‘lexical gap’ (see [13, 14]). The lexical gap makes interpreting the above mentioned question correctly challenging, as there is no surface relation between the query string created and the URI local name author. To bridge the lexical gap, systems need to infer that create should be interpreted as author in the above case.

The lexical gap is only exacerbated when considering multiple languages as we face a cross-lingual gap that needs to be bridged. Consider for instance the question: Wer hat Wikipedia gegründet?, which involves mapping gründen to author to successfully interpret the question.

Addressing the lexical gap in question answering over linked data, we present a new system we call AMUSE that relies on probabilistic inference to perform structured prediction in the search space of possible SPARQL queries to predict the query that has the highest probability of being the correct interpretation of the given query string. As the main contribution of the paper, we present a novel approach to question answering over linked data that relies on probabilistic inference to determine the most probable meaning of a question given a model. The parameters of the model are optimized on a given training dataset consisting of natural language questions with their corresponding SPARQL queries as provided by the QALD benchmark. The inference process builds on approximate inference techniques, Markov Chain Monte Carlo in particular, to assign knowledge base (KB) Identifiers as well as meaning representations to every node in a dependency tree representing the syntactic dependency structure of the question. On the basis of these assigned meaning representations to every node, a full semantic representation can be computed relying on bottom-up semantic composition along the parse tree. As a novelty, our model can be trained on different languages by relying on universal dependencies. To our knowledge, this is the first system for question answering over linked data that can be trained to perform on different languages (three in our case) without the need of implementing any language-specific heuristics or knowledge. To overcome the cross-lingual lexical gap, we experiment with automatically translated labels and rely on an embedding approach to retrieve similar words in the embedding space. We show that by using word embeddings one can effectively contribute to reducing the lexical gap compared to a baseline system where only known labels are used.

2 Approach

Our intuition in this paper is that the interpretation of a natural language question in terms of a SPARQL query is a compositional process in which partial semantic representations are combined with each other in a bottom-up fashion along a dependency tree representing the syntactic structure of a given question. Instead of relying on hand-crafted rules guiding the composition, we rely on a learning approach that can infer such ‘rules’ from training data. We employ a factor graph model that is trained using a ranking objective and SampleRank as training procedure to learn a model that learns to prefer good over bad interpretations of a question. In essence, an interpretation of a question represented as a dependency tree consists of an assignment of several variables: (i) a KB Id and semantic type to every node in the parse tree, and (ii) an argument index (1 or 2) to every edge in the dependency tree specifying which slot of the parent node, subject or object, the child node should be applied to. The input to our approach is thus a set of pairs (qsp) of question q and SPARQL query sp. As an example, consider the following questions in English, German & Spanish: Who created Wikipedia? Wer hat Wikipedia gegründet? Quién creó Wikipedia? respectively. Independently of the language they are expressed in, the threes question can be interpreted as the same SPARQL query from the introduction.

Our approach consists of two inference layers which we call L2KB and QC. Each of these layers consists of a different factor graph optimized for different subtasks of the overall task. The first inference layer is trained using an entity linking objective that learns to link parts of the query to KB Identifiers. In particular, this inference step assigns KB Identifiers to open class words such as nouns, proper nouns, adjectives and verbs etc. In our case, the knowledge base is DBpedia. We use Universal DependenciesFootnote 2 [28] to get dependency parse trees for 3 languages. The second inference layer is a query construction layer that takes the top k results from the L2KB layer and assigns semantic representations to closed class words such as question pronouns, determiners, etc. to yield a logical representation of the complete question. The approach is trained on the QALD-6 train dataset for English, German & Spanish questions to optimize the parameters of the model. The model learns mappings between the dependency parse tree for a given question text and RDF nodes in the SPARQL query. As output, our system produces an executable SPARQL query for a given NL question. All data and source code are freely availableFootnote 3. As semantic representations, we rely on DUDES, which are described in the following section.

2.1 DUDES

DUDES (Dependency-based Underspecified Discourse Representation Structures) [9] is a formalism for specifying meaning representations and their composition. They are based on Underspecified Discourse Representation Theory (UDRT) [10, 33], and the resulting meaning representations. Formally, a DUDE is defined as follows:

Definition 1

A DUDE is a 5-tuple \((v,\text {vs},l,\text {drs},\text {slots})\) where

  • v is the main variable of the DUDES

  • vs is a (possibly empty) set of variables, the projection variables

  • l is the label of the main DRS

  • drs is a DRS (the main semantic content of the DUDE)

  • slots is a (possibly empty) set of semantic dependencies

The core of a DUDES is thus a Discourse Representation Structure (DRS) [15]. The main variable represents the variable to be unified with variables in slots of other DUDES that the DUDE in question is inserted into. Each DUDE captures information about which semantic arguments are required for a DUDE to be complete in the sense that all slots have been filled. These required arguments are modeled as set of slots that are filled via (functional) application of other DUDES. The projection variables are relevant in meaning representations of questions; they specify which entity is asked for. When converting DUDES into SPARQL queries, they will directly correspond to the variables in the SELECT clause of the query. Finally, slots capture information about which syntactic elements map to which semantic arguments in the DUDE.

As basic units of composition, we consider 5 pre-defined DUDES types that correspond to data elements in RDF datasets. We consider Resource DUDES that represent resources or individuals denoted by proper nouns such as Wikipedia (see 1st DUDES in Fig. 1). We consider Class DUDES that correspond to sets of elements, i.e. classes, for example the class of Persons (see 2nd DUDES in Fig. 1). We also consider Property DUDES that correspond to object or datatype properties such as author (see 3rd DUDES in Fig. 1). We further consider restriction classes that represent the meaning of intersective adjectives such as Swedish (see 4th DUDES in Fig. 1). Finally, a special type of DUDES can be used to capture the meaning of question pronouns, e.g. Who or What (see 5th DUDES in Fig. 1).

Fig. 1.
figure 1

Exampeles for the 5 types of DUDES

When applying a DUDE \(d_2\) to \(d_1\) where \(d_1\) subcategorizes a number of semantic arguments, we need to indicate which argument \(d_2\) fills. For instance, applying the 1st DUDES in Fig. 1 to the 3rd DUDES in Fig. 1 at argument index 1 yields the following DUDE:

figure b

2.2 Imperatively Defined Factor Graphs

In this section, we introduce the concept of factor graphs [19], following the notations in [17, 41]. A factor graph \(\mathcal {G}\) is a bipartite graph that defines a probability distribution \(\pi \). The graph consists of variables V and factors \(\varPsi \). Variables can be further divided into sets of observed variables X and hidden variables Y. A factor \(\varPsi _i\) connects subsets of observed variables \(x_i\) and hidden variables \(y_i\), and computes a scalar score based on the exponential of the scalar product of a feature vector \(f_i(x_i, y_i)\) and a set of parameters \(\theta _i\): \(\varPsi _i=e^{ f_i(x_i,y_i) \cdot \theta _i}\). The probability of the hidden variables given the observed variables is the product of the individual factors:

$$\begin{aligned} \pi (y|x;\theta ) = \frac{1}{Z(x)} \prod _{\varPsi _i\in \mathcal {G}} \varPsi _i(x_i,y_i)= \frac{1}{Z(x)} \prod _{\varPsi _i\in \mathcal {G}} e^{ f_i(x_i,y_i)\cdot \theta _i} \end{aligned}$$
(1)

where Z(x) is the partition function. For a given input consisting of a dependency parsed sentence, the factor graph is rolled out by applying template procedures that match over parts of the input and generate corresponding factors. The templates are thus imperatively specified procedures that roll out the graph. A template \(T_j \in \mathcal {T}\) defines the subsets of observed and hidden variables \({(x',y')}\) with \(x' \in X_j\) and \(y' \in Y_j\) for which it can generate factors and a function \(f_j(x', y')\) to generate features for these variables. Additionally, all factors generated by a given template \(T_j\) share the same parameters \(\theta _j\). With this definition, we can reformulate the conditional probability as follows:

$$\begin{aligned} \pi (y|x;\theta ) = \frac{1}{Z(x)} \prod _{T_j \in \mathcal {T}} \prod _{(x', y') \in T_j} e^{ f_j(x', y')\cdot \theta _j} \end{aligned}$$
(2)

Input to our approach is a pair (WE) consisting of a sequence of words \(W=\{w_1, \dots , w_n\}\) and a set of dependency edges \(E \subseteq W \times W\) forming a tree. A state \((W,E,\alpha ,\beta ,\gamma )\) represents a partial interpretation of the input in terms of partial semantic representations. The partial functions \(\alpha : W \rightarrow KB\), \(\beta : W \rightarrow \{t_1,t_2,t_3,t_4,t_5\}\) and \(\gamma : E \rightarrow \{1,2\}\) map words to KB identifiers, words to the five basic DUDES types, and edges to indices of semantic arguments, with 1 corresponding to the subject of a property and 2 corresponding to the object, respectively. Figure 2 shows a schematic visualization of a question along with its factor graph. Factors measure the compatibility between different assignments of observed and hidden variables. The interpretation of a question is the one that maximizes the posterior of a model with parameters \(\theta \): \(y^* = argmax_{y}\pi (y|x;\theta )\).

Fig. 2.
figure 2

Factor graph for the question: Who created Wikipedia?. Observed variables are depicted as bubbles with straight lines; hidden variables as bubbles with dashed lines. Black boxes represent factors.

2.3 Inference

We rely on an approximate inference procedure, Markov Chain Monte Carlo in particular [1]. The method performs iterative inference for exploring the state space of possible question interpretations by proposing concrete changes to sets of variables that define a proposal distribution. The inference procedure performs an iterative local search and can be divided into (i) generating possible successor states for a given state by applying changes, (ii) scoring the states using the model score, and (iii) deciding which proposal to accept as successor state. A proposal is accepted with a probability that is proportional to the likelihood assigned by the distribution \(\pi \). To compute the logical form of a question, we run two inference procedures using two different models. The first model L2KB is trained using a linking objective that learns to map open class words to KB identifiers. The MCMC sampling process is run for m steps for the L2KB model; the top k states are used as an input for the second inference model called QC that assigns meanings to closed class words to yield a full fledged semantic representation of the question. Both inference strategies generate successor states by exploration based on edges in the dependency parse tree. We explore only the following types of edges: Core arguments, Non-core dependents, Nominal dependents defined by Universal DependenciesFootnote 4 and nodes that have the following POS tags: NOUN, VERB, ADJ, PRON, PROPN, DET. In both inference models, we alternate across iterations between using the probability of the state given the model and the objective score to decide which state to accept. Initially, all partial assignments \(\alpha _0 ,\beta _0, \gamma _0\). are empty.

We rely on an inverted index to find all KB IDs for a given query term. The inverted index maps terms to candidate KB IDs for all 3 languages. It has been created taking into account a number of resources: names of DBpedia resources, Wikipedia anchor texts and links, names of DBpedia classes, synonyms for DBpedia classes from WordNet [16, 26], as well as lexicalizations of properties and restriction classes from DBlexipedia [40]. Entries in the index are grouped by DUDES type, so that it supports type-specific retrieval. The index stores the frequency of the mentions paired with KB ID. During retrieval, the index returns a normalized frequency score for each candidate KB ID.

L2KB: Linking to Knowledge Base. Proposal Generation: The L2KB proposal generation proposes changes to a given state by considering single dependency edges and changing: (i) the KB IDs of parent and child nodes, (ii) the DUDES type of parent and child nodes, and (iii) the argument index attached to the edge. The Semantic Type variables range over the 5 basic DUDES types defined, while the argument index variable ranges in the set {1,2}. The resulting partial semantic representations for the dependency edge are checked for satisfiability with respect to the knowledge base, pruning the proposal if it is not satisfiable. Figure 3 depicts the local exploration of the dobj-edge between Wikipedia and created. The left image shows an initial state with empty assignments for all hidden variables. The right image shows a proposal that is changed the KB IDs and DUDE types of the nodes connects by the dobj edge. The inference process has assigned the KB ID dbo:author and the Property DUDES type to the created node. The Wikipedia nodes gets assigned the type Resource DUDES as well as the KB ID dbr:Wikipedia. The dependency edge gets assigned the argument index 1, representing that dbr:Wikipedia should be inserted at the subject position of the dbo:author property. The partial semantic representation represented by this edge is the one depicted at the end of Sect. 2.2. As it is satisfiable, it is not pruned. In contrast, a state in which the edge is assigned the argument index 2 would yield the following non-satisfiable representation, corresponding to things that were authored by Wikipedia instead of things that authored Wikipedia:

figure c
Fig. 3.
figure 3

Left: Initial state based on dependency parse where each node has empty KB ID and Semantic Type. Right: Proposal generated by the LKB proposal generation for the question Who created Wikipedia?

Objective Function: As objective for the L2KB model we rely on a linking objective that calculates the overlap between inferred entity links and entity links in the gold standard SPARQL query.

All generated states are ranked by the objective score. Top-k states are passed to the next sampling step. In the next iteration, the inference is performed on these k states. Following this procedure for m iterations yields a sequence of states \((s_0,\dots ,s_m)\) that are sampled from the distribution defined by the underlying factor graphs.

QC: Query Construction. Proposal Generation: Proposals in this inference layer consist of assignments of the type QueryVar DUDES to nodes for class words, in particular determiners, that could fill the argument position of a parent with unsatisfied arguments.

Objective Function: As objective we use an objective function that measures the (graph) similarity between the inferred SPARQL query and the gold standard SPARQL query.

Figure 4 shows an input state and a sampled state for the QC inference layer of our example query: Who created Wikipedia?. The initial state (see Left) has Slot 1 assigned to the edge dobj. Property DUDES have 2 slots by definition. The right figure shows a proposed state in which the argument slot 2 has been assigned to the nsubj edge and the QueryVar DUDES type has been assigned to node Who. This corresponds to the representation and SPARQL queries below:

figure d
figure e
Fig. 4.
figure 4

Left: Input state; Right: Proposal generated by the QC proposal generation for the question Who created Wikipedia?

2.4 Features

As features for the factors, we use conjunctions of the following information: (i) lemma of parent and child nodes, (ii) KB Ids of parent and child nodes, (iii) POS tags of parent and child nodes, (iv) DUDE type of parent and child, (v) index of argument at edge, vi) dependency relation of edge, (vii) normalized frequency score for retrieved KB Ids, (viii) string similarity between KB Id and lemma of node, (ix) rdfs:domain and rdfs:range restrictions for the parent KB Id (in case of being a property).

2.5 Learning Model Parameters

In order to optimize parameters \(\theta \), we use an implementation of the SampleRank [41] algorithm. The SampleRank algorithm obtains gradients for these parameters from pairs of consecutive states in the chain based on a preference function \(\mathbb {P}\) defined in terms of the objective function \(\mathbb {O}\) as follows:

$$\begin{aligned} \mathbb {P}(s', s)= {\left\{ \begin{array}{ll} 1,&{} \text {if } \mathbb {O}(s') > \mathbb {O}(s)\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

We have observed that accepting proposals only on the basis of the model score requires a large number of inference steps. This is due to the fact that the exploration space is huge considering all the candidate resources, predicates, classes etc. in DBpedia. To guide the search towards good solutions, we switch between model score and objective score to compute the likelihood of acceptance of a proposal. Once the training procedure switches the scoring function in the next sampling step, the model uses the parameters from the previous step to score the states.

2.6 Addressing the Lexical Gap

A key component in the proposed question answering pipeline is the L2KB layer. This layer is responsible for proposing possible KB identifiers for parts of the question. Consider the question Who is the writer of The Hunger Games? It seems to be a trivial task to link the query word writer to the appropriate identifier dbo:author, however it still requires prior knowledge about the semantics of the query word and the KB entry (e.g. that the writer of a book is the author).

To address the lexical gap, we rely on the one hand on lexicalizations of DBpedia properties as extracted by M-ATOLL [39, 40] for multiple languagesFootnote 5. In particular for Spanish and German, however, M-ATOLL produces very sparse results. We propose two solutions to overcome the lexical gap by using machine translation to translate English labels into other languages as well as using word embeddings to retrieve candidate properties for a given mention text.

Machine Translations. We rely on the online dictionary Dict.cc Footnote 6 as our translation engine. We query the web service for each available English label and target language and store the obtained translation candidates as new labels for the respective entity and language. While these translations are prone to be noisy without a proper context, we receive a reasonable starting point for the generation of candidate lexicalizations, especially in combination with the word embedding approach.

Word Embedding Retrieval. Many word embedding methods such as the skip-gram method [25] have been shown to encode useful semantic and syntactic properties. The objective of the skip-gram method is to learn word representations that are useful for predicting context words. As a result, the learned embeddings often display a desirable linear structure that can be exploited using simple vector addition. Motivated by the compositionality of word vectors, we propose a measure of semantic relatedness between a mention m and a DBpedia entry e using the cosine similarity between their respective vector representations \(\varvec{v}_{m}\) and \(\varvec{v}_{e}\). For this we follow the approach in [5] to derive entity embedding vectors from word vectors: We define the vector of a mention m as the sum of the vectors of its tokensFootnote 7 \(\varvec{v}_{m} = \sum _{t \in m} \varvec{v}_{t}\), where the \(\varvec{v}_{t}\) are raw vectors from the set of pretrained skip-gram vectors. Similarly, we derive the vector representation of a DBpedia entry e by adding the individual word vectors for the respective label \(l_e\) of e, thus \(\varvec{v}_{e} = \sum _{t \in l_e} \varvec{v}_{t}\).

As an example, the vector for the mention text movie director is composed as \(\varvec{v}_{movie\text { }director}\) = \(\varvec{v}_{movie} + \varvec{v}_{director}\). The DBpedia entry dbo:director has the label film director and is thus composed of \(\varvec{v}_{dbo:director}=\varvec{v}_{film} + \varvec{v}_{director}\).

To generate potential linking candidates given a mention text, we can compute the cosine similarity between \(\varvec{v}_{m}\) and each possible \(\varvec{v}_{e}\) as a measure of semantic relatedness and thus produce a ranking of all candidate entries. By pruning the ranking at a chosen threshold, we can control the produced candidate list for precision and recall.

For this work, we trained 3 instances of the skip-gram model with each 100 dimensions on the English, German and Spanish Wikipedia respectively. Following this approach, the top ranking DBpedia entries for the mention text total population are listed below:

Mention

DBpedia entry

Cos. Similarity

Total population

dbo:populationTotal

1.0

dbo:totalPopulation

1.0

dbo:agglomerationPopulationTotal

0.984

dbo:populationTotalRanking

0.983

dbo:PopulatedPlace/areaTotal

0.979

A more detailed evaluation is conducted in Sect. 3 where we investigate the candidate retrieval in comparison to an M-ATOLL baseline.

3 Experiments and Evaluation

We present experiments carried out on the QALD-6 dataset comprising of English, German & Spanish questions. We train and test on the multilingual subtask. This yields a training dataset consisting of 350 and 100 test instances. We train the model with 350 training instances for each language from QALD-6 train dataset by performing 10 iterations over the dataset with learning rate set to 0.01 to optimize the parameters. We set k to 10. We perform a preprocessing step on the dependency parse tree before running through the pipeline. This step consists of merging nodes that are connected with compound edges. This results in having one node for compound names and reduces the traversing time and complexity for the model. The approach is evaluated on two tasks: a linking task and a question answering task. The linking task is evaluated by comparing the proposed KB links to the KB elements contained in the SPARQL question in terms of F-Measure. The question answering task is evaluated by executing the constructed SPARQL query over the DBpedia KB, and comparing the retrieved answers with answers retrieved for the gold standard SPARQL query in terms of F-Measure.

Before evaluating the full pipeline on the QA task, we evaluate the impact of using different lexical resources including the word embedding to infer unknown lexical relations.

3.1 Evaluating the Lexicon Generation

We evaluate the proposed lexicon generation methods using machine translation and embeddings with respect to a lexicon of manual annotations that are obtained from the training set of the QALD-6 dataset. The manual lexicon is a mapping of mention to expected KB entry derived from the (question-query) pairs in QALD-6 dataset. Since M-ATOLL only provides DBpedia ontology properties, we restrict our word embedding approach to also only produce this subset of KB entities. Analogously, the manual lexicon is filtered such that it only contains word-property entries for DBpedia ontology properties to prevent the unnecessary distortion of the evaluation results due to unsolvable query terms.

The evaluation is carried out with respect to the number of generated candidates per query term using the Recall@k measure. Focusing on the recall is a reasonable evaluation metric since the considered manual lexicon is far from exhaustive, but only reflects a small subset of possible lexicalizations of KB properties in natural language questions. Furthermore, the L2KB component is responsible for producing a set of linked candidate states which act as starting points for the second layer of inference, the QC layer. Providing a component with a high recall in this step of the pipeline is crucial for the query construction component.

Figure 5 visualizes the retrieval performance using the Recall@k metric. We can see a large increase in recall across languages when generating candidates using the word embedding method. Combining the M-ATOLL candidates with the word embedding candiates yields the strongest recall performance. The largest absolute increase is observed for German.

Fig. 5.
figure 5

Retrieval performance with respect to the manual lexicon.

3.2 Evaluating Question Answering

In order to contextualise our results, we provide an upper bound for our approach, which consists of running over all instances in test using 1 epoch and accepting states according to objective score only, thus yielding an oracle-like approach. We report Macro F-Measures for this oracle in Table 1 together with the actual results on test when optimizing parameters on training data. We evaluate different configurations of our system in which we consider (i) a name dictionary derived only from DBpedia labels (DBP), (ii) additional dictionary entries derived from DBLexipedia (DBLex), (iii) a manually created dictionary (Dict), and (iv) entries inferred using cosine similarity in embedding space (Embed). It is important to note that even the oracle does not get perfect results, which is due to the fact that the lexical gap still persists and some entries can not be mapped to the correct KB Ids. Further, errors in POS tagging or in the dependency tree prevent the inference strategy to generate the correct proposals.

We see that in all configurations, results clearly improve when using additional entries from DBLexipedia (DBLex) in comparison to only using labels from DBpedia. The results further increase by adding lexical entries inferred via similarity in embedding space (+Embed), but are still far from the results with manually created dictionary (Dict), showing that addressing the lexical gap is an important issue to increase performance of question answering systems over linked data.

On the linking task, while the use of embeddings increases performance as seen in the DBP + DBLex + Embed vs. DBP + DBLex condition, there is still a clear margin to the DBP + DBLex + Dict condition (English 0.16 vs. 0.22, German 0.10 vs. 0.27, Spanish 0.04 vs. 0.30).

On the QA task, adding embeddings on top of DBP + DBLex also has a positive impact, but is also lower compared to the DBP + DBLex + Dict condition (English 0.26 vs. 0.34, German 0.16 vs. 0.37, Spanish 0.20 vs. 0.42). Clearly, one can observe that the different between the learned model and the oracle diminishes the more lexical knowledge is added to the system.

Table 1. Macro F1-scores on test data for the linking and question answering tasks using different configurations

3.3 Error Analysis

An error analysis revealed the following four common errors that prevented the system from finding the correct interpretation: (i) wrong resource (30% of test questions), as in When did the Boston Tea Party take place? where Boston Tea Party is not mapped to any resource, (ii) wrong property (48%), as in the question Who wrote the song Hotel California? where our system infers the property dbpedia:musicalArtist for song instead of the property dbpedia:writer, (iii) wrong slot (10%), as in How many people live in Poland?, where Poland is inferred to fill the 2nd slot instead of the 1st slot of dbepdia:populationTotal and (iv) incorrect query type (12%), as in Where does Piccadilly start? where our approach wrongly infers that this is an ASK-query.

4 Related Work

There is a substantial body of work on semantic parsing for question answering. Earlier work addressed the problem using statistical machine translation methods [42] or inducing synchronous grammars [43]. Recent work has framed the task as the one of inducing statistical lexicalized grammars; most of this work has relied on CCG as grammar theory and lambda calculus for semantic representation and semantic composition [2,3,4, 18, 20,21,22, 35, 46]. In contrast to the above work, we assume that a syntactic analysis of the input in the form of a dependency tree is available and we learn a model that assigns semantic representations to each node in the tree. Most of earlier work in semantic parsing has concentrated on very specific domains with a very restricted semantic vocabulary. More recently, a number of researchers have considered this challenge and focused on open-domain QA datasets such as WebQuestions, which relies on Freebase [6,7,8, 30,31,32, 34, 44, 45].

Our approach bears some relation to the work of Reddy et al. [31] in the sense that we both start from a dependency tree (or ungrounded graph in their terminology) and the goal is to ground the ungrounded relations in a KB. We use a different learning approach and model as well as a different semantic representation formalism (DUDES vs. lambda expressions). More recently, Reddy et al. [32] have extended their method to produce general logical forms relying on Universal Dependencies, independent of the application, that is question answering. They evaluate their approach both on the WebQuestions as well as Graphqueries. While the datasets they use have thousands of training examples, we have shown that we can train a model using only 350 questions as training data.

The work of Freitas et al. [12] employs a distributional structured vector space, the \(\tau \)-Space, to bridge the lexical gap between queries and KB in order to map query terms to corresponding properties and classes in the underlying KB. Further, Freitas et al. [11] studied different distributional semantic models in combination with machine translation. Their findings suggest that combining machine translation with a Word2Vec approach achieves the best performance for measuring semantic relatedness across multiple languages.

Denis et al. [23] have proposed an end-to-end QALD model exploiting neural networks. The approach works well for answering simple questions and has been trained on a dataset with 100.000 training instances. In contrast, QALD-6 benchmarks have less data (350 instances) and questions include more difficult questions requiring aggregation and comparison. Neelakantan et al. [27] have proposed an approach based on neural model that achieves comparable results to the state-of-art non-neural semantic parsers on WikiTableQuestions [29] dataset, which includes questions with aggregation.

The best performing system on the QALD-6 benchmark [36] was the one by [24], achieving an F-measure of 89%. However, the approach relies on a controlled natural language approach in which queries have been manually reformulated so that the approach can parse them. The only system that is able to perform on three languages as ours is the UTQA system [38]. The UTQA system achieves much higher results compared to our system, reaching F-measures of 75% (EN), 68% (ES) and 61% (Persian). The approach relies on a pipeline of several classifiers performing keyword extraction, relation and entity linking as well as answer-type detection. All these steps are performed jointly in our model.

Höffner et al. [14] recently surveyed published approaches on QALD benchmarks, analysed the differences and identified seven challenges. Our approach addresses four out of these seven challenges: multilingualism, ambiguity, lexical gap and templates. Our probabilistic model performs implicit disambiguation and performs semantic interpretation using a traditional bottom-up semantic composition using state-of-the-art semantic representation formalisms and thus does not rely on any fixed templates. We have proposed how to overcome the lexical gap using an approach to induce lexical relations between surface mentions and entities in the knowledge base using a representational learning approach. Multilinguality is addressed by building on universal dependencies and our methodology which allows to train models for different languages.

5 Conclusion

We have presented a multilingual factor graph model that can map natural language input into logical form relying on DUDES as semantic formalism. Given dependency-parsed input, our model infers both a semantic type and KB entity to each node in the dependency tree and computes an overall logical form by bottom-up semantic composition. We have applied our approach to the task of question answering over linked data, using the QALD-6 dataset. We show that our model can learn to map questions into SPARQL queries by training on 350 instances only. We have shown that our approach works for multiple languages, English, German and Spanish in particular. We have also shown how the lexical gap can be overcome by using word embeddings increasing performance beyond using explicit lexica produced by lexicon induction approaches such as M-ATOLL. As a future work, we will extend our approach to handle questions with other filtering operations. We will also make our system available on GERBIL [37] to support the direct comparison to other systems.