In this work, we propose a three-step pipeline, as illustrated in Fig. 1, to address the simple question answering task:
-
Step 1 (Entity detection) The objective is to tag each individual word in the question as either entity or non-entity. We apply standard Recurrent Neural Networks (RNNs), and Conditional Random Fields (CRFs) to identify entities in the question.
-
Step 2 (Relation prediction) We classify the question as one of the relation types in the KB by applying standard (RNNs), Convolutional Neural Networks (CNNs), plus logistic regression.
-
Step 3 (Entity linking). We then link the identified entities to there corresponding nodes in the KB by constructing an inverted index using python dictionaries to generate a candidate list of triples/fact with their respective score and finally rely on the relation from the classifier to filter out the candidate list and retain the best candidate triple with the object entity as the answer.
Our initial goal is to generate a structured query Q (subject, relation) from the natural language question q that accurately interprets the question. To generate the structured query, our approach makes two assumptions; first, we assume first-order questions that can be answered by retrieving a single fact from the KB, and second, we assume that the source entity/subject entity is mentioned in the question.
Step 1: Entity Detection
The goal of Step 1 is to tag each individual word in the question as either entity or non-entity. In the Simplequestions data set, each question is associated with a triple (subject, predicate, and object) from a Freebase subset that answers the question. The subject is given as an MID; we use the names file [9] which consists of entity MID’s and their corresponding entity names to check entity mentions in the question and annotate tokens as either entity or non-entity. In some cases, there were no exact matches which introduced some noise. We apply fuzzy matching to project the entity mention to n-gram sequence with the smallest edit distance to the entity name. Edit distance [27] is a common string matching metric for measuring the difference between two sequences. Informally, edit distance between two words is a minimum number of single-character edits (insertion, deletions, or substitutions) required to change one word into another. The edit distance between two strings d, e of length |d| and |e|, respectively, is given by \(D_{d,e}(|d|,|e|)\):
$$\begin{aligned} D_{d,e}(i,j)= \mathrm{{min}} {\left\{ \begin{array}{ll} D_{d,e}(i-1, j)+1 \\ D_{d,e}(i, j-1)+1 &{} \text {otherwise}.\\ D_{d,e}(i-1,j-1)+1_{({d_{i} \ne e_{j}})}, \end{array}\right. } \end{aligned}$$
(1)
where \(1_{({d_{i} \ne e_{j}})}\) is the indicator function equivalent to 0 when \({d_{i} = e_{j}}\) and equal to 1 otherwise, and \(D_{d,e}\) is the distance between the first i characters of d and the first j characters of e. i and j are 1-based indices. The first element in the minimum corresponds to deletion from d to e, the second to insertion, and third to match or mismatch depending on whether the respective symbols are the same.
We identify the Entity being queried by formulating the entity detection task as a sequence labeling problem, as shown in Fig. 2, where each word or token is tagged as entity or non-entity; I: entity and 0: non-entity. To this end, we apply both recurrent neural network and conditional random field.
Recurrent Neural Network (RNN) Recurrent neural networks [12] are capable of processing arbitrary sequential inputs by applying an activation function to the hidden vector recursively. In Fig. 2, we represent each question word/token with a word embedding, and the representation is then combined with the hidden layer representation from the previous time step using either BiLSTMs (Bidirection Long Short-Term Memory) [17] or BiGRUs (Bidirectional Gated Recurrent Units) [8].
The key idea of LSTMs is the cell state. As we read the sentence from left to right, the LSTM is going to have a new memory variable \(\mathbf{C}^{\langle t \rangle }\) called the memory cell at time step \(\mathbf{t}\), so that when the network gets further into the sentence, it can still remember information seen earlier [16]. LSTMs have the ability to control the hidden state updates and outputs using gates. They consist of three gating functions; the update gate \({\varGamma }_\mathrm{u}\) used to control the update at each time step, the forget gate \({\varGamma }_\mathrm{f}\) which decides the amount of information from the previous hidden state to keep or throw away, and the output gate \({\varGamma _\mathrm{o}}\) which regulates the flow of information in and out of the cell.
The current memory cell \(\mathbf{C}^{\langle t \rangle }\) is computed by interpolating the previous hidden state \({\mathbf{a}^{\langle t-1 \rangle }}\) and the candidate state \({\tilde{\mathbf{C}}}^{\langle t \rangle }\). The equations that govern the LSTM behavior are defined as:
$$\begin{aligned} {{\hat{\mathbf{C}}}^{\langle t \rangle }}\,=\,& {} tanh(W_{c}[{\mathbf{a}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_{c}) \nonumber \\ \varGamma _\mathrm{u}\,=\,& {} \sigma (W_\mathrm{u}[{\mathbf{a}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_\mathrm{u}) \nonumber \\ \varGamma _\mathrm{f}\,=\,& {} \sigma (W_\mathrm{f}[{\mathbf{a}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_\mathrm{f}) \nonumber \\ \varGamma _\mathrm{o}\,=\,& {} \sigma (W_\mathrm{o}[{\mathbf{a}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_\mathrm{o}) \nonumber \\ \mathbf{C}^{\langle t \rangle }\,=\,& {} \varGamma _\mathrm{u} \odot {\hat{\mathbf{C}}}^{\langle t \rangle } + \varGamma _\mathrm{f} \odot \mathbf{C}^{\langle t-1 \rangle } \nonumber \\ \mathbf{a}^{\langle t \rangle }\,=\,& {} \varGamma _\mathrm{o} \odot \mathbf{C}^{\langle t \rangle }, \end{aligned}$$
(2)
where \({{\hat{\mathbf{C}}}^{\langle t \rangle }}\) is the candidate cell state at time \(\mathbf{t}\), \(\mathbf{a}^{\langle t-1 \rangle }\) the activation at previous time step, \(X^{\langle t \rangle }\) the current input, and W and b are parameter and bias terms, respectively. \({\mathbf{C}^{\langle t \rangle }}\) is the current internal cell state with \(\odot {}\) as the element-wise vector product. \(\sigma\) denotes sigmoid and tanh denotes the hyperbolic tangent.
The GRUs, on the other hand, consists of two gates; the update gate \({\varGamma _\mathrm{u}}\), and the reset gate \({\varGamma _\mathrm{r}}\). The update gate allows the model to learn by itself, how much of the previous information should be passed to the future. This limits the vanishing gradient problem, since the model does not have to remember all the information seen previously. The reset gate tells how relevant is the previous cell state for computing the current candidate state. The GRU transition equations are defined as follows:
$$\begin{aligned} {{\hat{\mathbf{C}}}^{\langle t \rangle }}\,=\, & {} tanh(W_{c}[\varGamma _\mathrm{r} \odot {\mathbf{C}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_{c}) \nonumber \\ \varGamma _\mathrm{u}\,=\, & {} \sigma (W_\mathrm{u}[{\mathbf{C}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_\mathrm{u}) \nonumber \\ \varGamma _\mathrm{r}\,=\, & {} \sigma (W_\mathrm{r}[{\mathbf{C}}^{\langle t-1 \rangle }, X^{\langle t \rangle }] + b_\mathrm{r}) \nonumber \\ \mathbf{C}^{\langle t \rangle }\,=\, & {} \varGamma _\mathrm{u} \odot {\hat{\mathbf{C}}}^{\langle t \rangle } + (1- \varGamma _\mathrm{u}) \odot \mathbf{C}^{\langle t-1 \rangle }, \end{aligned}$$
(3)
whereby \({\mathbf{C}}^{\langle t-1 \rangle }\) is the previous cell state and \(\sigma\) denotes the sigmoid activation function \(\sigma (x) = \frac{1}{1+e^{-x}}\) that is applied element wise to the vector.
The BiLSTMs and BiGRUs apply a non-linear transformation to compute the hidden layer representation at the current time step. The final hidden representation at the current time step is then projected to the output dimensional space and normalized into a probability distribution via a softmax layer, as shown in Fig. 2.
Conditional Random Fields (CRFs) The other method that we use for entity detection is Conditional random fields [25] so as to have a performance comparison between non-neural networks with neural networks on the entity detection task. CRFs are used to represent the probability of a hidden state sequence given some observations. For example, given \({x} =(x_1,x_2, \ldots ,x_m)\) as the input sequence, and \({s} =(s_1,s_2, \ldots ,s_m)\) the output states or crf tags, the conditional probability \(c_p\) can be given by \(c_p =p(s|x)\). We define \({\varPhi }(x,s) \in R^{d}\) a feature map that maps x paired with s to d a dimensional feature vector. The probability is therefore modeled as a log linear:
$$\begin{aligned} p(s|x,w) = \frac{\exp (w \cdot {\phi }(x,s))}{\sum _{s^{'}}{\exp (w \cdot {\phi }(x,s^{'}))}}, \end{aligned}$$
(4)
where w \(\in R^{d}\) is a parameter vector with \({s'}\) ranging over all possible outputs. The parameter vector w can be estimated by assuming that we have a set of n labeled samples \(\{(x^{i}, s^{i})\}^{n}\), with \(i = 1\). The regularized log likelihood is given by:
$$\begin{aligned} {\mathcal {L}}(w) = \sum _{i=1}^{n}\log {p}(s^{i}|x^{i},w)-\frac{\lambda {2}}{2}||w||_{2}^{2} -\lambda {1}||w||_{1}, \end{aligned}$$
(5)
where \(\frac{\lambda {2}}{2}||w||_{2}^{2}\) and \(\lambda _{1}||w||_{1}\) forces w to be small in the respective norm. We estimate a parameter vector w* as \(w^* = \mathrm{argmax} (w\in R^{d}{\mathcal {L}}(w))\). If we are able to estimate w* the parameter vector, we can then find the most likely tag a sentence s* for a sentence x by \(s^* = \mathrm{argmax}_{s}p(s|x:w^*)\).
We use Stanford Named Entity Recognizer (Stanford NER) [13] to train crf. This tool labels word sequences in the sentence into four classes; person, organization, location, and non-entity. It extracts features such as current/previous and next word, part of speech (POS) tag, character n-gram etc, and trains a crf model. In this work, the question is tagged into four classes, the first three classes of (person, organization, and location) were tagged as entity. In the experiment, we trained the Stanford NER on the training set and labeled the test set questions. After classifying every token as entity (denoted by I) or non-entity (denoted by 0), entity phrases are extracted by grouping consecutive words, i.e., given a question “where was barack obama born?”, the entity detection output is [0 0 I I 0], from which a single entity “barack obama” is extracted.
Step 2: Relation Prediction
In step 1, we obtain the subject entity from the question words. The goal of step 2 is to identify the relation being queried in the given natural language question to come up with the structured query. In this step, the entire natural language question is classified as one of the knowledge base relation types. The Freebase knowledge base consists of 1837 unique relation types (possible labels). We therefore conduct a large-scale classification to assign the relation type to the question. To achieve this, we explore several methods including RNNs, convolutional neural networks (CNNs), and logistic regression. We introduce logistic regression, so that we can compare the performance of neural networks and non-neural network on the relation prediction task.
Recurrent Neural network (RNN) A model similar to the one used for entity detection is used, and both BiLSTM and BiGRU are applied. However, relation prediction is not a tagging task, since it is over the entire question. The classification decision is based on the output of the hidden layer of the last token, as shown in Fig. 3.
Convolutional Neural Networks (CNNs) We also apply vanilla CNNs to extract local features by sliding filters over the word embeddings.
In Fig. 4, the sentence is represented by concatenating words and padding where necessary as \(x_{1:n} = x_{1}\oplus x_{2}\oplus \cdots \oplus x_{n}\), and we use convolutional filters on the input matrix transformed into word embeddings to generate new features from a window of words represented by \({c_{i} = f(W \cdot x_{i:i + h-1} + b)}\). The filter is applied to each of the possible window of words in the sentence to produce a feature map represented as \({c = [c_{1}, c_{2}, \ldots , c_{n-h+1}]}\) and we apply maxPooling over the filter to take the maximum value as a feature corresponding to this particular filter. The idea is to capture the most important feature, which is basically one with the highest value for each feature map. And finally, these features are passed on to the fully connected softmax layer whose output is the probability distribution over labels. CNNs have shown to perform well on sentence classification [20]. In our experiment, we use the architecture in [23], by changing it to a single static channel instead, and use it for relation classification.
Logistic Regression To compare the performance of neural network with non-neural network methods on the relation prediction task, we apply logistic regression. In our experiment, we consider two types of features extracted from the question:
-
(1)
Term Frequency-Inverse Document Frequency (tfidf) To extract tfidf question features, we apply existing tools CountVectorizer and TfidfTransformer sci-kit learn classes.
-
(2)
Question Word embeddings and 1-Hot encoding of relation words To obtain the question representation, both question and relation words are used. Table 1 shows sample questions and the respective relations. Question word embeddings are averaged and out-of-vocabulary words are assigned a vector of zero. Words in the relation class are split into individual tokens to come up with a vocabulary of relation words.
Table 1 Sample questions and their relation, the questions, and relations are split into individual words to come up with their respective representations that is to say word embeddings from question words and bag of words from relation words In Fig. 5, the vector representations are concatenated to come up with the question features. This vector representation combines word embeddings strength and one-hot encoding. Both the neural network and the non-neural network methods classify the entire question into one of the Freebase KB relation types.
Finally, a structured query consisting of the entity from step 1, and the relation from step 2, is generated. Using a similar example of the question “where was Barack Obama born?”, the relation prediction from step 2 is [people/person/place_of_birth]. If the identified entity from step 1 is “Barack Obama”, the structured query \(\{entity, relation \}\) becomes \(\{{``Barack Obama {\text {''}}, people/person/place\_of\_birth}\}\).
Step 3: Entity Linking
From Step 1 and Step 2, we generate the structured query that accurately represents the natural language question. In step 3, the main objective is to identify the correct entity node in the KB. The detected entity words in step 1 that represent the candidate entity in the structured query are linked to the actual entity node in the knowledge base. We build indexes to link the extracted entity to the actual knowledge base: first, a names index which maps all entity (MID’s) in the Freebase subset to their names in the names file [9]; second, the inverted index to map any entity n-gram to all nodes in the knowledge base.
Figure 6 shows the entity linking process; we iterate over entity word n-grams in a decreasing order of n; if we find candidate values, early termination is applied to stop searching for smaller values of n. If, for example, our query ‘Barack Obama’ for \(n=2\) finds candidate entities, then the search will be terminated and only those entities will be included. This helps in pruning entities that would have been retrieved for queries \('barack'\) and \('obama'\) for \(n=1\). After coming up with all possible entities in Freebase, candidate entity nodes are retrieved from the index and appended to the list. These candidate entities in the list are then scored using inverse document frequency (idf) and ranked in descending order.
Once we have a list of candidate entities, each candidate node is used as a starting point to reach candidate answers. We limit our search to a single hop and retrieve all nodes that are reachable from the candidate node where the relation path is consistent with the predicted relation (in the structured query). We look at the relation types, and all those candidate entity nodes with relation type different from the one generated in step 2 are removed. Only those candidate entity nodes with a relation type leading to another node that is similar to the one generated in step 2 are kept from which the entity node with a high score in the remaining candidate list has an object entity node which is the answer to the question.
Identification of Ambiguity in the Data
One of the contributions of this work is to show that there exist multiple answers to the question that are not easy to disambiguate which limits the performance of the question answering system. This is a common challenge in free open data sets and such ambiguities are most likely to arise from the annotation process. In the Simplequestions data set, annotators are asked to write a natural language question for a corresponding triple [7]. In such a case, where one is only given a triple, it would be difficult to anticipate possible ambiguities in the KB. We identify such ambiguities in the data.
Given a natural language question q with the corresponding triple (s, p, o), where s, p, ando are the subject, relation, and object, respectively, we aim at determining a set of all possible (s, p) pairs that accurately interpret the question q. The first step is to determine the string alias a by matching the phrase in q (i.e., entity in the structured query) with the subject alias s in freebase. Next, we find all other Freebase entity MID’s that share this alias and add them to a set S.
For example, given the question ’what country was the film the debt from?’, in Table 2, we can see three examples of subject–relation pairs with equal linguistic evidence that cannot be easily disambiguated. After generating a set of entities S, the next step is to come up with a set of potential relations P, as shown in Table 2. For this, we abstract a from the question, i.e., “what country was the film \(\langle e \rangle\) from?” to determine an abstract relation p. An accurate semantic interpretation of the question q is defined if there exists a subject–relation pair \({(s,p) \in KB}\) where \({p \in P}\) and \({s \in S}\). In a case where multiple (s, p) pairs exist like in (Table 2), the question is not answerable. To answer such a question, we predict the most likely relation \(p_{\max } \in P\) that makes the answer candidate to be \({(s, p_{\max }) \in \mathrm{KB}}\). If it happens that answer candidates are more, i.e., \(p_{max}\) is more than one, then we pick \({s_{\max }}\) with the most facts of type \({p_{\max }}\).
Table 2 Examples of ambiguity in the data