Deep Query Ranking for Question Answering over Knowledge Bases

Zafar, Hamid; Napolitano, Giulio; Lehmann, Jens

doi:10.1007/978-3-030-10997-4_41

Deep Query Ranking for Question Answering over Knowledge Bases

Hamid Zafar²⁰,
Giulio Napolitano²¹ &
Jens Lehmann^20,21

Conference paper
First Online: 18 January 2019

2737 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11053))

Abstract

We study question answering systems over knowledge graphs which map an input natural language question into candidate formal queries. Often, a ranking mechanism is used to discern the queries with higher similarity to the given question. Considering the intrinsic complexity of the natural language, finding the most accurate formal counter-part is a challenging task. In our recent paper [1], we leveraged Tree-LSTM to exploit the syntactical structure of input question as well as the candidate formal queries to compute the similarities. An empirical study shows that taking the structural information of the input question and candidate query into account enhances the performance, when compared to the baseline system. Code related to this paper is available at: https://github.com/AskNowQA/SQG.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Question answering (QA) systems provide a convenient interface to enable their users to communicate with the system through natural language questions. QA systems can be seen as advanced information retrieval systems, where (a) users are assumed to have no knowledge of the query language or structure of the underlying information system; (b) the QA system provides a concise answer, as opposed to search engines where users would be presented with a list of related documents. There are three types of source of information being consumed by QA systems, namely unstructured resources (e.g. Wikipedia pages), structured resources and hybrid sources. Given the extensive progress being made in large scale Knowledge Graphs (KGs), we mainly focus on QA systems using KGs as their source of information, since such systems might be able to yield more precise answers than those using a unstructured sources of information.

Given the complexity of the QA over KGs, there is a proclivity to design QA systems by breaking them into various sequential subtasks such as Named Entity Disambiguation (NED), Relation Extraction (RE) and Query Building (QB) among others [2]. Considering the fact that the system might end up with more than one candidate queries due to uncertainty in the linked entities/relations, ambiguity of the input question or complexity of the KGs, a ranking mechanism in the final stage of the QA system is required to sort the candidate queries based on their semantic similarity in respect to the given natural language question. Although considerable research has been devoted to QA over KG, rather less attention has been paid to query ranking subtask.

2 Related Work

Bast el al. [3] were inspired by the learning-to-rank approach from the information retrieval community to rank candidate queries using their feature vector, which contains 23 manually crafted features such as number of entities in the query candidate. They considered the ranking problem as a preference learning problem where a classifier (e.g. logistic regression) is supposed to pick the better option out of two given options. In a similar line of work, Abujabal et al. [4] hand-picked 16 features and utilized a random forest classifier to learn the preference model. Identifying the feature set requires manual intervention and depends heavily on the dataset at hand. In order to avoid that, Bordes [5] proposed an embedding model, which learns a fixed-size embedding vector representation of the input question and the candidate queries such that a score function produces a high score when the matching question and query are given. Inspired by the success of [5], Yih et al. [6] used deep convolutional neural networks to learn the embeddings and compute semantic similarity of the generated chains of entity/relation with respect to the given question. Despite their advantage to avoid using any manually engineered features, the models introduced by [5, 6] failed to exploit the syntactical structure of the input question or the candidate queries. In the next section, we propose to use Tree-LSTM [7] in order to take advantage of the latent information in the structure of question and the candidate queries.

3 Deep Query Ranking

Consider the example question “What are some artists on the show whose opening theme is Send It On?” from [1], the candidate queries of an arbitrary QA pipeline are illustrated in Fig. 3. The candidate queries are similar to each other in the sense that they are made up of a set of entities and relations, which are shared among them. Motivated by the success of embedding models [5, 6], we aim to enhance them by considering the structure of input question and candidate queries as well. In this regard, Tai et al. [7] proposed a Tree-LSTM model, which considers the tree representation of the input, as opposed to most RNN based models (e.g. LSTM) which take a sequence of tokens as input. The state of a Tree-LSTM unit depends on the children units (Fig. 2), enabling the model to consume the tree-structure of the input. Consequently, not only the input sequence matters but also how the elements of the input are connected together.

In order to learn the embedding vector we used a similarity function [8] along with two Tree-LSTM models for the input question and the candidate queries. The input to the Question Tree-LSTM is the dependency parse tree of the question (Fig. 1), whilst the tree-representation of the candidate queries is fed into the Query Tree-LSTM (Fig. 3). The Tree-LSTM models are trained to map their input into a latent vectorized representation such that the pair of question/correct query would have the highest score in respect to the others.

4 Empirical Study

We prepared two datasets for the ranking model based on LC-QuAD dataset [9] which consists of 5,000 question-answer pairs. Both datasets consist of questions and candidate queries. The first dataset, DS-Min is constructed using only the correct entities/relations, while DS-Noise is generated using the correct entities/relations plus four noisy ones per each linked item in the question.

Table 1. The accuracy of Tree-LSTM vs. LSTM (from [1])

Full size table

The performance of the Tree-LSTM ranking model is reported in Table 1. The Tree-LSTM outperforms vanilla LSTM in both datasets. While Tree-LSTM performs better in DS-Noise in comparison to DS-Min, LSTM model degrades in DS-Noise. Although there are more training data in DS-Noise with balanced distribution of correct/incorrect data items, LSTM is not able to benefit from the information laying in the structure of its input, in contrast to Tree-LSTM.

5 Conclusions

We presented the problem of ranking formal queries, with the goal of finding the query that truly captures the intention of a given question. We reviewed the recent attempts to the problem and introduced our findings on using Tree-LSTM from our recent paper [1]. The model learns an embedding vector which captures the dependency parsing structure of the question and tree-representation of the queries to compute the similarity of the pairs for improved ranking.

References

Zafar, H., Napolitano, G., Lehmann, J.: Formal query generation for question answering over knowledge bases. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 714–728. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_46
Chapter Google Scholar
Diefenbach, D., Lopez, V., Singh, K., Maret, P.: Core techniques of question answering systems over knowledge bases: a survey. Knowl. Inf. Syst., pp. 1–41 (2017)
Google Scholar
Bast, H., Haussmann, E.: More accurate question answering on freebase. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1431–1440. ACM (2015)
Google Scholar
Abujabal, A., Yahya, M., Riedewald, M., Weikum, G.: Automated template generation for question answering over knowledge graphs. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1191–1200 (2017)
Google Scholar
Bordes, A., Chopra, S., Weston, J.: Question answering with subgraph embeddings. arXiv preprint arXiv:1406.3676 (2014)
Yih, S.W.-T., Chang, M.-W., He, X., Gao, J.: Semantic parsing via staged query graph generation: question answering with knowledge base. In: Proceedings of the Joint Conference of ACL and AFNLP (2015)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: ACL (2015)
Google Scholar
Yih, W.-T., Richardson, M., Meek, C., Chang, M.-W.: The value of semantic parse labeling for knowledge base question answering. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 201–206 (2016)
Google Scholar
Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: LC-QuAD: a corpus for complex question answering over knowledge graphs. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 210–218. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_22
Chapter Google Scholar

Download references

Acknowledgments

This research was supported by EU H2020 grants for the projects HOBBIT (GA no. 688227), WDAqua (GA no. 642795) as well as by German Federal Ministry of Education and Research (BMBF) for the project SOLIDE (no. 13N14456).

Author information

Authors and Affiliations

Computer Science Institute, University of Bonn, Bonn, Germany
Hamid Zafar & Jens Lehmann
Fraunhofer IAIS, Sankt Augustin, Germany
Giulio Napolitano & Jens Lehmann

Authors

Hamid Zafar
View author publications
You can also search for this author in PubMed Google Scholar
Giulio Napolitano
View author publications
You can also search for this author in PubMed Google Scholar
Jens Lehmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamid Zafar .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
National University of Ireland, Galway, Ireland
Edward Curry
IBM Research - Ireland, Dublin, Ireland
Elizabeth Daly
University College Dublin, Dublin, Ireland
Brian MacNamee
Nokia (Ireland), Dublin, Ireland
Alice Marascu
Vodafone, Milan, Italy
Fabio Pinelli
IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
University College Dublin, Dublin, Ireland
Neil Hurley

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zafar, H., Napolitano, G., Lehmann, J. (2019). Deep Query Ranking for Question Answering over Knowledge Bases. In: Brefeld, U., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11053. Springer, Cham. https://doi.org/10.1007/978-3-030-10997-4_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-10997-4_41
Published: 18 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10996-7
Online ISBN: 978-3-030-10997-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)