1 Introduction

The Question Answering System (QAS) plays an important role in getting questions and automatically answering them using a knowledge information system. This paper blends the essence of Question Generation, Question Comprehension, and Question Answering to overcome the Question Answering System’s limitations.

Question Answering System had existed way back in the 1960s. The first-ever question answering system introduced was BASEBALL [16]. It was built with a sequence of handwritten rules, and all baseball figures were stored in a database accumulated over the year. Later, LUNAR [42] was introduced during the Apollo mission to answer questions. This system was built to answer the moon’s geological patterns and other related information about the APOLLO mission. The customized nature of this system leads to the generation of highly accurate answers.

As the research evolved, the Question Answering System started gaining higher credibility due to data outbursts. The Natural Language Processing (NLP) systems were introduced to reach realistic language understanding [18]. Using the NLP concept, there has been significant research in Question Answering Systems during the past four decades. Early examples of primordial NLP systems are ELIZA [40], SHRDLU [41], which were developed to understand language between humans and machines. Although ELIZA was closer to a human conversation but was much less intelligent and knew almost nothing. SHRDLU on the other hand, was able to reason about the block world. Although the conversation was limited to the block’s world, so not convincingly human-like, it does know what it is talking about.

Later in the year, 2011 IBM Watson [15] gained worldwide attention, which uses NLP to analyze human speech for meaning and syntax. Way back, it was commonly referred to as a brain. In recent years, search engines (Google), chatbots (SIRI, ALEXA, and CORTANA) are becoming better at going beyond by answering the exact answer to our question. The Question Answering System has also seen significant changes in the architecture from basic Recurrent Neural Network (RNN) to transformers [8, 12] over the years.

The Question Answering System is classified into an Open-domain Question Answering System, and Closed-domain Question Answering System [24]. The open-domain question answering systems like [10, 17] can handle nearly any questions based on world knowledge. This type of Question Answering System has access to more data to extract the answer. The closed-domain question answering systems are domain-specific [2, 9, 45]. Closed-domain question answering systems answers from either a pre-structured database or the collection of domain-specific natural language documents.

According to the studies [32, 33], the human accuracy of answering the question is 89.45%, and the state-of-the-art Question Answering System’s accuracy is 93.01%. Although system accuracy exceeds human accuracy, such a Question-Answering system lacks reasoning power as humans do [30, 34, 44], to identify the questions and understand them. The SQUAD 2.0 dataset [31] provides unanswerable questions with plausible answers; however, identifying the unanswerable question remains a challenge.

The limitations of the question answering system are:

  • Unanswerable Questions: A question that is incorrect and related to the context is posed to the Question Answering System. The Question Answering System, which has outstripped human accuracy, should know that the question is unanswerable and should not generate the answer. However, the SQUAD 1.1 dataset models answer such unanswerable questions by unreliable guesses on questions for which the correct answer is not stated. It indicates that these models lack a rational way of reasoning. Even though the SQUAD 2.0 dataset introduced unanswerable questions in the dataset, identifying the unanswerable questions remains unsolved.

  • Irrelevant Questions: When the Question Answering System is posed with irrelevant questions that are out of context, the system still generates an understandable but nonsensical answer. On the other hand, humans do not provide such nonsensical answers; instead, they will identify that the question is irrelevant and out of context.

The contributions of this paper are as follows:

  1. 1.

    To automatically generate the possible question-answer pairs, given a passage.

  2. 2.

    We introduce a Question Similarity mechanism, where it will identify the unanswerable and irrelevant questions.

  3. 3.

    We combine the Question Generation System with Question Answering System to create an application called Automatic Question-Answer Pairs Generation System.

Rest of the paper is organized as follows. Section 2 provides the related work on Question Answering Systems. Section 3 explains about the automatic question-answer pair generation, and question similarity mechanism. Section 4 provides details about the datasets used and the experiments. The experimental results are presented in Section 5, and Section 6 discusses about the results. Finally, in Section 7 we conclude this paper.

2 Related works

In recent years, several works are proposed to tackle world knowledge by combining search factors based on bi-gram hashing, TF-IDF matching [7] and machine reading comprehension [22, 29]. It brought the Question Answering System a good beginning. The most recent QAS is the Bidirectional Encoder Representations from Transformers (BERT) [11]. It uses neural models such as transformers to pre-train the large corpora of data. Such a latest refinement has led to remarkable gains in NLP tasks such as Question Answering, Text Summarization, and many classification problems. Besides BERT, for a broad range of applications, researchers have lately exhibited the efficiency of neural models using pretraining language modeling by taking BERT as a base model. By combining different neural architectures with the BERT language model and exploiting its embeddings, cutting-edge results in English has been achieved [5]. BERT model with the advancement of the research, a few systems such as the end-to-end interactive chatbot system like BERTserini [43], a lighter version of BERT called ALBERT [21], and an all-purpose language model called DistilBERT [36] were introduced.

The model is trained on a specific dataset after pre-training with a large corpus of data to answer the questions either in an open-domain or closed-domain question answering system. There are few datasets for question answering systems such as the CuratedTREC dataset [1], WebQuestions dataset [3] that answer questions from Freebase [4], and the Stanford Question Answering Dataset (SQuAD) [33], which is based on Wikipedia knowledge source.

The SQuAD is one of the most significant general-purpose Open-domain Question Answering datasets currently available among all these datasets. There are two versions of SQuAD dataset: SQuAD 1.1 [33] and SQuAD 2.0 [32]. The dataset SQuAD 2.0 contains unanswerable questions with plausible answers in addition to the SQuAD 1.1 dataset.

However, as seen in Table 1, when unanswerable and irrelevant questions are asked to the system, the model would make unreliable and incorrect guesses and answers to such questions.

Table 1 Examples showing unanswerable and irrelevant questions resulting in incorrect answers

Along with the Question Answering System (QAS) side, the Question Generation System (QGS) plays a vital role in making the model understand the question and answer it. According to Sun et al. [39], there is a close relation between Question Answering and Question Generation. The question generation task has seen many training objectives. Works such as [13, 25, 37] does not capture long-term dependencies but concentrate on the most recent tokens. Even though these papers provide a good result, these works lack capturing long-term dependencies [19, 22]. The work proposed by Qi et al. [29] has a future n-gram as a training objective, thus providing excellent results in question generation tasks.

When we extensively tested the Question Answering System keeping in mind how the answer is generated, it is found that Question Comprehension plays a significant role in the question answering system [38]. Also, systems like [46] introduce a pair-to-sequence model that captures the interaction between the question asked and the given paragraph. Specific systems like ParaQG [20] try to generate the questions from the paragraph. Systems like [35] pick up the keywords from the question and paragraph and match them using RNN. Pota et al., [27] used Convolution Neural Networks (CNNs) to classify the questions. The question classification plays a vital role in extracting the correct answer in the Question Answering System. The method proposed by Esposito et al., [14] extracts the most relevant terms from the questions, and then these words are placed in the context. This document collection is later used in the QA system. Some other work like [28] uses Part of Speech (POS) tagging based on a deep neural network. Here the POS is tagged at the character level, and then it is eventually fed to Bi-LSTM. This method handles rare and Out-of-Vocabulary words as well as common and known words.

3 Methodology

This section introduces an automatic Question-Answer pairs generation system, a combination of Question Answering System and Question Generation System. To address the limitations of the Question Answering System, we propose a Question Similarity mechanism. The possible generated questions are from the state-of-the-art question generation system called ProphetNet [29] and the Question posed is from the SQuAD 2.0 dataset. The Question Similarity mechanism calculates the cosine similarity between the possible generated questions from the given paragraph and the question posed.

3.1 Automatic question-answer pairs generation system

The automatic question-answer pairs generation system uses pre-trained weights of a state-of-the-art question generation system called ProphetNet [29] to generate the questions, and BERT [11] model to generate the answers for the generated questions.

As shown in Fig. 1, first, we provide the passage as input to both the question generation system and answering system. Once the question generation system generates the possible set of questions based on the answer spans, which are found by a noun and verb phrases in the passage, the generated questions are given to the question answering system. The question answering system based on the passage and the set of generated questions generates the answers. Finally, we get the Question-Answer pairs from this system.

Fig. 1
figure 1

Block diagram depicting Question-Answer pairs generation system

3.2 Question similarity mechanism

In addition to automatically generating Question-Answer pairs, if additional questions are posed to the system, such questions are identified either as answerable or unanswerable and irrelevant before passing it to the Question Answering System. To identify the questions, we introduce a mechanism called a Question Similarity mechanism. This mechanism calculates the cosine similarity between the generated questions and the question posed.

As shown in Fig. 2, the passage is initially passed to the Question Generation System to generate the possible set of questions on the given paragraph based on the answer spans derived on the noun and verb phrases.

Fig. 2
figure 2

Block diagram to identify the unanswerable or irrelevant questions

Let GQ and QP be the set of generated questions and the question posed with |GQ| = m and |QP| = 1. The sentence embeddings for the generated questions is obtained using Universal Sentence Encoder [6], which gives better results than the pre-trained word embeddings such as those produced by GloVe [26] and word2vec [23] and it is given by,

$$ X_{SE}^{GQ}=\{E_{GQ}^{(i)}\in\mathbb{R}^{512};i=1,\ldots,m \}. $$
(1)

where,

  • \(X^{GQ}_{SE}\) is the set of Sentence Embeddings (SE) for the Generated Questions (GQ), and

  • EGQ is the sentence embeddings for each Generated Question (GQ).

Similarly, we obtain the sentence embeddings for the question posed as

$$ X_{SE}^{QP}=E_{QP}^{(i)}\in\mathbb{R}^{512};i=1. $$
(2)

where,

  • \(X_{SE}^{QP}\) is the set of Sentence Embeddings (SE) for the Questions Posed (QP), and

  • EQP is the sentence embeddings for each Question Posed (QP).

The cosine similarity between the generated questions and the question posed is computed as per the (3).

$$ \begin{array}{@{}rcl@{}} &&\text{Cosine Similarity}(E_{GQ}^{(i)},X_{SE}^{QP})=\cos(E_{GQ}^{(i)},X_{SE}^{QP})\\&&\quad=\frac{\langle E_{GQ}^{(i)},X_{SE}^{QP}\rangle}{||E_{GQ}^{(i)}|| ||X_{SE}^{QP}||},\quad i=1,\ldots,m \end{array} $$
(3)

where \(\langle E_{GQ}^{(i)},X_{SE}^{QP}\rangle \) denotes the inner product of \(E_{GQ}^{(i)}\), and \(X_{SE}^{QP}\).

To calculate Question Similarity Score (QSS), we need to identify the question among the generated questions, whose cosine similarity is highest with respect to the posed question. We call it as Highest Similarity Score Question, and it is obtained by (4).

$$ \text{Highest Similarity Score Question}=\underset{i\in\{1,\dots,m\}}{argmax}\ \cos(E_{GQ}^{(i)},X_{SE}^{QP}). $$
(4)

Now, the Question Similarity Score between the generated question (identified as per the (4)) and the question posed is given by,

$$ \text{Question Similarity Score}(E_{GQ}^{(j)},X_{SE}^{QP})=\cos(E_{GQ}^{(j)},X_{SE}^{QP}) $$
(5)

where \(E_{GQ}^{(j)}\), and \(X^{QP}_{SE}\) are the sentence embeddings for the j th generated questions (as obtained by (4)) and the question posed respectively.

3.3 Question Posed

Question Answering System is posed with several question types. The questions are classified into unanswerable, irrelevant, or answerable

  • Unanswerable: When the context is available in the passage, but the user poses the question in a very complex way, which is unanswerable by the question answering system, this question is labeled an unanswerable question.

  • Irrelevant: When the user poses a question that is out of context with the given passage, this question is labeled as irrelevant.

  • Answerable: It is defined as the question whose context is available in the given passage, and this question is answerable by the question answering system.

3.4 Question similarity score

The question similarity mechanism is used as a question filter to the Question Answering System. This mechanism identifies and filters unanswerable, irrelevant, and answerable questions based on the threshold value. The range of the QSS threshold and the corresponding label of the posed question is given in Table 2.

Table 2 Labeling of posed question

In our experiment, 1000 questions are chosen for unanswerable questions, irrelevant questions, and answerable questions from the SQuAD 2.0 dataset. We have found that the Irrelevant questions have question similarity scores in the range of 0.00 to 0.50 and unanswerable questions have their question similarity scores in the range 0.50 to 0.80. Further, we experimented to check the question similarity scores for the answerable questions and found that the question similarity scores are in the range of 0.85 to 1.00. So, we set the threshold values to be in the range of 0.00 − 0.50 if the posed question is Irrelevant, 0.50 − 0.85 if the posed question is Unanswerable, and 0.85 − 1.00 if the posed question is Answerable question. If the question posed crosses the threshold value, it is identified as an answerable or relevant question, and it is passed to the question answering system to get the answer to that question. If the question posed does not cross the threshold, then as per the Table 2 it is identified either as irrelevant or unanswerable.

4 Data and Experiments

The following data are used for the experiments:

  1. 1.

    We have used SQuAD 2.0 [32] dataset for our experiments. It consists of 50,000 additional questions to that of SQuAD 1.1 [33] dataset, which has 100,000 answerable questions.

  2. 2.

    The pre-trained weights of state-of-the-art Question Generation System called ProphetNet [29] to generate the questions for a given paragraph.

  3. 3.

    The pre-trained weights of BERT [11] Question Answering System, which is fine-tuned on the SQuAD 1.1 dataset [33].

  4. 4.

    Pre-trained Universal Sentence Encoder (USE) [6] to generate the sentence embeddings for the questions (Tables 3456789 and 10).

Table 3 This table illustrates possible generated questions from the passage, which is randomly taken from SQUAD 2.0 dataset
Table 4 This table illustrates the generated question-answer pairs from the given passage
Table 5 This table illustrates how the question similarity method identifies unanswerable or irrelevant questions and addresses the limitations of QAS
Table 6 This table illustrates possible generated questions from the passage, which is randomly taken from SQUAD 2.0 dataset
Table 7 This table illustrates the generated question-answer pairs from the given passage
Table 8 This table illustrates how the question similarity method identifies unanswerable or irrelevant questions and addresses the limitations of QAS
Table 9 This table illustrates possible generated questions from the passage, which is randomly taken from SQUAD 2.0 dataset
Table 10 This table illustrates the generated question-answer pairs from the given passage

5 Results

5.1 Automatic question-answer pairs generation system

This subsection shows the results produced by the automatic question-answer pairs generation system. We have generated automatic question-answer pairs for 100 passages from the SQuAD 2.0 dataset [7]. Tables 36 and 9 show all possible questions generated from the passages by question generation system. These questions are further given to the question answering system and the passage to generate the answers to the possible generated questions. Table 4, Table 7, and Table 10 show the question-answer pairs generated by automatic question-answer pairs generation system. On manual reading, it is found that the question-answer pairs generated are of good quality (Table 11).

Table 11 This table illustrates how the question similarity method identifies unanswerable or irrelevant questions and addresses the limitations of QAS

5.2 Question similarity mechanism

This subsection provides the results of the proposed question similarity mechanism. When a question is posed to the question answering system, the question similarity mechanism identifies whether the question posed is answerable or unanswerable and relevant or irrelevant questions. Both unanswerable and irrelevant questions are taken from the SQuAD 2.0 dataset [7] for the experiments.

We have carried out the experiments for random 100 passages from the SQuAD 2.0 dataset [7] with unanswerable and irrelevant questions. As shown in Tables 58 and 12, when the cosine similarity score of generated question and posed question does not exceed the threshold of 0.85, it is marked or labeled either as an unanswerable or irrelevant question. Such a question will not be passed to the Question Answering System. So, the question posed with less than the threshold will not be passed to the Question Answering system. Our proposed question similarity mechanism does not allow the question-answering model to answer the unanswerable or irrelevant questions by incorrect guessing. We also present the Question Similarity Scores for answerable questions from SQuAD 2.0 dataset [7]. We found that the answerable questions get Question Similarity scores above 0.90. We can infer that the question similarity mechanism identifies the questions on par with human judgment.

Table 12 Quantitative analysis of the BERT model trained on SQuAD 2.0 and BERT model trained on SQuAD 1.1 with Question Similarity Mechanism

We have experimented with 1000 questions for both Unanswerable and Irrelevant questions. In our experiments, we have used the BERT model trained on SQuAD 1.1 dataset. BERT model trained on SQuAD 2.0 should not predict answers for the Unanswerable questions. However, this model answers few Unanswerable questions. We introduced the Question Similarity mechanism with the BERT model trained on SQuAD 1.1; this mechanism helps to identify unanswerable and irrelevant questions. Irrelevant questions are not introduced in the SQuAD 2.0 dataset. For a particular passage in SQuAD 2.0 dataset, irrelevant questions are chosen randomly from the different passages. So that the randomly chosen questions will not be related to the context. The efficiency of the model is calculated by,

$$ \text{Efficiency}= \frac{\text{No. of Unanswerable/Irrelevant questions not answered by the model}} {\text{Total No. of Unanswerable/Irrelevant questions}} \times 100 $$
(6)

6 Discussion

The automatic question-answer pairs generation gives an overview of how the question answering system and question generation work as a twin task system to obtain satisfactory results. Also, on manual reading, we can infer that this system generates good question and answer pairs. The question-answer pairs generated to date are confined to generate only ‘wh’ questions and their answers. The majority of the question-answer pairs generation systems are rule-based systems. Whereas our proposed application generates all possible question-answer pairs using a machine learning approach.

In the question similarity mechanism, we show the work’s significance by addressing the Question Answering System’s challenge. Even though the works like [1, 3, 32, 33] introduced different techniques to overcome the limitations of the Question Answering System, the identification of the unanswerable questions remains an open challenge. The proposed Question Similarity mechanism does not require training. It improves the question answering systems’ performance by focusing only on the answerable or relevant questions. By this, we can infer that the Question Similarity mechanism incorporates a human way of reasoning to identify unanswerable and irrelevant questions and hence addresses the limitation of QAS.

7 Conclusion

In this paper, we introduce an application by combining the Question Generation and Question Answering system called automatic question-pairs generation system, where all possible question and answer pairs will be generated. It has got various applications in different fields. Later, we introduce a Question Similarity mechanism that imitates human reasoning to identify whether the question posed is answerable questions or unanswerable and irrelevant questions. The existing question answering systems cannot identify whether the question posed is answerable or unanswerable and irrelevant. If the question posed is unanswerable or irrelevant, then such questions are not passed to the QAS. As there is no training process involved in this model, it requires less computational resources. This mechanism can be included with state-of-the-art Question Answering Systems so that the models can concentrate on answerable questions to improve their performance. The automatically generated question-answer pairs can be used as a dataset to train the Question Answering models.