Automatic question-answer pairs generation and question similarity mechanism in question answering system

With the swift growth of the information over the past few years, taking full benefit is increasingly essential. Question Answering System is one of the promising methods to access this much information. The Question Answering System lacks humans’ common sense and reasoning power and cannot identify unanswerable questions and irrelevant questions. These questions are answered by making unreliable and incorrect guesses. In this paper, we address this limitation by proposing a Question Similarity mechanism. Before a question is posed to a Question-Answering system, it is compared with possible generated questions of the given paragraph, and then a Question Similarity Score is generated. The Question Similarity mechanism effectively identifies the unanswerable and irrelevant questions. The proposed Question Similarity mechanism incorporates a human way of reasoning to identify unanswerable and irrelevant questions. This mechanism can avoid the unanswerable and irrelevant questions altogether from being posed to the Question Answering system. It helps the Question Answering Systems to focus only on the answerable questions to improve their performance. Along with this, we introduce an application of the Question Answering System that generates the question-answer pairs given a passage and is useful in several fields.


Introduction
The Question Answering System (QAS) plays an important role in getting questions and automatically answering them using a knowledge information system. This paper blends the essence of Question Generation, Question Comprehension, and Question Answering to overcome the Question Answering System's limitations.
to reason about the block world. Although the conversation was limited to the block's world, so not convincingly human-like, it does know what it is talking about.
Later in the year, 2011 IBM Watson [15] gained worldwide attention, which uses NLP to analyze human speech for meaning and syntax. Way back, it was commonly referred to as a brain. In recent years, search engines (Google), chatbots (SIRI, ALEXA, and CORTANA) are becoming better at going beyond by answering the exact answer to our question. The Question Answering System has also seen significant changes in the architecture from basic Recurrent Neural Network (RNN) to transformers [8,12] over the years.
The Question Answering System is classified into an Open-domain Question Answering System, and Closeddomain Question Answering System [24]. The opendomain question answering systems like [10,17] can handle nearly any questions based on world knowledge. This type of Question Answering System has access to more data to extract the answer. The closed-domain question answering systems are domain-specific [2,9,45]. Closeddomain question answering systems answers from either a pre-structured database or the collection of domain-specific natural language documents.
According to the studies [32,33], the human accuracy of answering the question is 89.45%, and the state-ofthe-art Question Answering System's accuracy is 93.01%. Although system accuracy exceeds human accuracy, such a Question-Answering system lacks reasoning power as humans do [30,34,44], to identify the questions and understand them. The SQUAD 2.0 dataset [31] provides unanswerable questions with plausible answers; however, identifying the unanswerable question remains a challenge.
The limitations of the question answering system are: -Unanswerable Questions: A question that is incorrect and related to the context is posed to the Question Answering System. The Question Answering System, which has outstripped human accuracy, should know that the question is unanswerable and should not generate the answer. However, the SQUAD 1.1 dataset models answer such unanswerable questions by unreliable guesses on questions for which the correct answer is not stated. It indicates that these models lack a rational way of reasoning. Even though the SQUAD 2.0 dataset introduced unanswerable questions in the dataset, identifying the unanswerable questions remains unsolved. -Irrelevant Questions: When the Question Answering System is posed with irrelevant questions that are out of context, the system still generates an understandable but nonsensical answer. On the other hand, humans do not provide such nonsensical answers; instead, they will identify that the question is irrelevant and out of context.
The contributions of this paper are as follows: 1. To automatically generate the possible question-answer pairs, given a passage. 2. We introduce a Question Similarity mechanism, where it will identify the unanswerable and irrelevant questions. 3. We combine the Question Generation System with Question Answering System to create an application called Automatic Question-Answer Pairs Generation System.
Rest of the paper is organized as follows. Section 2 provides the related work on Question Answering Systems. Section 3 explains about the automatic question-answer pair generation, and question similarity mechanism. Section 4 provides details about the datasets used and the experiments. The experimental results are presented in Section 5, and Section 6 discusses about the results. Finally, in Section 7 we conclude this paper.

Related works
In recent years, several works are proposed to tackle world knowledge by combining search factors based on bigram hashing, TF-IDF matching [7] and machine reading comprehension [22,29]. It brought the Question Answering System a good beginning. The most recent QAS is the Bidirectional Encoder Representations from Transformers (BERT) [11]. It uses neural models such as transformers to pre-train the large corpora of data. Such a latest refinement has led to remarkable gains in NLP tasks such as Question Answering, Text Summarization, and many classification problems. Besides BERT, for a broad range of applications, researchers have lately exhibited the efficiency of neural models using pretraining language modeling by taking BERT as a base model. By combining different neural architectures with the BERT language model and exploiting its embeddings, cutting-edge results in English has been achieved [5]. BERT model with the advancement of the research, a few systems such as the end-to-end interactive chatbot system like BERTserini [43], a lighter version of BERT called ALBERT [21], and an all-purpose language model called DistilBERT [36] were introduced.
The model is trained on a specific dataset after pre-training with a large corpus of data to answer the questions either in an open-domain or closed-domain question answering system. There are few datasets for question answering systems such as the CuratedTREC dataset [1], WebQuestions dataset [3] that answer questions from Freebase [4], and the Stanford Question Answering Dataset (SQuAD) [33], which is based on Wikipedia knowledge source.
The SQuAD is one of the most significant generalpurpose Open-domain Question Answering datasets currently available among all these datasets. There are two However, as seen in Table 1, when unanswerable and irrelevant questions are asked to the system, the model would make unreliable and incorrect guesses and answers to such questions.
Along with the Question Answering System (QAS) side, the Question Generation System (QGS) plays a vital role in making the model understand the question and answer it. According to Sun et al. [39], there is a close relation between Question Answering and Question Generation. The question generation task has seen many training objectives. Works such as [13,25,37] does not capture long-term dependencies but concentrate on the most recent tokens. Even though these papers provide a good result, these works lack capturing long-term dependencies [19,22]. The work proposed by Qi et al. [29] has a future n-gram as a training objective, thus providing excellent results in question generation tasks.
When we extensively tested the Question Answering System keeping in mind how the answer is generated, it is found that Question Comprehension plays a significant role in the question answering system [38]. Also, systems like [46] introduce a pair-to-sequence model that captures the interaction between the question asked and the given paragraph. Specific systems like ParaQG [20] try to generate the questions from the paragraph. Systems like [35] pick up the keywords from the question and paragraph and match them using RNN. Pota et al., [27] used Convolution Neural Networks (CNNs) to classify the questions. The question classification plays a vital role in extracting the correct answer in the Question Answering System. The method proposed by Esposito et al., [14] extracts the most relevant terms from the questions, and then these words are placed in the context. This document collection is later used in the QA system. Some other work like [28] uses Part of Speech (POS) tagging based on a deep neural network. Here the POS is tagged at the character level, and then it is eventually fed to Bi-LSTM. This method handles rare and Out-of-Vocabulary words as well as common and known words.

Methodology
This section introduces an automatic Question-Answer pairs generation system, a combination of Question Answering System and Question Generation System. To address the limitations of the Question Answering System, we propose a Question Similarity mechanism. The possible generated questions are from the state-of-the-art question generation Fig. 1 Block diagram depicting Question-Answer pairs generation system system called ProphetNet [29] and the Question posed is from the SQuAD 2.0 dataset. The Question Similarity mechanism calculates the cosine similarity between the possible generated questions from the given paragraph and the question posed.

Automatic question-answer pairs generation system
The automatic question-answer pairs generation system uses pre-trained weights of a state-of-the-art question generation system called ProphetNet [29] to generate the questions, and BERT [11] model to generate the answers for the generated questions.
As shown in Fig. 1, first, we provide the passage as input to both the question generation system and answering system. Once the question generation system generates the possible set of questions based on the answer spans, which are found by a noun and verb phrases in the passage, the generated questions are given to the question answering system. The question answering system based on the passage and the set of generated questions generates the answers. Finally, we get the Question-Answer pairs from this system.

Question similarity mechanism
In addition to automatically generating Question-Answer pairs, if additional questions are posed to the system, such questions are identified either as answerable or unanswerable and irrelevant before passing it to the Question Answering System. To identify the questions, we introduce a mechanism called a Question Similarity mechanism. This mechanism calculates the cosine similarity between the generated questions and the question posed.
As shown in Fig. 2, the passage is initially passed to the Question Generation System to generate the possible set of questions on the given paragraph based on the answer spans derived on the noun and verb phrases.
Let GQ and QP be the set of generated questions and the question posed with |GQ| = m and |QP | = 1. The sentence embeddings for the generated questions is obtained using Universal Sentence Encoder [6], which gives better results than the pre-trained word embeddings such as those produced by GloVe [26] and word2vec [23] and it is given by, where, Similarly, we obtain the sentence embeddings for the question posed as where, -X QP SE is the set of Sentence Embeddings (SE) for the Questions Posed (QP), and -E QP is the sentence embeddings for each Question Posed (QP).
The cosine similarity between the generated questions and the question posed is computed as per the (3). where To calculate Question Similarity Score (QSS), we need to identify the question among the generated questions, whose cosine similarity is highest with respect to the posed question. We call it as Highest Similarity Score Question, and it is obtained by (4).
Highest Similarity Score Question = argmax i∈{1,...,m} Now, the Question Similarity Score between the generated question (identified as per the (4)) and the question posed is given by, where E (j ) GQ , and X QP SE are the sentence embeddings for the j th generated questions (as obtained by (4)) and the question posed respectively.

Question Posed
Question Answering System is posed with several question types. The questions are classified into unanswerable, irrelevant, or answerable -Unanswerable: When the context is available in the passage, but the user poses the question in a very complex way, which is unanswerable by the question answering system, this question is labeled an unanswerable question.
-Irrelevant: When the user poses a question that is out of context with the given passage, this question is labeled as irrelevant. -Answerable: It is defined as the question whose context is available in the given passage, and this question is answerable by the question answering system.

Question similarity score
The question similarity mechanism is used as a question filter to the Question Answering System. This mechanism identifies and filters unanswerable, irrelevant, and answerable questions based on the threshold value. The range of the QSS threshold and the corresponding label of the posed question is given in Table 2.
In our experiment, 1000 questions are chosen for unanswerable questions, irrelevant questions, and answerable questions from the SQuAD 2.0 dataset. We have found that the Irrelevant questions have question similarity scores in the range of 0.00 to 0.50 and unanswerable questions have their question similarity scores in the range 0.50 to 0.80. Further, we experimented to check the question similarity scores for the answerable questions and found that the question similarity scores are in the range of 0.85 to 1.00. So, we set the threshold values to be in the range of 0.00 − 0.50 if the posed question is Irrelevant, 0.50 − 0.85 if the posed question is Unanswerable, and 0.85−1.00 if the posed question is Answerable question. If the question posed crosses the threshold value, it is identified as an answerable or relevant question, and it is passed to the question answering system to get the answer to that question. If the question posed does not cross the threshold, then as per the Table 2 it is identified either as irrelevant or unanswerable.

Data and Experiments
The following data are used for the experiments: 1. We have used SQuAD 2.0 [32] dataset for our experiments. It consists of 50,000 additional questions to that of SQuAD 1.1 [33] dataset, which has 100,000 answerable questions. When the Question Similarity Score (QSS) is within a particular range then the corresponding label is assigned to that posed question  (Tables 3, 4, 5, 6, 7, 8, 9 and 10).

Automatic question-answer pairs generation system
This subsection shows the results produced by the automatic question-answer pairs generation system. We have generated automatic question-answer pairs for 100 passages from the SQuAD 2.0 dataset [7]. Tables 3, 6  18 . what was the effect of placing a mouse or a candle in a closed container over water?
A 18 . caused the water to rise and replace one-fourteenth of the air's volume before extinguishing the subjects G 19 . what did mayow find when placing a mouse or a candle in a closed container over water?
A 19 . caused the water to rise and replace one-fourteenth of the air's volume before extinguishing the subjects G 20 . what did mayow think the water did with one-fourteenth of the air's volume before extinguishing the mice? The questions generated by the question generation system is further given to the question answering system using BERT [24], which finds the answers A 1 to A 28 for all the generated questions. The first column specifies the question posed. The second column specifies the highest question similarity score between the question posed and generated questions. The third column indicates whether the question is answerable/unanswerable/irrelevant based on the highest question similarity score by comparing it with the threshold (0.85). The fourth column indicates the answers generated for the questions posed, bypassing the question to the BERT question answering system and 9 show all possible questions generated from the passages by question generation system. These questions are further given to the question answering system and the passage to generate the answers to the possible generated questions. Table 4, Table 7, and Table 10 show the questionanswer pairs generated by automatic question-answer pairs generation system. On manual reading, it is found that the question-answer pairs generated are of good quality (Table 11).

Question similarity mechanism
This subsection provides the results of the proposed question similarity mechanism. When a question is posed  The questions generated by the question generation system is further given to the question answering system using BERT [24], which finds the answers A 1 to A 15 for all the generated questions Answerable the availability of the bible in vernacular languages The first column specifies the question posed. The second column specifies the highest question similarity score between the question posed and generated questions. The third column indicates whether the question is answerable/unanswerable/irrelevant based on the highest question similarity score by comparing it with the threshold (0.85). The fourth column indicates the answers generated for the questions posed, bypassing the question to the BERT question answering system to the question answering system, the question similarity mechanism identifies whether the question posed is answerable or unanswerable and relevant or irrelevant questions. Both unanswerable and irrelevant questions are taken from the SQuAD 2.0 dataset [7] for the experiments. We have carried out the experiments for random 100 passages from the SQuAD 2.0 dataset [7] with unanswerable and irrelevant questions. As shown in Tables 5, 8 and 12, when the cosine similarity score of generated question and posed question does not exceed the threshold of 0.85, it is marked or labeled either as an unanswerable or irrelevant question. Such a question will not be passed to the Question Answering System. So, the question posed with less than the threshold will not be passed to the Question Answering system. Our proposed question similarity mechanism does not allow the question-answering model to answer the unanswerable or irrelevant questions by incorrect guessing. We also present the Question Similarity Scores for answerable questions from SQuAD 2.0 dataset [7]. We found that the answerable questions get Question Similarity scores above 0.90. We can infer that the question similarity mechanism identifies the questions on par with human judgment.
We have experimented with 1000 questions for both Unanswerable and Irrelevant questions. In our experiments, we have used the BERT model trained on SQuAD 1.1 dataset. BERT model trained on SQuAD 2.0 should not predict answers for the Unanswerable questions. However, this model answers few Unanswerable questions. We introduced the Question Similarity mechanism with the BERT model trained on SQuAD 1.1; this mechanism helps to identify unanswerable and irrelevant questions. Irrelevant questions are not introduced in the SQuAD 2.0 dataset. For a particular passage in SQuAD 2.0 dataset, irrelevant questions are chosen randomly from the different passages. So that the randomly chosen questions will not be related to the context. The efficiency of the model is calculated by,  The questions generated by the question generation system is further given to the question answering system using BERT [24], which finds the answers A 1 to A 20 for all the generated questions The first column specifies the question posed. The second column specifies the highest question similarity score between the question posed and generated questions. The third column indicates whether the question is answerable/unanswerable/irrelevant based on the highest question similarity score by comparing it with the threshold (0.85). The fourth column indicates the answers generated for the questions posed, bypassing the question to the BERT question answering system

Discussion
The automatic question-answer pairs generation gives an overview of how the question answering system and question generation work as a twin task system to obtain satisfactory results. Also, on manual reading, we can infer that this system generates good question and answer pairs. The questionanswer pairs generated to date are confined to generate only 'wh' questions and their answers. The majority of the questionanswer pairs generation systems are rule-based systems. Whereas our proposed application generates all possible question-answer pairs using a machine learning approach.
In the question similarity mechanism, we show the work's significance by addressing the Question Answering System's challenge. Even though the works like [1,3,32,33] introduced different techniques to overcome the limitations of the Question Answering System, the identification of the unanswerable questions remains an open challenge. The proposed Question Similarity mechanism does not require training. It improves the question answering systems' performance by focusing only on the answerable or relevant questions. By this, we can infer that the Question Similarity mechanism incorporates a human way of reasoning to identify unanswerable and irrelevant questions and hence addresses the limitation of QAS.

Conclusion
In this paper, we introduce an application by combining the Question Generation and Question Answering system called automatic question-pairs generation system, where all possible question and answer pairs will be generated. It has got various applications in different fields. Later, we introduce a Question Similarity mechanism that imitates human reasoning to identify whether the question posed is answerable questions or unanswerable and irrelevant questions. The existing question answering systems cannot identify whether the question posed is answerable or unanswerable and irrelevant. If the question posed is unanswerable or irrelevant, then such questions are not passed to the QAS. As there is no training process involved in this model, it requires less computational resources. This mechanism can be included with state-of-the-art Question Answering Systems so that the models can concentrate on answerable questions to improve their performance. The automatically generated question-answer pairs can be used as a dataset to train the Question Answering models. indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.