Understanding Visitors’ Curiosity in a Science Centre with Deep Question Processing Network

Xu, Zhaozhen; Howarth, Amelia; Briggs, Nicole; Cristianini, Nello

doi:10.1007/s40593-023-00377-8

Understanding Visitors’ Curiosity in a Science Centre with Deep Question Processing Network

Article
Open access
Published: 20 November 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Understanding Visitors’ Curiosity in a Science Centre with Deep Question Processing Network

Download PDF

Zhaozhen Xu ORCID: orcid.org/0000-0002-2017-154X¹,
Amelia Howarth²,
Nicole Briggs³ &
…
Nello Cristianini⁴

610 Accesses
Explore all metrics

Abstract

Questions have a critical role in learning and teaching. People ask questions to obtain information and express interest in ideas. The Bristol scientific centre “We The Curious” launched “Project What If” in 2017 to inspire residents of Bristol to record their questions and pursue their curiosities. Researching these questions may help the museum better understand the curiosity of its audiences and create exhibitions or educational content that are more relevant to their interests and lives. The project managed to collect more than 10,000 questions on various topics, and more questions are being collected on a daily basis. With this large amount of data collected, it is time-consuming to process and analyse all the questions by humans. This research aims to apply artificial intelligence (AI) techniques and models in analysing these questions gathered by We The Curious. Meanwhile, in AI, there is a lack of tools that focus on processing and analysing the questions. Thus, we introduce a deep neural network called QBERT to process the questions for three tasks: question taxonomy, equivalent question detection, and question answering. Then we apply QBERT to provide an analysis of the questions collected by We The Curious, as well as comprehend Bristolians’ curiosity. Then using QBERT, we categorise the We The Curious questions into 90 themes and 5,930 communities. Moreover, 436 questions are answered by one-sentence answers extracted from Wikipedia.

Question answering system with text mining and deep networks

Article 16 May 2024

Predicting closed questions on community question answering sites using convolutional neural network

Article 07 November 2019

An effective deep learning pipeline for improved question classification into bloom’s taxonomy’s domains

Article 28 October 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Questions play a pivotal role in both learning and teaching. Numerous studies (Song, 2016; Ebersbach et al., 2020) have demonstrated that asking questions improved educational outcomes and promoted a further understanding of the learning material. People ask questions to obtain information and express interest in ideas. Moreover, encouraging learners to generate questions fosters critical thinking in the learning content. Research (Cuccio-Schirripa and Steiner, 2000) showed that questioning is one of the essential thinking processing skills for critical thinking, creative thinking, and problem-solving.

Meanwhile, with the development of online communities and smart voice assistants, many people are asking questions and searching for answers on a daily basis. Vast quantities of questions emerge every day. The idea of online question communities like Quora and Stack Overflow also encourages users to ask questions and connect with people who have the same question or have unique insights and quality answers.

The numerous queries produced lead to a challenge in artificial intelligence: Can we use the machine to understand and analyse these large-scale queries sourced from different contexts? Furthermore, how might such models be advantageous to education?

A typical question processing module includes query formulation and answer type detection (Jurafsky and Martin, 2014). The query formulation can include part-of-speech tagging and stopword removal. Answer type detection categorises the questions according to the anticipated answer type, such as numeric, location, entity, and so on. While the primary objective of the typical question processing module is to enhance the performance of a question answering system, the vast amount of data offers the potential for extracting other valuable information from the question itself.

Therefore, we report here on a multi-task processing question network we call QBERT (Q stands for question), which is built with a leading deep neural network BERT (Devlin et al., 2018). QBERT is built as a generalist to solve three tasks we defined in the question domain: detecting the topic of questions, detecting equivalent questions with similar meaning but different wording, and locating potential answers to these same questions. The goal of having a generalist is that we can use one model to solve all the tasks, even if it does not excel in any task. These three tasks are helpful in discovering the information in the questions beyond facts.

Our approach is based on a fine-tuned language model sentence-BERT (SBERT) (Reimers and Gurevych, 2019), a Siamese BERT that projects the sentences into high-dimensional vector space. The embeddings of the sentences with similar semantic meanings are close to each other in the high-dimensional space. Notice that our intention is not to design a new model but to fine-tune SBERT in a multi-tasking way for processing questions. After fine-tuning SBERT with different tasks and loss functions, the embeddings generated from input data can be used for both classification and retrieval tasks. Moreover, our approach is time-efficient in finding the related sequence in a large corpus such as Wikipedia while combining with approximate nearest neighbour search (Johnson et al., 2019).

After training QBERT, we apply it to analyse a real-world dataset. In 2017, “Project What If” was started at the “We The Curious” science centre of Bristol (UK), with the stated intention of being the first exhibition all about “the curiosity of a city”. Its aim was no less than to capture the curiosity of Bristolians (and visitors) by collecting all their questions. It was focused on the questions “of real people”, and through these is aimed at understanding what Bristolians were curious about. In other words, it was not so much about the answers to individual questions as it was about understanding a Community from the questions it asks.

Despite the clear identity of “We The Curious” as a science centre, the organisers of this project were trying to gauge a broader set of interests, about culture and society, in a time of rapid change. A collection of the spontaneous questions of thousands of people was expected to provide deep insights about the people who asked them. It was expected that through this project, “We The Curious” could learn more about how the role of a science centre could evolve in addressing these questions in collaboration with its community. Over the following three years, the project gathered over 10,000 questions, both in their “museum” venue and in initiatives around the city. That list, taken together, contained many questions, worries, doubts, and ambitions of thousands of citizens.

A “We The Curious” corpus (WTC corpus) has been constructed from these questions. By analysing the queries from the WTC corpus, we present QBERT in a practical way, demonstrating its applicability for analysing question data from other sources.

With QBERT, we perform topic classification to identify the visitor’s interest. On top of the topic classification task, we also include a type classification when analysing the questions. Besides, we introduce an external knowledge source as a candidate answer set when searching for answers to a given question. In such a way, we try to reveal more information contained in the question corpus. For example, from the question taxonomy, we can understand the people who asked the questions from their interests and comprehension level of the content; recognising equivalent questions can group the common questions and help locate identical and popular questions; and the question answering task can identify the questions that can be answered by an external source, which potentially reduces the workload on answering queries. As a result, educational content providers can adjust their focus on the common questions or questions that a machine cannot answer based on the analysis.

In this article, we fine-tune a multi-task deep neural network to process questions. With this model, we aim to understand We The Curious visitors’ curiosity by processing and analysing a new question corpus collected by We The Curious. The article describes the WTC corpus in Section 3, the algorithm in Section 4, the content-analysis of the corpus in Section 5, and the discussion of results in Section 6.

Background

Research (Cuccio-Schirripa and Steiner, 2000) showed that questioning is one of the essential thinking processing skills for critical thinking, creative thinking, and problem-solving. Learners’ questions play a critical role in both learning and teaching (Chin and Osborne, 2008). Students’ questions can help construct knowledge, self-evaluate, and motivate their interest in a topic during learning. On the other hand, teachers can diagnose students’ understanding and evaluate their thinking through their questions.

Different kinds of questions can stimulate different extents of learning (Scardamalia and Bereiter, 1992). For example, knowledge-based questions generated from interest or to better understand and extend the knowledge have a higher order than text-based questions that are asked in response to given content. Scardamalia and Bereiter’s research also showed that students tend to ask questions about basic information for less familiar topics but more wondering questions for familiar topics. Thus, categorising the questions can be beneficial to understanding the questioners and tailoring the learning content.

One of the most famous methods to categorise the questions for teaching is Bloom’s taxonomy (Bloom, 1956). It divided the questions into knowledge, comprehension, application, analysis, synthesis, and evaluation. The taxonomy was generated in a way to help students learn.

In AI, question processing is considered a process in some early question answering systems (Li and Roth, 2002; Ferrucci, 2012). This process includes query formulation, answer type detection and keyword extraction. The question processing module intends to improve the performance of the question answering system. Nevertheless, questions are always a special type of text. In education, a question is used to reflect the student’s ability and concerns (Chin and Osborne, 2008). Therefore, it is interesting to define question processing with regard to data analysis and find out what we can learn from it.

AI research pays attention to three types of question-related tasks: question classification, measuring question similarity, and question answering. Question classification is a specific case of text classification. Question similarity is similar to measuring text similarity. Moreover, question answering is a combination of text classification, similarity measures, and information retrieval.

Question classification is widely used to improve question-answering performance. The questions are classified based on the type of answer. Early work like Lehnert’s taxonomy (Lehnert, 1977) has 13 semantic classes for the questions. Recent work like TREC (Li and Roth, 2002) focuses on factual questions and creates a hierarchical taxonomy that separates the questions into 6 coarse classes (abbreviation, entity, description, human, location and numeric value) and 50 fine classes. However, to further analyse the comprehension content from the questions’ complexity, the question type categorisation needs to be general, including factual as well as non-factual, regardless of the theme. According to the framework in (Mohasseb et al., 2018), the questions can be classified into 6 types based on grammar: confirmation, factoid, choice, hypothetical, causal, and list questions. In this paper, we did not train the model to predict the exact types as in (Mohasseb et al., 2018) because both the model and the dataset are not publicly accessed, and we are not able to label more data following the framework. Instead, we classified questions on the basis of which “interrogative word” they contain. An interrogative word is a function word for asking questions, such as what, which, when, where, who, whom, whose, why, whether and how.

Question similarity is a sub-field of measuring sentence similarity. This paper leverages a corpus-based approach that measures semantic similarity by utilising the information from large corpora (Chandrasekaran and Mago, 2021). Vector representations learned from the corpora are used to encode the text. This process is also known as embedding. As we discussed in the previous sections, the embedding methods used to capture sentence similarity have evolved in the past decades, from word embedding (Mikolov et al., 2013; Pennington et al., 2014) to sentence embedding (Arora et al., 2016; Conneau et al., 2017) to deep contextualised embedding (Reimers and Gurevych, 2019; Peters et al., 2018). This leads to the final method to measure semantic similarity, deep neural network-based methods. Among all the deep learning models, pre-trained language models (Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019) created a top performance in capturing semantic similarity.

Question answering is always a challenging research task in natural language understanding. For question answering, we identified our research as open-domain and open-book answer retrieving. The system was designed to infer the correct answer from knowledge sources like Wikipedia in a concise sentence. Open-domain question answering is more challenging in using large-scale knowledge sources and machine comprehension than answering questions using a knowledge base. Previous research (Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020; Chen et al., 2017; Yang et al., 2019) leveraged a retriever-reader or retriever-generator that retrieved the relevant passage from the knowledge source and extracted an answer span from the passage. The passage can be a document, paragraph, sentence or fixed-length segment. However, this two-stage system is computationally expensive. Inspired by DenSPI (Seo et al., 2019), we encode all the sentences in the knowledge base and search for the most relevant sentence with the query. In addition, we perform the approximate nearest neighbour search (Johnson et al., 2019) to reduce the search time.

Among all these tasks, the pre-trained language models had earned state-of-the-art results. The model we used, Bidirectional encoder representations from Transformers (BERT) (Devlin et al., 2018), and its modified models had leading performance in classification (McCreery et al., 2020; Sun et al., 2019) and question answering (Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020). We used BERT to embed the questions into high-dimensional vectors, and we used it to recognise the topic, equivalence and potential answers of a given question.

However, the complex architecture of the large language model requires more computational resources to train and utilise the model. Furthermore, it becomes more challenging to interpret the internal process of the network and understand the reasoning behind the network decision. As a result, we leverage multi-task learning (MTL) that trains all tasks jointly. It aims at improving generalisation by adapting domain information contained in the related training tasks (Caruana, 1997).

MTL is not a new idea. McCann et al. (2018) designed an MTL model that solved all the tasks simultaneously without identifying the primary tasks. In this paper, we call the single model that can perform multiple tasks a “generalist”.

In natural language processing, MT-DNN (Liu et al., 2019) trains their multi-tasking model with the transformer encoder and task-specific layer so that it can apply to classification and regression tasks. To adapt to various tasks, some researchers re-frame all the datasets into the same format. MQAN (McCann et al., 2018) formulates all the datasets into question answering over context. T5 (Raffel et al., 2020) creates a sequence-to-sequence format for the tasks. All these models focus on general language understanding tasks like GLUE (Wang et al., 2018), and decaNLP (McCann et al., 2018).

In contrast, we focus on a range of different tasks for processing questions. Furthermore, we aim to reduce the task-specific layers in the model by fine-tuning a general embedding model that can be utilised to process questions in different tasks.

WTC Corpus Overview

We will call our dataset of open-domain questions “the WTC corpus”^{Footnote 1}, this section describes its origin and main features.

The dataset originated from a project run in Bristol (UK) by “We The Curious” (henceforth WTC), an educational charity and science centre.

Data Collection

Between January 2017 to October 2019, We The Curious collected over 10,000 open-domain questions from a diversity of sources: We The Curious venue in Bristol, offsite, and online. Offsite question gathering ensured questions were received from Bristol postcodes (BS1 - BS16). In the meanwhile, all Bristol postcodes were covered through general submissions. The data came from people of all ages and backgrounds. All these data collected started a digital database of questions, held by We The Curious. There are four main methods for question collection.

Project What If Cube

This was a space set up on the venue floor whereby visitors could enter a large wire cube, write down their question on a piece of paper and then attach it to the cube. Questions were routinely taken down and stored, after which they were inputted by hand into the question database spreadsheet.

Curious Cube outings

This was a portable cube that portrayed asked questions through a mirrored surface in LED lights. The cube could be connected to an iPad through which questions could be entered, stored, and displayed on the cube. The team collected questions at various events around Bristol.

Question gathering

This was a series of events whereby We The Curious staff visited places of interest, such as schools and community centres, where they facilitated question input of participants using question cards. Questions were then inputted by hand into the question database spreadsheet.

Online input:

An entry point was made available on the We The Curious website, whereby any user could enter a question digitally, and the question would be stored. When a question was entered through the website, no personal information was taken.

Apart from the online input, the questions were collected after visiting We The Curious venue or taking part in the events held by We The Curious. The participants were asked to leave any questions they were curious about without giving any further instruction related to the content or topic.

We The Curious created a digital database of questions, which is in We The Curious’s possession. All the questions collected were represented verbatim. We The Curious is responsible and accountable for protecting the personal data of individuals submitting this information alongside their questions. All personal data is held by We The Curious in compliance with GDPR protocol (European Parliament, and Council of the European Union, 2016), and personal data is not shared with other parties, including the analysis team of this project. For the purpose of the present study, a smaller dataset was generated by removing all the personal data that was associated with the questions, and only this was shared with the analysts (Zhaozhen Xu and Nello Cristianini).

Data Pre-processing

Manual Curation of the WTC CorpusQuestions were first moderated manually by We The Curious staffs. The questions in the database were also screened for any possible identifying information or potentially offensive or inappropriate language or content. These were removed from the database. After moderation, the resulting dataset contained 10,073 questions.

Automated Pre-processing

The raw corpus also includes repeated questions, various types and topics, and other non-question sentences. Some simple pre-processing is performed before content analysis, such as removing exactly identical questions and questions shorter than three words. After these steps, the filtered WTC dataset contains 8,600 questions. This pre-processed, anonymised and moderated, textual dataset is what we will call the WTC corpus in this paper. Table 1 shows the statistics of the WTC corpus. The distribution of the lengths of the questions is illustrated in Fig. 1.

The word cloud in Fig. 2 shows that the questions cover the universe and space, the human body, energy and climate change, animals and plants, chemistry and materials, the future and some other topics outside of the typical science categories listed before. The size of the word is proportional to its frequency in the corpus.

Table 1 The statistics of the WTC corpus

Full size table

QBERT: a Question Processing BERT

In this section, we propose a multi-task approach to train BERT for processing short questions on a diverse set of NLP tasks, such as multi-class classification (question topic classification), pairwise classification (equivalent question recognition), and regression (similar question mining and question answering).

Question Taxonomy (QT)

is a classification task that can identify the type and topic of a given question. The model categorises the questions into nine types and ten topics. The types are based on main question words like “what”, “how”, etc., and the topics depend on the content of the queries.

Equivalent Question Recognition (QE)

can identify similar questions in the corpus by calculating the distance of the questions in a high dimensional vector space. This task helps analyse the questions in two ways: minimal the unique questions for further studying and understanding the most common questions.

Question Answering (QA)

in this article is defined as an open-domain open-book task, as there is no limitation on the questions, and it can search for the answer from an external source. Through this task, we intend to understand what kind of questions can be answered with one sentence from a knowledge base like Wikipedia with confidence.

Datasets and Metrics

There are 8,600 questions in the WTC corpus, which is a relatively small amount of data for training a deep neural network. Furthermore, the datasets used to pre-train our model are real-world questions from users in different online communities and search engines, mostly without human curation or processing, similar to the WTC questions. As a result, before applying the model to the WTC corpus, we first trained and fine-tuned the neural network on some open-accessed large-scale question datasets. After achieving a fair performance on the existing corpus, we adapted the model to the WTC dataset.

Based on the tasks we choose for processing questions, we train the model on Yahoo! Answer (Zhang et al., 2015), Quora Question Pair (Csernai, 2017), and WikiQA (Yang et al., 2015).

Yahoo! Answer

generated by Zhang et al. is a corpus that includes queries, best answers and topics from the Yahoo! Answer website. There are 1,460,000 samples evenly distributed into ten topics. We utilised the Yahoo! Answer dataset for topic classification and question answering training. To separate the dataset for these two tasks, we named the dataset, which contains query-topic pairs Yahoo Topic (YT) and the dataset with question-answer pairs Yahoo Question-Answering (YQA).

Quora Question Pair (QQP)

is a question pair identification competition released in 2017. The dataset is widely applied for similar question classification. The questions are collected from user input on Quora, an online community for raising all kinds of questions. It includes 404,290 pairs of questions with annotations to identify whether the question pair is duplicated. There are 537,931 unique queries in the QQP dataset.

WikiQA

is a dataset with factual query logs on Bing and candidate answers from Wikipedia Summary. Wikipedia Summary is the first paragraph of the Wikipedia article. The queries were carefully picked with a start of Wh-words such as “when’’ and “which’’. Furthermore, for any question, it required a minimum of 5 users to click on a Wikipedia page after searching. The dataset includes 3,047 questions and 26,154 candidate sentences with human labelling for correct answers.

One of the main reasons we chose these datasets is that the questions from these datasets are all user-generated and sourced from widely used communities and platforms such as Yahoo! Answer, Quora, and Bing. Moreover, Yahoo! Answer and QQP provide a broad and varied sample for training the model. On the other hand, WikiQA contains a one-sentence answer extracted from Wikipedia, which aligns with the QA task we defined. The summary of each dataset and the evaluation metrics are illustrated in Table 2.

Note that the answers gathered from Yahoo! Answer are not limited to short phrases or sentences. Sometimes, the answers could be an article. In this study, we intend to train the model to extract a one-sentence answer for a given query. Therefore, we will not evaluate the models on YQA.

Table 2 Summary of the training datasets

Full size table

Method

Recent research showed that pre-training language models achieved a good performance on language processing and understanding tasks. This article presents a fine-tuned question processing network based on a pre-training model called BERT (Devlin et al., 2018). BERT was initially trained on two tasks: the next sentence prediction that understands the relationship between context and the masked language model that predicted the masked token in the content. The training was performed with BookCorpus (Zhu et al., 2015) and English Wikipedia (only the text part is included).

In the previous study (Xu et al., 2021), we applied a BERT-based model to process questions. One of the limitations is that the model used two network structures for different tasks. The model performed QT with a single BERT structure because QT requires a single question as the input and selects a label from multiple classes. However, in QE and QA, we used a Siamese BERT for taking pairwise inputs, such as (Question, Question) pair and (Question, Answer) pair. This reduces the consistency of the model and makes the model more complicated to be used in practice.

To improve upon this previous study by performing these three tasks with one generalist model, we consider the multiclass classification as a pairwise classification by taking the (Sequence, Class) as the pairwise input. We minimise the distance between sequence and class during training. On the other hand, we retrieve the closest class to a sequence instead of categorising the class for each sequence during inference.

Our method is based on SBERT (Reimers and Gurevych, 2019), which projects input sequence (sentence in this case) in high dimensional space with a Siamese BERT architecture. As a result, the sentences with similar meanings will be close to each other in the vector space. SBERT was trained based on BERT but with extra data (Bowman et al., 2015; Williams et al., 2018; Cer et al., 2017) to capture the sentence similarity. The idea of using the Siamese structure is that the model might capture the similarity between questions as well as the relationship between the question and its corresponding answer. Cosine similarity can be calculated between sentence representations produced by the model. Additionally, we can apply our model in binary classification by introducing a similarity threshold $\Theta $, so that

$$Label = 1,\;\;\; when\;\; Cosine\;Similarity > \Theta $$

By fine-tuning with the corpus described in Section 4.1, we develop a question-oriented SBERT, which we call QBERT. QBERT was first introduced by Xu et al. (Xu et al., 2021) in 2021. In this paper, we will improve QBERT based on our previous research by introducing a “generalist” that can achieve comparable results on multiple question processing tasks with a single model (Xu and Cristianini, 2023). Figure 3 illustrates the architecture of the model.

Input Layer

$S = ( s_{1}, ..., s_{n})$ is an input sequence with n words. The sequence can be either a topic, question, sentence, or paragraph. The model takes a pairwise input $(S, S')$ such as a question pair, question-topic pair, or question-answer pair. The pairwise input is then passed to two BERTs that share the parameters.

BERT Layer

The shared embedding layer following the setup of $BERT_{base}$ which takes the sequence input as word tokens and generates an output for each token as well as a [CLS] token at the beginning of the output sequence. $BERT_{base}$ uses the an encoder containing 12 layers and 110M parameters and is pre-trained with two unsupervised tasks: masked language model and next sentence prediction. The output of the BERT layer is in $\mathbb {R}^{d}$ vector space, and according to BERT, $d = 768$.

Pooling Layer

Similar to SBERT, the model leverages a mean pooling strategy that computes the mean of all output tokens (except [CLS]) of the sequence from BERT. After the pooling function, the model generates a pair of embedding U and embedding V as (1) , where $U\in \mathbb {R}^{d}$ and $V\in \mathbb {R}^{d}$.

$$\begin{aligned} Embedding = \frac{1}{n}\sum _{i=1}^{n} \Phi _{BERT}(s_{i}) \end{aligned}$$

(1)

We apply two different loss functions for different types of data: online contrastive loss for binary classification tasks that have both positive and negative sample, and multiple negatives ranking loss for information retrieving datasets that does not contain positive nor negative label. Adam optimiser (Kingma and Ba, 2015) minimises the loss based on the cosine similarity $D_{cosine}(U,V)$.

Pairwise Classification Specific Loss

QBERT introduces the contrastive loss (Hadsell et al., 2006) for pairwise classification. It aims to gather positive pairs in the vector space while separating negative pairs. For embedding U, V, the loss is calculated as follows.

$$\begin{aligned} L_{contrastive}=\frac{1}{2}\left\{ Y\left( 1-D_{cosine} \right) ^{2} + \left( 1-Y \right) \left[ max\left( 0, m-(1-D_{cosine}) \right) \right] ^{2}\right\} \end{aligned}$$

(2)

Where Y is the binary label. $Y = 1$ if U and V are similar. And the distance $D = 1-D_{cosine}$ between U, V is minimised. When $Y = 0$, the distance increases between U, V until larger than the given margin m. In particular, we apply online contrastive loss that only computes the loss between hard positive and hard negative pairs.

Retrieval Specific Loss

One of the advantages of applying multiple negative ranking loss is that the training dataset no longer requires either positive or negative labels. For a given positive sequence pair $(S_{i}, S_{i}')$, the function assumes that any $(S_{i}, S_{j}')$ is negative when $i \ne j$. For example, in question answering, for question set $Q = \{q_{1}, ..., q_{m}\}$ and answer set $A = \{a_{1}, ..., a_{m}\}$, $(q_{i}, a_{i})$ is a positive pair given by the dataset, $(q_{i}, a_{j})$ is a negative pair randomly generated from the dataset. The cross-entropy loss of all the sequence pairs is calculated as follows.

$$\begin{aligned} L_{multiple\_negative}=-(Ylog\left( D_{cosine} \right) + \left( 1-Y \right) log(1-D_{cosine})) \end{aligned}$$

(3)

During inference, QBERT introduces a threshold filter. It calculates the $D_{cosine}(U, V)$, the cosine similarity between embeddings U and V, and applies different similarity thresholds for each task to determine if two sequences are related in terms of topic, equivalent question, or corresponding answer. The threshold of best performance is selected after training.

As shown in the previous research Xu et al. (2021), the training curriculum was critical for multi-task question processing. In Xu et al. (2021), the tasks were trained once at a time, from QE to QA to QT (QT was trained with different network architecture). However, the tasks learned in the earlier stage had a worse performance compared to the tasks learned in the later stage. To improve this, we train QBERT in a fixed-order round robin (RR) curriculum.

In this thesis, we refer to the training curriculum as the learning order for all the tasks. During training, the data in each dataset get divided into batches $Z = \{z_{1}, ..., z_{n}\}$. In each step, one batch $z_{i}$ is selected randomly, and the model parameters are updated by stochastic gradient descent.

By applying the RR curriculum, QBERT trains all the tasks simultaneously in the RR curriculum. The data in each task-specific layer are built as mini-batches and divided into two task-specific loss functions. During each step, the model is trained and updated by batches with online contrastive loss and multiple negative ranking loss. QBERT-RR alternates between tasks during training, preventing the model from forgetting about the tasks learned at the beginning of the training.

The model applied a threshold filter instead of the task-specific loss function during inference in QE and QA tasks. The model only tried to minimise the distance between the related input sequences while training, regardless of identifying if the sequences were related. Thus, we introduced a distance threshold that can tell us if the question pair is similar or if the candidate sentence is the answer to the question. The distance threshold with the best performance was found during training.

Implementation Details

Before learning on question-related tasks, the BERT layer in the model was pre-trained following (Devlin et al., 2018) and (Reimers and Gurevych, 2019).

The length of the input sequence was limited to 35 tokens. Any sequence with a length of more than 35 tokens was truncated at the end. Compared to the original BERT, which has a maximum sequence length of 512 tokens (including special tokens), we reduced the sequence length because questions are mostly short sentences, and in the WTC corpus, 87.96% of the questions are within ten words. Besides, instead of concatenating the sequence pair input into one sequence like BERT, QBERT applied two identical BERTs to read and process the sequence pair.

Usually, the machine takes a single sentence as input and performs multi-label classification for question topic classification. To adapt the QT task into the QBERT Siamese network architecture, we convert the topic classification into a topic retrieval task. QBERT embeds the questions as well as the topics, and labels all the question topic pairs as related. During training, the model minimises the multiple negative ranking loss between the topic and questions. During inference, the model calculates the cosine similarity between the question and all candidate topics. The candidate with the largest similarity is considered the topic of the question.

We train QBERT with the online contrastive loss for QE. We define the similarity threshold for QE based on the best accuracy on the training set. Then, we evaluate the model on both QE classification and retrieval tasks. The QE retrieval candidate corpus is constructed by sampled queries in the QQP test set.

For QA, we train WikiQA with the online contrastive loss and YQA and SQuAD with multiple negative ranking loss. This is because YQA and SQuAD only contain question answering pairs and do not come with negative samples. However, WikiQA has both positive and negative samples. In the meanwhile, there are questions with no answers in the dataset. Thus, a threshold is needed to identify whether the closest candidate to the question is the high-confidence answer. The threshold is defined as the one that creates the best precision in the WikiQA training set. Using the threshold that creates the best precision ensures the retrieving candidates have a low false positive rate. In other words, the answers selected by the model are more likely to be the correct answer.

The implementation of QBERT is based on PyTorch and SBERT. The training parameters are shown in Table 3. The margin for positive samples and negative samples is 0.5. We train the model for 5 epochs with a batch size of 32 and a learning rate of $2e-5$. 10% of the training data is used for warm-up.

Table 3 Training parameters for QBERT

Full size table

We train QBERT with one GeForce GTX TITAN X GPU. Training specialised QBERT takes 7 hours on the QQP dataset, 0.5 hours on WikiQA, 9 hours on the YT, and 16.5 hours on YQA. On the other hand, training the generalist QBERT takes 93 hours. Even though training QBERT-DI is time-consuming, once trained, the model is much faster during inference. It takes 1.5ms, 5.44ms, 19.62ms, and 49.76ms per question in YQT, QQP, WikiQA, and SQuAD, respectively.

After training the network, we found the best threshold for the QE task by evaluating the performance of training data. In the QE task, the model was trained to classify similar questions. During inference, the model calculated the cosine similarity between the question pairs, where $similarity = 1 - D_{cosine}$. If the question embeddings had a larger similarity than the defined threshold, they were considered similar questions. In an ideal situation, all the question pairs with high similarity should be labelled as 1 (similar). We sorted the question pairs by similarity in descending order to find the best threshold. All the similarity scores calculated between the training question pairs were utilised as the threshold to evaluate the accuracy of the model. The threshold with the best accuracy was recorded for the QE task.

The implementation of QBERT was built with PyTorch and the SBERT library (https://github.com/UKPLab/sentence-transformers). All the training data is open accessed.

Performance

We evaluate QBERT with YT for QT classification, QQP for QE classification and QE retrieval, and WikiQA for QA.

Table 4 The performance of QBERT-RR compares with the performance of single-task SBERT trained on QT, QA and QE

Full size table

For QE, we evaluate classification and retrieval task accuracy with the QQP dataset. If the question pair has a similarity larger than the threshold, it is categorised as equivalent in classification. To perform similar question mining, we create a question corpus based on QQP. First, all the relevant questions for the given query are included in the dataset, ensuring that there is always a relevant question in the corpus. Second, we fill the rest of the corpus with irrelevant questions. There are 104,033 samples in total. While mining the similar questions from the corpus, the candidate with the highest similarity larger than the threshold is defined as the duplicate question.

The performance of classification tasks like YT and QQP is evaluated with accuracy on the test set. In QE, the label was predicted by the threshold. On the other hand, in WikiQA, the question is not guaranteed to have an answer. Therefore, the model takes the sentence with the highest cosine similarity score in the candidate set for each question and compares it with the threshold. The prediction is correct if the similarity is above the threshold and the sentence is labelled as a correct answer. The results are illustrated in Table 4.

SBERT was only trained on natural language inference datasets and semantic textual similarity datasets containing sentence pairs with labels. It, therefore, manages to detect similar question pairs, albeit with poor performance. However, SBERT was not trained to group sentences with the same topic, and it is unable to identify the question topic. Since SBERT achieves similar accuracy to other models on the WikiQA dataset, it has a worse F1 score compared to others.

In Table 4, model QT, QE, and QA represent single-task training. It leverages the same architecture as QBERT. However, for each task, it has a separate model. While fine-tuning the single-task model, we update the BERT layer to minimise the task-specific loss for each dataset. The results show that QBERT-RR achieves comparable or better performance on most question datasets compared to the single-task model.

We also compare QBERT-RR with E5 (Wang et al., 2022), a state-of-the-art general-purpose embedding model that achieved strong performance in classification, clustering, and retrieval. Compared to QBERT, E5 trained the shared encoder with a prefix identifier for the data. The results show that E5 obtains similar performance on QA. On the other hand, QBERT-RR outperforms E5 on QT and QE tasks.

In addition, we evaluate QA retrieval using WikiQA. We evaluate the performance of the retrieval task by accuracy@K, which calculated the accuracy of the answer within the closest K positions to the question. We create an answer corpus for each query in the WikiQA test set. First, all correct answers to the given query are included in the answer candidates corpus so that we made sure that there was always a correct answer in the corpus. Second, we filled the rest of the corpus with irrelevant sentences in the WikiQA dataset. The results of retrieving answers from WikiQA are shown in Fig. 4.

The results show that QBERT has a better performance in retrieving answers from a given corpus compared to SBERT and the single-task model. However, E5 achieves a leading performance in answer retrieval.

The corpus generated for QA retrieval has a size of 26k sentences, which is much smaller than the original Wikipedia Summary. However, we did not manage to find the same dump of Wikipedia as in WikiQA datasets. Thus, searching in a larger corpus might further affect the results.

WTC Corpus Analysis with QBERT

After training QBERT-RR, we apply the model to the WTC corpus for processing and understanding questions. In the WTC corpus, we have questions with various content from different people. The questions cover multiple topics and overlap in the content.

Through the analysis, we intend to understand the question content, like the types and topics, so that people can further understand the questioner’s interest and understanding level. We can filter the duplicate questions to reduce the workload for further analysis when detecting similar questions. Furthermore, QBERT can link the new questions in the future with the existing data. The repeating questions also reveal the common doubt from the visitor that can be used to comprehend the questioner. In the end, QBERT discovered the questions that can be answered with high confidence by Wikipedia Summary. Under an educational scenario, people can focus more on the knowledge-based questions that the machine cannot answer.