Introduction

Question classification is an essential task in the natural language understanding module of question answering (QA), which aims to classify questions into certain pre-defined intent categories. Previous studies [1,2,3] indicate that an efficient question classification module contributes to restricting the search space for finding answers, thus reducing search costs. A typical example is “What’s the capital of China?”. The real intent of this question is “locations”. Therefore, the candidate of answers is the “locations” pronoun related to “China” and “capital”, which is a much smaller space than the entire search. Moreover, an efficient question classification module also can guide the QA system to select the optimum knowledge base and search strategy. For example, “Who is the CEO of Facebook”, focusing more on the relation among different entities, is well suited for searching answers in a knowledge graph. While if the question is “Why Facebook will never charge you?”, whose intent is “reasons”, the best choice is to search for answers in a web knowledge base.

In the early stage, many rule-based methods [4, 5] tried to match questions with hand crafted templates to determine the category of question. However, plenty of rules need to manually pre-defined for different cases, which is time-consuming and labor-intensive. With the rapid development of deep learning and lexical embedding techniques, neural networks have made great breakthroughs in natural language processing, especially convolutional neural networks (CNN) [6, 7] and recurrent neural networks (RNN) [8,9,10]. RNN-based models view text as a sequence of words, which are intended to capture word dependencies in text. Bai et al. [11] proposed a positional RNN model that considers aspect word position information for text classification. Therasa et al. [12] introduced an adaptive RNN model with feature optimization to learn question representation with faster convergence and avoid the local optima. Both of them have achieved a great improvement in results. CNN-based models are trained to identify textual patterns, such as key phrases and text structure. Soni et al. [13] presented a CNN-based architecture that applies two-dimensional multi-scale convolutional procedures to extract intra- and inter-sentence features from input text data. Tan et al. [14] proposed an adaptive CNN model that adaptively generated convolutional filters to project word embeddings into the same subspace. In addition, hybrid models [15] combining CNNs and RNNs are also proposed. Ma et al. [16] put forward a hierarchical convolutional recurrent neural network for Chinese question classification, which combines TextCNN and Bi-LSTM to learn human-understandable concepts in a hierarchical structure. However, RNNs are susceptible to gradient disappearance or explosion and have high time complexity due to their recursive nature, While CNNs do not consider sequence order and fail to capture long-range dependencies. In the recent years, self-attention mechanism [17, 18] has been widely used in question classification for its strength of capturing global dependencies. Liu et al. [19] proposed a multi-stage attention model based on temporal CNN, capturing contextual-related features at word and concept levels. Zheng et al. [20] constructed a deep learning model combining RNNs and attention mechanism, in which RNNs generated the semantic features, and the obtained features were weighed in accordance with the attention mechanism. However, vanilla self-attention mechanism models all signals with weighted averaging, which is prone to overlooking the relation of neighboring signals [21]. Moreover, recent studies [22, 23] have demonstrated that fully utilizing part-of-speech (POS) information would result in additional semantic improvements in sentence representations.

Accordingly, in this paper, we propose a POS-aware adjacent relation attention network (POS-ARAN) for question classification, which enhance context representations with POS information and neighboring signals. Specifically, we propose an adjacent relation attention mechanism (ARAM), which revise the vanilla self-attention mechanism by incorporating a Gaussian bias via a dynamic non-symmetrical window. In this way, additional contextual information between neighboring words can be captured, while long-term dependencies are not affected. In addition, a POS-aware embedding layer is proposed, which helps to locate the appropriate headwords by syntactic information. Extensive experiments are conducted on Experimental Data for Question Classification (EDQC) dataset and Yahoo! Answers Comprehensive Questions and Answers 1.0, the results demonstrate that our model significantly improves the performance, achieving 95.59% in coarse-grained level accuracy and 92.91% in fine-grained level accuracy, respectively.

Our contributions are summarized as follows:

  • We propose an adjacent relation attention mechanism (ARAM). The ARAM revises the original self-attention mechanism by integrating a learnable Gaussian bias within a dynamic window, which enables it to capture additional contextual information between neighboring words.

  • We propose a POS-aware embedding layer, which helps to locate the appropriate headwords by syntactic information for textual content understanding.

  • We conduct our experiments on the widely used question classification datasets, and the experimental results show that our proposed models can achieve better performance than previous state-of-the-art models.

The remaining of this article is structured as follows: In Sect. “Related work”, relevant studies about this task are presented. In Sect. “Method”, we formalize the definition of question classification. Moreover, the method and architecture of POS-ARAN are completely described in this section. In Sects. “Experiments” and “Discussion”, we present the results of the comparison experiments and ablation experiments to prove that our model is competitive and provide a brief analysis. Finally, in Section. “Conclusion”, some conclusions are summarized.

Related work

Question Answering (QA) is a vital task in natural language processing (NLP) [24, 25]. The goal is to build systems that can automatically answer questions raised by human in a natural language [26]. Question classification is a key sub-task to QA systems, which aims to map the question into a certain category. Various techniques have emerged to solve question classification issues. These techniques can be divided into three groups: rule-based techniques, machine learning techniques, and deep learning techniques.

Rule-based approaches

Initially, most question classifiers followed rule-based strategies, such as the Webclopedia QA Typology [27], including 276 hand-written rules corresponding to the 180 answer types. Dragomir et al. [28] employ Ripper and a heuristic rule-based algorithm to identify the question type. Kwok et al. [29] introduced MULDER that can determine the question type just by looking at the questions interrogative pronoun. Silva et al. [30] evaluated a rule-based question classifier, either it directly matched the question classification or it identified the headword of questions and mapped it into the question category by WordNet. Although rule-based methods can achieve good results in specific domains without the need for large amounts of training data, they have been phased out as different expressions can significantly increase the number of rule templates and consume a lot of resources and time.

Machine-learning-based approaches

With the development of machine learning, approaches applied in question classification have gradually changed away from manual and expert rules. Most of these methods are based on supervised statistical machine learning. Huang et al. [31] used the support vector machine (SVM) with linear kernel to classify questions and achieved the accuracy of 89.2% on Text REtrieval Conference (TREC) dataset. Zhang et al. [32] designed a kind of SVM with a tree-like custom kernel, which enabled their model to achieve the accuracy of 90.0% on TREC dataset. In addition to the SVM model, researchers also employed the maximum entropy model in question classification. Kocik et al. [33] proposed a maximum entropy model on TREC and achieved the accuracy of 89.8%. Le Nguyen et al. [34] put forward a subtree mining algorithm that uses subtrees of parse trees as features and combines a maximum entropy model to classify questions. Furthermore, the Sparse Network (SNoW) is also an alternative, which trains an independent linear function for each class by updating the rules. Li et al. [35] introduced a hierarchical classifier and firstly allocated crude tags to questions. These tags and other features are then entered into the subsequent hierarchy for classification. They attained an accuracy of 89.3% on the TREC dataset.

Deep-learning-based approaches

Compared with traditional machine learning, one of the improvements of deep learning lies in the word embedding. Traditional word embedding models like the N-GRAM are prone to data smoothing due to sparse data. Furthermore, the word vector obtained from such a model can reflect neither the diversity of words nor the connection between the words. In contrast, distributed representation methods like word2vec utilize low-dimensional, dense vectors to represent the semantic information of words, effectively solving the above problem. Yilmaz et al. [36]used word2vec on different deep learning architectures to study an agglutinative language, achieving better performance than traditional methods on multiple datasets. Although some studies still used n-gram features, they tried to employ CNNs to compensate for the shortcomings. Kim et al. [37] proposed a CNN-based classification model to capture the n-grams features of text. This model is simple and consists of only five layers. Soni et al. [13] presented a CNN-based architecture that applies two-dimensional multi-scale convolutional procedures to extract intra- and inter-sentence features from input text data. Tan et al. [14] proposed an adaptive CNN model that adaptively generated convolutional filters to project word embeddings into the same subspace. RNN is also a widely used approach to capture word dependencies in text. Zhou et al. [38] used 2D convolution and 2D pooling layer to obtain the representation of output from Bi-LSTM. In this way, their model could capture more sequence features and vector features simultaneously. Similarly, Wu et al. [39] introduced two Bi-LSTM to generate hidden state representations of the question and answer text respectively. Cai et al. [40] proposed a classification model based on CNN-LSTM network, which focused on the medical field. The experimental results on Health Care Quality Indicators (HCQI) dataset proved the efficiency of their model. RNN model that considers aspect word position information for text classification. Therasa et al. [12] introduced an adaptive RNN model with feature optimization to learn question representation with faster convergence and avoid the local optima. With the occurrence of Transformer, the self-attention mechanism has become a matter of concern. It can generate the weights of different connections dynamically, enabling it to handle the long-term dependency in sentences. Liang et al. [41] proposed a novel SVA-CNN deep learning architecture, which leveraged a multi-view representation of text to learn high-level features. Simultaneously, spatial attention and view attention mechanisms were proposed to preserve the latent interaction among different-granularity semantic groups. Liu et al. [19] proposed a multi-stage attention model based on temporal CNN, capturing contextual-related features at word and concept levels. Zheng et al. [20] constructed a deep learning model combining RNNs and attention mechanism, in which RNNs generated the semantic features, and the obtained features were weighed in accordance with the attention mechanism.

However, since the conventional self-attention models consider all words in a sequence, so that the relation among neighboring words is weakened in the weighted averaging. As we all know, neighboring words frequently include a wealth of information about the words that are close to them, which is crucial for models to understand the semantics of natural language. To solve this problem, POS-ARAN, which incorporates POS information and enhances the model’s comprehension of local context, is suggested in this paper.

Method

Problem definition

Question classification can be defined as follows. The initial input is a question Q

$$\begin{aligned} Q = \left\{ q_1,q_2,...,q_m\right\} , \end{aligned}$$
(1)

where m denotes the number of words in the input question. Then, POS-ARAN performs a nonlinear transformation on Q and gains the result \({\mathcal {R}}\)

$$\begin{aligned} {\mathcal {R}} =\left\{ r_1,r_2,...r_n\right\} , \end{aligned}$$
(2)

where n denotes the number of intent categories. \({\mathcal {R}}\) can be regarded as the predicted scores for n intent categories. Finally, POS-ARAN selects the category with the highest score \(r_i\) as the outputted intent category

$$\begin{aligned} r_i =max\left\{ r_1,r_2,...r_n\right\} ,i\in \left[ 1,2,...,n\right] . \end{aligned}$$
(3)

Overall architecture

In this section, we introduce the overall architecture of our POS-ARAN model which is shown in Fig. 1. The model first receives an input question and transforms the sequence into a word vector matrix. The POS-aware embedding layer presented in this paper is based on GloVe, which has been pre-trained on Wikipedia 2014 and Gigaword 5, containing over 6 billion tokens. By calculating the similarity of the original word to the rest of the words in the sentence, the alignment probabilities as attention values are generated. Among them, the word having the maximum alignment probability is defined as the headword. Then, the word vector matrix is sent to the ARAM module, which is the main module in POS-ARAN. It applies revised self-attention to capture the long-term dependency of the headword. During this process, the adjacent relation of each word is encoded to improve the short-term dependency of the headword. Finally, the encoded matrix is passed to a softmax layer and obtains the final probability of categories.

Fig. 1
figure 1

The whole architecture of the proposed POS-ARAN model

Fig. 2
figure 2

Distribution correction of symmetrical window and non-symmetrical window

Adjacent relation attention mechanism

As shown in Fig. 2, the original distribution of attention treats the same words at various distances almost equally, which is inconsistent with human’s cognition of natural language texts that the neighboring words of headword could provide more richer semantic information. Namely, we hope the original word \(x_i\) can keep highly relevant to the neighboring words of the headword \(x_j\) when \(x_i\) is aligned with \(x_j\). In this way, the long-term dependency and local information can be kept at the same time.

The conventional self-attention takes all of the words in the sentence into consideration, so that the weighted averaging weakens the short-term dependency between the adjacent words. Hence, we introduce the idea that revising the original attention distribution to take into account the expected local information. In addition, the ideal distribution should only correct the distribution of some necessary positions, which could avoid unnecessary interference on the original distribution.

In this paper, we hypothesize that the semantic contribution to headword from tokens at different distance obeys a normal distribution. The reason to choose normal distribution is that it is hard to statistically measure the semantic importance of a word according to another one. Compared with different decaying mechanisms, e.g., linear decaying according to distance [42], or other distributions such as Zipf Distribution, studies demonstrate that the Gaussian [43] assumption works better. Specifically, we learn a couple of Gaussian biases with a non-symmetrical window instead of the symmetrical window, and add these biases to the attention distribution. As shown in Fig. 2, when self-attention aligns “Where” with “capital”, it should not only focus on “capital” but also on the words adjacent to “capital”. Figure 2a applies the traditional symmetrical window whose center is “capital”. However, not all words in the window are related to the headword. For example, “is” and “the” nearly have nothing to do with “capital” and should not receive much attention. Directly using a symmetrical window may pay attention to some irrelevant words. As shown in Fig. 2b, POS-ARAN applies the non-symmetrical window to correct the distribution. It is obvious that more attention is paid to “of” and “China", which is more reasonable.

Given an input sequence X=\(\left\{ x_1,x_2,...,x_n\right\} \), self-attention will map it to a target sequence H=\(\left\{ h_1,h_2,...,h_n\right\} \), where X \(\in \) \({\mathbb {R}}^{d\times n}\), H \(\in \) \({\mathbb {R}}^{d\times n}\). d and n are the dimension of hidden layers and length of sequence. To implement our non-symmetrical window strategy, we split the whole window D into \(D_{Left}\) and \(D_{Right}\), which represent the windows on both sides of headword \(x_j\). In most case, \(D_{Left}\) is different from \(D_{Right}\), and the extra distribution can be defined as Eq. (4)

$$\begin{aligned} \left\{ \begin{array}{c} {\theta _{i,j} = \left| \frac{Cov\left| (P_{i},j) \right| }{\sigma _{i} \cdot \sigma _{j}} \right| - 0.5} \\ {{GL}_{i,j} = max\left( 0,\theta _{i,j}\right) \frac{2\theta _{i,j} \left( {j - P_{i}}\right) ^{2}}{D_{Left}^{2}}} \\ {{GR}_{i,j} = max\left( 0,\theta _{i,j}\right) \frac{2\theta _{i,j} \left( {j - P_{i}}\right) ^{2}}{D_{Right}^{2}}}, \\ \end{array}\right. \end{aligned}$$
(4)

where \(P_i\) is the position of predicted words adjacent to \(x_i\). Since the prediction of each headword depends on its corresponding scalar \(p_i\), it can be calculated by Eq. (5). \(\theta _{i,j} \in \left[ -0.5,0.5\right] \) is a coefficient that controls the extent of correction. When \(\theta _{i,j}>0\), the attention of the corresponding position should be correct. \({GL}_{i,j}\) and \({GR}_{i,j}\) denote the Gaussian distribution based on the left window and right window of the headword. Traditionally, D is set as a constant [44], which means that we can just improve the adjacent relation in a fixed range around the headword. However, this would be hindered by two problems. First, the value of D is hard to determine. If D is too big, the improvement of the adjacent relation is pointless, and if D is too small, some important words may be overlooked. Second, the window size of headwords differs from each other. In other words, this value should be determined dynamically. Therefore, ARAM itself learns a transition matrix \(W_d\) to calculate D for each headword dynamically, as shown in Eq. (6)

(5)
(6)

where \(U_d^T\) is a set of linear projection vector. \(W_d \in R^{d\times d}\) denotes the transition matrix learned dynamically during training. R is a scaled factor that rescale the range of window, and \(\lambda \) is a fine-tuned weight. Then, the process of the revised self-attention in our POS-ARAN model can be formalized as Eq. (9)

$$\begin{aligned} e_{ij}&= \left( x_{i} \cdot W^{Q} \right) \cdot \left( x_{j} \cdot W^{K} + r_{ij}^{K} \right) /\sqrt{d} \end{aligned}$$
(7)
$$\begin{aligned} a_{ij}&= \frac{\textrm{exp}\left( e_{ij} + {GL}_{ij} + {GR}_{ij} \right) }{\sum \limits _{k = 1}^{n}\left. {\textrm{exp}\left( e \right. }_{ik} + {GL}_{ik} + {GR}_{ik} \right) } \end{aligned}$$
(8)
$$\begin{aligned} h_{i}&= {\sum \limits _{j = 1}^{n}{a_{ij} \cdot \left( x_{j} \cdot W^{V} + r_{ij}^{V} \right) }}, \end{aligned}$$
(9)
Fig. 3
figure 3

The structure of ARAM module. “Adjacent Relation Promotion” improves the adjacent relation of words, which utilizes the Gaussian bias to correct the distribution of self-attention. “Concatenate” concatenates the vector matrices from the previous layer. “Multi-Head Attention” is the self-attention layer. “Add & Norm” represents the layer normalization

Fig. 4
figure 4

The relation between semantic relevance and POS. The abbreviations for the part-of-speech tags correspond as follows: WP: pronoun, VB(VBZ): verb, DT: determiner, NN(NNP): noun, IN: preposition

Fig. 5
figure 5

The detailed process of revised self-attention

where \(W^Q\), \(W^K\),\(W^{V} \in {\mathbb {R}}^{d_{x} \times d_{h}}\) are the transition matrices of vector Query, Key, and Value. In Eq. (8), \({GL}_{i,j}\) and \({GR}_{i,j}\) are added to the original distribution linearly. Equation (7) applies the scaled dot product of self-attention model to define the scoring function of \(e_{ij}\). This method adds a scaled factor on the basis of the traditional dot product, which can avoid the low learning efficiency caused by small gradient. To scale the result to an appropriate range, we set the scaled factor as 1/\(\sqrt{d}\). Moreover, during training process, we assume that when the distance between two elements in the same sequence exceeds a certain threshold value k, then the information of these two elements is less important. In Eqs. (7) and (9), we add a bias \(r_{ij}\) to describe the adjacent relation between \(x_i\) and \(x_j\). \(r_{ij}\) is transferred from PPE, which is a novel positional encoding method introduced in the next subsection. Thereby, the calculation of \(r_{ij}^V\) and \(r_{ij}^K\) can essentially be attributed to training two relative position sequences \(W^V\) and \(W^K\)

$$\begin{aligned}&\left\{ \begin{array}{c} {W^{V} = \left\{ {w_{- k}^{V},\cdots ,w_{k}^{V}} \right\} , w_{i}^{V} \in R^{d_{h}},i \in \left[ {- k, k} \right] } \\ {W^{K} = \left\{ {w_{- k}^{K},\cdots ,w_{k}^{K}} \right\} , w_{i}^{K} \in R^{d_{h}},i \in \left[ {- k, k} \right] } \\ \end{array}\right. \end{aligned}$$
(10)
$$\begin{aligned}&\left\{ \begin{array}{c} {r_{ij}^{V} = PPE \cdot W_{max(-k,min\left( j-i,k\right) )}^V} \\ {r_{ij}^{K} = PPE \cdot W_{max(-k,min\left( j-i,k\right) )}^K} \\ \end{array}\right\} . \end{aligned}$$
(11)

As illustrated in Fig. 3, the ARAM module consists of three “Adjacent Relation Promotion” blocks, one “Concatenate” layer, one “Multi-Head Attention” layer. and one “Add & Norm” layer, which is the structure verified by experiments. First, we pass the input word vector matrix through “Adjacent Relation Promotion” blocks simultaneously to learn the features of sequence in different subspaces. Then, we concatenate three outputs and use the self-attention layer to capture the dependency of context. Finally, we use the layer normalization [45] which could perform well in the sequence problems of NLP.

In addition, we try to cascade the ARAM module N times to make the network deeper for better performance. However, deeper networks are also difficult to train due to the problem of vanishing gradient. To avoid this problem, we apply the residual connection which is denoted by the blue arrow in Fig 3. The experimental results shown in Section “Experiments” have proved its effectiveness.

POS-aware embedding layer

Inspired by [46], we find that the part-of-speech (POS) of word is also an available feature that can guide the original words to locate the appropriate headwords. As shown in Fig. 4, “what” is much more related to “abbreviation” and “AIDs”, less related to “stand” and “for” and barely related to “does” and “the”. In this case, we can find that the words with tags “WP”, “NN”, and “NNP” have stronger relation. This is not a particular case. By analyzing plenty of samples, a strong semantic relationship can be inferred via the words with a given POS (e.g., pronoun and noun in a question usually have a strong relevance). Generally, the conventional word embedding is to transform a word into the vector representation, which does not contain POS information. Accordingly, we introduce a POS-aware embedding layer as an alternative to the word embedding layer. As depicted in Fig. 5, our model employs GloVe [47] to implement the word embedding process, which could generate a matrix \(W^R\) with a shape of \(m\times n\). m and n denote the dimension of word vector and sequence length, respectively. Similar to the self-attention, given the POS tag of the inputting sentence \(Y=\left\{ y_1,y_2,...,y_n\right\} \in {\mathbb {R}}^{n\times 1}\) and a learnable parameter \(W^S\in {\mathbb {R}}^{n\times m}\), the POS vector \(S\in {\mathbb {R}}^{m\times 1}\) can be defined as

$$\begin{aligned} S = \left( W^S\right) ^TY. \end{aligned}$$
(12)

Then, to integrate the original word representation with the POS information, the POS vector \(S\in {\mathbb {R}}^{m\times 1}\) is extended to a matrix \(S^{'}\in {\mathbb {R}}^{m\times n}\). The final word representation \(\left( W^R\right) ^{'}\) can be illustrated as

$$\begin{aligned} W^{R^{'}} = W^R\cdot S^{'}, \end{aligned}$$
(13)

where \(\cdot \) denotes dot product and the final word representation consists of essential word embedding and the POS of each word.

Different from other POS-based solutions [48, 49], we combine the POS embedding with positional encoding. In self-attention network, positional encoding is indispensable, because self-attention network processes all words in sentence simultaneously, rather than one by one as in RNNs.

In other words, previous self-attention networks did not utilize position information. Vaswani et al. [50] proposed positional encoding to embed positional information into self-attention network. Their method is defined as Eqs. (14) and (15)

$$\begin{aligned}&{PE}_{(p,2i)} = {\sin \left( p/10000^{2i/d_{model}} \right) } \end{aligned}$$
(14)
$$\begin{aligned}&{PE}_{(p,2i + 1)} = {\cos \left( p/10000^{2i/d_{model}} \right) }, \end{aligned}$$
(15)

where p denotes the position of word in sentence, i represents the ith element of the positional vector, and \(d_{model}\) is the dimension of word vector. We consider that the effect of the same POS on the attention distribution may vary from one position to another. Also, positional encoding does not require any training parameters and the output can be directly normalized to [-1,1] by trigonometric functions. Thence, we combine POS embedding with positional encoding. The detailed process is formalized in Eq. (16)

$$\begin{aligned}&P P E_{(P O S, p, t)}\nonumber \\&\quad = \left\{ \begin{array}{c} sin\left( p\cdot e^{- 4{{tlog}{\frac{10}{d_{model}}+POS\cdot t}}}\right) , t = 0,2,4,... \\ cos\left( p\cdot e^{- 4{{(t-1)log}{\frac{10}{d_{model}}+POS\cdot t}}}\right) , t = 1,3,5,... \\ \end{array}\right\} , \end{aligned}$$
(16)

where POS is the POS feature of words. PPE denotes the POS-aware embedding with positional encoding, which is a matrix with a shape of \(m\times n\). m and n denote the dimension of word vector and sequence length, respectively. \(PPE_{(p+k)}\) can be represented by the linear function of \(PPE_(p)\) according to Eqs. (17) and (18). It means that we can judge the positional relation between two words by the value of k

$$\begin{aligned}&\sin {\left( {\alpha + \beta } \right) = {\sin {\alpha {\cos \beta } + {\cos \alpha }{\sin \beta }}}} \end{aligned}$$
(17)
$$\begin{aligned}&\cos {\left( {\alpha + \beta } \right) = {\cos \alpha }{\cos \beta } - {\sin \alpha }{\sin \beta }}. \end{aligned}$$
(18)

As the POS features are embedded in the PPE, its influence can vary with relative position. Additionally, the computation process nearly has no training parameters, which will not incur much extra cost.

Then, we split the word representation into h heads and utilize several transition matrices to calculate the Q/K/V matrices for attention \(Z_i\). \(Z_i\) is calculated by Eq. (19)

$$\begin{aligned} Z_{i} = softmax\left( \frac{{\mathcal {Q}}_{i}^{T} K_{i}}{\sqrt{d_{k}}} \right) V,i \in \left[ 0,h - 1 \right] . \end{aligned}$$
(19)

Finally, all \(Z_i\) are concatenated together and multiplied with a transition matrix to calculate the final Z. Z has the same dimension with the initial word representation.

Table 1 Experimental environment

Experiments

Experimental environment

The experimental environments are listed in Table 1. We choose Pycharm and Keras to implement our experiments on a Linux physical machine with 64GB RAM, E5-2630@2.40GHz CPU and GTX2080Ti GPU.

Table 2 Coarse-grained data distribution
Table 3 Fine-grained data distribution

Datasets

We have evaluated our POS-ARAN on two different question classification datasets: (i) Experimental Data for Question Classification (EDQC) and (ii) Yahoo! Answers Comprehensive Questions and Answers 1.0.

EDQC dataset is a public question classification dataset [51], which can be obtained from https://cogcomp.seas.upenn.edu/Data/QA/QC/. The dataset contains 15,447 questions in total and all questions are labeled according to coarse grain and fine grain, respectively. On the coarse-grained level, the intent of all questions can be divided into 6 categories: ”ENTY” (entity), “LOC” (location), “NUM” (number), “DESC” (description), “HUM” (human), and “ABBR” (abbreviation). Moreover, all questions are also labeled with 47 fine-grained categories. The category information and data distribution are listed in Tables 2 and 3.

The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories, published by Cornell University, including Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships and Politics & Government. Each class is comprised of 140,000 training samples and 5000 testing samples. The dataset is a corpus of answers by the end of October 25, 2007 at Yahoo, which includes all questions and their corresponding answers. In the question classification experiments, we only used the question and the main category information.

Evaluation metrics

Multiple performance and evaluation criteria are used to evaluate the performance of proposed model. Following prior work, we adapt Accuracy, Precision(P),Recall(R) and F1 score. The formulas are stated in Eqs. (20)–(22)

$$\begin{aligned} Acc&=\frac{T P+T N}{T P+T N+F P+F N} \end{aligned}$$
(20)
$$\begin{aligned} \hbox {F} 1&=\frac{2 \times P \times R}{P+R} \end{aligned}$$
(21)
$$\begin{aligned} P&=\frac{T P}{T P+F P}, R=\frac{T P}{T P+T N}, \end{aligned}$$
(22)
Table 4 Hyperparameter configuration
Fig. 6
figure 6

Visualization of the training accuracy

Fig. 7
figure 7

The association of each word in the sentence

Fig. 8
figure 8

The visualized result of the improved self-attention. “[CLS]” and “[SEP]” are two delimiters mentioned in [52] and have no semantic information

where TP represents the number of samples correctly predicted as positive class, FP represents the number of samples incorrectly predicted as positive class, TN is the number of samples correctly predicted as negative class, and FN is the number of samples incorrectly predicted as negative class. The same is true for multiple classifications, as long as all other categories that do not belong to the current category are considered as negative cases. Higher values denote better performance for all metrics.

Parameter configuration

The hyperparameters in POS-ARAN are listed in Table 4. We assume that the length of input sequence does not exceed 50. If the sequence length is shorter than 50, we will pad it to 50. After word embedding, the shape of input sequence will be \(n\times \)100, where n is the sequence length. To avoid overfitting, we set the dropout rate as 0.5. The final layer of POS-ARAN is a dense layer, whose dimension is consistent with the number of categories. We select categorical cross-entropy as our loss function, which is formulated in Eq. (20)

$$\begin{aligned} c = - \frac{1}{n}{\sum _{x}{y{\ln a} + \left( 1 - y \right) {\ln \left( 1 - a \right) }}}. \end{aligned}$$
(23)

To update parameters effectively, we use ADAM algorithm to optimize POS-ARAN model. As for the hyperparameters of ADAM, we set \(\beta _1\) = 0.9, \(\beta _1\) = 0.999, \(\epsilon \) = \(10^{-8}\). 70% of the data is divided into training set and the remaining 30% is divided into validation set. We set the learning rate as 0.001 and train the model for 200 epochs. After observation, the performance of POS-ARAN model keeps stable after 50 epochs.

Experimental results and analysis

In this section, a detailed experimental result and analysis are presented. As depicted in Fig. 6, the accuracy of coarse-grained and fine-grained classification task is 95.59% and 92.91%. It is obvious that POS-ARAN performs better on coarse-grained classification tasks. The main reason is that fine-grained categories are much more than coarse-grained categories, so that the number of samples in every fine-grained category is not enough. As a result, the difficulty of fine-grained classification is greatly increased, which makes the difference of validation accuracy reasonable. Correspondingly, the training loss of coarse-grained classification task is lower than that of fine-grained classification task. In addition, compared with the result of fine-grained classification task, the training accuracy of coarse-grained classification task converges more quickly. The experimental results show that POS-ARAN can successfully converge in 200 epochs and achieves satisfactory performance both on coarse-grained classification task and fine-grained classification task.

Fig. 9
figure 9

The comparison results of various models

To understand how the ARAM module assigns its attention, we visualize the weights distribution of attention in Fig. 7. As presented in Fig. 7, an attention weight matrix is constructed to demonstrate the correlation between any two words in the sentence. The darker the elements in the matrix are, the more contributions the relative words have when calculating the attention weights. In this way, neighboring words can have the same contribution as headwords. The mechanism is beneficial to capture the local semantic information in sentences, which contributes to correctly judge the intent of questions.

As shown in Fig. 8, the visualization results of ARAM are composed of Query q Key k and element-wise \(q\times k\), which gives us a clear picture of the intermediate results of queries and keys. It helps to demonstrate the robustness of the POS-ARAN model in more detail. According to the result output by Softmax layer, ”what” is more related to “term”, and meanwhile keeps a certain relevance with “is”, “the”, “young”, and “fox”. It is believed that our POS-ARAN model can capture the short-term and long-term dependency simultaneously by capturing the adjacent relation.

Table 5 Comparison of the proposed POS-ARAN against State-of-the-Art Benchmark Algorithms

Comparison results and ablation studies

In this section, we compare our POS-ARAN model with the traditional machine learning methods and other mainstream deep learning networks on EDQC and Yahoo! Answers. The comparison results are shown in Table 5. In our experiments, SNoW [53] is an improved hierarchical classification model. Maximum entropy model is the one proposed by Kocik [33] et al. SVM is the support vector machine with linear kernel proposed by Huang [31] et al. TextCNN [37] and TextRNN [54] are the common models applied to sequence problems. GRU [55] and Bi-LSTM [56] are two special variants of RNN which perform well in NLP task. Word2vec pre-trained word embeddings [36] are introduced on deep learning architectures for experiments including CNN and RNN. Bidirectional Encoder Representations from Transformers (BERT) [57] achieve state-of-the-art performance, since it can provide better representation through capturing bi-directional context using Transformers. ALBERT [58] is an extension study on BERT, which improves parameter-efficiency of BERT by incorporating two parameter reduction techniques. A multi-granularity fusion neural network (MGF) [59] is used for comparison, which has recently achieved optimal results in medical question classification. More intuitive results can be observed according to Fig. 9. Especially, since POS-ARAN model is essentially a deep learning model, we also compare the accuracy trend of our POS-ARAN with other deep learning methods, as shown in Fig. 10.

According to Fig. 9 and Table 5, it is obvious that deep learning methods are generally better than the traditional machine learning methods. Specifically, our POS-ARAN model performs well no matter in coarse-grained task or fine-grained. Although the accuracy of GRU and Bi-LSTM is similar in accuracy to POS-ARAN on coarse-grained task, POS-ARAN still outperforms them on fine-grained task. Compared to the Bert, the current state-of-the-art approach for NLP tasks, our POS-ARAN still achieves a competitive performance. As shown in Tables 6 and 7, our proposed model POS-ARAN can classify well the NUM category in EDQC and Sports category in Yahoo! Answers. Meanwhile, we find that the recall of ABBR category is lower than other categories, which may be related to the insufficient sample size.

Table 6 Results of coarse-grained category in EDQC of POS-ARAN
Table 7 Results of POS-ARAN on Yahoo! Answers

Since the architecture of GRU and Bi-LSTM is essentially RNN, it means that they are trained serially. As shown in Fig. 10, we can observe that RNN-based models take a long time to train and converge more slowly than POS-ARAN. Figure 11 illustrates the training procedure for various models and demonstrates the variance in training duration. As can be seen from Fig. 11, TextCNN is the fastest, followed by POS-ARAN. GRU and TextRNN take nearly three times longer than POS-ARAN and yet Bi-LSTM is the worst, with a time expenditure of around six times that of POS-ARAN. Thus, although the accuracy of GRU and Bi-LSTM is on par with our POS-ARAN model, POS-ARAN still has a significant advantage in terms of training speed. In contrast, TextCNN, although faster than POS-ARAN, is far less accurate. In brief, POS-ARAN achieves optimal results compared to other methods across multiple evaluation metrics.

Moreover, to obtain the best performance, we try several different combinations of parameters and compare them on EDQC, as shown in Table 8, and Figs. 12 and 13. The result shows that the performance of POS-ARAN model increases continuously as the increasing of ARAM layers, attention heads, and attention dimension. In the meantime, we also need to take the number of parameters into consideration. Blindly increasing parameters will lead to excessive computational expenditure. Finally, we obtain the best architecture of POS-ARAN model in our experiments, which has 2 ARAM layers, 4 attention heads, and attention dimension of 64.

Fig. 10
figure 10

The coarse-grained and fine-grained accuracy trend

Fig. 11
figure 11

The difference of various models’ time expenditure

Table 8 Accuracy of POS-ARAN model with different configurations.”ARAM layers” represents the number of layers of ARAM. ”Attention heads” (Attr-H) is the number of head in self-attention mechanism. ”Attention dimension” (Attr-D) denotes the dimension of the hidden layer in attention network
Fig. 12
figure 12

Heatmap of accuracy (%) on EDQC of different configurations

Fig. 13
figure 13

Results on EDQC with different Atrr-H and Atrr-D

Discussion

According to the final qualitative result, we find that almost all coarse-grained categories are correctly classified; however, some questions are misclassified in fine-grained tasks. We conclude that there are two possible reasons for this. On the one hand, the number of fine-grained categories is around eight times more than that of coarse-grained categories, which leads to more complex situation. On the other hand, the sample size of each fine-grained category is unbalanced, which causes significant differences in the accuracy of each category. Therefore, although we select categorical cross-entropy as the loss function, it is still necessary for us to balance the sample size of each fine-grained category during the training process. Nevertheless, such a data preprocessing approach is only theoretically feasible. This is because the question quantity gap between each category is extremely diverse in the existing datasets. Removing questions from certain categories to balance the data would result in inaccurate classification. POS-ARAN is also satisfactory from the point of view of generalization capability, as its accuracy in real scenarios is almost identical to its accuracy on the test set.

Conclusion

In this paper, we propose a POS-aware adjacent relation attention network (POS-ARAN) for question classification, which can capture both the long-term dependency and local representations in the text. To enhance the local representation of adjacent relations among words in different sentences, we introduce a learnable Gaussian bias which can obtain an adaptive self-attention distribution via a dynamic window to revise the original attention distribution. Furthermore, a novel POS-aware embedding layer is proposed, which helps to locate the appropriate headwords by syntactic information. The POS-ARAN applies an revised self-attention mechanism to classify questions and addresses the problem that the relation of neighboring words is weakened when calculating the weighted average of attention. The experiments on EDQC demonstrate that our POS-ARAN model exceeds most traditional and deep learning models in terms of performance and time. POS-ARAN achieves the coarse-grained accuracy of 95.59% and fine-grained accuracy of 92.91%, which proves that the POS-ARAN model is the more advanced model in question classification.

In the future, we plan to extend our research in the following aspects: (i) introducing SOTA language models into question classification to boost the overall performance with adequate contextual representation; (ii) integrating more syntactic cues into the question component extraction task to better capture semantic information; and (iii) investigating multi-task learning for question classification to apply to more complex situations.