1 Introduction

Text classification is a common task in the field of Natural Language Processing (NLP), which is widely used in news classification, information retrieval and machine reading comprehension, etc. In the field of legal intelligent question-answering, accurately judging the legal category based on the facts described by users is also a text classification task, and a necessary prerequisite to realize intelligent question answering.

With the development of deep learning, researchers begin to apply deep learning models to achieve text classification. Kim [1] proposed to use Convolutional Neural Networks (CNN) to realize text classification, and achieved results that surpass previous supervision models. Later, Du et al. [2] and Huang et al. [3] used Recurrent Neural Network (RNN) and Graph Neural Networks (GNN) for text classification tasks, respectively. Deep learning models have gradually become the mainstream methods of text classification because they do not require manual feature extraction and the classification effect are significant. However, deep learning models need the support of large-scale labeled datasets. When there is less annotated data, they are easy to show the phenomenon of overfit [4]. Especially in specific application fields, such as medical and financial, annotators are required to have a high professional level, which leads to a high cost of annotation. So the development of deep learning is limited in these fields. In the case of sparse annotation, it has become a problem for researchers to achieve efficient text classification.

Currently, researchers have begun to study methods based on Few-Shot Learning [5] (FSL), such as model-agnostic meta-learning [6], relation networks [7], matching networks [8], etc. But these methods are mostly applied in computer vision, and are rarely found in NLP. Recently, Gao et al. [9] proposed hybrid attention-based prototypical networks for few-shot relation classification, and successfully applied FSL in text classification. However, in the legal intelligent question-answering system, due to the short length of legal consulting texts, serious oral language, the lack of applicable datasets and so on, the classification encounters great obstacles. For these reasons, we construct a few-shot dataset for lagal consulting questions classification, and propose a classification model based on multi-attention prototypical networks. Our contributions are summarized as follows:

  • We construct a few-shot dataset for legal consulting questions. This dataset contains 46 categories and each category contains 30 to 300 instances.

  • In view of the particularity of the dataset constructed in this paper, we propose a multi-attention prototypical networks model based on instance-dimension level attention.

  • Experimental results on our dataset show that our model achieves state-of-the-art effects. We also test our model on other few-shot datasets. The experimental results show that our model has generalization ability.

2 Related Works

Currently, with the development of deep learning, researchers began to apply it to text classification and achieved good results. Kim [1] proposed to use CNN for text classification. First, he utilized pre-trained word vectors to represent text information as input, and then used the labeled data to train to achieve text classification. The result of classification proves that CNN is not only suitable for image classification tasks, but also for text classification tasks. Due to the sequential nature of the text information and the problem of long-term dependence, Zhou et al. [10] proposed to apply Long Shot-Term Memory (LSTM) to text classification, which solved the problem that CNN cannot model contextual information. Whether it is CNN or LSTM, the sentence is used as the input of model, the sentence features are extracted through the deep learning models, and the classifier is used to achieve classification. But these models ignore the information relevance between sentences. Yang et al. [11] proposed a hierarchical attention mechanism model for document classification, including word-level and sentence-level attention, so that the model has the ability to assign different attention to each word and each sentence in one document.

In the application field where annotated data is scarce, the existing deep learning models cannot play a very good role. In response to this problem, the few-shot learning methods have gradually attracted researchers’ attention and began to be applied in the field of computer vision [4, 8, 12,13,14]. Gao et al. [9] believed that text is different from images, and there are problems such as information diversity and large noise. Therefore, they proposed a prototypical networks model based on the hybrid attention mechanism to achieve relation classification. This model designed instance-level and feature-level attention to alleviate the influence of noisy data and sparse features. Sun et al. [15] proposed a hierarchical attention prototypical networks for few-shot text classification. They designed the feature level, word level, and instance-level multi cross attention for the model to enhance the expressive ability of semantic space. Geng et al. [16] proposed an induction network for few-shot text classification. They used the dynamic routing algorithm in meta-learning to learn the generalized class-wise representation so that the model can be better generalized and promoted. Subsequently, Geng et al. [17] proposed dynamic memory induction networks for few-shot text classification. The model utilizes dynamic routing to provide more flexibility to memory-based FSL in order to better adapt the support sets, which is a critical capacity of few-shot classification models. Bao et al. [18] proposed a meta-learning model for few-shot text classification. They used an attention generator and a ridge regressor to make the model perform well in cross-category transfer.

Fig. 1
figure 1

Prototypical networks based on multi-attention

3 Prototypical Networks Based on Multi-attention

3.1 Basic Concepts for Few-Shot Learning

The purpose of FSL is to generalize and analogize limited prior knowledge, and extend it to new tasks. It is used to solve the problem of how to improve the performance of the model in the case of less training data. FSL contains the following concepts:

Support set: N categories are randomly selected from the dataset at each iteration, and each category contains K instances, thus forming the support set containing \(N \times K\) instances. Generally speaking, at each iteration, the model will be trained once on support set to learn features.

Query set: the query set is similar to support set. Q instances from each of the remaining instances of the N classes will be selected to form query set. During training, the model will calculate the loss on query set and update parameters after once training on support set.

Fig. 2
figure 2

Feature extraction module. The red rectangle in (a) represents a convolution kernel, and the red rectangle in (b) represents the result of a convolution operation performed by the convolution kernel on the sentence embedding vector; the green rectangle in (b) represents the filter size of the max-pooling operation, and the green rectangle in (c) represents the result of performing the max-pooling operation on a local feature vector

N-way K-shot: N-way refers to N categories, and K-shot means that each category contains K instances. N-way K-shot task refers to the model learns features from support set, so that it knows how to distinguish these N categories.

Training set and test set: the training set and test set in FSL task are consistent in data composition. But there is no category overlap between two datasets, and they both contain support set and query set. In the FSL task, to adapt to new tasks quickly, the model first trains parameters on training set, and then the support set in test is used to adjust the parameters. Finally, the model tests the performances on query set.

3.2 Model Framework

In this paper, we propose a Legal consulting questions Classification model based on Multi-Attention Prototypical Networks (MAPN-LC). This model uses prototypical networks as a basic framework, and realizes classification by calculating the distance between consulting questions and each class prototype. In the classification process, we introduce the instance-dimension level attention mechanism. Firstly, we assign different weights to each instance in one category, and then assign different weights to each dimension of weighted instances’ feature vectors, thereby improving the contribution of key instances and key features.

We define the legal consulting questions classification as predicting the category of query instance q through a given support set S. Among them, the support set S is defined as follows:

$$\begin{aligned} S = \left\{ \left( s^1_1,s^2_1,\ldots ,s^K_1 \right) ,\ldots ,\left( s^1_N,s^2_N,\ldots ,s^K_N \right) \right\} , \end{aligned}$$
(1)

where \(s_i^j\) denotes the j-th instance of the i-th category, and S is expressed in the form of N-way K-shot.

Table 1 Three instances in “Joining rights protection” randomly selected from the legal consultation question dataset

The framework of the model is shown in Fig. 1. For each instance in support set S, each word in the instance is expressed as a word vector through the word embedding module, and then the feature vector of instance is obtained through the feature extraction module. Then, the instance-dimension level attention is used on instances’ feature vectors to obtain the weight vector \(\beta _i\) in the dimension. For query instance q, feature vector \(x_q\) is obtained through the word embedding module and feature extraction module. Finally, the class prototype \(c_i\) of each class is calculated through prototypical networks, and the distance function is generated by combining the weight vector \(\beta _i\) to predict the legal category of query instance q.

3.3 Instance Encoder

3.3.1 Embedding Layer

Since the text cannot be directly involved in the calculation, we use the pre-training word vectors to map each word in the instance into the form of word vector. Each word \(w_i\) in the instance \(s=(w_1,w_2,\ldots ,w_n )\) is expressed as a \(d_k\) dimensional vector \(e_i\). Then the embedding vector \(s'=(e_1,e_2,\ldots ,e_n )\) of instance s is obtained, where \(s'\in \mathbb {R}^{n \times d_k }\), and n denotes the maximum length of instance.

3.3.2 Feature Extraction Layer

Since the input of the model is a sentence, there is a strong correlation between each word in the sentence, so we can generate sentence feature vectors by extracting the semantic information of all words, and better express the complete semantic information of the sentence. In this paper, we perform CNN on the dimension of the word embeddings to extract the completed semantic information to obtain the corresponding sentence embedding, so as to better represent the sentence features. The feature extraction module is shown in Fig. 2. First, we use h convolutional kernels with \(t \times d_k\) (\(t<n\)) dimentional to perform on \(n \times d_k\) dimensional embedding vector \(s'\). And h local feature vectors with dimensions \(m \times 1\) can be obtained. Then we use the max-pooling operations with the window size of \(m \times 1\) to extract the maximum feature of each convolution kernel. Finally, we can obtain a \(1 \times h\) dimensional feature vector, and instance \(s'\) is represented as a feature vector \(x \in \mathbb {R}^{1 \times h}\).

3.4 Prototypical Networks

The main idea of prototypical networks is to use the class prototype vectors to represent the instances of each category. We represent the class prototype of each category by calculating the average value of its instances’ feature vectors. The class prototype vector is calculated by Eq. (2):

$$\begin{aligned} c_i =\frac{1}{K}\sum _{j=1}^{K} x_i^j, \end{aligned}$$
(2)

where \(x_i^j\) denotes the j-th instance’s feature vector of the i-th category, K denotes instances’ number of the i-th category in support set, and \(c_i\) denotes the class prototype of the i-th category.

Fig. 3
figure 3

Instance-dimension level attention module

3.4.1 Instance-Dimension Level Attention

Our dataset has a certain particularity, the length of the instances in our dataset is shorter and the instances of the same category have certain similarities. It can be seen from Table 1, although the three instances have different expressions, they have similar semantics. Besides, there are fewer instances in the support set in FSL, and the features extracted from the support set have a datasparse problem. We introduce the instance-dimension level attention composed of instance-level and dimensional-level attention. It is shown in Fig. 3.

The instance-level attention draws on the idea of the self-attention mechanism [19], and performs cross-attention between K instances in the support set to assign different weights to each instance. If one instance has a higher similarity with other instances, it means that this instance is more representative of its category and should be given a higher weight. Thus, a certain interdependence relationship is established between K instances in the same category.

The instance-level attention module first multiplies the instances’ feature vector set \(X_i\) of a certain category in support set with its projection matrices \(W^Q\), \(W^K\) and \(W^V\), respectively. Then through the linear transformation we can obtain three vectors, \(\mathrm{Query}_i\), \(\mathrm{Key}_i\) and \(\mathrm{Value}_i\). Finally, we can get the self-attention score of each instance. That is, one instance can more represent its category, the higher corresponding score. And the weight matrix \(\gamma _i\) composed of similarity scores between each instance and other instances is obtained. Where \(X_i=\{x_i^1,x_i^2,\ldots ,x_i^K \}\in \mathbb {R}^{K \times h}\), i denotes the i-th category, and \(x_i^j\) denotes the j-th instance’s feature vector of the i-th category. \(\mathrm{Query}_i\), \(\mathrm{Key}_i\), \(\mathrm{Value}_i \in \mathbb {R}^{K \times h}\), and the weight matrix \(\gamma _i\in \mathbb {R}^{K \times K}\) is shown as follows:

$$\begin{aligned} \gamma _i =\mathrm{softmax} \left( \frac{\mathrm{Query}_i \cdot \mathrm{Key}^T_i}{\sqrt{h}}\right) . \end{aligned}$$
(3)

Then, as shown in Eq. (4), the weight matrix \(\gamma _i\) and the vector \(\mathrm{Value}_i\) are element-multiplied to obtain the weighted feature vector set \(Z_i=\left\{ z_i^1,z_i^2,\ldots ,z_i^K \right\} \in \mathbb {R}^{K \times h}\), where \(z_i^j\) denotes the j-th instance’s weighted feature vector of the i-th category.

$$\begin{aligned} Z_i =\gamma _i \cdot \mathrm{Value}_i. \end{aligned}$$
(4)

When classifying special categories in the feature space, certain dimensions have stronger distinguishing abilities. Therefore after obtaining weighted feature vectors with a certain interdependence relationship, we introduce dimension-level attention to increase the weight of these dimensions. We apply a CNN-based feature attention mechanism according to Gao et al., to achieve dimension-level attention. Finally, through the convolution operations on each dimension of the weighted feature vector set \(Z_i\), the weight vector \(\beta _i\) on the feature dimension is obtained.

3.4.2 Instance Prediction

To predict the category of query instance q, we calculate the distance from \(x_q\) to each class prototype \(c_i\). The most commonly used distance function is Euclidean distance. For prototypical networks based on multi-attention proposed in this paper, the dimension-level weighted feature vector \(\beta _i\) of the support set can be obtained through the instance-dimension level attention. We use the distance function constructed based on the weight vector \(\beta _i\) to realize the label prediction of the instances [9]. It can make the model better adapt to the given categories and instances, and better alleviate the impact of feature sparse on model performance. The distance function is shown in Eq. (5):

$$\begin{aligned} d_i =\beta _i \left( c_i-x_q \right) ^2. \end{aligned}$$
(5)

Finally, we use the cross-entropy loss function to evaluate the gap between the predicted category and actual category of instance q, and use the stochastic gradient descent algorithm to adjust the parameters.

4 Experiments

To verify that the model proposed in this paper can better realize the legal consulting questions classification, we construct a few-shot consulting questions classification dataset in the legal field, and compare it with other FSL models.

4.1 Datasets

We evaluate our approach on legal consulting questions classification dataset, amazon product dataset, HuffPost headlines dataset, and FewRel.

Legal consulting questions dataset are obtained from real legal intelligent question-answering websites. This dataset contains 46 categories, such as medical disputes, insurance claims and bond mortgages, with a total of 10,402 legal consulting questions. Each category contains 30–300 instances with different text lengths, and the maximum text length is 40. According to the definition of FSL, we divide the dataset into a training set and a test set, and there is no overlap between the categories of two sets. The statistical information of the dataset is shown in Table 2.

Table 2 Dataset of legal consulting questions

Amazon product dataset contains customer reviews from 24 product categories. Since the original dataset is too large, we generate a subset by sampling 1000 reviews from each category and split it according to Bao et al. [18]

HuffPost headlines dataset [20] consists of news headlines published on HuffPost between 2012 and 2018. These headlines split among 41 classes.

FewRel [21] is a relation classification dataset developed for FSL. Each instance is a sentence, annotated with a head entity, a tail entity and their relation. We need to classify the relation between the head and tail entities based on the semantic information of the sentence.

4.2 Baselines

In this section, the baseline models in our experiments are introduced as follows:

  • SNAIL [13] is a meta-learning model, which uses temporal convolutional neural networks and attention modules to integrate information from past experience to achieve rapid learning.

  • GNN [14] is proposed to embed instances of support set and correspond label information into graph nodes, and spread information between nodes. The query set instances receive information from support set, so as to realize few-shot classification.

  • Siamese neural networks [12] is proposed to accomplish image classification. During training, this model combines different samples into pairs and inputs them into siamese network to extract the features, and then calculates the distance between the features to realize classification.

  • Prototypical networks [4] is applied to a few-shot image classification task. This model maps the samples to a metric space. For each class of samples, it calculates the mean of feature vectors to represent the class prototype. When predicting an unknown sample, the Euclidean distance is used to calculate the distance between the sample and each class prototype to predict the label of the sample.

  • Proto-HATT [9] is a hybrid attention-based prototypical networks for noisy few-shot relation classification.

4.3 Parameter Settings

In this experiment, Google’s open-source toolkit word2vec is used to obtain word embedding vectors by training the dataset constructed in this paper. The hyper-parameters are determined by grid search algorithm, including CNN encoding window size \(t\in \{2,3,4,5\}\), batch size batch \(\in \{3,4,5,6\}\), and learning rate size lr\(\in \{0.001,0.01,0.1\}\). In addition, we performed 30,000 iterations to train the model. The parameters used by the model are shown in Table 3.

Table 3 Parameter settings
Table 4 Accuracy comparison between different models on our dataset (%)
Table 5 Ablation study of prototypical Networks based on multi-attention on our dataset (%)

4.4 Experimental Results

4.4.1 Overall Performance

To verify the superiority of the model proposed in this paper, we have compared it with the current mainstream FSL models.

As shown in Table 4: (1) In the four cases, the accuracy of MAPN-LC is higher than that of the baseline models, and it is 2–3% higher than Proto-HATT which works best among the comparison models. (2) From the experimental results of MAPN-LC in the four cases, the accuracy is the highest in 5-way 10-shot, and the accuracy is the lowest in 10-way 5-shot. It explains that with the same number of categories, as the number of training instances increases, the model performance will also improve. In the case of the same number of instances, as the number of categories increases, the difficulty also increases. (3) MAPN-LC is better than Proto-HATT. The reason is that although Proto-HATT uses feature-level attention, the difference is that we consider the impact of instance similarity on model performance when extracting instances’ features. MAPN-LC increases the weight of instances with higher similarity to other instances when extracting features. Based on the above points, MAPN-LC proposed in this paper can better realize the classification of legal consulting questions.

To verify the attention purposed in the paper, the instance-level and dimension-level attention, which are applied on the basis of prototypical networks are helpful to improve the performance of model, we conduct an ablation experiment on the multi-attention prototypical networks. The results are shown in Table 5. Among them, Prototypical networks represent the basic prototypical networks model, DAPN-LC represents prototypical networks model that only applies dimension-level attention, and MAPN-LC represents prototypical networks model that integrates instance-dimension level attention.

It can be seen from Table 5: (1) In the four cases, DAPN-LC are 5% higher than the basic prototypical networks model on average, which proves that extracting instances features to express its category characteristics is helpful to improve the model performance. (2) MAPN-LC that applies instance-level attention on the basis of DAPN-LC is 1–2% higher than DAPN-LC, which proves that the dimensional features extracted after weighting the instances are more representative. (3) Based on the above experimental results, it can be seen that adding instance-dimension level attention on the basis of the prototypical networks can improve the performance of the model.

Table 6 Accuracy comparison of few-shot text classification on HuffPost (%)
Table 7 Accuracy comparison of relation classification on FewRel validation set (%)

4.4.2 Generalization Ability Verification

To verify that the model proposed in this paper is also applicable to other public datasets, we experimented on HuffPost, FewRel and Amazon. The accuracy comparison results are shown in Tables 6, 7 and 8. It can be seen that the performance of MAPN-LC is higher than that of Proto-HATT in most cases, which proves that the model proposed in this paper is not only suitable for legal consulting questions dataset, but also for other general few-shot datasets. However, compared with Proto-HATT, the performance improvement is not obvious, indicating that this model is more suitable for the legal consulting questions dataset.

Table 8 Accuracy comparison of few-shot text classification on Amazon (%)

5 Conclusion and Future Work

In this paper, we propose a classification model based on multi-attention prototypical networks to solve the few-shot classification task in the field of legal intelligent question-answering. First, a few-shot consulting questions classification dataset in the legal field is constructed and the classification is realized by integrating multi-attention on the basis of the prototypical networks. Among them, multi-attention refers to the instance-dimension level attention. The instance-level attention is mainly used to weight each category instances so that the local features extracted by the dimension-level attention can better represent the characteristics of its category. The dimension-level attention is used to capture the semantic information of instances and alleviate the problem of feature sparseness. In the experiment, we compare with the current mainstream FSL methods, the results show that the model proposed in this paper is better than baseline models.

In the future, we will refine the classification dataset of few-shot consulting problem in the legal field which constructed in this paper, and explore more suitable models for the classification of legal consulting questions, so as to improve the performance of classification.