1 Introduction

A recent series of pneumonia outbreaks caused by a novel coronavirus has spread across China and the rest of the world. During the pandemic, many criminals took the opportunity to break the law and commit crimes, endangering people’s health and safety while challenging to prevent and control the pandemic by the inspecting authorities. Because of the unprecedented outbreak of the pandemic and its wide-ranging impact, the inspecting authorities need to be case-specific when confronted with cases against the prevention and control of the pandemic at the beginning. The increasing number of criminal cases poses a significant challenge to the work of the authorities. On the other hand, people’s knowledge of the laws related to the pandemic is also lacking. In artificial intelligence, there has been a lot of research work that has explored the possibility of AI tools to aid legal judgments (Huang and Luo 2018). Moreover, such tools cannot only make legal work more efficient. However, they can also help ordinary people lacking legal knowledge.

For example, Weber et al. (1998), Zhong et al. (2019), and Liu et al. (2019) develop three case-based reasoning systems on legal documents to solve real-world problems. And Liu and Hsieh (2006) use the K-nearest neighbour as a classifier to process Chinese judicial documents. With the success of deep learning in natural language processing, deep neural networks have also been used to predict crime type in legal cases (Hu et al. 2018). Many researchers studied the application of these methods in the legal field with significant results. However, few attempts have been made to apply these methods to the legal cases involved in the COVID-19 pandemic and develop an operational system.

To date, the lack of open, high-quality data of legal cases regarding the pandemic, the diversity of charges, and the complexity of cases make achieving a reliable information retrieval model a challenging task. To the end, in this paper, we try to use the small amount of available data to implement a WeChat-based information retrieval system of legal knowledge. The system can identify similar cases regarding outbreaks. Based on the description of the case entered by the user, the system can give the corresponding similar case or the reference legal gist that applies to the description. In this paper, we collect and construct the relevant dataset by ourselves. We also propose a semantic matching network based on the siamese structure as the basis for the system’s functional implementation. Moreover, we propose a relation learning module to help the network better model the relationship between the two sentences.

Figure 1 shows how we implement our system. Firstly, we use crawlers to collect data from the target websites and perform pre-processing operations on the data. Then, we use the data collected in the first step to train a neural network model to support our system’s core functions. Finally, we deploy the model trained in the second step to the developer server, receive user input, and provide the user with the corresponding result. To realise the function of providing a reference to the legal gist, because of the small amount of available dataset and the diversity of crimes, we use text matching instead of case classification to provide the legal gist applied by similar cases as the target output.

Fig. 1
figure 1

The flow chart of our system

The main contributions of this paper are as follows. (1) We propose an effective semantic matching network for implementing our system. (2) We design and implement a system to assist legal practitioners and ordinary people during the pandemic. And (3) the system implemented is based on the WeChat public platform for deployment, with a low access and user-friendly use threshold.

The rest of this paper is organised as follows. Section 2 presents the definition of the problem we are going to solve in this paper. Section 3 describes the procedure of data acquisition and reprocessing. Section 4 gives the details of the text encoder of our network. Section 5 discusses the relation learning module in our network. Section 6 presents the experimental analysis on our model. Section 7 discusses the deployment of our system. Section 8 compare our wok with the related work. Finally, Sect. 9 concludes this paper with future work.

2 Problem definition

For each case entered by a user, we treat the factual description as a sequence of words, i.e.,

$$\begin{aligned} X= \{x_1,\ldots ,x_N\}, \end{aligned}$$

where N is the number of words. Our system’s input is a sequence of words, and its output is a reference case that is most similar to the input case or the reference legal gist applicable to that case. We implement the two functions based on similar case matching (we will detail their implementations in later sections).

3 Data preparation

This section will describes the data acquisition and reprocessing.

3.1 Data acquisition

Fig. 2
figure 2

Typical flow of crawling process

figure a

We use a crawler to acquire the relevant data online. Figure 2 shows a typical flow of the crawling process. Algorithm 1 shows data acquisition and preparation steps based on this crawling process. On its lines 1–3, we derive the URL List used in this work from the online release office page of the Supreme People’s Procuratorate of the People’s Republic of China.Footnote 1 Specifically, on line 1, we treat the online release office page as homepage L. Then, on line 2, we filter the URLs with the keyword “typical cases of epidemic prevention and control crimes” in the title to form the URL List. Finally, on line 3, we initialise the dataset file of S for subsequent dataset information storage. On lines 4–9, we iterate over each URL in the URL list to construct our structured dataset file. Specifically, on line 6, we use the request library in Python to retrieve each URL’s page source in the URL list. On line 7, we use the BeautifulSoup library of Python to parse the content from these page-sources and save the parsed contents as HTML records for URL List. On line 8, we use a simple rule-based filter to extract the case information we want from each case in the HTML records, including the fact and type of crime.

Table 1 The statistics of the raw dataset
Table 2 Examples of formatted data after our processing

Table 1 shows the statistics of collected data. We manually annotate each case with a legal gist according to the law applicable to the crime type based on the stored information. The constructed structured dataset consists of 97 items, which is stored in the format of “Crime”, “Law Gist”, and “Fact” (see Table 2).

figure b

Since we collected a relatively small number of available case samples, we follow a slightly different approach to constructing positive and negative case samples, i.e., we use the relative similarity between cases for classification. More specifically, we are based on the assumption that, for a case, the similarity of the same type is higher than the different type. We will verify the validity of this hypothesis in Sect. 6.5 later.

Algorithm 2 shows the method we use to construct the final training data. On line 1, we initialise the list to store final training data. On line 2, we initialise a list of indexes to filter crime types for data enhancement. On lines 3–19, we use data augmentation techniques to obtain the final training data. Specifically, on line 4 of the algorithm, we select all the cases from F that match the crime type C[i]. On line 5, we remove index i from the list of indexes to exclude crime type C[i] from the data enhancement process later. On line 6, we create a temporary copy of the index list to facilitate subsequent data enhancement operations. On line 7, we obtain the number of crime types remaining after excluding i crime types. On line 8, we iterate through the crime cases and begin the data enhancement operation. On lines 9–16, for each case cs matching with C[i], we construct a negative sample by randomly selecting one case from each of the crime types ranked after C[i] combined with cs. On lines 17–19, we construct positive samples by combining the current cs with each case in the cases ranked after cs.

The following is an piece of the final training data:

figure c

As shown above, a piece of the final training data is a 3-tuple \((y, X_1, X_2)\), where \(X_1\) is a description of one case, \(X_2\) is a description of another case, and y is the similarity label. If \(X_{1}\) is similar to \(X_{2}\), label y is 1 (see line 18 of Algorithm 2); otherwise, \(y=0\) (see line 15 of Algorithm 2).

Fig. 3
figure 3

The process of data augmentation operation

Fig. 4
figure 4

The number distributions of two categories before and after data augmentation

3.2 Data augmentation by back translation

To increase the sample number and intra-class sample diversity, we use a complementary data augmentation technique to extend our dataset. The data augmentation method uses two translation tools: one of which translates Chinese into English, and the other translates English back into Chinese. To avoid the method affecting the sample balance of the dataset, we only apply it to the categories with less than nine sample sizes. As shown in Table 1, these categories are: “Selling Fake and Inferior Products”, “Illegal Business”, “Illegal Hunting and Killing of Rare and Endangered Wildlife Protected By The State”, and “Illegal Acquisition and Sale of Rare and Endangered Wildlife”.

In this work, we employ Google Translate and Baidu Translate as translation tools for our data augmentation operations. The former is a famous translation engine worldwide, while the latter is a top-ranked translation engine in China. We use Google Translate as our Chinese-English system and Baidu Translate as our English-Chinese system. The process of data augmentation is illustrated in Fig. 3, where Chinese is the source language. The number distribution of the two categories is shown in Fig. 4. We can see that our data construction method yields a more balanced training dataset, enabling us to learn a better performance model.

Fig. 5
figure 5

The architecture of proposed network

As shown in Fig. 5, our network consists of two modules: text encoder and relation learning module. We will discuss them in turn in the following two sections.

4 Text encoder

This section will present the text encoder of our network model.

Fig. 6
figure 6

The architecture of a CNN for s word input sentence

Support there is a pair of case descriptions:

$$\begin{aligned} X_1&=\{w_{1,1},\ldots ,w_{1,m} \}, \end{aligned}$$
(1)
$$\begin{aligned} X_2&=\{w_{2,1},\ldots ,w_{2,n} \}, \end{aligned}$$
(2)

where \(w_{i,j}\) indicates the jth word in the ith sequence, and m and n are the length of the two cases, respectively. Then we convert the words in the text to their corresponding word embeddings:

$$\begin{aligned} E_1&= \{e_{1,1}, \ldots , e_{1,m}\} = W_e[V(X_1)], \end{aligned}$$
(3)
$$\begin{aligned} E_2&= \{e_{2,1}, \ldots , e_{2,n}\} = W_e[V(X_2)], \end{aligned}$$
(4)

where \(V( \cdot )\) denotes the function that converts the sequence to its indices in the corresponding vocabulary, \(W_e\) is the corresponding embedding matrix, and \(e_{i,j}\) is the vector corresponding to the jth word in the ith sentence.

The trainable word embedding is initialised from 300-dimension pre-trained Chinese Word Vectors (Li et al. 2018). For Chinese, the accuracy of word segmentation affects the subsequent semantic parsing process. Therefore, in practice, we use character embedding, which also can convey strong semantics. All out-of-vocabulary words are mapped to a <UNK> token, randomly initialised to a 300-dimension vector.

We then use Convolutional Neural Network (CNN) (Krizhevsky et al. 2017) as a shared text encoder to produce the high-level semantic representation vector of all words for two cases. We have tried CNN, Bi-LSTM (Graves and Schmidhuber 2005), Bi-GRU (Cao et al. 2019) and BERT (Devlin et al. 2019) as the encoder in experiments (see Sect. 6.4), finding CNN is the best, so finally we choose CNN as t he encoder in our system.

Figure 6 shows the CNN structure we used.

  1. 1.

    Convolution Suppose the input sentence is \(X_1\) (see formula (1)), and the embedding is \(E_1\) (see formula (3)) as the sentence matrix. Thus, we have a sentence matrix of \(m \times d\). Then the convolution operation is to apply a filter \(W_k \in {\mathbb {R}}^{k \times d}\) to a window of k words to produce a new feature. For example, we consider subsequence \(S_j = [e_{1,j}; \cdots ; e_{1,{j+k}}]\) where [; ] denotes the concatenation operation, then the convolution operation takes the dot product of \(W_k\) with \(S_j\) followed by a non-linear function to generate a new feature \(c_{k, j}\), i.e.,

    $$\begin{aligned} c_{k, j} = \mathrm{{ReLU}}(W_k \cdot S_j + b_k), \end{aligned}$$
    (5)

    where \(b_k\) is the bias. Thus, the filter is applied to each possible window of k words in \(E_1\) to produce a feature map:

    $$\begin{aligned} \vec {c}_{k} = (c_{k,1}, \ldots , c_{k, m-k+1} ). \end{aligned}$$
    (6)

    To capture the different features, we use l filters (i.e., \(W=\{ w_{1,k}\), \(\ldots \), \(w_{l,k}\}\)) in the convolution operation. Then, the convolution result is a matrix:

    $$\begin{aligned} C_k = [c_{k,1}; \cdots ; c_{k,l}] \in {\mathbb {R}}^{l \times (m-k+1)}. \end{aligned}$$

    According to our experimental results, in practice, we use three kinds of convolution kernels with widths of 2, 3 and 4 (i.e., \(k=2, 3, 4\)). So we end up with three convolution results of \(C_2\), \(C_3\), and \(C_4\).

  2. 2.

    Max pooling After obtaining futures, we perform the max pooling operation on each feature map in the convolution result, i.e.,

    $$\begin{aligned} \overline{c}_{k}= {\max }\{c_{k,i}\mid i \in \{1, \ldots , l \}\}, \end{aligned}$$
    (7)

    meaning to use the maximum value as the feature for that region. Putting them togather, we have:

    $$\begin{aligned}&\overline{C} = \{ \overline{c}_1, \ldots , \overline{c}_l \}. \end{aligned}$$
    (8)

    The idea behind max pooling is to capture the essential features in each feature mapping.

Finally, we concatenate all the results obtained after max-pooling as the output of the CNN:

$$\begin{aligned} \vec {v}_1 = [\overline{C}_2;\; \overline{C}_3; \;\overline{C}_4]. \end{aligned}$$
(9)

Similarly, we can also perform the above operation to obtain the output \(v_2\) of input \(X_2\).

5 Relation learning

This section will discuss relation learning (the second part of our network model). After passing the text encoder, the original input sequence pairs \(\langle X_1, X_2\rangle \) are transformed into a new representation \(\langle v_1,v_2\rangle \). To enhance the model’s ability to recognise the relationship of input sentence pairs, we propose two modules: semantic matching (Sect. 5.1) and contrast learning (Sect. 5.2). We use joint learning (Sect. 5.3) to train these two modules together.

5.1 Semantic matching

A semantic matching module is to learn the semantic relations between sentence pairs. Specifically, we first perform a nonlinear transformation on \(\vec {v}_1\) and \(\vec {v}_2\) as follows:

$$\begin{aligned}&\vec {v}'_{1}= \mathrm{{ReLU}}(W_s \vec {v}_1 + b_s), \\&\vec {v}'_{2} = \mathrm{{ReLU}}(W_s \vec {v_2} + b_s), \end{aligned}$$

where ReLU (Rectified Linear Unit) (Nair and Hinton 2010) is a non-linear activation function as follows:

$$\begin{aligned} f(x) = \max \{0, x\}. \end{aligned}$$

Then, we use a heuristic matching technique (Wang et al. 2018) to model the relationship between sentence pairs:

$$\begin{aligned} \vec {z} = [\vec {v}'_{1}; \;\vec {v}'_{2}; \;(\vec {v}'_1 \circ \vec {v}'_2); \;(\vec {v}'_1 - \vec {v}'_2)], \end{aligned}$$

where [;] denotes the concatenation operation, and \(\circ \) denotes the element-wise product. Heuristic matching is considered more effective when combining multiple representations than simple concatenation and addition operations (Mou et al. 2016). The element-wise product and difference can be viewed as capturing the different aspects of the relationship between the two sentences. Next, we employ a MLP (Multi-Layer Perceptron) (Gardner and Dorling 1998) with one hidden layer for label prediction:

$$\begin{aligned}&P(\hat{y} \mid X_{1}, X_{2}) = \mathrm{{MLP}}(\vec {z}). \end{aligned}$$

5.2 Contrastive learning

To enhance the model’s ability to distinguish between sentences, we propose an auxiliary task contrastive learning. In this task, the model first measures the Euclidean distance between vectors \(\vec {v}_{1}\) and \(\vec {v}_{2}\) by

$$\begin{aligned} D(\vec {v}_1, \vec {v}_2) = \sqrt{(v_{1,1} - v_{2,1})^2 + \cdots + (v_{1,n} - v_{2,n})^2}. \end{aligned}$$
(10)

Then, to achieve the goal of this task, we use the loss function of contrastive learning (Hadsell et al. 2006), which can ensure that semantically similar samples are closer in the embedding space. As a distance-based loss function, contrastive loss aims to maximise the distance of features from different class and minimise the distance of features from the same class. The contrastive loss is calculated as follows:

$$\begin{aligned} L_c=\frac{1}{K} \sum _{i=1}^{K} L(y_{i},\vec {v}_{i,1},\vec {v}_{i,2}), \end{aligned}$$
(11)

where K is the number of training pairs, and

$$\begin{aligned} L(y_{i},\vec {v}_{i,1},\vec {v}_{i,2}) = y_i (D(\vec {v}_{i,1}, \vec {v}_{i,2}))^2+(1 - y_i)\left( \max \{\alpha -D(\vec {v}_{i,1}, \vec {v}_{i,2}), 0\}\right) ^2, \end{aligned}$$

where \((y_{i},\vec {v}_{i,1},\vec {v}_{i,2})\) is the ith labeled sample pair, \(y_i \in \{0,1\}\) is the similarity label of the ith sample pair, \(D(\vec {v}_{i,1}, \vec {v}_{i,2})\) is the Euclidean distance of the ith sample pair, and the \(\alpha \) is a margin value that represents our tolerance for the distance between dissimilar features.

The Euclidean distance is affected by the vector dimensionality and fluctuates widely in its range of values. Therefore, to bypass the above problems, we first perform L2-normalisation of the vectors \(\vec {v}^1\) and \(\vec {v}^2\) before calculating the distance as follows:

$$\begin{aligned}&\vec {u}_1 = \frac{\vec {v}_1}{\sqrt{\sum ^n_{i=1}(v_{1,i})^2}}, \\&\vec {u}_2 = \frac{\vec {v}_2}{\sqrt{\sum ^n_{i=1}(v_{2,i})^2}}. \end{aligned}$$

Then, we replace vectors \(\vec {v}_1\) and \(\vec {v}_2\) in formula (10) by \(\vec {u}_1\) and \(\vec {u}_2\).

5.3 Joint learning

For the semantic matching module, we use Cross-Entropy as the loss function:

$$\begin{aligned} L_s = -y P(\hat{y} \mid X_1, X_2). \end{aligned}$$

Then we integrate the above loss function with the contrastive loss function (i.e., formula (11)) as follows:

$$\begin{aligned} {\mathcal {L}} = (1-\lambda ) L_s + \lambda L_c, \end{aligned}$$
(12)

where \(\lambda \) is the hyper-parameter that controls the weight of auxiliary loss.

Finally, the whole network is trained by minimising the joint objective function (12).

Table 3 The statistics of our dataset

6 Experiment

This section will present the experimental evaluations on our system.

6.1 Datasets and experimental settings

The whole datasets contain 1350 items. Table 3 shows the detailed statistics of the dataset. In our experiments, the max length is set to 512. The model is implemented with the PyTorch framework. For the BERT model, we use the BertAdam optimiser with a learning rate of \(5e^{-4}\) to minimise the loss during training and set the batch size to 16. For the rest of the model, we use Adam optimiser with a learning rate of \(1e^{-3}\) and set the batch size to 16. We use 128-dimensional filters for CNN with a width of 2, 3, and 4, respectively. We set the hidden size as 300 for all the GRU and LSTM layers and set the layer’s number as 1. The margin \(\alpha \) is set to 1. We use the F1-score on the validation set to achieve early stopping for all experiments.

6.2 Baselines

We compare our network with the following four semantic matching models on the dataset we have constructed.

  • ConvNet (Severyn and Moschitti 2015): ConvNet uses a convolutional network to derive features from text pairs. It then uses various pooling techniques to map the text pair features. A trainable matrix is used to compute similarity features. Finally, sentence pair features, similarity features and additional manual features are concatenated and passed into the fully connected layer for label prediction.

  • MatchPyramid (Pang et al. 2016): MatchPyramid models text matching task as image recognition. It first uses input text pairs to obtain a Matching Matrix and then uses Hierarchical Convolution to capture rich information about matching patterns. Finally, an MLP is used for label prediction.

  • ESIM (Chen et al. 2017): ESIM is a hybrid neural inference model containing four components: Input Encoding, Local Inference Modelling, Inference Composition and Prediction. The Input Encoding module is used to encode the input text. Local inference modelling computes the differences between the representations obtained from Input Encoding to achieve local information inference for the sequence. Inference Composition performs combinatorial learning of the local inference information and its context. Finally, the final vector v is used to do label prediction.

  • Sentence-BERT (SBERT) (Reimers and Gurevych 2019): SBERT is a variant of the pre-trained BERT network, which uses siamese structures to obtain sentence embeddings with semantic information. It performs a pooling operation on each of the sentence representations obtained from BERT to obtain two fixed-size sentence vectors u and v. Then the vectors u, v, and \(|u-v|\) are concatenated for label prediction.

6.3 Evaluation criteria

According to criteria Accuracy, Precision, Recall, and F1-score, we evaluate the performance of the above models. Their specific calculations are as follows:

$$\begin{aligned}&{Accuracy} = \frac{T\!P+T\!N}{T\!P+F\!P+F\!N+T\!N}, \end{aligned}$$
(13)
$$\begin{aligned}&\quad { Precision} = \frac{T\!P}{T\!P+F\!P}, \end{aligned}$$
(14)
$$\begin{aligned}&\quad {Recall} = \frac{T\!P}{T\!P+F\!N}, \end{aligned}$$
(15)
$$\begin{aligned}&\quad { F1}= \frac{2 \times Recall \times Precision}{Recall+Precision}, \end{aligned}$$
(16)

where

  • \(T\!P\) denotes the correctly predicted values with label 1;

  • \(T\!N\) denotes the correctly predicted values with label 0;

  • \(F\!P\) denotes the scenario where the predicted value is 1, and the true value is 0; and

  • \(F\!N\) denotes the scenario where the predicted value is 0, and the true value is 1.

Furthermore, we compute the inference time for each model to compare the efficiency of the models. In this paper, we use the time taken by the models to process each pair of input texts as the inference time in milliseconds.

Table 4 Experimental result of different models on our test set, where CL stands for Contrastive Learning
Fig. 7
figure 7

Experimental result of different models on our test set

Fig. 8
figure 8

Ablation result of our model on test set

6.4 Results and analysis

Table 4 shows the overall experimental results of the comparison. Figure 7 shows that our proposed network significantly outperforms baseline models with respect to almost all the evaluation criteria. This means that our relation learning module in the network is effective. Although SBERT outperforms our network with respect to criterion Recall, our network has better overall performance and takes much less time to infer than SBERT. We also perform an ablation study to analyse the impact of the contrastive learning on network performance, and the experimental result is shown in Fig. 8. The results show that the introduction of contrastive learning leads to performance improvements. We also analyse the model performance when using Bi-LSTM, Bi-GRU and BERT as encoders and find that the best results are obtained using CNN.

Fig. 9
figure 9

Cosine similarity of positive and negative samples in the test set (the higher similarity the more similar the two samples)

Table 5 Mapping relationships between the crime types involved and the letters in Fig. 9

The above experimental observations may imply:

  1. 1.

    When applying deep neural networks to real-world scenarios, some SOTA neural networks may not perform as well as we expect. We should deal with problems case by case.

  2. 2.

    In the case of few-shot dataset, the contrastive learning might be a good way to improve the representation learning ability of the model.

6.5 Distance distribution of samples

Figure 9 visualises the cosine similarity of intra-class and inter-class samples for all the crime types of the test data. The bars with stars indicate the similarity of intra-class (similar) samples, and the bars with slashes indicates the similarity of inter-class (dissimilar) samples. Specifically, for \(X_1\) (one case) in each 3-tuple data \((y, X_1, X_2)\) of each type of criminal cases, we calculate its average similarity with all the cases of the same crime type and its average similarity with all the cases of different crime types. For the convenience of visualisation, we map each crime type to the corresponding letter. Table 5 shows their corresponding relations.

The similarity between \(\vec {u}_1\) and \(\vec {u}_2\) is calculated by:

$$\begin{aligned} \text{ similarity }(\vec {u}_1, \vec {u}_2)&= \frac{\sum ^n_{i=1} u_{1,i} \cdot u_{2,i}}{\sqrt{\sum ^n_{i=1} u_{1,i}^2} \times \sqrt{\sum ^n_{i=1} u^2_{2,i}}}, \end{aligned}$$
(17)

which is cosine similarity. We take the average of the cosine similarities calculated for each class’s positive and negative samples separately to represent the intra-class and inter-class sample similarity. The results show that the hypothesis that the intra-class samples are more similar than the inter-class samples is reasonable, and our method can distinguish the two texts well in most cases.

Fig. 10
figure 10

The tips for new followers in Chinese and English. When new users follow our WeChat Official Account, the system will give relevant instructions on using the system

figure d
Fig. 11
figure 11

An example that a real user uses our system. A real user entered a case about selling masks at an excessive price, and the system responded with a similar case for that user

7 Deployment

This section will present the details of how we deploy our system.

figure e

Once the model is trained, we deploy it on the developer server to provide the users’ service. As shown in Fig. 10, we customised the “Followed Reply” function of the WeChat Official Account. When new users enter the dialogue window, the system automatically gives a brief introduction of system functions and usage examples. Then users can enter content according to the system’s prompts and get the appropriate responses.

Algorithm 3 shows how the system process user’s input and provides the corresponding result to the user. Specifically, we first use string operations to extract the case description part and a character representing the type of response the user wants of the user’s input. If the first character of the user’s original input is 1, we feed the matching case with the highest similarity to the user as a result (see lines 1–4). Otherwise, the matching legal gist is fed back to the user as the reference result (see lines 5–8). We then pre-process the user’s case descriptions and encode them using the trained model to obtain a vector representation. Next, we obtain the desired return result by matching the vector representation of the input case with the vectors in the matching database. The whole matching database can be found in “Appendix”. Figure 11 shows an example of using our system.

As shown in Algorithm 4, to speed up the response time of the system, our matching process is divided into two parts: coarse-sorting and fine-sorting. We firstly perform a coarse sorting process. On lines 1 and 2, we take all the vectors corresponding to the coarse fields in the matching database as the set of vectors to be retrieved and retrieve the index number idxCoarse corresponding to the nearest neighbour vector of the query vector Inp. The index number idxCoarse reflects the crime type to the input case. On line 3, we use the index idxCoarse to construct the primary key key so that we can query the matching database. On line 4, we extract the corresponding crime type from the matching database based on the key. If parameter mode is equal to ‘gist’, on lines 5–7, we simply extract the corresponding case gist from the matching database based on the key as the return result, without entering the refined sorting process. Otherwise, on lines 8–13, we proceed with the fine-sorting process. On line 9, based on the key, we extract the retrieval range of the local doc vectors corresponding to the crime type c from the matching database. On line 10, we extract the new set of vectors to be retrieved from the local doc vectors. On line 11, we obtain the new index number idxFine of the nearest neighbour vector of the query vector. On line 12, we extract all cases corresponding to crime type c from the matching database. On line13, using the index numbers idxFine, we can get the most similar cases from the cases as the return result.

8 Related work

This section will discuss the related work to show how our work advances the state of the art in relevant research areas.

8.1 Legal information retrieval

Legal IR has become one of the essential topics in the research area of Artificial Intelligence and Law (AI & Law) (Westermann et al. 2019; Šavelka and Ashley 2022).

  • Some researchers focus on developing easy-to-use legal IR systems. For example, Nejadgholi et al. (2017) developed a semi-supervised learning-based legal IR system for immigration cases. The system can find cases in which the fact description is the most similar to the query. Sugathadasa et al. (2018) also developed a legal IR system. It uses a page rank graph network with TF-IDF (Mao and Chu 2002), which builds document embeddings by constructing a vector space to perform document retrieval tasks.

  • Some emphasise proposing effective IR methods for specific legal cases. For examople, Savelka et al. (2019) and Novotná (2020) employ static word embeddings and cosine similarity to calculate the similarity between a judgement and the query. Šavelka and Ashley (2022) further leverages static word embeddings and various topic models to retrieve proper sentences from a case law database based on the phrase in a given clause.

However, our work in this paper is different from the above work in the following aspects.

  1. (1)

    These methods usually retrieve specific cases (e.g., immigration, veterans’ claims, or others) because they are trained on specific case datasets. As a result, few can retrieve legal cases of the COVID-19 pandemic. Instead, our system focuses on retrieving legal cases related to the COVID-19 pandemic.

  2. (2)

    Furthermore, although these methods are effective and interpretable, they cannot utilize the semantic information of sentences to identify similarities between sentences because they use Boolean rules or word embedding techniques to calculate the similarity between sentences. Unlike them, our system uses a convolutional neural network model trained on the COVID-19 pandemic dataset to capture the semantic information between the data.

In addition, unlike our work in this paper, few existing studies developed datasets on legal cases of the COVID-19 pandemic. For example, Ma et al. (2021c) developed a Chinese legal case retrieval dataset on general criminal cases. Rabelo et al. (2022) produced a common law case retrieval dataset based on the Federal Court of Canada case law database. They are both well-known large-scale legal case retrieval datasets. However, neither of them involves legal cases related to COVID-19.

8.2 CNN-based methods for legal information retrieval

Some researchers use CNN to improve the retrieval performance of legal IR, but our work in this paper is different from them. For example, Tran et al. (2019) used CNNs to implement document and sentence-level pooling, achieving the state-of-art result on the COLIEE dataset (Kano et al. 2018). Wan and Ye (2021) proposed a TinyBERT-based Siamese-CNN model to calculate the similarity between judgment documents. Specifically, they first used TinyBERT to obtain the embedding of documents and then extracted document features by a CNN to calculate the similarity of judgment documents. Sampath and Durairaj (2022) used CNNs to obtain similarity features of substructures (i.e. parts of the whole legal document) and used the similarity features to improve the retrieval performance of the model further. However, although we also used CNNs to complete the pooling step, unlike them, we propose a semantic matching network for pairwise relationship learning based on CNN and use contrastive learning to help the network distinguish between sentences.

8.3 Pre-trained language model based methods for legal information retrieval

Some researchers apply pretrained language models to legal IR. Rossi and Kanoulas (2019) used the pre-trained language model BERT to reformulate the ranking problem in the retrieval problem as a pairwise relevance score problem to improve legal IR. Shao et al. (2020) used BERT to build a paragraph-level interaction model to calculate the relevance between two cases and to complete document matching. Xiao et al. (2021) developed a pretrained language model, called Lawformer, for long legal document understanding. The model is based on pretrained language model Longformer, integrating local and global attention to capturing long-range dependencies between words in a lengthy legal document. The model can be used for similar legal case retrieval. Askari et al. (2021) integrates lexical and neural ranking models for similar legal case retrieval. They used multiple methods (including term extraction and automatic summarisation based on longformer encoder-decoder) for longer query documents to create a shorter query document. Their experiments show the excellent performance of their model on the COLIEE retrieval benchmarks. Wehnert et al. (2021) fine-tuned a BERT classifier for the statute law retrieval task and combined it with TF-IDF to vectorise the documents. Their approach outperforms most baseline methods. Zhu et al. (2022) proposed a two-stage BERT-based ranking method and integrated it with multi-task learning to improve the retrieval performance of the model further. Abolghasemi et al. (2022) used multi-task learning to fine-tune the BERT model to learn document-level representation. Their approach improves the efficiency of the BERT re-ranker in similar legal case retrieval.

However, although these methods based on pretrained language models have improved legal IR performance, they require substantial computational resources, making them difficult to implement for online deployment. Rather, the core of our system is a lightweight CNN-based retrieval model that is easily deployed online and responds in real-time in retrieval.

8.4 COVID-19 information retrieval

With the impact of the COVID-19 pandemic in recent years, there has been an explosion of information about With the impact of the COVID-19 pandemic in recent years, there has been an explosion of information about COVID-19. To this end, some researchers study information retrieval of COVID-19. For example, Wise et al. (2020) first proposed a heterogeneous knowledge graph on COVID-19 for extracting and visualising the relationships among COVID-19 scientific articles. Then they develop a document similarity engine by combining graph topology information with semantic information, which utilises low-dimensional graph embeddings and semantic embeddings from the knowledge graph for similar article retrieval. Esteva et al. (2021) designed a semantic and multi-stage search engine for the COVID-19 lature, which helps healthcare workers to find scientific answers and avoid misinformation in times of crisis. Alzubi et al. (2021) developed a retriever-reader dual algorithm called COBERT, which can retrieve literature on COVID-19 pneumonia. Tran et al. (2021) developed a scientific paper retrieval system for supporting users to efficiently access knowledge in the large number of COVID-19 scientific papers published rapidly. Aonillah et al. (2022) developed a COVID-19 question answering system in Bahasa Indonesia using recognising question entailment.

However, they do not involve relevant legal information but only some simple general knowledge retrieval. Unlike them, the system we developed mainly provides users with legal knowledge related to COVID-19 pneumonia and retrieves some specific real legal cases for users’ reference.

8.5 Deep neural network based information retrieval

Due to the successful application of deep neural networks in natural language processing, many deep neural network based information retrieval methods have been proposed for document retrieval. Liu et al. (2019) used a hybrid approach to calculate the similarity between texts. They first extract keywords from the text to obtain the corresponding word vectors. Then, they combine the obtained word vectors with the statistical vectors of predefined text feature words to synthesise the similarity between two legal cases. Khattab and Zaharia (2020) proposed a ranking model called ColBERT, which uses BERT to encode queries and documents independently to accomplish end-to-end retrieval of millions of documents. Liu et al. (2021) developed a four-tower BERT model that uses the distance between hard negative instances and simple negative instances to enhance the performance of the retriever during training. The studies mentioned above are based on the siamese structure, consisting of two or more consistent sub-networks (e.g., CNN, LSTM or BERT). These sub-networks extract features from two or more inputs separately during training. This structure obtains the maximum retrieval efficiency and is suitable for deployment in actual production environments.

Similar to them, we also employ a siamese structure for online retrieval tasks. However, they have not yet use self-supervised learning based on the siamese structure to enhance the task effect. In contrast, we used self-supervised learning in this paper to enhance the semantic correlation between legal cases. Also, we analyse the corresponding effects of the self-supervised learning network based on the siamese structure and demonstrate that the network performs better than the baseline model.

8.6 BERT-based information retrieval

Since the BERT model was born in 2018, many researchers have presented many BERT-based methods for various IR tasks and achieved impressive results. These methods are in three categories according to the size of the datasets. The first category of methods is large-scale IR, which uses enormous data resources to train models for optimal performance. For example, a BERT-based CogLTX framework proposed by Ding et al. (2020) shows good performance on four large datasets (including NewsQA (Trischler et al. 2017), HotpotQA (Wharton et al. 1994), 20NewsGroups (Lang 1995), and Alibaba), each of which contains tens of thousands of documents. Li and Gaussier (2021) utilises 500,000 news articles and 25 million web documents to train a standard BERT model for long document retrieval. Liu et al. (2021) proposed a four-tower model focusing on the retrieval phase and demonstrated its effectiveness in large-scale retrieval. Guo et al. (2022) adopts the same transformer architecture as BERT and utilises nearly three million web pages with HTML sources and their tree structure to pre-train an information retrieval model called Webformer. Their model outperforms existing pre-trained models in web page retrieval tasks. However, these methods require large training data for feature extraction and representations, which acquisition is complex, and on which the cost of performing manual annotation is prohibitive. Instead, our work investigates the similarity computation performance of deep learning networks for complex datasets with small samples and explores a feasible solution in practice, especially in the legal domain.

The second category of BERT-based IR methods are few-shot IR. They use the small number of available data to fine-tune BERT to accomplish the retrieval task. Maillard et al. (2021) developed a single general-purpose “universal" BERT-based retrieval model that uses a multi-task approach for training and performs compatible with or even better than specialised retrievers in few-sample retrieval. Yu et al. (2021) proposed a session-intensive retrieval system, ConvDR, which learns the embedding dot product to retrieve documents. They also use a teacher-student BERT-based framework to grant ConvDR the ability to learn from very few samples to address the problem of data-hungriness. Mokrii et al. (2021) investigated the transferability of BERT-based ranking models and comprehensively compared transfer learning with models trained on pseudo-labels generated using the BM25 scoring function. Their study shows that fine-tuning models trained on pseudo labels by using a small number of annotated queries can match or exceed the performance of transferred models. All the above methods primarily use BERT to improve the performance of the models for retrieval in scenarios with a small number of training data. These models show that BERT is good at dealing with few-sample scenarios. However, in our ablation experiments, we replaced the CNN encoder by BERT and found that BERT did not lead to a performance improvement but a degradation. Therefore, BERT is not always effective in few-sample real-world scenarios of IR. Furthermore, large language models tend to be overfitting because of insufficient data for fine-tuning. Therefore, we should handle it according to the specific task.

The third category of BERT-based IR methods is zero-shot IR, such as cross-language zero-shot IR (MacAvaney et al. 2020; Wang et al. 2021; Nair et al. 2022), zero-shot self-supervised learning (Assareh 2022) and zero-shot neural paragraph retrieval (Ma et al. 2021a). However, these methods are not as effective as the first two above in practical applications. Therefore, we do not discuss them here.

8.7 Intelligent legal system

Benefiting from a large number of available high-quality textual datasets (Locke and Zuccon 2018; Duan et al. 2019; Zhong et al. 2020b) and web-based public data in the legal field, researchers recently studied extensively how natural language processing can empower the legal field on many topic, such as:

  1. (1)

    Legal opinion generation Ye et al. (2018) propose a model for generating legal opinions based on factual descriptions and uses the generated legal opinions to improve the interpretability of their crime prediction system.

  2. (2)

    Legal judgment prediction Zhong et al. (2020a) proposed a reinforcement learning-based model for predicting interpretable judgments. Ma et al. (2021b) used a real legal dataset to reasonably predict the court judgment of a case according to the plaintiff’s claims and their court debate. Feng et al. (2022) proposed an event-based prediction model with constraints for legal judgments, which addresses the failure of existing methods to locate the critical event information that determines the judgment.

  3. (3)

    Legal document summarisationNguyen et al. (2021) used reinforcement learning to train a model for legal document summarisation. Specifically, they used a proximal policy optimisation approach and proposed a new reward function that encourages the generation of summarisation candidates that satisfy lexical and semantic criteria. Klaus et al. (2022) fine-tuned a transformer-based extractive summarisation model to simplify the statutes, so non-jurists can easily understand them.

However, although our system is an intelligent legal system, ours is different from the above systems. They explore task-specific improvements on large datasets and demonstrate the feasibility of the methods under specific designs. Rather, in this paper, we aim to explore the usability of neural network models, including the currently popular pre-trained language models, in implementing a matching-style legal IR system for the COVID-19 pandemic with samples of small size.

9 Conclusion

We propose a semantic matching network for pairwise relation learning. Moreover, we introduce auxiliary contrastive learning to help network better distinguish the sentences. Our experiments show that our method leads to performance improvements under a variety of encoder designs.

Based on the network we proposed, we design and implement a legal IR system for the COVID-19 pandemic. The system can identify: (1) the crime cases entered by the user and find the most similar cases to be pushed to the user as answers, and (2) the crime cases documented by the user and give the reference legal gist applicable to the case. Meanwhile, the study could benefit developing a more comprehensive legal IR system and similar systems. In the future, it is worth doing more experiments on more data set to analyse the effect of various neural models for such tasks and accordingly improving the system.