PRILJ: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments

In an era characterized by fast technological progress that introduces new unpredictable scenarios every day, working in the law field may appear very difficult, if not supported by the right tools. In this respect, some systems based on Artificial Intelligence methods have been proposed in the literature, to support several tasks in the legal sector. Following this line of research, in this paper we propose a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments, to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-based methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset provided by EUR-Lex, proves the effectiveness and the efficiency of the proposed method. In particular, its ability of modeling different topics of legal documents, as well as of capturing the semantics of the textual content, appear very beneficial for the considered task, and make PRILJ very robust to the possible presence of noise in the data.


Introduction
The actions of members within a community are usually regulated by an enforced system of rules, that aims to ensure equality, fairness, and justice within the community. When the community is actually a Nation, these binding rules of conduct are usually referred to as the law.
Contrary to most of the fields where Computer Science can be considered as a boost for daily activities, the law may appear as a static system, that often responds and adapts too slowly. To alleviate this issue, researchers are putting significant effort in designing advanced (also automated) solutions to improve the efficiency of the processes in the legal sector. In this context, a strong contribution may come from the Artificial Intelligence (AI) field. Among the few attempts made in this direction, we can mention the work by Mandal et al. (2017), where the authors applied AI techniques to measure the similarity among legal case documents, which can be useful to speed up the identification and analysis of judicial precedents. Another relevant example is the work by Medvedeva et al. (2020), where the authors consider the semi-automation of some legal tasks, such as the prediction of judicial decisions of the European Court of Human Rights.
Following this line of research, in this paper, we propose an AI method that can support human legal experts during their activity of writing legal case judgments, by exploiting lexical and semantic similarity at two different degrees of granularity. Indeed, legal case judgments can be considered, represented, and analyzed as whole documents, or according to their summaries, paragraphs, sentences, or reasons for citation. In this work, we analyze them as whole documents as well as according to their paragraphs, with the goal of identifying paragraph regularities among legal case judgments. In particular, given a (possibly incomplete or under preparation) document, henceforth called target document, our system will support the retrieval of paragraphs semantically similar to those of the target document, appearing in the set of reference documents related to previous transcribed legal case judgments. Therefore, paragraph regularities refer to the set of retrieved paragraphs, that can significantly support and facilitate the redaction of the target document at hand. Indeed, such paragraphs may provide useful indications about aspects, clauses, or citations that have been reported in contexts similar to that of the target document, and that are expected to be reported also in the target document. In other words, accessing a set of similar paragraphs would provide the legal expert with possible clues on missing pieces of information in the target document, that would possibly deserve to be added to a document under preparation.
Although in the literature we can find several document similarity measures implemented through (a) network-based approaches (Kumar et al. 2011;Minocha et al. 2015), (b) text-based methods (Kumar et al. 2013;Mandal et al. 2017) or (c) hybrid approaches (Kumar et al. 2013), the estimation of the similarity between two legal case documents is still considered a challenging task. Indeed, different themes in legal case documents form different networks of rules that, if considered as a single collection of documents, may lead to inaccurate estimations. PRILJ: an efficient two-step method based on embedding and… In order to overcome this issue, in this paper we combine the embedding of legal case judgments with a clustering approach, to effectively identify regularities among paragraphs. In particular, we (i) pre-process documents through standard Natural Language Processing (NLP) approaches; ii) represent them into a multidimensional semantic feature space, through a document embedding approach based on Neural Networks (NN); (iii) group them through a clustering method, in order to capture similar documents; (iv) learn an embedding model for each cluster, from the paragraphs belonging to their documents. The specific paragraph embedding model, learned from a subset of documents falling into a given cluster, can then be adopted to represent paragraphs (belonging to reference documents or to the target document) into a semantic feature space. Finally, we exploit an efficient strategy to identify paragraph regularities based on Approximated Nearest Neighbor Search (ANNS).
Our two-step approach has the main advantage of learning a different semantic representation for each group of documents (rather than one single model), that allows us to capture peculiarities of paragraphs according to the specific topic. We argue that such peculiarities would not be easily identifiable through a unique representation learned from the whole set of paragraphs. This aspect also allows the proposed two-step approach to be robust to the presence of noise in the data. Note that noise can be in the form of misleading words, e.g., homonyms, or single words that are strongly related to a topic, appearing in paragraphs that are related to a totally different topic. Therefore, it is of utmost importance to be able to capture the right topics of paragraphs, without being affected by the presence of such words.
The rest of the paper is structured as follows. In Sect. 2 we describe previous work related to the present paper, while in Sect. 3 we describe in detail the proposed method. In Sect. 4 we describe our experimental evaluation, and we show and discuss the results. Finally, in Sect. 5 we draw some conclusions and outline possible future work.

Related work
In the following subsections, we briefly discuss existing works that support the retrieval of legal information as well as the identification of regularities in legal case judgments by exploiting clustering-based approaches.

Retrieval of legal information
The retrieval of legal information from existing collections of legal documents can be supported by Information Retrieval (IR) techniques. Indeed, in the literature they have been already profitably used for different tasks. Existing approaches can mainly be categorized into methods that adopt manual knowledge engineering procedures (Brüninghaus and Ashley 2001;Silveira and Ribeiro-neto 2004) and methods that exploit NLP (Biagioli et al. 2005;Tomlinson et al. 2007). In general, NLP-based approaches applied to the legal sector are deemed to be superior, since they combine data-driven methods and embedding models when analyzing legal case judgments to directly identify legal concepts or different representations, such as tagged feature-value pairs or logical predicates (Maxwell and Schafer 2008;Zhong et al. 2020).
More recently, the big data paradigm, as well as the availability of big data analytics tools, is influencing legal authorities towards the publication of legal case documents through their online databases. On the other hand, researchers in the AI field are seizing this opportunity to enhance existing studies and contribute to the legal informatics field. For example, the work by Kumar et al. (2011);Trompper and Winkels (2016); Shulayeva et al. (2017); Mandal et al. (2017) prove the effort devoted to the development of complex intelligent solutions, to solve tasks such as legal document review, precedent analysis, or document similarity evaluation. Specifically, since various judgments lack semantic annotations, Trompper and Winkels (2016) proposed a method to assign a section hierarchy to Dutch legal case judgments: given a set of unstructured legal case judgments, the authors apply tokenization with linear-chain condition random fields (Sutton and McCallum 2012) to identify and label the roles of textual elements in a legal case judgment. Probabilistic context-free grammars are finally used to organize the text elements into a section hierarchy. Shulayeva et al. (2017) applied a machine learning method to automatically annotate sentences containing legal facts and principles in common law reports. The proposed approach relies on a feature selection step and on a Naïve Bayes multinomial classifier, to classify a given sentence as a principle, a fact or a neutral text. Although satisfactory results were achieved, experiments were limited to a corpus of only 50 reports. Kumar et al. (2011) and Mandal et al. (2017) also contributed to the enhancement of the precedent analysis task. Both the proposed approaches were based on measuring the similarity among legal case judgments. In particular, Kumar et al. (2011) applied TF-IDF to represent documents, considering all the terms or only legal terms. Moreover, they also adopted the so-called bibliographic coupling similarity (that considers common out-citations), and co-citation similarity (that considers common in-citations). The authors reported that only the cosine similarity based on legal terms and the bibliographic coupling similarity provided accurate results when identifying similar legal case judgments, but generally observed a limited scalability of the proposed approach. On the other hand, Mandal et al. (2017) applied four different methodologies, namely, TF-IDF, Word2Vec, Doc2Vec, and Latent Dirichlet Allocation, to represent legal case judgments at different levels of granularity, such as summaries, paragraphs, sentences, or reasons for citation.
The adoption of embedding techniques to represent the textual content of legal documents has also been explored in LEGAL-BERT proposed by Chalkidis et al. (2020). LEGAL-BERT uses Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2019), to obtain contextual representations from legal documents. BERT uses a Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. Chalkidis et al. (2020) compared the performance of a general purpose pre-trained BERT model (Devlin et al. 2019), a fine-tuned BERT model with domain-specific corpora, and a BERT model totally learned from scratch from domain-specific corpora. The experiments performed by Chalkidis et al. (2020) confirmed that the fine-tuned version and the model trained from scratch from domain-specific documents show the best performance.
The same line of research has been explored in the Competition On Legal Information Extraction/Entailment (COLIEE), during which several tasks related to the legal domain have been solved with the support of embedding techniques. Among the approaches related to the present paper, it is worth mentioning BERT-PLI (Shao et al. 2020), that adopts BERT to capture the semantic relationships at the paragraph-level and then infers the relevance between two cases by aggregating paragraph-level interactions. Analogously to LEGAL-BERT, the BERT model in BERT-PLI is fine-tuned with a dataset related to the legal field.
It is noteworthy that, although the exploitation of embedding techniques has already been explored in the literature, to the best of the authors knowledge, the method proposed in this paper is innovative in the following aspects: i) it is the first method in the literature that exploits a two-step approach, based on clustering, to analyze textual content at document and paragraph levels in the legal domain; ii) it exploits an efficient Approximated Nearest Neighbor Search (ANNS) method to identify paragraph regularities; iii) it is much more robust to the presence of noise in the data, with respect to both baseline and state-of-the-art solutions, as we will show in our empirical evaluation (see Sect. 4.3).

Clustering-based approaches in the legal sector
Clustering generally refers to an unsupervised task consisting in grouping similar objects into clusters. More specifically, similar objects should fall into the same cluster, while dissimilar objects should fall into different clusters. Together with the design of advanced clustering algorithms (Berkhin 2002;Ester et al. 1996;Pio et al. 2012;Corizzo et al. 2019), the most critical research aspect of clustering is in the design of a proper representation of the objects/items at hand (Mikolov et al. 2013;Le and Mikolov 2014), as well as of similarity measures which allow the algorithms to understand how much two objects are similar/dissimilar.
Clustering techniques have also been adopted in the legal field. Despite the fact that legal documents are highly abstract, researchers managed to group similar legal documents on the basis of their topics, or on the basis of case citations and legal citations (Conrad et al. 2005;Lu et al. 2011;Raghav et al. 2015;Kachappilly and Wagh 2018). Conrad et al. (2005) adopted a clustering tool  to apply both hard and soft clustering on three large heterogeneous datasets to effectively generate a taxonomy while supporting legal firms in their knowledge management processes. Particularly promising results were achieved when adopting hierarchical clustering methods, where clusters are organized in a hierarchy, or overlapping clustering methods, where each document can possibly belong to multiple clusters. Lu et al. (2011) reports a successful and scalable implementation of a soft clustering algorithm that is based on topic segmentation. Contrary to typical approaches based on lexical similarity, the authors exploit topics, document citations, and click-stream data from user behavior databases, to obtain a high-quality classification similar to that achieved by human legal experts. Raghav et al. (2015) exploited citations and paragraph links to cluster legal case judgments from the Supreme Court of India, aiming to build efficient search engines. The authors used regular expressions to extract citations. Moreover, they define links between pairs of paragraphs belonging to different judgments showing a cosine similarity, computed on the basis of their TF-IDF representation, higher than a given threshold. A new clustering algorithm was also proposed, based on the Jaccard coefficient, and cluster prototypes were defined by selecting the legal case judgment exhibiting the highest similarity with the other legal case judgments within the cluster. Analogously, also Kachappilly and Wagh (2018) used case citations when clustering legal case judgments from the Indian Constitution. The proposed approach transforms the dataset into a binary matrix, indicating the presence or the absence of a citation of each case. Subsequently, the dataset is partitioned into groups through the classical k-means algorithm, using the Euclidean distance.
Although there are several works in the literature that considered the task of measuring the similarity among legal documents, and possibly identifying clusters thereof, the method proposed in this paper can be considered the first that simultaneously exploits advanced embedding techniques to capture the semantics and the context from the text. As mentioned in Sect. 1, these techniques are then used both in a two-step model to analyze the textual content at both document and paragraph level, and in an ANNS method to efficiently retrieve the most similar paragraphs with respect to a document at hand.

Methodology
In this section, we describe our method, called PRILJ (Paragraph Regularities Identification in Legal Judgments), that combines a clustering technique applied at a document level with an embedding approach applied at both document and paragraph levels. Embedding techniques have found several applications in the literature (Grover and Leskovec 2016;Corizzo et al. 2020;Pio et al. 2020;Ceci et al. 2020), and actually aim at representing any kind of structured and unstructured data as a numerical feature vector, so that existing retrieval, data mining and machine learning methods that work on classical feature vector representation can be adopted.
The methodological contribution of our approach comes from its ability to identify regularities by also exploiting common themes/topics of groups of legal case documents, as well as their possibly common/similar case citations or legal citations. Moreover, once we represent documents and paragraphs in a semantic feature space through embedding techniques, we exploit a smart strategy to identify the most similar paragraphs that overcomes the bottleneck usually introduced by cosinebased pairwise similarity comparisons (Mandal et al. 2017;Thenmozhi et al. 2017). This strategy makes our approach not only accurate but also very efficient.
Before describing PRILJ in detail, we provide some useful definitions to ease the understanding: 1 3 PRILJ: an efficient two-step method based on embedding and… • Training set D T : a collection of legal case judgments, represented as textual documents, adopted to train our models; • Reference set D R : a collection of legal case judgments, represented as textual documents, from which we are interested to identify paragraph regularities; • Target document d : a legal case judgment (possibly incomplete or under preparation) about which we are interested to identify paragraph regularities from the reference set.
The training set and the reference set may possibly fully (or partially) overlap i.e., D T = D R (or D T ∩ D R ≠ � ), namely, the set of documents adopted to train our models may be the same as (or overlap with) the collection from which we want to identify paragraph regularities with respect to the target document. Note that PRILJ is fully unsupervised and the target document d is never contained in either the training set or in the reference set (i.e., d ∉ (D T ∪ D R )).
Our method PRILJ consists of three main phases, namely: Training phase (see Fig. 1 and Algorithm 1), during which PRILJ i) trains a document embedding model from D T , that is able to represent documents into a semantic feature space; ii) identifies k groups of documents in D T according to their semantic representation, through a clustering method; iii) learns k paragraph embedding models, one for each cluster, that are able to represent paragraphs into a semantic feature space. Paragraph embedding of the reference set (see Fig. 2 and Algorithm 2), that exploits both the document embedding model and the k paragraph embedding models learned during the training phase, to identify a semantic representation of all the paragraphs of the reference set. Identification of paragraph regularities (see Fig. 3 and Algorithm 3), that exploits the identified document clusters, the document embedding model and the k paragraph embedding models to evaluate, through an efficient strategy, the similarity among paragraphs. The purpose is to identify paragraphs from the reference set that appear related to those of the target document (possibly under preparation).
In the remainder of this section, we will describe these three phases.
Following Algorithm 1, we now provide the details of the training phase. The algorithm starts with the application of some pre-processing steps to the documents in D t (line 2). In details, the pre-processing consists of: i) lowercasing the text, ii) removing punctuation and digits, iii) applying lemmatization, and iv) removing rare words. The pre-processed documents are then used to train a document embedding model M (line 3), that is subsequently exploited to represent each document of the training set D T in the latent feature space, obtaining the set of embedded training documents E T (lines 4-7). Such documents are then partitioned into k clusters [C 1 , C 2 , ..., C k ] by adopting the k-means clustering algorithm (line 8). Each cluster of documents becomes the input for a further learning step at the paragraph level: documents falling in the same cluster will contribute to the learning of a specific paragraph embedding model. Algorithmically, for each 1 3  PRILJ: an efficient two-step method based on embedding and… document cluster C i , 1 ≤ i ≤ k , we extract the paragraphs from the documents falling into C i (line 11) and train a paragraph embedding model P i (line 12).
The embedding models, both at the document level and at the paragraph level, are learned through neural network architectures based on Word2Vec (Mikolov et al. 2013) and Doc2Vec (Le and Mikolov 2014). Such approaches, originally proposed to embed words and documents, respectively, can fruitfully be adopted to represent legal court case documents and their paragraphs. Previous works demonstrated the superiority of Word2Vec and Doc2Vec over classical counting-based approaches, since they take into account both the syntax and semantics of the text (Donghwa et al. 2018;Mandal et al. 2017). In addition, their ability to catch the semantics and the context of single words and paragraphs allow them to properly represent new (previously unseen) documents which features have not been explicitly observed during the training phase. On the contrary, a purely counting-based syntactic approach would fail to represent a document which words have never been observed in the training collection. Further details on the Word2Vec and Doc2Vec architectures implemented in PRILJ are provided in Sects. 3.1 and 3.2.
Following Algorithm 2 we now describe the embedding of the reference set. Analogously to the training phase, we pre-process the documents of the reference set D r (line 2). Then, each document of the reference set is embedded using the previously learned document embedding model M (line 5). The embedded representation of the document is then used to identify the closest document cluster id (i.e., c at line 6) that corresponds to the optimal paragraph embedding model (i.e., P c ) to adopt in the embedding of its paragraphs (lines 7-10). The set of all the embedded paragraphs E R is finally returned by the algorithm, from which we are interested to identify regularities for a given target document d. We stress the fact that this two-step strategy allows us to model both general patterns at the document level and specific patterns at the paragraph level, possibly leading to an improved representation and, accordingly, to more accurate identification of paragraph regularities.
The identification of paragraph regularities, described in Algorithm 3, starts by following the same steps mentioned in Algorithm 2 to represent each paragraph of the target document d in the paragraph embedding space. Namely, the most proper paragraph embedding model is adopted to embed its paragraphs, selected according to the closest document cluster with respect to d. For each embedded paragraph, we finally identify the top-n most similar paragraphs from the set of embedded paragraphs E R belonging to the reference set (line 9). Their identification could straightforwardly be based on the computation of vector-based similarity/distance measures (e.g., cosine similarity, Euclidean distance, etc.) between the embedded paragraphs of the target document d and all the embedded paragraphs of the reference set E r . Such a pairwise comparison would be computationally intensive and would lead to inefficiencies during the adoption of the proposed system in a real-world scenario. To overcome this issue, we adopt a more advanced method for the identification of the top-n most similar paragraphs, based on random projections.
In the following subsections, we provide additional details about two different models that we adopt for document and paragraph embedding (Word2Vec and DocVec), as well as about the approach we propose to efficiently identify the top-n most similar paragraphs.

Learning document and paragraph embedding models through Word2Vec
Word2Vec (Mikolov et al. 2013) is a word embedding method, namely a method to represent words as numerical feature vectors. Word2Vec learns a model to embed words by analyzing a collection of documents and by exploiting two different neural network architectures: Continuous-Bag-of-Words (CBOW) and Skip-gram (SG). Both architectures can capture rich syntactic and semantic relationships between words, but they are based on two different techniques: CBOW adopts a feed-forward neural network to predict a target (central) word from a given (surrounding) context, while SG aims to predict the surrounding words of a given target word. The first is generally faster to train and usually provides slightly better accuracy for frequent words, while the second is more accurate in the representation of rare words, at the price of a generally higher running time. However, previous experiments provided even discordant conclusions, depending on the specific datasets (Jin and Schuler 2015;Miñarro-Giménez et al. 2015).
In PRILJ, we adopt the variant based on CBOW, also because its learning process is conceptually closer to our final goal. In fact, a feed-forward neural network which can predict a target word from a given (surrounding) context well adapts to the task of identifying words and paragraphs to suggest while writing a document (according to the current context). This is not the case SG, where the task is rather different (predicting the context given a word).
Methodologically, given a sequence of words ⟨w t−j , ..., w t , ..., w t+j ⟩ , representing the target word w t and its context of size 2j, Word2Vec first maps each context word w i to a one-hot vector representation of size V, where V corresponds to the size of the vocabulary observed in the collection of documents. Each element of the vector corresponds to one word of the vocabulary and a generic word is then represented by a vector of 0s for all the vector values, except the value corresponding to the specific word, which is 1.
The neural network architecture aims to learn the optimal matrix S ∈ ℝ V×E , where E is the desired size of the embedding space. The one-hot vectors of the context words are then multiplied by S. The obtained 2j vectors in the space ℝ E are averaged by the hidden layer to obtain the embedding of the target word w t . Formally, the hidden layer is computed as: The output layer, obtained by multiplying the embedding of the target word w t by S ⊤ , corresponds to the one-hot vector of w t (see Fig. 4). This means that the neural network is learned so that it accurately reconstructs the one-hot vector of the target word w t , given the one-hot vectors of the context words.
Once the neural network has been trained using the training set, the obtained matrix S can be used to embed a given word into a numerical feature space of size E.
In our case, we learn a document embedding model from the training documents D T (Algorithm 1, line 3), as well as k paragraph embedding models, one for each group of documents identified through k-means (Algorithm 1, line 12). An embedding model learned by Word2Vec can be queried for several purposes, but it natively provides an embedding for single words which represents the semantic of the word based on its context. In order to obtain an embedding for sequences of words (that may correspond to paragraphs or whole documents, in our case), different aggregation strategies can be adopted, including sum and mean, as suggested by Le and Mikolov (2014). In PRILJ, we obtain an embedding for the documents of the reference set (Algorithm 2, line 5) and for the target document (Algorithm 3, line 3), as well as for paragraphs of the reference set (Algorithm 2, line 9) and of the target document (Algorithm 3, line 8). For these purposes, we adopted the mean of the embeddings of the words.

Learning document and paragraph embedding models through Doc2Vec
Although Word2Vec can in principle be used to represent sequences of words, by adopting the mentioned aggregation strategies, it was not originally designed for this purpose. Consequently, Le and Mikolov (2014) proposed Doc2Vec, that is natively able to generate a vector representation of word sequences, where a sequence can be either a paragraph or a whole document. Methodologically, Doc2Vec can exploit two different architectures, namely distributed memory (PV-DM) and distributed bag of words (PV-DBOW). Similar to CBOW, PV-DM aims to predict the word w t , given its context. However, the context is represented not only by the one-hot vector representations of its surrounding words, but also by a C-dimensional one-hot vector representation ( ) of the unique sequence ID (seq), where C is the total number of sequences. This vector encapsulates the topic shared by words in the same sequence. Conversely, PV-DBOW makes use of the SG architecture, where the one-hot vector representation of the unique ID associated with the sequence is fed to the input layer instead of the one-hot vector of w t .
For the same motivations for the adoption of CBOW in Word2Vec, in PRILJ, we adopt the PV-DM architecture in Doc2Vec, as shown in Fig. 5. The main differences with respect to the architecture shown in Fig. 4 are i) the presence of the C-dimensional one-hot vector ( ) in the input layer associated to the sequence ID (seq), and ii) the additional matrix D ∈ ℝ C×E , which values are optimized together with those of the matrix S. Formally, in this case, the hidden layer is computed as: Analogously to the adoption of Word2Vec, we learn a document embedding model and k paragraph embedding models, and exploit them to embed documents and paragraphs, respectively. The main difference is that, in this case, we do not need any aggregation step to obtain the embedding of sequences of words from the embedding of single words.

Approximated Nearest Neighbour Search (ANNS) for the identification of paragraph regularities
In this subsection, we describe the strategy adopted in PRILJ to efficiently identify the top-n most similar paragraphs of the reference set, with respect to the paragraphs of the target document. A straightforward approach would consist in the computation of the pairwise cosine similarity between the vector representation of the paragraphs. However, such an approach would be computationally intensive. Namely, its time complexity would be O(n r ) for each paragraph of the target document, where n r is the number of paragraphs of the reference set.
To deal with this computational issue, we propose an approach based on Annoy (Bernhardsson 2015), where the idea is to perform an approximated nearest neighbour search (ANNS). Methodologically, we perform two phases, i.e., index construction on the paragraphs of the reference set, and search, that occurs when we actually need to identify the top-n most similar paragraphs with respect to a paragraph of the target document. During the index construction, we build T binary trees. Each tree is built by partitioning the input set of vectors recursively, by randomly selecting two vectors and defining a hyperplane that is equidistant from them (see Fig. 6). It is noteworthy that even if based on random partitioning, vectors that are close to each other in the feature space are more likely to appear close to each other in the tree. It can be proved that this indexing step has a computational cost of O(T × log 2 (n r )) = O(log 2 (n r )).
During the search process, we traverse the binary trees by exploiting a priority queue. Specifically, each tree is recursively traversed, and the priority of each split node is defined according to the distance to the query vector (that is a paragraph of the target document, in our case). This process leads to the identification of T leaf nodes, where the query vector falls into. The distance between the query vector and the set of vectors falling into the identified leaves is finally exploited to return the top-n most similar paragraphs (Li et al. 2016). Computationally, also the search process takes O(T × log 2 (n r )) = O(log 2 (n r )).
Despite the fact that the results may not be identical to those of the exact search, previous experiments showed that its ability to face the curse of dimensionality leads to high-quality approximations, together with much higher efficiency.
In PRILJ, the index construction is performed once, on the set of all the paragraphs belonging to the reference set E R , while the search process is carried out for each paragraph of the target document (Algorithm 3, line 9).

Experiments
In this section, we describe the experimental evaluation we performed to assess the effectiveness of the proposed method PRILJ. Specifically, in Sect. 4.1, we describe the considered real-world dataset, while in Sect. 4.2 we describe in detail the experimental setting and the considered comparative evaluations. Finally, in Sect. 4.3, we show and discuss the obtained results.

Dataset
In our experiments, we use a dataset made available by EUR-Lex 1 which, excluding empty documents, consists of 4,181 official public EU legal documents, having an average length of 2,739 words, related to both unconsolidated and finalized legal 1 3 case judgments from 2008 to 2018. The total number of paragraphs in the dataset is 530,744, with an average of 22 words per paragraph. Each legal case judgment has a unique CELEX number which components are the EUR-Lex sector, the year, the document type and the document number. The CELEX number was used to fetch the legal case judgments. We extracted the textual content by ignoring HTML elements, while the < p > tag was used to identify the paragraphs. We applied the preprocessing steps mentioned in Sect. 3. We ignored words having a document frequency lower than 3, and we retained only paragraphs having at least 10 words.
Note that the considered dataset falls within the case-law sector and includes only legal case judgments by the Courts of Justice. Therefore, it does not include views delineated by the Advocate General and opinions on draft agreements given by the European Court.

Experimental setting
All the experiments were performed in a 10-fold cross-validation (10-fold CV) setting, where 90% of the dataset is considered as training set and the remaining 10% of the dataset is considered as testing set, alternatively for 10 times. All the documents of the testing set were considered as target documents, while the reference set was built by constructing 20 replicas of each paragraph of the documents in the testing set, perturbed by introducing a controlled amount of noise. In particular, the noise was introduced by replacing a given percentage of words of each paragraph with random words selected from the Oxford dictionary 2 . In our experiments, we considered different levels of noise, namely, 10%, 20%, 30%, 40%, 50%, and 60%, in order to evaluate the robustness of the proposed approach to different amounts of noise. We stress the importance of specifically evaluating this aspect, since noise (e.g., homonyms or misleading words) can be easily present in textual documents, and a robust approach should provide accurate results also when input documents are affected by potentially high amounts of noise.
In order to assess the specific contribution of the adopted embedding strategies, we compared the results obtained through Word2Vec and Doc2Vec with those achieved using a baseline strategy, i.e., the classical TF-IDF approach. In all the cases, we adopted a 50-dimensional feature vector. For TF-IDF, we selected the top-50 words showing the highest frequency across the set of legal case judgments.
We evaluated the performance of the two-step model implemented in PRILJ with different numbers of clusters, i.e., k ∈ { √ �D T �∕2, √ �D T �, √ �D T � ⋅ 2} . Note that k = √ �D T � is generally considered a default value for the number of clusters, when this has to be manually specified. In our analysis we also evaluate the sensitivity of PRILJ to the value of k.
Moreover, we compared the observed performance with that obtained by a baseline strategy that does not group training documents into clusters (henceforth denoted as one-step model).
We also performed an additional comparative analysis with state-of-the-art competitor systems. Specifically, we compared PRILJ with: • LEGAL-BERT-BASE, that is the LEGAL-BERT model 3 fine-tuned by Chalkidis et al. (2020) using a wide set of legal documents related to EU, UK and US law; • LEGAL-BERT-SMALL, that is the LEGAL-BERT model 3 fine-tuned by Chalkidis et al. (2020) using the same set of documents adopted for LEGAL-BERT-BASE, but in a lower-dimensional embedding space; • LEGAL-BERT-EURLEX, that is the LEGAL-BERT model 3 fine-tuned by Chalkidis et al. (2020) using the EUR-LEX dataset; • BERT-PLI, that is the system BERT-PLI 4 based on BERT, fine-tuned with a small set of legal documents, proposed by Shao et al. (2020) in the Competition On Legal Information Extraction/Entailment (COLIEE).
Note that the above-mentioned competitors are able to represent paragraphs as feature vectors (i.e., they are embedding models), taking into account the semantics and the context of the textual content. Specifically, LEGAL-BERT-BASE, LEGAL-BERT-EURLEX and BERT-PLI represent paragraphs in a 768-dimensional feature space, while LEGAL-BERT-SMALL represents paragraphs in a 512-dimensional feature space. The embedding of each paragraph was computed as the mean of the embedding of its tokens. Finally, we evaluated the effectiveness and the efficiency of the approach implemented in PRILJ for the identification of the top-n most similar paragraphs based on ANNS, with T = 100 (number of trees). Specifically, we performed an additional comparative analysis against a non-approximated solution based on the cosine similarity, on a subset of 100 documents randomly selected from the dataset. This analysis was performed considering the best configuration in terms of the number of clusters k, and also focused on evaluating the advantages in terms of computational efficiency.
As evaluation measures, we collected precision@n, recall@n and f1-score@n, averaged over the paragraphs of target documents and over the 10 folds, with n ∈ {5, 10, 15, 20, 50, 100} . Specifically, for each paragraph of a target document in the testing set, we considered as True Positives the number of correctly retrieved (perturbed) replicas from the reference set.
In summary, our experimental evaluation was performed along multiple dimensions of analysis, i.e., on the evaluation of: (i) the effect of different amounts of noise in the data, to evaluate the robustness of PRILJ to the presence of noise; (ii) the contribution of the embedding approaches implemented in PRILJ that also catch the semantics, with respect to the adoption of TF-IDF; (iii) the contribution provided by the two-step model with different numbers of clusters, with respect to the one-step model that does not exploit document clustering; (iv) the effect of the approximated nearest neighbor approach implemented in PRILJ, both in terms of effectiveness and in terms of efficiency.

Results
In Tables 1, 2 and 3, we report the precision@n, the recall@n, and the f1-score@n results, respectively, measured with different embedding strategies and different levels of noise introduced in the dataset. The upper-left subtable shows the results obtained with the one-step model, while the other subtables show the results obtained by PRILJ with different numbers of clusters.
As expected, we can observe that in all the configurations the presence of noise negatively affects the results: the higher the amount of noise introduced, the lower the precision@n, the recall@n, and the f1-score@n. It is noteworthy that there is no difference in terms of conservativeness among all the considered approaches, namely, approaches obtaining a higher precision@n also obtain a higher recall@n and, accordingly, a higher f1-score@n. This interesting result allows us to outline clear conclusions (reported in the following of the section) about the most effective approaches (and their parameters) for the different phases implemented in PRILJ.
First, we can observe that, although the baseline based on TF-IDF obtained acceptable results, the adoption of the embedding methods implemented in PRILJ is significantly beneficial. Specifically, when adopting Doc2Vec, we observe an average improvement of 17.68% for the precision@n, 17.71% for the recall@n, and 17.70% for the f1-score@n. On the other hand, when adopting Word2Vec, we observe an average improvement of 24.22% for the precision@n, 24.27% for the recall@n, and 24.29% for the f1-score@n. This result confirms our initial intuition that catching the context and the semantics leads to significant improvements. Moreover, although Doc2Vec is natively able to work with word sequences, Word2Vec PRILJ: an efficient two-step method based on embedding and… The upper-left subtable shows the results obtained with the one-step model, while the other subtables show the results obtained by PRILJ with different numbers of clusters. The best result observed for a given n of the precision@n in a given subtable is shown in boldface, while the absolute best result is underlined PRILJ: an efficient two-step method based on embedding and… The upper-left subtable shows the results obtained with the one-step model, while the other subtables show the results obtained by PRILJ with different numbers of clusters. The best result observed for a given n of the recall@n in a given subtable is shown in boldface, while the absolute best result is underlined PRILJ: an efficient two-step method based on embedding and… The upper-left subtable shows the results obtained with the one-step model, while the other subtables show the results obtained by PRILJ with different numbers of clusters. The best result observed for a given n of the f1-score@n in a given subtable is shown in boldface, while the absolute best result is underlined From the results, it is possible to clearly identify the contribution of the two-step architecture we propose. Indeed, the results show that the proposed two-step model outperforms the one-step model, in all situations and for all the considered evaluation measures. We can also observe that the two-step model is much more robust to the presence of noise: although we can still observe a decrease of precision@n, recall@n, and f1-score@n when the noise amount increases, its impact is much less evident. In Fig. 7, we report a histogram that graphically shows the impact of the noise on the f1-score@20, with the two-step model (with different values of k) and with the one-step model. From the results, we can also observe that in general, the number of extracted clusters k seems to not significantly affect the results, even if the best results are observed with k = √ �D T � ⋅ 2 . This means that the documents are distributed among several topics and that learning a different (more specialized) paragraph embedding model for each of them is helpful to retrieve significant paragraph regularities.
Focusing on the comparison with state-of-the-art systems, in Table 4 we report the f1-score@n results obtained by PRILJ (two-step model, k = √ �D T � ⋅ 2 , Word-2Vec) and by the considered competitor systems, with different levels of noise. From the results, we can easily observe that PRILJ always outperforms all the competitors, independently on the value of n of the f1-score@n measure, and independently on the amount of noise in the data. Moreover, as we can also observe in Fig. 8, the impact of noise is very evident on competitor systems. On the contrary, PRILJ appears very robust to the noise and, thus, adoptable in real contexts even when the amount of noise in the data is high. The significantly lower f1-score@n results achieved by the competitors, when documents are affected by high levels of noise, can be mainly due to   Kumar et al. (2020), where the sensitivity of BERT-based models to the presence of noise has been investigated. Finally, we specifically analyzed the performance of the ASSN approach implemented in PRILJ. We recall that we adopt an approximated approach for the identification of paragraph regularities to overcome computational bottlenecks. Since approximated approaches may usually lead to a loss in terms of accuracy, it is important to show that the high efficiency achieved by PRILJ does not come at the price of significantly worse results than those achievable through an exact search. As anticipated in Sect. 4.2, for this purpose, we performed a comparison with the exact computation of the top-n most similar paragraphs using the cosine similarity on a subset of 100 documents, with the PRILJ configuration that provided the best results (i.e., two-step model with k = √ �D T � ⋅ 2 , see Tables 1, 2, 3). The f1-score results of this comparison are shown in Table 5, and graphically summarized in Fig. 9. The exact search based on cosine similarity leads to better results mainly when adopting TF-IDF with high levels of noise. On overall, the observed average improvement in terms of f1-score@n with respect to the adopted ANNS approach is 0.6% , which can be considered negligible. On the other hand, the advantage in terms of efficiency is significant: the exact search required up to 1000x the time took by the ASSN implemented in PRILJ (see Table 6). This advantage is empirically evident even with the small subset of documents that we used, but the difference between their theoretical computational complexity (i.e., O(log 2 (n r )) vs O(n r ) , for each paragraph of the target document) provides a clear win to ANNS for large document collections. Indeed, while we were able to complete one run of the experiments on the full dataset on average in 1.5 hours, the adoption of the cosine similarity would have required some months on our server equipped with a 6-cores CPU@3.2 Ghz and 64GB of RAM.  The configurations in which the cosine similarity achieves better results are emphasized in italic The obtained results allow us to conclude that: i) PRILJ based on Word2Vec, the two-step model and k = √ �D T � ⋅ 2 provides the best overall results for the identification of paragraph regularities in legal case judgments; ii) the two-step model based on clustering implemented in PRILJ provides clear advantages, since it is able to properly model the different topics in the document collection and is very robust to the presence of noise; iii) the efficient ASSN strategy adopted by PRILJ provides results comparable to those achieved by an exact search, in a fraction of time. These conclusions make PRILJ a useful tool that can be adopted in real-world scenarios, for the accurate and efficient identification of paragraph regularities from large collections of legal case judgments, which can be profitably used in the redaction of similar legal documents.

Conclusions
In this work, we proposed PRILJ, a novel approach to identify paragraph regularities in legal case judgments. PRILJ represents documents and paragraphs thereof in a numerical feature space by exploiting embedding methods able to catch the context and the semantics. Moreover, PRILJ is based on a two-step model, that groups similar documents into clusters and, for each of them, learns a specific paragraph Fig. 9 A histogram showing the differences in terms of f1-score@20 achieved on a subset of 100 documents when using ANNS or the cosine similarity, with the two-step model, k = √ �D T � ⋅ 2 and Word2Vec as embedding strategy 1 3 embedding model. This approach allows us to properly catch peculiarities exhibited by paragraphs and documents of similar topics and to handle the presence of noise in a robust manner. Finally, PRILJ is able to identify paragraph regularities with respect to target documents very efficiently. Our extensive experimental evaluation has proved the accuracy and the efficiency of the developed approach on real data, which can be considered a useful tool in real-world scenarios, also when large collections of documents have to be analyzed. PRILJ has also been able to outperform four existing state-of-the-art competitor systems, achieving significantly better performances when the amount of noise in the data increases.
For future work, we will extend the capabilities of PRILJ in providing, in addition to retrieval functionalities, also suggestions during the preparation of new legal documents. Specifically, we will exploit process mining methods to identify frequent patterns observed in the sequences of paragraphs of legal documents. This would allow us to suggest the next (type of) paragraph to include in a legal document under preparation, as well as to perform conformance checking on a legal document, i.e., to verify if it has been properly written in accordance with the patterns observed on other, similar legal documents.
Funding Information Open access funding provided by Università degli Studi di Bari Aldo Moro within the CRUI-CARE Agreement. GP acknowledges the support of Ministry of Universities and Research through the project "Big Data Analytics", AIM 1852414-1 (line 1).

Data Availability Statement
The adopted dataset and all the detailed results are available at: https:// osf. io/ 2jum9/? view_ only= ea9c9 29499 9746c cb2af 62edd acb8d 9a. The source code repository of the system will be made publicly available after the publication of the manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.