DiscoLQA: zero-shot discourse-based legal question answering on European Legislation

Sovrano, Francesco; Palmirani, Monica; Sapienza, Salvatore; Pistone, Vittoria

doi:10.1007/s10506-023-09387-2

DiscoLQA: zero-shot discourse-based legal question answering on European Legislation

Original Research
Open access
Published: 10 January 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence and Law Aims and scope Submit manuscript

DiscoLQA: zero-shot discourse-based legal question answering on European Legislation

Download PDF

Francesco Sovrano ORCID: orcid.org/0000-0002-6285-1041^1,2,
Monica Palmirani³,
Salvatore Sapienza³ &
…
Vittoria Pistone³

1038 Accesses
2 Altmetric
Explore all metrics

Abstract

The structures of discourse used by legal and ordinary languages share differences that foster technical issues when applying or fine-tuning general-purpose language models for open-domain question answering on legal resources. For example, longer sentences may be preferred in European laws (i.e., Brussels I bis Regulation EU 1215/2012) to reduce potential ambiguities and improve comprehensibility, distracting a language model trained on ordinary English. In this article, we investigate some mechanisms to isolate and capture the discursive patterns of legalese in order to perform zero-shot question answering, i.e., without training on legal documents. Specifically, we use pre-trained open-domain answer retrieval systems and study what happens when changing the type of information to consider for retrieval. Indeed, by selecting only the important parts of discourse (e.g., elementary units of discourse, EDU for short, or abstract representations of meaning, AMR for short), we should be able to help the answer retriever identify the elements of interest. Hence, with this paper, we publish Q4EU, a new evaluation dataset that includes more than 70 questions and 200 answers on 6 different European norms, and study what happens to a baseline system when only EDUs or AMRs are used during information retrieval. Our results show that the versions using EDUs are overall the best, leading to state-of-the-art F1, precision, NDCG and MRR scores.

L2QA: Long Legal Article Question Answering with Cascaded Key Segment Learning

Legal Question Answering System Using FrameNet

Collaborative System for Question Answering in German Case Law Documents

1 Introduction

We are witnessing a growing need for the digitisation of our society, which requires great interdisciplinary efforts in law, information technology and engineering. This need has led to the birth of institutions such as the Ministry of Digital Governance of Greece and the Australian Digital Transition Agency, or long-term plans like the European Digital Transition Action Plan and many others.

In the literature of AI, answering questions using an extensive collection of documents of diversified topics (i.e., Private International Law) is called open-domain Question Answering (QA). Modern open-domain QA systems usually combine traditional information retrieval techniques and neural reading comprehension models. Nevertheless, neural reading comprehension of legal texts (e.g., European legislation) is challenging because legalese is rarer, mercurial and in many ways different from a commonly used natural language. Hence, the difference between legal and ordinary languages does foster technical issues when applying or fine-tuning general-purpose language models for open-domain question answering on legal resources. This is especially true when the meaning of a legal document is encoded in its (discourse) structure in a way that is different from the spoken language. For example, long sentences or more “formal” writing may be preferred in legislative documents (e.g., Brussels I bis Regulation EU 1215/2012) to reduce potential ambiguities and improve comprehensibility. However, the noise introduced by the excessive length of the sentence or their unusual structure can distract a language model trained in ordinary English, pushing it to commit more errors.

As a result, standard neural reading comprehension models may only be able to represent the semantics of a legal text if they are adequately specialised to do it. This is because legalese is not repetitive. It is canonical and has semantic terminology that tends to avoid polysemy and to be used punctually in particular contexts as if the sentences it forms were governed by formal rules. Hence, applying these formal rules impacts the discourse structure, as suggested by Sovrano et al. (2022).

Here, we expand the work published by Sovrano et al. (2020), investigating some mechanisms to perform “zero-shot” legal question answering. More specifically, “zero-shot” means that question answering is performed through pre-trained language models (e.g., a model that is trained on generic non-legal documents) without fine-tuning them on the downstream legal task of question answering. In this sense, zero-shot legal question answering can be a necessary solution for all those tasks characterised by a paucity of data (e.g., European hard laws, the resolutions of the United Nations General Assembly) and for which we want to train AI-based solutions through machine learning without having enough information for effective fine-tuning. Conversely, zero-shot legal question answering might be less helpful whenever data are abundant (e.g., American case law or privacy policies).

In this article, we investigate the role of discourse structure in legalese, trying to understand and exploit its importance in encoding the meaning of legal documents. The goal of this investigation is also practical, not just theoretical. Understanding how legalese differs from its spoken counterpart can help solve the data scarcity problem in legalese processing/comprehension. This would allow us to better exploit generic language models not calibrated to a downstream legal task or even not trained on legal documents, as shown throughout the paper.

Specifically, we use open-domain QA systems based on information retrieval and neural reading comprehension and study what happens when changing the type of information to consider for retrieval. These QA systems encode all the possible answers (e.g., parts of articles, recitals) with a general-purpose neural model and then use the encoding for fast similarity-based retrieval. Usually, these answers are just a short part (a grammatical sub-tree) of one sentence or paragraph, especially if the whole is very long. Suppose the neural model is not specialised in legalese. In that case, it will likely fail to identify and capture the importance of grammatical sub-trees that are uncommon in the spoken language. Hence, by selecting only those grammatical sub-trees deemed the most important, we should be able to help the information retriever and the QA system by partially hiding noise within answers. To identify these important grammatical sub-trees, we used the theory of Elementary Discourse Units (EDUs) (Prasad et al. 2008) and the theory of Abstract Meaning Representations (AMRs) (Banarescu et al. 2013).

In other words, we show how to produce more effective answer retrieval tools by capturing discourse structure, leveraging existing tools for QA specialised in common natural languages. Therefore, to shed more (empirical) light on what constitutes meaning in legalese, we decided to design an experiment focused on understanding whether there is a benefit in using only EDUs or AMRs, as triplets in the knowledge graphs extracted by the pipeline proposed by Sovrano et al. (2020). We devised a simple experiment where we study what happens to the baseline QA system when using EDUs or AMRs during information retrieval.

In particular, to evaluate our results, we present a new dataset called Q4EU that extends Q4PIL (Sovrano et al. 2021) with 3 European norms, for a total of 72 unique questions and 225 expected answers (in the form of articles and recitals) on 6 heterogeneous European norms spanning from Private International Law to Human Rights Law (i.e., the General Data Protection Regulation, UE 2016/679), from regulations of electronic signatures to the European arrest warrant.

The results of our experiments show that the versions using EDUs are overall the best, leading to state-of-the-art top-k precision and F1 scores for all the values of k we considered. Our instances of DiscoLQA were able to generalize across the different legal sub-domains tested, even if the deep language models involved were not pre-trained on legal corpora.

However, we tested and evaluated DiscoLQA on specific European norms and a relatively small dataset, without using deep language models pre-trained on legal corpora.

Our contribution is threefold:

1.
We show where a general-purpose language model may fail when applied to legal documents, hinting at how to intervene for effective fine-tuning or re-training. In other words, we show that legalese’s semantics may be encoded differently. Identifying the sources of meaning may be beneficial for effectively improving the state-of-the-art neural reading comprehension of legal documents.
2.
We show a way to effectively use discourse analysis for legal question answering, improving state-of-the-art without fine-tuning or re-training the language models on the regulations at hand.
3.
We publish Q4EU, a new evaluation dataset for legal question answer retrieval that extends the work by Sovrano et al. (2021).

For reproducibility purposes, we also publish on GitHub the source code of DiscoLQA.^{Footnote 1}

This paper is structured as follows. In Sect. 2, we discuss the related work on (legal) QA, while in Sect. 3 we give all the necessary background information to understand the pipeline of algorithms presented in Sect. 4. Finally in Sect. 5 we discuss our experiment and present the Q4EU dataset, analysing the results in Sect. 7 while pointing to future work in Sect. 8.

2 Related work

Legal QA is a relatively recent field of study in the context of AI and Law, with many exciting solutions available today. Some of these solutions follow end-to-end approaches, exploiting existing language models. In contrast, some others try to exploit ontologies and knowledge graphs by framing QA as a task of information retrieval.

On the one hand, we can see a paucity of end-to-end generic solutions to legal QA that are usually focused only on particular and narrow applications for which large enough datasets are available. Instead, when no large dataset is available for training, we generally have that using deep language models pre-trained on ordinary English does not always produce good results. Zheng et al. (2021) showed that the more complex the legal reasoning task to answer, the less effective the fine-tuning could be. An example of end-to-end QA system is the work by Kim et al. (2015), where a deep neural network is trained on a dataset of Boolean questions from Japanese legal bar exams. Another interesting example is the work by Ravichander et al. (2019), proposing an end-to-end question-answering solution for privacy policies.

On the other hand, an example of an answer retrieval system specific to Private International Law is the one proposed by Sovrano et al. (2020). It consists of a combination of TF-IDF and some deep language models to retrieve pertinent answers from an automatically extracted knowledge graph of contextualised grammatical sub-trees. In particular, the knowledge graph is aligned to a legal ontology based on Ontology Design Patterns (i.e., agent, role, event, temporal parameter, action) to mirror the legal significance of the relationships within and among the provisions. In this sense, we extend the work by Sovrano et al. (2020), trying to overcome some of the issues of using language models not trained in legalese.

While another example of an answer retrieval system is the work by Vold and Conrad (2021), comparing the performance of a deep learning-based solution with that of a traditional SVM. In particular, Vold and Conrad (2021) fine-tuned a deep language model (called RoBERTa) on a dataset of questions about privacy policies (that usually use a language closer to spoken English rather than legalese), obtaining better results than with an SVM.

3 Background

In this section, we provide all the necessary background information to understand state-of-the-art automated question-answering and the relationship between discourse theory and legalese.

3.1 Question answering and law

Natural language processing/understanding is of utmost importance in the intersection of AI and Law. This is why many works in this field have focused on general-purpose state-of-the-art language models for the generation of word/sentence embeddings (Shao et al. 2020; Condevaux et al. 2019; Vink et al. 2020).

For example, Bommarito et al. (2018) published a framework for natural language processing and information extraction for legal and regulatory texts. While Chalkidis and Kampas (2019) proposed one of the first models for legal word embeddings. Also, the Incorporated Council of Law Reporting for England and Wales (ICLR 2019) published Blackstone, a library meant to allow researchers and engineers to automatically extract information from long, unstructured legal texts (such as judgements, skeleton arguments, scientific articles, Law Commission reports, pleadings). More generally, natural language processing for legal texts has recently raised a lot of interest, highlighting “the need to create a bridge between conceptual questions, such as the role of legal interpretation in mining and reasoning, as well as computational and engineering challenges, such as the handling of big legal data and the complexity of regulatory compliance” (Robaldo et al. 2019).

Automating legal reasoning is not a trivial task, as it requires a deep understanding of language, non-monotonic logic and the theory of interpretation, as well as sufficient flexibility to handle the plethora of changes to which law and hermeneutics are subject over time. Current state-of-the-art AI for reasoning is divided into two approaches: the symbolic and the sub-symbolic. The symbolic approach draws from formal languages and logic. It requires every component of the reasoning to be an abstract symbol with a pre-defined and context-independent interpretation of its meaning, making the AI based on this approach hardly compatible with natural languages such as English, Chinese, and Spanish. On the other hand, the sub-symbolic approach draws from recent advancements in deep learning. Exploiting large amounts of data, it can “understand” natural language and visual inputs in a scalable and highly effective way. However, it loses transparency by working on non-symbolic representations (i.e., arbitrary numerical vectors) frequently not interpretable.

Non-monotonic reasoners based on Defeasible Logic (Lam and Governatori 2009), Deontic Logic (Hage 2000) and Argumentation (Gordon and Walton 2009) are famous examples of symbolic AI applied to the legal domain. All require legal documents to be translated (manually) from their original natural language into some particular formal language upon which classical logical reasoning can be applied. This type of reasoner usually struggles to scale to handle natural language (i.e., English) inputs such as documents and questions.

On the other hand, the sub-symbolic approach is more versatile and well-known to be more easily applied directly to natural language documents. Famous sub-symbolic approaches to (legal) reasoning are the so-called QA algorithms. As suggested by Xie et al. (2020); Cao et al. (2019); Zhang et al. (2018); Hudson and Manning (2019) and others, in many cases, question answering can be seen as an instance of reasoning. These QA algorithms are usually trained end-to-end to extract short (i.e., 2–3 words) answers from a whole document (text or image) to match a given question.

The most common end-to-end QA algorithms, i.e. those collected by Wolf et al. (2020), rely on Transformers (Vaswani et al. 2017). Hence they have quadratic complexity in the size of the whole document to be searched for an answer. This characteristic makes end-to-end QA based on Transformers fail in all those situations where collections of large documents of diversified topics (i.e., Private International Law) are involved, or parts of the same answer are scattered across multiple documents. A solution to this problem is seen in Question-Answer Retrieval, also known as Dense Passage Retrieval or open-domain QA (Chen and Yih 2020). Modern open-domain QA systems usually combine traditional information retrieval techniques and neural reading comprehension models. These QA systems encode all the identified possible answers (e.g., parts of articles, recitals) with a general-purpose neural model. Then they use the encoding for fast similarity-based retrieval. Therefore, differently from end-to-end QA, Question-Answer Retrieval is less end-to-end, requiring the a priori identification of the possible snippets of text functioning as answers, but it is much faster. In fact, it has a complexity that is usually proportional to the product of the size of the context (normally a small paragraph) and the size of the answer (commonly smaller than the context).

Among the most important Question-Answer Retrieval models, we distinguish between those that use the answer’s context for the generation of embeddings^{Footnote 2} (Yang et al. 2020; Karpukhin et al. 2020; Roy et al. 2020) and those who do not (Chen et al. 2020).

3.2 Discourse theory and legal language

The relation between discourse theory and legalese is complicated and still open to discussion. Discourse theory is a branch of linguistics that studies how coherence and cohesive relations can be the threads that make up a text to form a discourse. A discourse is said to be coherent if all of its pieces belong together, while it is said to be cohesive if its elements have some common thread. Sanders et al. (1992) identified two requirements for a theory of discourse:

Descriptive adequacy: A theory discourse structure makes it possible to describe the structure of all kinds of (natural) texts.
Psychological plausibility: A theory of discourse structure should at least generate plausible hypotheses on the role of discourse structure in constructing cognitive representation.

In recent years, many different theories of discourse have been spelt out, each with different pros and cons. Among them, we cite the Rhetorical Structure Theory (Mann and Thompson 1988), assuming that discourse is structured as a tree, the Segmented Discourse Representation Theory (Lascarides and Asher 2007) assuming that discourse is structured as a graph (therefore allowing long-distance attachments), and the theory of EDUs (Miltsakaki et al. 2004; Prasad et al. 2008; Webber et al. 2019) making no assumption on the text structure. Common to them is probably the identification of something that may be called Elementary Discourse Unit (EDU). EDUs are spans of text denoting a single event serving as a complete, distinct unit of information that the surrounding discourse may connect to Stede (2013). EDUs can be combined to form many different types of discourse Fludernik (2000); D’Angelo (1984) including: argumentation, exposition, description, narration.

The theory of EDUs encoded by the Penn Discourse Treebank (PDTB) model is considered one of the most generic theories of discourse. Indeed, PDTB is data-driven (based on lexically grounded relations) and makes little assumptions about the underlying language. As a result, with little or no change in annotative style, PDTB appears to be usable for modelling discourses of natural languages belonging to different families (Zufferey and Degand 2017), e.g., Chinese, Arabic, and Hindi. In particular, PDTB is based on the assumption that “the meaning and coherence of a discourse result partly from how its constituents relate to each other”. Therefore discourse relations are defined as semantic relations between abstract objects (or EDUs) mentioned in discourse and connected by explicit (e.g., “but”, “then”, “for example”, and “although”) or implicit relations. According to PDTB, discourse relations can be of one of 4 main types: temporal, contingency (causality, purpose, etc.), expansion, and comparison. PDTB-style annotations and the other theories of discourse have inspired an ISO standard (Prasad and Bunt 2015).

The application of PDTB to legalese has been explored by some Robaldo et al. (2008); Cabrio et al. (2013), but has yet to have much follow-up. The point is that ordinary discourse theory is better suited to judgments, Hansard reports, testimonies and reports of debates. Instead, it seems unsuited to legislative texts and contracts, for which a specific vocabulary (e.g., definitions) or textual structure (e.g., hierarchy) is used to identify meaning through interpretation theory. Indeed, legislative texts have a deeper structure than common sentences. For example, a list has a legal meaning of conditions linked together by specific semantics. Furthermore, the classical linguistic structures based on discourse connectives tend to be used differently in law. Legal connectives do not have the same semantic value as everyday discourse. They are operators of deontic rules with multiple meanings (e.g., “xor”, “or”, “and”). Also, some discourse structures tend not to be used at all because they are not a good practice in legal drafting (e.g., “but” and “for example”).

4 DiscoLQA: discourse theory for legal question answering

This paper proposes a novel pipeline of algorithms called DiscoLQA, short for Discourse-based Legal Question Answering. DiscoLQA is based on the automatic extraction of special knowledge graphs designed to address Legal QA through general-purpose deep language models that are not specifically trained on legal documents. In particular, DiscoLQA is composed by the baseline tool of Sovrano et al. (2020) extended with a new component responsible for the extraction of special information units representing EDUs and AMRs.

The baseline tool described by Sovrano et al. (2020) is composed of a pipeline of algorithms for efficient Question-Answer Retrieval through the extraction of a knowledge graph from a set of information units. In this sense, the main difference between DiscoLQA and the baseline is (as shown in Fig. 1) the type of information units considered by the knowledge graph extractor. The baseline uses as information units all the clauses^{Footnote 3} of the source documents.^{Footnote 4} Instead, DiscoLQA can use as information units not only the clauses but also the AMRs and discourse relations extracted from the clauses.

In other words, DiscoLQA supports more types of information units and allows the retrieval of answers from any combination of clauses, AMRs and discourse relations. Specifically, discourse relations are meant to capture how EDUs are connected, while AMRs are meant to capture the informative components within the EDUs by possibly supporting answering to basic questions such as “who did what to whom, when or where”. For example, from the sentence “The existence and validity of a contract, or any term of a contract, shall be determined by the law which would govern it under this Regulation if the contract or term were valid” it is possible to extract the following discourse relation about contingency (that we represent as a pair of question and answer for convenience and clarity) “In what case would the law govern it under this Regulation? If the contract or term were valid”, and the following AMR question-answer “By what is the existence and validity of a contract determined? The law that would govern it under this Regulation if the contract or clause were valid”. So, a discourse relation identifies two EDUs: the first encoded in the question and the second in the answer.

In this section, we discuss the system implementation of DiscoLQA, starting from the proposed mechanism for extracting EDUs and AMRs.

4.1 Information units extraction: discourse relations and abstract meaning representations

The AMRs and EDUs used by DiscoLQA are extracted from sentences and paragraphs through a deep language model based on T5^{Footnote 5} Raffel et al. (2020) pre-trained on a multi-task mixture of unsupervised and supervised tasks.

Vanilla T5 is not trained to recognise AMRs or EDUs. Therefore we had to fine-tune T5 on some public datasets designed for these tasks. These datasets are namely QAMR (Michael et al. 2018) for extracting AMRs, and QADiscourse (Pyatkin et al. 2020) for EDUs and discourse relations. Interestingly, both datasets encode AMRs and EDUs as question-answer pairs; this is done for convenience only. Indeed, as pointed out by Michael et al. (2018); Pyatkin et al. (2020); Roit et al. (2020) and others, the question-answer format is more natural, facilitating humans to operate changes, correct errors, suggesting improvements, even without knowing in detail all the underlying linguistic theories.

Most importantly, the QAMR and QADiscourse datasets are not related to any of the technical domains covered by Q4EU. They do not contain legal documents or text fragments written in legalese. In other words, by fine-tuning T5 on QAMR and QADiscourse, we do not refine T5 on legal texts. Legal fine-tuning would require the costly extraction of a dataset of AMRs and EDUs from legal texts, also considering ad hoc adaptations of discourse theories and abstract meaning representation to legal language.

In particular, the QAMR dataset is made of 107,880 different questions (and answers) that are a mapping of AMR theory to the following wh-phrases:

What (60.9% of the dataset),
Who (17.5%),
How (6.9%),
Where (5.0%),
When (4.3%),
Which (2.9%),
Whose (1.9%),
Why (0.6%).

On the other hand, the QADiscourse dataset is made of 16,613 different questions (and answers) that are a mapping of PDTB to the following wh-phrases mainly on contingency and temporal relations:

In what manner (25% of the dataset),
What is the reason (19%),
What is the result (16%),
What is an example (11%),
After what (7%),
While what (6%),
In what case (3%),
Despite what (3%),
What is contrasted with it (2%),
Before what (2%),
Since when (2%),
What is similar (1%),
Until when (1%),
Instead of what (1%),
What is an alternative (\(\le 1\%\)),
Except when (\(\le 1\%\)),
Unless what (\(\le 1\%\)).

The two considered datasets are tuples of \(<s, q, a>\), where s is a source sentence, q is a question (implicitly) expressed in s, and a is an answer expressed in s. So that T5 is fine-tuned to tackle at once the following four tasks per dataset:

1.
Extract a given s and q,
2.
Extract q given s and a,
3.
Extract all the possible q given s,
4.
Extract all the possible a given s.

Specifically, we fine-tuned the T5 model on QAMR and QADiscourse for five epochs.^{Footnote 6} The objective of the fine-tuning was to minimise a loss function measuring the difference between the expected output (i.e., a for the 1st task, q for the 2nd task, etc.) and the output given by T5. A mathematical definition of the loss function is given by Raffel et al. (2020).

At the end of the training, the average loss was 0.4098, meaning that our fine-tuned T5 model cannot perfectly extract AMRs or EDUs from the text composing the training set. On the one hand, this is a good thing because it is likely that the model did not over-fit on the training set. On the other hand, this points to the fact that the AMRs and EDUs extracted by our T5 model can be imperfect, containing errors that could propagate to the answer retrieval system. Regardless, in the following sections, we show that even if the language models we rely on are imperfect, we can still outperform the baseline information retrieval system.

4.2 System implementation: knowledge graph extraction and answer retrieval

DiscoLQA, similarly to the baseline tool described by Sovrano et al. (2020), consists in a pipeline of AI algorithms that is capable of extracting from a set of information units a particular graph of knowledge that an information retrieval system can exploit to answer a given question. In particular, this knowledge graph is extracted by detecting, with a dependency parser, all the possible phrases and sub-phrases within the information units so that each phrase stands for an edge of the knowledge graph. In practice, these phrases are represented as special triplets of subjects, templates and objects called template-triplets. Specifically, the templates are composed of the ordered sequence of tokens connecting a subject and an object. The subject and the object are represented in such templates with the placeholders “{subj}” and “{obj}”.

Hence, the resulting template-triplets are a sort of function, where the predicate is the body and the object and the subject are the parameters. Obtaining a natural language representation of these template-triplets is straightforward by design by replacing the instances of the parameters in the body. This natural representation is then used as a possible answer for retrieval by measuring the similarity between its embedding and the embedding of a question. An example of template-triple is:

Subject: “the applicable law”
Template: “Surprisingly {subj} is considered to be clearly more related to {obj} rather than to something else”
Object: “that Member State”

Because of the adopted extraction procedure, the resulting knowledge graph could be better. It may contain mistakes caused by wrongly identified grammatical dependencies or other issues.

To increase the interoperability of the extracted knowledge graph with external resources, we formatted it as an RDF graph. RDF is a standard model for data interchange on the Web (Allemang and Hendler 2011). In particular, RDF has features that facilitate data merging even if the underlying schemas differ. To format a graph of template triplets in an RDF graph, we performed the following steps:

We assigned a Uniform Resource Identifier (URI) to every node (i.e., subject and object) and edge (i.e., template) of the graph by lemmatising the associated text. To each URI, we assigned an RDFS label corresponding to the associated text.
We added special triplets to keep track of the sources from which the template-triplets were extracted so that for each node and edge is possible to go back to the source document or paragraph.
We added sub-class relations between composite concepts (syntagms) and the simplest concepts (if any) composing the syntagm. For example, “contractual obligation” is a sub-class of “obligation”.

For more technical details about how we performed all the steps mentioned above to convert the template-triplets into an RDF graph, please refer to Sovrano et al. (2020) or the source code of DiscoLQA.

Finally, the algorithm to retrieve answers from the extracted knowledge graph is based on the following steps. Let C be the set of concepts in a question q, and \(m=<s,t,o>\) be a template-triplet, and \(u=t(s,o)\) be the natural language representation of m also called information unit, and z its source paragraph. DiscoLQA performs answer retrieval by finding the most similar concepts to C within the knowledge graph, retrieving all their related template-triplets m (including those of the sub-classes), and selecting amongst the natural language representations u of the retrieved template-triplets those that are likely to be an answer to q. The probability that u pertinently answers q can be estimated through SyntagmTuner (Sovrano et al. 2022) as the numerical similarity between the embedding of \(u + z\) (i.e., u concatenated with z) and the embedding of q. So that if \(u + z\) is similar enough to q, then z is said to be an answer to q for the information unit u. Therefore, the algorithm can retrieve any arbitrary number of answers, given that enough information units are available.

In particular, the embeddings of \(u + z\) and q are obtained through a deep language model specialised on QA retrieval and pre-trained on ordinary English to associate similar vectorial representations to a question and its correct answers. The pre-trained deep language models we considered for our implementation of DiscoLQA and our experiments are the Universal Sentence Encoder (Yang et al. 2020), MiniLM (Wang et al. 2021), and MPNet (Song et al. 2020).

5 Experiment

Given all the premises stated in Sect. 1 and Sect. 3, we designed an experiment to better understand the role of discourse relations in legalese, in order to determine how to exploit existing state-of-the-art general-purpose natural language models for QA in order to automatically and effectively answer questions on legal documents (e.g., Private International Law). Indeed, legalese is a technical language in many ways similar to its related natural language, but with important differences in how the meaning is encoded in the text. Legalese is not repetitive. It is canonical and has semantic terminology that tends to avoid polysemy and to be used punctually in particular contexts as if the sentences it forms were governed by very formal rules.

We hypothesise that applying these formal rules affects the syntagmatic relationships within sentences and discourse structure. Suppose this hypothesis were correct, in principle, it would be possible to specialise general-purpose natural language models to legalese simply by integrating them with external information about the structure of discourse of legal texts without costly training procedures otherwise hampered by the scarcity of data. This is why we decided to design an experiment focused on understanding whether there is a benefit in using discourse relations and AMRs instead of plain sentences when performing Question-Answer Retrieval on the body^{Footnote 7} of articles and recitals. The overall idea is that using discourse relations and AMRs as information units would help to partly crystallise into the retrieval system the structure of discourse used by the legal texts. This would make it invariant, avoiding the answer retriever using the discourse schemes learned from the common language instead.

Hence we designed DiscoLQA that, as described in Sect. 4, extends the baseline Question-Answer Retrieval system proposed by Sovrano et al. (2020), supporting different combinations of information units, i.e., AMR and discourse relations. So, for the experiment, we can compare the performance of different information units on the same answer retrieval algorithm. More precisely, we want to study the following instances of DiscoLQA:

Clause: equivalent to the QA tool by Sovrano et al. (2020). This is DiscoLQA which uses only clauses as information units.
Clause+EDU+AMR: DiscoLQA which uses clauses, discourse relations and AMRs as information units, all together.
Clause+EDU: DiscoLQA using clauses and discourse relations but not AMRs.
Clause+AMR: DiscoLQA using clauses and AMRs.
EDU+AMR: discourse relations and AMRs.
EDU: discourse relations.
AMR.

As a result, if one type/combination of information units would perform better than the others, the gain in performance would be imputed to the only difference between the tools: the type/combination of adopted information units. Therefore, if DiscoLQA were better than the baseline (Sovrano et al. 2020), we would have some evidence to support our initial hypothesis by measuring the effects of discourse structure on the performance of information retrievers trained on general-purpose natural language.

We consider as a baseline only the answer retrieval system by Sovrano et al. (2020) mainly for two reasons:

1.
It is the only system we know that can perform legal question-answering without any ad-hoc fine-tuning or training procedure. We do not have an extensive enough dataset to train an end-to-end QA system on specific European legislation; our focus is on zero-shot legal QA (as defined in Sect. 1).
2.
It is the only legal question-answer retrieval system we know that has been tested on European legislation. Therefore it is the most suitable baseline for us.

To show that the results generalise across different deep language models, we decided to run the experiments on different state-of-the-art deep neural networks for answer retrieval:

The Universal Sentence Encoder Q &A model (USE, for short), by TensorFlow (Yang et al. 2020, Google);
MiniLM (Wang et al. 2021, Microsoft);
MPNet (Song et al. 2020, Microsoft).

In particular, the last two models were fine-tuned on 215 million question-answer pairs^{Footnote 8} by SBERT (Reimers and Gurevych 2019).

We decided to consider only the models mentioned above because: i) they are some of the best general-purpose models for the task on TensorFlow and SBERT (two state-of-the-art repositories for deep neural networks easily accessible through user-friendly APIs); ii) deep neural networks for answer retrieval (i.e., models for generating vectorial representations of questions and answers) are different from and less common than models for question answering or answer extraction.

Unfortunately, we do not know of any general-purpose open-source deep language model trained specifically on legal answer retrieval. The only exception could be the work by Vold and Conrad (2021), though their language model was trained on privacy policies, and they are usually written in more plain English than European legislation (Table 1).

Finally, in order to evaluate DiscoLQA and perform the experiment, we need a dataset of at least 50^{Footnote 9} relevant questions on European legislation, with known expected answers. Considering that Q4PIL (Sovrano et al. 2021) comprises only 17 questions on Private International Law, we decided to build a larger test set called Q4EU, to include more questions on different European norms, as described in Sect. 5.

6 Q4EU: a test set for legal answer retrieval

Q4EU contains 72 unique questions and 225 expected answers (i.e., articles and recitals). For simplicity of exposition, Q4EU can be divided into the following sub-sets:

Q4PIL (see Table 2): containing questions about 3 private international laws: Rome I Regulation EC 593/2008; Rome II Regulation EC 864/2007; Brussels I bis Regulation EU 1215/2012. These regulations are, respectively, on the law applicable to contractual obligations, on the law applicable to non-contractual obligations, on jurisdiction and the recognition and enforcement of judgements in civil and commercial matters. In particular, they aim to provide a tool for identifying the applicable law and the jurisdiction in cases when two or more legal systems connect and generate complex relationships (e.g., a sale of goods contract between an Italian and a German citizen regarding commodities situated in Spain).
Q4EAW (see Table 3): containing questions about the Council Framework Decision (CFD) of 13 June 2002 on the European arrest warrant and the surrender procedures between Member States.^{Footnote 10} In particular, this framework decision increases the efficiency of extradition procedures for crime suspects. Furthermore, it also determines the abolition of formal extradition procedures between member states of the EU for persons who are fugitives from justice after being finally convicted. The framework decision represents the first concretisation of the principle of free movement of judicial decisions in criminal matters, encompassing both pre-sentence and final decisions by fostering judicial cooperation and the development of a single area of freedom, security and justice in the EU.
Q4GDPR (see Table 4): containing questions about the General Data Protection Regulation (GDPR),^{Footnote 11} the most relevant piece of legislation in the EU legal framework with regards to data protection law. Its goal is to foster the fundamental right to data protection, enshrined by the Charter of Fundamental Rights of the European Union (art. 8), while harmonising rules in data processing, profiling, and risk management.
Q4eIDAS (see Table 5): containing questions about Regulation (EU) No 910/2014 of the European Parliament and of the Council of 23 July 2014 on electronic identification and trust services for electronic transactions in the internal market and repealing Directive 1999/93/EC,^{Footnote 12} also known as eIDAS Regulation. This legislation tackles several issues in electronic identification, electronic signature, electronic seals, and trust services. Its goal is to provide legal certainty for cross-border transactions in the EU Single Market.

Some statistics on the datasets mentioned above are shown in Table 1.

To build the Q4EU dataset and, in the first place, the Q4PIL dataset, the pieces of legislation (i.e., the norms) kept into account are conceived as self-contained legal environments. While legal interpretation is often grounded on external legal factors (e.g., jurisprudence, scholars’ opinions), we opted for a “black letter” approach to the law that only considers the legislative legal formant. Therefore, the point of view assumed in our analysis is the perspective of the lawmakers. This has a twofold implication for question-and-answer drafting.

On the one hand, questions have been modelled to be answered solely within the legal text under scrutiny. They do not refer to legal concepts, such as the hierarchy of legal sources or competence, that are not explicitly mentioned in the regulations. Moreover, not all the (legal) questions are the same. While some accept as an answer a provision that exactly matches the question, others rely on more complex interpretations (i.e., legal reasoning) to be answered. Therefore, questions have been classified depending on their context specificity, which can either be low, normal, or high.

First, specific questions whose answer is precisely in the domain of the regulations and an answer is provided in the “black letter” of the law were labelled as highly specific. An example of a question with high specificity is “In what court can an employee sue its employer?” because it perfectly falls within the scope and goals of Regulation Brussels I-bis and finds its exact answer in the provisions of Articles 21 and 23.

Questions whose answer falls within the scope of the regulations while requiring an abstraction of multiple legal provisions were labelled as normally specific. For instance, “What is the applicable rule to protect the weaker party of a contract?” was labelled as normally specific since the answer also relies on the concept of “weaker party” mentioned across two regulations (Recital 23 Rome I and Recital 18 Brussels I) concerning any contract (as a legal concept) rather than specific contractual types.

Finally, broad questions whose tentative answer is found through an articulate combination of articles and recitals were labelled as having low specificity. For instance, a question with low specificity is “Can the parties choose a different applicable law for different parts of the contract?”. While Rome I Regulation provides for a discipline on the applicable law to contract, it does not contain any provision concerning individual parts. The answer is ultimately open to interpretation in such a question, whereas the Regulation suggests norms that could serve as a reference point.

Since such classification might be subjective and dependent on each jurist, three legal experts independently evaluated the level of context specificity and decided by the majority about the final level.

On the other hand, the answers to the questions provided by legal experts, which constitute the dataset used to observe the performance of deep language modes, are obtained by mirroring the question-drafting methodology. Three legal experts, different from the question-drafters, provided answers to the legal questions by looking for the following:

1.
Specific, punctual, and explicit answers in the case of highly specific questions;
2.
General and conceptual, yet text-based, answers to normally specific questions; and
3.
Prima facie textual references to be used as interpretative points of reference in the case of low specific questions.

These experts only provided textual references in the legislation at the article or recital level (e.g., Rome I art. 8; B Rec. 18). When at least two experts agree on a given answer, their response is valid without further enquiry. If one expert provides another answer, another expert validates this response. In drafting the validation answers, no other articles or recitals have been considered except those provided by the original validators.

Table 1 Statistics on Q4EU: the column “Art./Rec.” counts the number of recitals and articles. The column “Questions” counts the number of different questions, and the column “Tokens per Art./Rec.” counts the mean number of tokens per article/recital, and so on. Please note that Q4EU is the sum of Q4PIL, Q4EAW, Q4GDPR and Q4eIDAS

DiscoLQA: zero-shot discourse-based legal question answering on European Legislation

Abstract

Similar content being viewed by others

L2QA: Long Legal Article Question Answering with Cascaded Key Segment Learning

Legal Question Answering System Using FrameNet

Collaborative System for Question Answering in German Case Law Documents

1 Introduction

2 Related work

3 Background

3.1 Question answering and law

3.2 Discourse theory and legal language

4 DiscoLQA: discourse theory for legal question answering

4.1 Information units extraction: discourse relations and abstract meaning representations

4.2 System implementation: knowledge graph extraction and answer retrieval

5 Experiment

6 Q4EU: a test set for legal answer retrieval

7 Results and error analysis

8 Discussion and conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation