1 Introduction

We are witnessing a growing need for the digitisation of our society, which requires great interdisciplinary efforts in law, information technology and engineering. This need has led to the birth of institutions such as the Ministry of Digital Governance of Greece and the Australian Digital Transition Agency, or long-term plans like the European Digital Transition Action Plan and many others.

In the literature of AI, answering questions using an extensive collection of documents of diversified topics (i.e., Private International Law) is called open-domain Question Answering (QA). Modern open-domain QA systems usually combine traditional information retrieval techniques and neural reading comprehension models. Nevertheless, neural reading comprehension of legal texts (e.g., European legislation) is challenging because legalese is rarer, mercurial and in many ways different from a commonly used natural language. Hence, the difference between legal and ordinary languages does foster technical issues when applying or fine-tuning general-purpose language models for open-domain question answering on legal resources. This is especially true when the meaning of a legal document is encoded in its (discourse) structure in a way that is different from the spoken language. For example, long sentences or more “formal” writing may be preferred in legislative documents (e.g., Brussels I bis Regulation EU 1215/2012) to reduce potential ambiguities and improve comprehensibility. However, the noise introduced by the excessive length of the sentence or their unusual structure can distract a language model trained in ordinary English, pushing it to commit more errors.

As a result, standard neural reading comprehension models may only be able to represent the semantics of a legal text if they are adequately specialised to do it. This is because legalese is not repetitive. It is canonical and has semantic terminology that tends to avoid polysemy and to be used punctually in particular contexts as if the sentences it forms were governed by formal rules. Hence, applying these formal rules impacts the discourse structure, as suggested by Sovrano et al. (2022).

Here, we expand the work published by Sovrano et al. (2020), investigating some mechanisms to perform “zero-shot” legal question answering. More specifically, “zero-shot” means that question answering is performed through pre-trained language models (e.g., a model that is trained on generic non-legal documents) without fine-tuning them on the downstream legal task of question answering. In this sense, zero-shot legal question answering can be a necessary solution for all those tasks characterised by a paucity of data (e.g., European hard laws, the resolutions of the United Nations General Assembly) and for which we want to train AI-based solutions through machine learning without having enough information for effective fine-tuning. Conversely, zero-shot legal question answering might be less helpful whenever data are abundant (e.g., American case law or privacy policies).

In this article, we investigate the role of discourse structure in legalese, trying to understand and exploit its importance in encoding the meaning of legal documents. The goal of this investigation is also practical, not just theoretical. Understanding how legalese differs from its spoken counterpart can help solve the data scarcity problem in legalese processing/comprehension. This would allow us to better exploit generic language models not calibrated to a downstream legal task or even not trained on legal documents, as shown throughout the paper.

Specifically, we use open-domain QA systems based on information retrieval and neural reading comprehension and study what happens when changing the type of information to consider for retrieval. These QA systems encode all the possible answers (e.g., parts of articles, recitals) with a general-purpose neural model and then use the encoding for fast similarity-based retrieval. Usually, these answers are just a short part (a grammatical sub-tree) of one sentence or paragraph, especially if the whole is very long. Suppose the neural model is not specialised in legalese. In that case, it will likely fail to identify and capture the importance of grammatical sub-trees that are uncommon in the spoken language. Hence, by selecting only those grammatical sub-trees deemed the most important, we should be able to help the information retriever and the QA system by partially hiding noise within answers. To identify these important grammatical sub-trees, we used the theory of Elementary Discourse Units (EDUs) (Prasad et al. 2008) and the theory of Abstract Meaning Representations (AMRs) (Banarescu et al. 2013).

In other words, we show how to produce more effective answer retrieval tools by capturing discourse structure, leveraging existing tools for QA specialised in common natural languages. Therefore, to shed more (empirical) light on what constitutes meaning in legalese, we decided to design an experiment focused on understanding whether there is a benefit in using only EDUs or AMRs, as triplets in the knowledge graphs extracted by the pipeline proposed by Sovrano et al. (2020). We devised a simple experiment where we study what happens to the baseline QA system when using EDUs or AMRs during information retrieval.

In particular, to evaluate our results, we present a new dataset called Q4EU that extends Q4PIL (Sovrano et al. 2021) with 3 European norms, for a total of 72 unique questions and 225 expected answers (in the form of articles and recitals) on 6 heterogeneous European norms spanning from Private International Law to Human Rights Law (i.e., the General Data Protection Regulation, UE 2016/679), from regulations of electronic signatures to the European arrest warrant.

The results of our experiments show that the versions using EDUs are overall the best, leading to state-of-the-art top-k precision and F1 scores for all the values of k we considered. Our instances of DiscoLQA were able to generalize across the different legal sub-domains tested, even if the deep language models involved were not pre-trained on legal corpora.

However, we tested and evaluated DiscoLQA on specific European norms and a relatively small dataset, without using deep language models pre-trained on legal corpora.

Our contribution is threefold:

  1. 1.

    We show where a general-purpose language model may fail when applied to legal documents, hinting at how to intervene for effective fine-tuning or re-training. In other words, we show that legalese’s semantics may be encoded differently. Identifying the sources of meaning may be beneficial for effectively improving the state-of-the-art neural reading comprehension of legal documents.

  2. 2.

    We show a way to effectively use discourse analysis for legal question answering, improving state-of-the-art without fine-tuning or re-training the language models on the regulations at hand.

  3. 3.

    We publish Q4EU, a new evaluation dataset for legal question answer retrieval that extends the work by Sovrano et al. (2021).

For reproducibility purposes, we also publish on GitHub the source code of DiscoLQA.Footnote 1

This paper is structured as follows. In Sect. 2, we discuss the related work on (legal) QA, while in Sect. 3 we give all the necessary background information to understand the pipeline of algorithms presented in Sect. 4. Finally in Sect. 5 we discuss our experiment and present the Q4EU dataset, analysing the results in Sect. 7 while pointing to future work in Sect. 8.

2 Related work

Legal QA is a relatively recent field of study in the context of AI and Law, with many exciting solutions available today. Some of these solutions follow end-to-end approaches, exploiting existing language models. In contrast, some others try to exploit ontologies and knowledge graphs by framing QA as a task of information retrieval.

On the one hand, we can see a paucity of end-to-end generic solutions to legal QA that are usually focused only on particular and narrow applications for which large enough datasets are available. Instead, when no large dataset is available for training, we generally have that using deep language models pre-trained on ordinary English does not always produce good results. Zheng et al. (2021) showed that the more complex the legal reasoning task to answer, the less effective the fine-tuning could be. An example of end-to-end QA system is the work by Kim et al. (2015), where a deep neural network is trained on a dataset of Boolean questions from Japanese legal bar exams. Another interesting example is the work by Ravichander et al. (2019), proposing an end-to-end question-answering solution for privacy policies.

On the other hand, an example of an answer retrieval system specific to Private International Law is the one proposed by Sovrano et al. (2020). It consists of a combination of TF-IDF and some deep language models to retrieve pertinent answers from an automatically extracted knowledge graph of contextualised grammatical sub-trees. In particular, the knowledge graph is aligned to a legal ontology based on Ontology Design Patterns (i.e., agent, role, event, temporal parameter, action) to mirror the legal significance of the relationships within and among the provisions. In this sense, we extend the work by Sovrano et al. (2020), trying to overcome some of the issues of using language models not trained in legalese.

While another example of an answer retrieval system is the work by Vold and Conrad (2021), comparing the performance of a deep learning-based solution with that of a traditional SVM. In particular, Vold and Conrad (2021) fine-tuned a deep language model (called RoBERTa) on a dataset of questions about privacy policies (that usually use a language closer to spoken English rather than legalese), obtaining better results than with an SVM.

3 Background

In this section, we provide all the necessary background information to understand state-of-the-art automated question-answering and the relationship between discourse theory and legalese.

3.1 Question answering and law

Natural language processing/understanding is of utmost importance in the intersection of AI and Law. This is why many works in this field have focused on general-purpose state-of-the-art language models for the generation of word/sentence embeddings (Shao et al. 2020; Condevaux et al. 2019; Vink et al. 2020).

For example, Bommarito et al. (2018) published a framework for natural language processing and information extraction for legal and regulatory texts. While Chalkidis and Kampas (2019) proposed one of the first models for legal word embeddings. Also, the Incorporated Council of Law Reporting for England and Wales (ICLR 2019) published Blackstone, a library meant to allow researchers and engineers to automatically extract information from long, unstructured legal texts (such as judgements, skeleton arguments, scientific articles, Law Commission reports, pleadings). More generally, natural language processing for legal texts has recently raised a lot of interest, highlighting “the need to create a bridge between conceptual questions, such as the role of legal interpretation in mining and reasoning, as well as computational and engineering challenges, such as the handling of big legal data and the complexity of regulatory compliance” (Robaldo et al. 2019).

Automating legal reasoning is not a trivial task, as it requires a deep understanding of language, non-monotonic logic and the theory of interpretation, as well as sufficient flexibility to handle the plethora of changes to which law and hermeneutics are subject over time. Current state-of-the-art AI for reasoning is divided into two approaches: the symbolic and the sub-symbolic. The symbolic approach draws from formal languages and logic. It requires every component of the reasoning to be an abstract symbol with a pre-defined and context-independent interpretation of its meaning, making the AI based on this approach hardly compatible with natural languages such as English, Chinese, and Spanish. On the other hand, the sub-symbolic approach draws from recent advancements in deep learning. Exploiting large amounts of data, it can “understand” natural language and visual inputs in a scalable and highly effective way. However, it loses transparency by working on non-symbolic representations (i.e., arbitrary numerical vectors) frequently not interpretable.

Non-monotonic reasoners based on Defeasible Logic (Lam and Governatori 2009), Deontic Logic (Hage 2000) and Argumentation (Gordon and Walton 2009) are famous examples of symbolic AI applied to the legal domain. All require legal documents to be translated (manually) from their original natural language into some particular formal language upon which classical logical reasoning can be applied. This type of reasoner usually struggles to scale to handle natural language (i.e., English) inputs such as documents and questions.

On the other hand, the sub-symbolic approach is more versatile and well-known to be more easily applied directly to natural language documents. Famous sub-symbolic approaches to (legal) reasoning are the so-called QA algorithms. As suggested by Xie et al. (2020); Cao et al. (2019); Zhang et al. (2018); Hudson and Manning (2019) and others, in many cases, question answering can be seen as an instance of reasoning. These QA algorithms are usually trained end-to-end to extract short (i.e., 2–3 words) answers from a whole document (text or image) to match a given question.

The most common end-to-end QA algorithms, i.e. those collected by Wolf et al. (2020), rely on Transformers (Vaswani et al. 2017). Hence they have quadratic complexity in the size of the whole document to be searched for an answer. This characteristic makes end-to-end QA based on Transformers fail in all those situations where collections of large documents of diversified topics (i.e., Private International Law) are involved, or parts of the same answer are scattered across multiple documents. A solution to this problem is seen in Question-Answer Retrieval, also known as Dense Passage Retrieval or open-domain QA (Chen and Yih 2020). Modern open-domain QA systems usually combine traditional information retrieval techniques and neural reading comprehension models. These QA systems encode all the identified possible answers (e.g., parts of articles, recitals) with a general-purpose neural model. Then they use the encoding for fast similarity-based retrieval. Therefore, differently from end-to-end QA, Question-Answer Retrieval is less end-to-end, requiring the a priori identification of the possible snippets of text functioning as answers, but it is much faster. In fact, it has a complexity that is usually proportional to the product of the size of the context (normally a small paragraph) and the size of the answer (commonly smaller than the context).

Among the most important Question-Answer Retrieval models, we distinguish between those that use the answer’s context for the generation of embeddingsFootnote 2 (Yang et al. 2020; Karpukhin et al. 2020; Roy et al. 2020) and those who do not (Chen et al. 2020).

3.2 Discourse theory and legal language

The relation between discourse theory and legalese is complicated and still open to discussion. Discourse theory is a branch of linguistics that studies how coherence and cohesive relations can be the threads that make up a text to form a discourse. A discourse is said to be coherent if all of its pieces belong together, while it is said to be cohesive if its elements have some common thread. Sanders et al. (1992) identified two requirements for a theory of discourse:

  • Descriptive adequacy: A theory discourse structure makes it possible to describe the structure of all kinds of (natural) texts.

  • Psychological plausibility: A theory of discourse structure should at least generate plausible hypotheses on the role of discourse structure in constructing cognitive representation.

In recent years, many different theories of discourse have been spelt out, each with different pros and cons. Among them, we cite the Rhetorical Structure Theory (Mann and Thompson 1988), assuming that discourse is structured as a tree, the Segmented Discourse Representation Theory (Lascarides and Asher 2007) assuming that discourse is structured as a graph (therefore allowing long-distance attachments), and the theory of EDUs (Miltsakaki et al. 2004; Prasad et al. 2008; Webber et al. 2019) making no assumption on the text structure. Common to them is probably the identification of something that may be called Elementary Discourse Unit (EDU). EDUs are spans of text denoting a single event serving as a complete, distinct unit of information that the surrounding discourse may connect to Stede (2013). EDUs can be combined to form many different types of discourse Fludernik (2000); D’Angelo (1984) including: argumentation, exposition, description, narration.

The theory of EDUs encoded by the Penn Discourse Treebank (PDTB) model is considered one of the most generic theories of discourse. Indeed, PDTB is data-driven (based on lexically grounded relations) and makes little assumptions about the underlying language. As a result, with little or no change in annotative style, PDTB appears to be usable for modelling discourses of natural languages belonging to different families (Zufferey and Degand 2017), e.g., Chinese, Arabic, and Hindi. In particular, PDTB is based on the assumption that “the meaning and coherence of a discourse result partly from how its constituents relate to each other”. Therefore discourse relations are defined as semantic relations between abstract objects (or EDUs) mentioned in discourse and connected by explicit (e.g., “but”, “then”, “for example”, and “although”) or implicit relations. According to PDTB, discourse relations can be of one of 4 main types: temporal, contingency (causality, purpose, etc.), expansion, and comparison. PDTB-style annotations and the other theories of discourse have inspired an ISO standard (Prasad and Bunt 2015).

The application of PDTB to legalese has been explored by some Robaldo et al. (2008); Cabrio et al. (2013), but has yet to have much follow-up. The point is that ordinary discourse theory is better suited to judgments, Hansard reports, testimonies and reports of debates. Instead, it seems unsuited to legislative texts and contracts, for which a specific vocabulary (e.g., definitions) or textual structure (e.g., hierarchy) is used to identify meaning through interpretation theory. Indeed, legislative texts have a deeper structure than common sentences. For example, a list has a legal meaning of conditions linked together by specific semantics. Furthermore, the classical linguistic structures based on discourse connectives tend to be used differently in law. Legal connectives do not have the same semantic value as everyday discourse. They are operators of deontic rules with multiple meanings (e.g., “xor”, “or”, “and”). Also, some discourse structures tend not to be used at all because they are not a good practice in legal drafting (e.g., “but” and “for example”).

4 DiscoLQA: discourse theory for legal question answering

This paper proposes a novel pipeline of algorithms called DiscoLQA, short for Discourse-based Legal Question Answering. DiscoLQA is based on the automatic extraction of special knowledge graphs designed to address Legal QA through general-purpose deep language models that are not specifically trained on legal documents. In particular, DiscoLQA is composed by the baseline tool of Sovrano et al. (2020) extended with a new component responsible for the extraction of special information units representing EDUs and AMRs.

The baseline tool described by Sovrano et al. (2020) is composed of a pipeline of algorithms for efficient Question-Answer Retrieval through the extraction of a knowledge graph from a set of information units. In this sense, the main difference between DiscoLQA and the baseline is (as shown in Fig. 1) the type of information units considered by the knowledge graph extractor. The baseline uses as information units all the clausesFootnote 3 of the source documents.Footnote 4 Instead, DiscoLQA can use as information units not only the clauses but also the AMRs and discourse relations extracted from the clauses.

In other words, DiscoLQA supports more types of information units and allows the retrieval of answers from any combination of clauses, AMRs and discourse relations. Specifically, discourse relations are meant to capture how EDUs are connected, while AMRs are meant to capture the informative components within the EDUs by possibly supporting answering to basic questions such as “who did what to whom, when or where”. For example, from the sentence “The existence and validity of a contract, or any term of a contract, shall be determined by the law which would govern it under this Regulation if the contract or term were valid” it is possible to extract the following discourse relation about contingency (that we represent as a pair of question and answer for convenience and clarity) “In what case would the law govern it under this Regulation? If the contract or term were valid”, and the following AMR question-answer “By what is the existence and validity of a contract determined? The law that would govern it under this Regulation if the contract or clause were valid”. So, a discourse relation identifies two EDUs: the first encoded in the question and the second in the answer.

In this section, we discuss the system implementation of DiscoLQA, starting from the proposed mechanism for extracting EDUs and AMRs.

4.1 Information units extraction: discourse relations and abstract meaning representations

Fig. 1
figure 1

Sketch of the pipeline used in the baseline and DiscoLQA. The baseline extracts only clauses from the source texts (articles, recitals, commission statements, etc.). DiscoLQA also extracts discourse relations and AMR as information units. The information units are then passed to the knowledge graph extractor that produces a graph used by the Question-Answer Retriever

The AMRs and EDUs used by DiscoLQA are extracted from sentences and paragraphs through a deep language model based on T5Footnote 5 Raffel et al. (2020) pre-trained on a multi-task mixture of unsupervised and supervised tasks.

Vanilla T5 is not trained to recognise AMRs or EDUs. Therefore we had to fine-tune T5 on some public datasets designed for these tasks. These datasets are namely QAMR (Michael et al. 2018) for extracting AMRs, and QADiscourse (Pyatkin et al. 2020) for EDUs and discourse relations. Interestingly, both datasets encode AMRs and EDUs as question-answer pairs; this is done for convenience only. Indeed, as pointed out by Michael et al. (2018); Pyatkin et al. (2020); Roit et al. (2020) and others, the question-answer format is more natural, facilitating humans to operate changes, correct errors, suggesting improvements, even without knowing in detail all the underlying linguistic theories.

Most importantly, the QAMR and QADiscourse datasets are not related to any of the technical domains covered by Q4EU. They do not contain legal documents or text fragments written in legalese. In other words, by fine-tuning T5 on QAMR and QADiscourse, we do not refine T5 on legal texts. Legal fine-tuning would require the costly extraction of a dataset of AMRs and EDUs from legal texts, also considering ad hoc adaptations of discourse theories and abstract meaning representation to legal language.

In particular, the QAMR dataset is made of 107,880 different questions (and answers) that are a mapping of AMR theory to the following wh-phrases:

  • What (60.9% of the dataset),

  • Who (17.5%),

  • How (6.9%),

  • Where (5.0%),

  • When (4.3%),

  • Which (2.9%),

  • Whose (1.9%),

  • Why (0.6%).

On the other hand, the QADiscourse dataset is made of 16,613 different questions (and answers) that are a mapping of PDTB to the following wh-phrases mainly on contingency and temporal relations:

  • In what manner (25% of the dataset),

  • What is the reason (19%),

  • What is the result (16%),

  • What is an example (11%),

  • After what (7%),

  • While what (6%),

  • In what case (3%),

  • Despite what (3%),

  • What is contrasted with it (2%),

  • Before what (2%),

  • Since when (2%),

  • What is similar (1%),

  • Until when (1%),

  • Instead of what (1%),

  • What is an alternative (\(\le 1\%\)),

  • Except when (\(\le 1\%\)),

  • Unless what (\(\le 1\%\)).

The two considered datasets are tuples of \(<s, q, a>\), where s is a source sentence, q is a question (implicitly) expressed in s, and a is an answer expressed in s. So that T5 is fine-tuned to tackle at once the following four tasks per dataset:

  1. 1.

    Extract a given s and q,

  2. 2.

    Extract q given s and a,

  3. 3.

    Extract all the possible q given s,

  4. 4.

    Extract all the possible a given s.

Specifically, we fine-tuned the T5 model on QAMR and QADiscourse for five epochs.Footnote 6 The objective of the fine-tuning was to minimise a loss function measuring the difference between the expected output (i.e., a for the 1st task, q for the 2nd task, etc.) and the output given by T5. A mathematical definition of the loss function is given by Raffel et al. (2020).

At the end of the training, the average loss was 0.4098, meaning that our fine-tuned T5 model cannot perfectly extract AMRs or EDUs from the text composing the training set. On the one hand, this is a good thing because it is likely that the model did not over-fit on the training set. On the other hand, this points to the fact that the AMRs and EDUs extracted by our T5 model can be imperfect, containing errors that could propagate to the answer retrieval system. Regardless, in the following sections, we show that even if the language models we rely on are imperfect, we can still outperform the baseline information retrieval system.

4.2 System implementation: knowledge graph extraction and answer retrieval

DiscoLQA, similarly to the baseline tool described by Sovrano et al. (2020), consists in a pipeline of AI algorithms that is capable of extracting from a set of information units a particular graph of knowledge that an information retrieval system can exploit to answer a given question. In particular, this knowledge graph is extracted by detecting, with a dependency parser, all the possible phrases and sub-phrases within the information units so that each phrase stands for an edge of the knowledge graph. In practice, these phrases are represented as special triplets of subjects, templates and objects called template-triplets. Specifically, the templates are composed of the ordered sequence of tokens connecting a subject and an object. The subject and the object are represented in such templates with the placeholders “{subj}” and “{obj}”.

Hence, the resulting template-triplets are a sort of function, where the predicate is the body and the object and the subject are the parameters. Obtaining a natural language representation of these template-triplets is straightforward by design by replacing the instances of the parameters in the body. This natural representation is then used as a possible answer for retrieval by measuring the similarity between its embedding and the embedding of a question. An example of template-triple is:

  • Subject: “the applicable law”

  • Template: “Surprisingly {subj} is considered to be clearly more related to {obj} rather than to something else”

  • Object: “that Member State”

Because of the adopted extraction procedure, the resulting knowledge graph could be better. It may contain mistakes caused by wrongly identified grammatical dependencies or other issues.

To increase the interoperability of the extracted knowledge graph with external resources, we formatted it as an RDF graph. RDF is a standard model for data interchange on the Web (Allemang and Hendler 2011). In particular, RDF has features that facilitate data merging even if the underlying schemas differ. To format a graph of template triplets in an RDF graph, we performed the following steps:

  • We assigned a Uniform Resource Identifier (URI) to every node (i.e., subject and object) and edge (i.e., template) of the graph by lemmatising the associated text. To each URI, we assigned an RDFS label corresponding to the associated text.

  • We added special triplets to keep track of the sources from which the template-triplets were extracted so that for each node and edge is possible to go back to the source document or paragraph.

  • We added sub-class relations between composite concepts (syntagms) and the simplest concepts (if any) composing the syntagm. For example, “contractual obligation” is a sub-class of “obligation”.

For more technical details about how we performed all the steps mentioned above to convert the template-triplets into an RDF graph, please refer to Sovrano et al. (2020) or the source code of DiscoLQA.

Finally, the algorithm to retrieve answers from the extracted knowledge graph is based on the following steps. Let C be the set of concepts in a question q, and \(m=<s,t,o>\) be a template-triplet, and \(u=t(s,o)\) be the natural language representation of m also called information unit, and z its source paragraph. DiscoLQA performs answer retrieval by finding the most similar concepts to C within the knowledge graph, retrieving all their related template-triplets m (including those of the sub-classes), and selecting amongst the natural language representations u of the retrieved template-triplets those that are likely to be an answer to q. The probability that u pertinently answers q can be estimated through SyntagmTuner (Sovrano et al. 2022) as the numerical similarity between the embedding of \(u + z\) (i.e., u concatenated with z) and the embedding of q. So that if \(u + z\) is similar enough to q, then z is said to be an answer to q for the information unit u. Therefore, the algorithm can retrieve any arbitrary number of answers, given that enough information units are available.

In particular, the embeddings of \(u + z\) and q are obtained through a deep language model specialised on QA retrieval and pre-trained on ordinary English to associate similar vectorial representations to a question and its correct answers. The pre-trained deep language models we considered for our implementation of DiscoLQA and our experiments are the Universal Sentence Encoder (Yang et al. 2020), MiniLM (Wang et al. 2021), and MPNet (Song et al. 2020).

5 Experiment

Given all the premises stated in Sect. 1 and Sect. 3, we designed an experiment to better understand the role of discourse relations in legalese, in order to determine how to exploit existing state-of-the-art general-purpose natural language models for QA in order to automatically and effectively answer questions on legal documents (e.g., Private International Law). Indeed, legalese is a technical language in many ways similar to its related natural language, but with important differences in how the meaning is encoded in the text. Legalese is not repetitive. It is canonical and has semantic terminology that tends to avoid polysemy and to be used punctually in particular contexts as if the sentences it forms were governed by very formal rules.

We hypothesise that applying these formal rules affects the syntagmatic relationships within sentences and discourse structure. Suppose this hypothesis were correct, in principle, it would be possible to specialise general-purpose natural language models to legalese simply by integrating them with external information about the structure of discourse of legal texts without costly training procedures otherwise hampered by the scarcity of data. This is why we decided to design an experiment focused on understanding whether there is a benefit in using discourse relations and AMRs instead of plain sentences when performing Question-Answer Retrieval on the bodyFootnote 7 of articles and recitals. The overall idea is that using discourse relations and AMRs as information units would help to partly crystallise into the retrieval system the structure of discourse used by the legal texts. This would make it invariant, avoiding the answer retriever using the discourse schemes learned from the common language instead.

Hence we designed DiscoLQA that, as described in Sect. 4, extends the baseline Question-Answer Retrieval system proposed by Sovrano et al. (2020), supporting different combinations of information units, i.e., AMR and discourse relations. So, for the experiment, we can compare the performance of different information units on the same answer retrieval algorithm. More precisely, we want to study the following instances of DiscoLQA:

  • Clause: equivalent to the QA tool by Sovrano et al. (2020). This is DiscoLQA which uses only clauses as information units.

  • Clause+EDU+AMR: DiscoLQA which uses clauses, discourse relations and AMRs as information units, all together.

  • Clause+EDU: DiscoLQA using clauses and discourse relations but not AMRs.

  • Clause+AMR: DiscoLQA using clauses and AMRs.

  • EDU+AMR: discourse relations and AMRs.

  • EDU: discourse relations.

  • AMR.

As a result, if one type/combination of information units would perform better than the others, the gain in performance would be imputed to the only difference between the tools: the type/combination of adopted information units. Therefore, if DiscoLQA were better than the baseline (Sovrano et al. 2020), we would have some evidence to support our initial hypothesis by measuring the effects of discourse structure on the performance of information retrievers trained on general-purpose natural language.

We consider as a baseline only the answer retrieval system by Sovrano et al. (2020) mainly for two reasons:

  1. 1.

    It is the only system we know that can perform legal question-answering without any ad-hoc fine-tuning or training procedure. We do not have an extensive enough dataset to train an end-to-end QA system on specific European legislation; our focus is on zero-shot legal QA (as defined in Sect. 1).

  2. 2.

    It is the only legal question-answer retrieval system we know that has been tested on European legislation. Therefore it is the most suitable baseline for us.

To show that the results generalise across different deep language models, we decided to run the experiments on different state-of-the-art deep neural networks for answer retrieval:

  • The Universal Sentence Encoder Q &A model (USE, for short), by TensorFlow (Yang et al. 2020, Google);

  • MiniLM (Wang et al. 2021, Microsoft);

  • MPNet (Song et al. 2020, Microsoft).

In particular, the last two models were fine-tuned on 215 million question-answer pairsFootnote 8 by SBERT (Reimers and Gurevych 2019).

We decided to consider only the models mentioned above because: i) they are some of the best general-purpose models for the task on TensorFlow and SBERT (two state-of-the-art repositories for deep neural networks easily accessible through user-friendly APIs); ii) deep neural networks for answer retrieval (i.e., models for generating vectorial representations of questions and answers) are different from and less common than models for question answering or answer extraction.

Unfortunately, we do not know of any general-purpose open-source deep language model trained specifically on legal answer retrieval. The only exception could be the work by Vold and Conrad (2021), though their language model was trained on privacy policies, and they are usually written in more plain English than European legislation (Table 1).

Finally, in order to evaluate DiscoLQA and perform the experiment, we need a dataset of at least 50Footnote 9 relevant questions on European legislation, with known expected answers. Considering that Q4PIL (Sovrano et al. 2021) comprises only 17 questions on Private International Law, we decided to build a larger test set called Q4EU, to include more questions on different European norms, as described in Sect. 5.

6 Q4EU: a test set for legal answer retrieval

Q4EU contains 72 unique questions and 225 expected answers (i.e., articles and recitals). For simplicity of exposition, Q4EU can be divided into the following sub-sets:

  • Q4PIL (see Table 2): containing questions about 3 private international laws: Rome I Regulation EC 593/2008; Rome II Regulation EC 864/2007; Brussels I bis Regulation EU 1215/2012. These regulations are, respectively, on the law applicable to contractual obligations, on the law applicable to non-contractual obligations, on jurisdiction and the recognition and enforcement of judgements in civil and commercial matters. In particular, they aim to provide a tool for identifying the applicable law and the jurisdiction in cases when two or more legal systems connect and generate complex relationships (e.g., a sale of goods contract between an Italian and a German citizen regarding commodities situated in Spain).

  • Q4EAW (see Table 3): containing questions about the Council Framework Decision (CFD) of 13 June 2002 on the European arrest warrant and the surrender procedures between Member States.Footnote 10 In particular, this framework decision increases the efficiency of extradition procedures for crime suspects. Furthermore, it also determines the abolition of formal extradition procedures between member states of the EU for persons who are fugitives from justice after being finally convicted. The framework decision represents the first concretisation of the principle of free movement of judicial decisions in criminal matters, encompassing both pre-sentence and final decisions by fostering judicial cooperation and the development of a single area of freedom, security and justice in the EU.

  • Q4GDPR (see Table 4): containing questions about the General Data Protection Regulation (GDPR),Footnote 11 the most relevant piece of legislation in the EU legal framework with regards to data protection law. Its goal is to foster the fundamental right to data protection, enshrined by the Charter of Fundamental Rights of the European Union (art. 8), while harmonising rules in data processing, profiling, and risk management.

  • Q4eIDAS (see Table 5): containing questions about Regulation (EU) No 910/2014 of the European Parliament and of the Council of 23 July 2014 on electronic identification and trust services for electronic transactions in the internal market and repealing Directive 1999/93/EC,Footnote 12 also known as eIDAS Regulation. This legislation tackles several issues in electronic identification, electronic signature, electronic seals, and trust services. Its goal is to provide legal certainty for cross-border transactions in the EU Single Market.

Some statistics on the datasets mentioned above are shown in Table 1.

To build the Q4EU dataset and, in the first place, the Q4PIL dataset, the pieces of legislation (i.e., the norms) kept into account are conceived as self-contained legal environments. While legal interpretation is often grounded on external legal factors (e.g., jurisprudence, scholars’ opinions), we opted for a “black letter” approach to the law that only considers the legislative legal formant. Therefore, the point of view assumed in our analysis is the perspective of the lawmakers. This has a twofold implication for question-and-answer drafting.

On the one hand, questions have been modelled to be answered solely within the legal text under scrutiny. They do not refer to legal concepts, such as the hierarchy of legal sources or competence, that are not explicitly mentioned in the regulations. Moreover, not all the (legal) questions are the same. While some accept as an answer a provision that exactly matches the question, others rely on more complex interpretations (i.e., legal reasoning) to be answered. Therefore, questions have been classified depending on their context specificity, which can either be low, normal, or high.

First, specific questions whose answer is precisely in the domain of the regulations and an answer is provided in the “black letter” of the law were labelled as highly specific. An example of a question with high specificity is “In what court can an employee sue its employer?” because it perfectly falls within the scope and goals of Regulation Brussels I-bis and finds its exact answer in the provisions of Articles 21 and 23.

Questions whose answer falls within the scope of the regulations while requiring an abstraction of multiple legal provisions were labelled as normally specific. For instance, “What is the applicable rule to protect the weaker party of a contract?” was labelled as normally specific since the answer also relies on the concept of “weaker party” mentioned across two regulations (Recital 23 Rome I and Recital 18 Brussels I) concerning any contract (as a legal concept) rather than specific contractual types.

Finally, broad questions whose tentative answer is found through an articulate combination of articles and recitals were labelled as having low specificity. For instance, a question with low specificity is “Can the parties choose a different applicable law for different parts of the contract?”. While Rome I Regulation provides for a discipline on the applicable law to contract, it does not contain any provision concerning individual parts. The answer is ultimately open to interpretation in such a question, whereas the Regulation suggests norms that could serve as a reference point.

Since such classification might be subjective and dependent on each jurist, three legal experts independently evaluated the level of context specificity and decided by the majority about the final level.

On the other hand, the answers to the questions provided by legal experts, which constitute the dataset used to observe the performance of deep language modes, are obtained by mirroring the question-drafting methodology. Three legal experts, different from the question-drafters, provided answers to the legal questions by looking for the following:

  1. 1.

    Specific, punctual, and explicit answers in the case of highly specific questions;

  2. 2.

    General and conceptual, yet text-based, answers to normally specific questions; and

  3. 3.

    Prima facie textual references to be used as interpretative points of reference in the case of low specific questions.

These experts only provided textual references in the legislation at the article or recital level (e.g., Rome I art. 8; B Rec. 18). When at least two experts agree on a given answer, their response is valid without further enquiry. If one expert provides another answer, another expert validates this response. In drafting the validation answers, no other articles or recitals have been considered except those provided by the original validators.

Table 1 Statistics on Q4EU: the column “Art./Rec.” counts the number of recitals and articles. The column “Questions” counts the number of different questions, and the column “Tokens per Art./Rec.” counts the mean number of tokens per article/recital, and so on. Please note that Q4EU is the sum of Q4PIL, Q4EAW, Q4GDPR and Q4eIDAS
Table 2 Q4PIL subset: here, “B” stands for Brussels I bis Regulation EU 1215/2012, “RI” for Rome I Regulation EC 593/2008 and “RII” for Rome II Regulation EC 864/2007
Table 3 Q4EAW subset: here, “W” stands for the CFD of 13 June 2002 on the European arrest warrant and the surrender procedures between Member States
Table 4 Q4GDPR subset: here, “G” stands for GDPR
Table 5 Q4eIDAS subset: here, “E” stands for the eIDAS Regulation

7 Results and error analysis

Considering that, with the Q4EU dataset, a single answer is not sufficientFootnote 13 to respond to a test query altogether, we relied on top-k precision, F1, Normalised Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) as evaluation metrics. In particular, the top-k precision, or P@k, is measured as the fraction of expected answers amongst the top-k retrieved instances. The top-k F1 score, or F1@k, is given by \(2\frac{R@k \cdot P@k}{R@k + P@k}\), where the top-k recall, or R@k, is measured as the fraction of correct answers retrieved in the top-k instances. In contrast, the top-k NDCG (Sakai 2007) is a measure of ranking quality normalised in [0, 1] that measures the usefulness, or gain, of an answer based on its position in the result list. Instead, the top-k MRR (Voorhees 1999) only cares about the single highest-ranked relevant item. It shows what system does the best job at placing a relevant document/passage in to highest rank.

It is important to note that the main difference between precision, F1, MRR and NDCG is that the last two are used to assess the ability of an answer retrieval system to rank correct answers first. Conversely, the other metrics measure the system’s precision and accuracy. For these reasons, all selected metrics are considered complementary measurements that may present different lenses into the problem of understanding answer retrieval systems (Dato et al. 2022).

Table 6 Q4EU—scores of universal sentence encoder
Table 7 Q4EU—scores of MiniLM
Table 8 Q4EU—scores of MPNet

In Tables 6, 7 and 8 we show the macroFootnote 14 top-k evaluation scores for \(k=\{ 5, 10 \}\),Footnote 15 studying how different types of information units and deep language models affect answer retrieval. In particular, we show two different evaluations in these tables. The first one is performed by running the answer retrieval algorithm on all the 6 norms of Q4EU (we will refer to it as “all norms search”), even though the questions in Q4EU usually target only 1 or 2 norms. Instead, the second one (we will refer to it as “target norms search”) is performed by considering only the legal acts targeted by every question (e.g., Q4GDPR targets only the GDPR, Q4eIDAS only eIDAS), filtering out all the answers coming from unrelated norms.

As expected, all the scores obtained with a “target norms search” are higher than with a “all norms search”. Interestingly, the difference between the two evaluations clearly shows the weight of incorrect selection of the target document with DiscoLQA in Q4EU. Nonetheless, these results show that regardless of the choice of k, using discourse relations (EDUs) as information units gives the best precision, especially when in combination with clauses and AMRs.

Despite their differences, MPNet, MiniLM (the best) and the Universal Sentence Encoder behave very similarly, suggesting that the information units we considered may play a role independent from the underlying language model used for retrieval. DiscoLQA using only discourse relations and AMRs as information units (i.e., EDU+AMR) outperforms the baseline in terms of precision. This happens with all the language models considered, except MPNet. This fact suggests that EDUs and AMRs can retain most of the relevant information of the corpus of technical documents, supporting our hypothesis. Moreover, as shown in Table 9, the average length of EDUs and AMRs is smaller than that of normal clauses, further corroborating the hypothesis and demonstrating that the deep language models considered can be distracted by longer clauses.

Table 9 Q4EU—average length of information units by type
Table 10 Statistical tests
Table 11 Statistical tests

In light of the similarities and differences observed across different algorithms and information units, a statistical test was essential to ascertain the significance of these findings. Since the data samples considered are not independent, we opted for the Wilcoxon signed-rank test (Woolson 2007), a non-parametric version of the paired T-test that is suitable for paired samples. Indeed, the same questions are tested across all algorithms (i.e., EDU, Clause, etc.).

The results of the one-sided statistical tests on the top10 scores (“target norms search”) of the Universal Sentence Encoder and MiniLM are shownFootnote 16 respectively in Tables 10 and 11. Statistically significant improvements were generally seen in the precision and F1 scores when using a combination of EDUs, AMRs and clauses. MiniLM showed significant gains mainly in precision, whereas the Universal Sentence Encoder displayed more widespread improvements, particularly in F1 scores. Neither answer retriever exhibited statistically significant changes in NDCG and MRR metrics.

Overall, these findings support our hypothesis. They show that it is possible to improve a general-purpose language model, making it perform better with legal texts. This is possible by better capturing syntagmatic relationships and using noiseless information units, i.e., decomposing a generic clause into one or more discourse relations or AMRs.

Table 12 Q4EU—examples of correct answers

In other words, as expected, the information units representing the (generic) clauses carry enough noise to distract the answer retriever. By breaking the sentences into EDUs and explicitly keeping their relations, we can crystallise the discourse structure into the knowledge graph, making it invariant. Therefore the answer retriever is forced to “reason” over the discourse patterns, minimising the chances of relying on common-sense discourse schemes instead.

Examples of how EDUs and AMRs are important for some questions of the Q4EU dataset are shown in Table 12. In particular, a qualitative analysis of the algorithm’s responses shows that it can identify useful normative references to ensure the completeness of the answer and develop an overview. For example, among the answers to the question “Who decides precedence in the event of a conflict between a European arrest warrant and a request for extradition from a third country?” the algorithm identifies Article 16.3 (the most relevant answer) and suggests Recital 8, which helps interpret Article 16.3. Furthermore, for the same question, the algorithm also suggests Article 10.6, which, while not suitable for answering the question, leads the jurist to complementary points of reference for more holistic reasoning and interpretation.

Table 13 Q4EU—examples of wrong answers: this table shows a few examples of answers wrongly given by the baseline and DiscoLQA (EDU+AMR)

Both Tables 12 and 13 show errors committed by the answer retrievers and the extractor of information units. These examples clearly reveal at least two different types of errors. The first type occurs when an information unit is extracted to be semantically or grammatically incorrect, such as in the first and fourth rows of Table 12. This type of error is relatively minor since, in some cases, the underlying language model is resistant to inaccuraciesFootnote 17, still allowing a correct answer to be retrieved, as shown in 12.

In particular, this first type of error is usually caused by the automatic extraction of AMRs and EDUs by a neural network, as described in Sect. 4.1. For this reason, it is possible to see in both Tables 12 and 13 examples of information units that do not perfectly overlap with the text of a response. On the other hand, the second type of error is due to mistakes in the deep language model for answer retrieval. As shown in Table 13, this type of error can be rather severe, causing wrong answers to be selected by the retriever.

Table 14 Q4EU—P@10 by context specificity
Table 15 Q4EU—percentage of answers more/less precise than the baseline

As in the evaluation carried out by Sovrano et al. (2021), we studied how (top-10) precision scores vary when the context specificity changes. Results partly confirm our expectations. We can see a trend where mean top-10 precision increases proportionally to the context specificity. This is clear in all instances of DiscoLQA, except AMR. In particular, as shown in Table 14, AMRs only contribute to better answer questions having low and normal specificity. Furthermore, we also show in Table 15 the percentage of queries for which DiscoLQA made a positive/negative difference from the baseline in terms of top-10 precision and grouped by specificity.

Our expectations were based on the fact that:

  • The specificity of a question is low when it asks something that cannot be explicitly found in the Regulations but requires a holistic analysis of principles, competence rules, and so forth;

  • Questions with low specificity usually tend to have more expected answers, and it may be harder to find all of them;

  • Multi-hop reasoning is usually required to answer questions with low specificity, but the considered answer retrievers are not equipped for that kind of reasoning (yet).

For example, the question “How should a contract be interpreted according to Regulation Rome I?” has a very low specificity. It requires pinpointing both recitals and articles for a proper answer, therefore, more distinct and distant paragraphs. Most of the questions regarding hermeneutics would probably require a broader view of the subject, having a low specificity to the Regulation, therefore requiring multi-hop reasoning.

8 Discussion and conclusion

With this paper, we empirically investigated the role of discourse structure in legalese, trying to understand its importance in encoding the meaning of legal documents. Ours is a first attempt to exploit more sophisticated linguistic theories such as PDTB. To this end, we devised a simple experiment on legal question answering, designed to shed more light on whether Elementary Discourse Units (EDUs) and Abstract Meaning Representations (AMRs) are the fundamental information units in legislative texts as well.

As a result of these experiments, we found that EDUs and AMRs seem to be useful for better capturing long-distant relations between information units, as shown in Table 15. This leads to an overall improvement of our DiscoLQA over the baseline, in terms of precision, F1, NDCG and MRR. In particular, EDU+AMR (the version of DiscoLQA using AMRs and EDUs) was able to produce \(23.61\%\) more precise top10 answers than the baseline, using MiniLM with “target norms search”. This percentage rises to \(25\%\) and \(27.27\%\) when considering only questions with normal and low specificity, respectively.

The goal of our experiments was also practical, not just theoretical. Understanding how legalese differs from its natural language can help us address the problem of data scarcity in legalese processing/understanding by allowing us to exploit general-purpose language models not specifically trained on legal documents. However, these generic language models may be one of many available. Indeed, in the literature, it is possible to find several examples of training data for legal domains, or at least training data that can be exploited via transfer learning paradigms. Nonetheless, transfer learning is challenging, and different legal domains or documents may deploy different discourse structures, requiring different language models. For example, privacy policies can be considered legal documents, though their language is usually closer to plain English than legalese, to help consumers understand the policy. In other words, transfer learning can be an alternative solution to zero-shot question answering. However, neither of the two approaches can be considered a one-size-fits-all solution for all possible problems.

We tested and evaluated DiscoLQA on specific European norms and a relatively small dataset without comparing our results with deep language models pre-trained on legal corpora, as explained at the end of Sect. 5. Nonetheless, even though Q4EU is about different legal sub-domains (respectively: Private International Law, the European arrest warrant, data protection and electronic signatures), our instances of DiscoLQA were able to generalise well across them, outperforming the baseline in all the cases. Notably, this result occurred even though we built DiscoLQA to perform zero-shot question answering without any training procedure involving European legislation or (more generally) legal documents. Therefore, DiscoLQA can potentially be used in various domains where data scarcity is unavoidable. To implement DiscoLQA, it is not necessary to manually create a new, time-consuming dataset, such as Q4EU.

Another discussion we should have is about the scalability of DiscoLQA. Indeed, DiscoLQA introduces some extra overhead on the identified baseline, but this overhead does not affect either the asymptotic time complexity of answer retrieval or pre-processing. More precisely, the time complexity of pre-processing changes only by a constant factor. This is because EDUs and AMRs are extracted in polynomial time from paragraphs (and not documents) by a pre-trained deep neural network that does not need to be retrained in order to work. Furthermore, the time complexity of retrieval can only increase by a constant factor, i.e., when EDUs and AMRs are combined with normal clauses. This is because the number and size of EDUs and AMRs normally never exceed that of clauses. Even when only EDUs or AMRs are considered instead of clauses, the time complexity is reduced by the smaller number of information units to be searched.

In most of today’s deep learning applications, the test and training sets are much larger than those used in these experiments. For example, the MS Marco (Nguyen et al. 2016) collection (partly also used for training MiniLM and MPNet) consists of over 1 million questions whose answers are extracted from 3.5 million web documents. These large datasets only make sense for training and evaluating generic language models on tasks that do not suffer from data scarcity. In these cases, due to bandwidth and scalability issues, a pre-processing strategy such as that employed by DiscoLQA and the baseline could introduce a significant memory overhead into the information retrieval system. Instead, due to the small size of the Q4EU dataset (less than 300 items per sub-collection), we can easily implement an extractor of knowledge graphs (and other relationship identifiers).

On the one hand, working with less data poses several technical challenges that sometimes require paradigm shifts. On the other hand, it can also open the way for several technological solutions previously considered impractical. In this article, we have shown only a few examples of how deep learning strategies can be rethought to adapt to smaller data and problems. We have only scratched the tip of an iceberg that may be uncovered by emerging ideas from joint efforts in the field of AI and law. For instance, as future work, we point to the possibility of specialising the algorithm for extracting EDUs and AMRs to legislative texts, taking into account what we already know about legal connectors and discourses.