The present section describes the 59 papers and their contribution to one or more RQs. When a paper included in the SLR refers to many RQs, it has been included in the subsection referring to the RQ which is more relevant accounting for the content of the paper.
RQ0: What are the Main Approaches for Translating Legal Documents into Formal Specifications?
Few works relate to the whole process of translating legal documents into formal specifications. Among them, Sharifi et al. propose [5] Symboleo, a specification language for legal contracts and shows through examples how to manually translate legal contracts into formal specifications. Contracts are represented as collections of obligations and powers and are about roles, who are their debtors and beneficiaries, and assets that change state, usually ownership. The validity of a formal specification can be checked, as demonstrated in subsequent work on Symboleo that presents the validation of two standard business contracts with a compliance checker [80]. Symboleo represents the most complete, detailed approach we identified in the SLR to generate formal contract specifications. The proposed approach can help understand conceptually the translation process and the main difficulties. However, little detail is provided about the translation process and the process is completely manual.
Similarly, Grosof proposes SweetDeal [49], a rule-based technique to represent business contracts automated along their lifecycle. Ontologies are represented as DAML + OIL, an extension of OWL (Ontology Web Language) that is frequently used in Semantic Web. A complete explanatory example on how to generate specifications is provided and explained. Similarly to [5], it could represent a reference to understand how NL is translated into a formal specification. However, the process is mostly manual, and it does not address the identification and annotation of legal concepts and relationships.
Hashmi proposes a manual methodology to extract legal requirements from text, and formalize them in the Event Calculus [52]. The proposal relies on IF–THEN rules and includes process aspects together with rule types (e.g., determinative or prescriptive). The assumption is that extracting the abstract structure of a legal document facilitates tracking the implications of business processes, tracing requirements, and checking for compliance, all issues frequently ignored in legal document analysis. The proposal relies on Ciceronian rhetorical loci, focusing on who, why, what, when, and where of business processes identified from real-life cases. IF–THEN rules make translation easier than for other formalisms, but they are not expressive enough to capture all the nuances of legal concepts. Differently from other approaches presented in the SLR, it forces the reader to formalize the specification with logic that is more easily translated into a programming language.
He et al. [53] propose a different approach where the level of formality is decreased to enhance understandability for a non-technical audience (e.g., lawyers, business analysts). The proposal is based on SPESC, a specification language for smart contracts developed with collaborative design in mind. Here, specifications are manually derived from text and are expressed in an NL-like language using an extended BNF grammar. SPESC specifications are more abstract than smart contract code and have a general structure consisting of parties, contract properties, terms, and data type definitions. Similarly to [52], the proposed approach potentially support a developer in deciding how to implement the formal specification and it is not limited to the how. A recent work [39] presents a process for generating a smart contract. A multi-tier ontology is proposed to support translation to a domain-specific representation using a Smart Legal Contract Markup Language (SLCML).
Breaux et al. [30] present another intermediary approach that focuses on traceability to express legal requirements semi-formally. This is accomplished through a computational requirements document expressed in a specification language for legal requirements (LRSL). Similarly to [53], the objective is to provide legal requirements that are freely available to policymakers, business analysts, and software developers. An automated parsing tool checks for syntax and semantic errors requirements in LRSL. The parser applies deontic annotations based on a set of heuristics and creates a model to identify mandatory requirements. Specifications can be exported to different formats, including HTML, GraphML, and XML to allow for different types of analysis. The approach could be complementary to other approaches proposed to guarantee the completeness of the formal specification. Finally, NómosT has been implemented with the objective to build models of law semi-automatically; first, the text of a law is annotated—relying on GaiusT, a semantic annotation tool [4]—and then, it generates a model [99]. NómosT uses Propositional Logic, a semi-formal language that lacks quantifiers, modal operators, and other features that would significantly limit the applicability of the approach to legal contracts. The approach supports well the initial steps of the translation process, but it is less helpful for the whole process compared to other approaches, such as [5] or [52].
RQ1: What Legal Ontologies Have Been Used for the Translation?
The use of legal ontologies for modeling legal documents dates back to the 90s; see for example [23]. Two main foci have been identified, sometimes used together: (a) the application of an existing ontology to a legal text for purposes of modeling and analysis, and (b) the identification of an ontology from legal texts. Concerning the first problem, Gangemi [43] proposed an influential Core Legal Ontology (CLO) to support information systems dealing with legal matters. The ontology is an extension of DOLCE+, which in turn is an extension of DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering), a foundational ontology that is used with the JurWordNet lexicon. CLO has been applied to compare and verify compliance of norms and to support text-mining. The work in [43] has significant pedagogical content to support understanding how an ontology is built.
An ontology that has seen significant legal applications is the Unified Foundational Ontology (UFO) extended to develop UFO-L [6] focusing on rights, duties, no rights, and permissions. This work emphasizes the importance of basing legal ontologies on legal theories and foundational ontologies. It represents the conceptual basis on how to build an ontology, although it does not detail practical implementation requirements. UFO-L has been used for modeling contract in [48], where it has been explored to bridge the gap between two different types of approaches for contract representation. Some approaches, such as ArchiMate, offer an opaque representation (e.g., not revealing rights and obligations), whereas others are devoted to formal representation. A service contract ontology is presented together with the extension of ArchiMate to reflect the proposed contract ontology. The details provided in the case study support well the development of other case studies for different domains.
Other ontologies have been specifically created for legal contracts. Among the first ones, Kabilan [54] proposed an ontology to efficiently link business process and contract management, improve business practices, and create alignment with the expectations of contracting parties. The proposed ontology is represented in UML and DAML, and consists of three levels: upper, domain, and template. The template level is intended to support modeling of specific types of contracts, such as rental or sales contracts.
Other approaches focus on providing tools and frameworks to build a legal ontology. Among them, Corcho et al. [37] propose a framework for building a legal ontology based on the METHONTOLOGY methodology and WebODE, a workbench for ontology engineering used in different domains. METHONTOLOGY is rooted in software and knowledge engineering methodologies. It supports legal professionals in building ontologies—by adapting a class taxonomy for the legal field—without significant involvement of knowledge engineers. The modeling process starts from the building of a glossary to represent concepts and relationships. The proposal has been applied for the development of several legal ontologies in Spain. Similarly to [48, 54], it provides a detailed case study to support the understanding of the process.
Yan et al. [98] underline the need for semantic information to be able to automatically execute a contractual agreement. The authors use OWL to formalize concepts and relationships and the Protégé-2000 tool for the implementation. Another approach aims at facilitating the management and representation of legal documents in XML [28]. A Legal Knowledge Interchange Format (LKIF) is provided as a reusable and extensible core ontology that also represents an interchange format to use for computer implementation and contract management. The ontology is based on a Description Logic. Despite the significant level of formality provided by the ontology and the MetaLex XML standard, LKIF, similarly to [98], appears to be better suited for ontological analysis and less as a reusable ontology for contracts.
RQ2: What Annotation Approaches are Used for Semantic Annotation of Legal Text?
Semantic annotation of legal text represents the most recurring objective identified in the SLR relative to the problem-at-hand. Most proposals rely mainly on NLP and Machine Learning (ML) techniques. The problem of semantically annotating unrestricted NL text presents significant technical difficulties. For specific domains, such as the legal one, where specialized and less ambiguous language is used, the semantic annotation problem is more manageable, with encouraging results, see for example [65] below.
An approach adopted in several works relies on grammar rules to annotate legal text. For example, Kiyavitskaya et al. [9] propose Cerno, a tool for semi-automatically generating annotations from regulations using a domain ontology, and patterns of lexical indicators for each concept of the ontology. In an experimental evaluation, it was shown that Cerno slightly increased the quality of annotation while decreasing substantially annotation times for human annotators. In follow-up work, in [56], text is annotated to identify legal concepts (such as actors, rights, obligations, etc.), and then, a semantic model is constructed from the annotation and transformed into a set of functional and non-functional requirements. The first steps of the process concerning semantic annotation rely on heuristics and a frame-based model to identify deontic terms that can be rewritten using a controlled NL. Along similar lines, Soavi et al. [2] build ContracT—a specialization of Cerno [9] and GaiusT [4]—to support human annotators in semantic and structural annotation of legal contract text. The tool is based on an ontology for contracts derived from UFO-L, and it has shown to improve the annotation process for concepts such as parties, assets, temporal conditions, whereas difficulties were encountered in annotating powers and obligations. The approaches in [2, 4, 9, 56] are generally based on the definition of a grammar for semantic annotation that has to be re-defined for different domains.
OWL is used in [18] to represent linguistic information; the approach relies on a parser of structural information that represents relationships between different chunks of text. Extraction rules are formalized in a Description Logic and capture syntactic and semantic information. Significant implicit knowledge entails lower practical usability. Lesmo et al. [65] annotate legal provisions—rights or obligations—to understand the implications arising from the amendment of laws. NLP techniques are used to generate a set of metadata to compactly describe the modifications. The process helps to identify sections of the provisions which have been modified; subsequently, syntactic analysis and semantic annotation are performed. The relationships among provisions are determined thanks to categories based on the words identified in the provisions (e.g., synonyms for deletion or replacement of a provision). The proposal has been evaluated with several laws with positive results for integration and substitution amendments, whereas deletions require further study. Similarly, IF–THEN rules are exploited by Mazzei et al. [70] to identify the semantic content of sentences in laws that imply a modification of an existing provision. The approach relies on the pairing of deep syntactic parsing with rule-based shallow semantic analysis. The process is enhanced with the annotation of metadata and identifies candidate locations of modificatory provisions. The approaches in [65, 70] are useful for understanding the implications of modifying legal text.
Rule-based approaches perform better when a constrained language limited to a specific domain is used, e.g., for sales contracts. Quaresma [81] proposes a mixed approach, based on linguistic information—most notably morphological and syntactic—and ML to extract information from legal texts. Top-level concepts, such as organizations and dates, are identified using a Support Vector Machine (SVM) classifier, whereas an NL parser is used for entity recognition. The approach, applicable to a very specific task in the annotation process, is applied to different languages. Results are encouraging for concept classifications and the identification of dates, mixed for locations, and poor for the identification of organizations and cross-references. Moreover, results differ depending upon the NL of the text. Among the most appreciated works and similarly relying on SVM, Biagioli et al. [25] classify provisions and extract arguments—a set of reasons supporting a certain point of view—with SVM classification and NLP techniques with promising results. Neill et al. [75] test the use of probabilistic tools to extract deontic modalities from legal text. To avoid ambiguity, logic is commonly used; however, logical rules do not allow the level of expressivity required in many domains such as the financial regulations. Therefore, the authors test a data-driven approach, to classify deontic modalities, relying on Artificial Neural Networks (ANN), as well as non-Neural Networks, which are briefly reviewed and tested. The approach shows encouraging results, particularly for the pre-trained ANN. Similarly to [81], the approaches proposed by [25, 75] apply to very specific tasks in the annotation process. A comparison of different approaches in information extraction relying on ML is performed by Sainani et al. [87] that extract requirements from large software engineering contracts. The aim is to automate the extraction and classification of such requirements to improve contract management for companies. The authors compare different ML approaches (such as SVM, Random Forest, and Naives Bayesian) to their approach based on Bidirectional Encoder Representations from Transformers (BERT), which reaches an f-score higher than 80%. The proposal is useful to understand the potential different uses of ML for information extraction. Chalkidis [35] explores how deep learning can support semantic extraction to identify contract concepts and structural elements. In the experiment, Bidirectional Long Short-Term Memory (BILSTM)—with a logistic regression layer and that operates on word, POS tag, and token embeddings—outperforms linear sliding windows classifiers, without the need for manually written rules. The approach is tested on a set of contracts with promising results and suggests the opportunity to be improved with further stacked layers. The approach is supported by the availability of 3500 English annotated contracts released by the authors.
The importance of understanding the semantics of legal text using ML approaches—the black box problem—has been tested with Legal Unstructured Information Management Architecture (LUIMA) [47]. LUIMA is a law-specific extraction tool for automatic annotation using ML for sentence annotation and reranking together with basic retrieval that relies on Apache Lucene. The system is based on UIMA—a framework used in different contexts (e.g., IBM Watson,Footnote 1 a question-answering computer system)—to prove that the pre-processing to identify semantics can outperform information retrieval processes that do not account for semantics. LUIMA is the most complete tool for semantic annotation using ML that has been identified in the SLR.
Finally, a few open-source tools have been proposed with a collaborative, holistic view and that offer detailed documentation, to manage legal documents using XML standards. As such tools are not performing semantic annotation, they should be mostly considered as a support for contract management after the annotation process has already been performed. Akoma Ntoso [22] supports the annotation process at three different layers: NL text, structure, and metadata. Similarly, LegalRuleML is used to verify compliance of business processes and legal norms through semantic analysis [46]. Similarly to Akoma Ntoso, a legal document is represented in three different layers: metadata, statements, and context. Metadata refers to information about the document, such as legal sources and temporal properties; statements are formal representations of the norms; context refers to the relationships in the document including metadata and statements.
RQ3: What are Main Approaches for Mining Relationships from Annotated Text?
The mining of relationships in text is accomplished through syntactic and contextual analysis. For example, the identification of a debtor relationship for obligation O1 in Table 3 is determined by looking for the subject of verb ‘shall deliver’, i.e., the Buyer, while the identification of the creditor for O3 on the same table is determined by noting that two roles were identified in the contract and Seller has already been assigned the role of debtor for O3, so it cannot also be the creditor.
Comparably to approaches discussed for other RQs, the use of templates has been adopted as an intermediate step to identify relationships and ease the generation of requirements. Sleimi et al. [90] emphasize that legal texts often omit relationships included in annotation ontology, as with O3 discussed above. Templates are intended to fill three main gaps: statements with no-counterpart, statements with a correlative, and statements with an implied statement. Different approaches for legal requirement templates are reviewed and NLP-based rules for the templates are defined. Such rules improve the performance of relationship mining algorithms. Similarly, Lee et al. [64] rely on templates to identify relationships and present a technique to extract, model and analyze security requirements written in NL. Their analysis is based on a Problem Domain Ontology (PDO), and it is applied manually with a checklist. Subsequently, PDO is applied to a template to extract relationships among requirements, and to increase the understandability of information available in different documents. The approach—tailored for security requirements but potentially useful for any legal document—is evaluated for adequacy, although it requires a time-consuming process. The approaches in [90] and [64] could be used complementarily to increase the quality of the identified requirements.
Other authors suggest identifying relationships to improve retrievability. Sleimi et al. [89] markup text to generate semantic annotation and build Resource Description Framework (RDF) triples—as a representation of a conceptual model—that are queried with SPARQL. The toolchain system, a set of complementary software components, is experimentally tested in an industrial environment for recall and precision by requirement analysts. The work suggests that the creation of a conceptual model of legal metadata could ease access to legal content. To manage contract content, Lau et al. [63] focus on retrievability to consolidate regulations—using a shallow parser—into an XML format; the authors rely on text-mining tools and manually defined rules to extract elements. The approach compares sections of text to identify relatedness by analyzing matching terms, features, and structure matches, relying on domain knowledge and legal corpus knowledge. A few limitations are highlighted as mismatches for phrases used in different contexts or using different terminologies. In [45], layout rules are applied to improve retrievability for structure elements using XML markup obtained with NLP tools, JAPE (Java Annotation Pattern Extraction), and GATE (General Architecture for Text Engineering), with the latter providing an open-source set of reusable algorithms and GUIs for NLP. JAPE and GATE are integrated into a tool named CLIEL (Commercial Law Information Extraction based on Layout). The system is tested on 97 commercial laws with different approaches: Layout Insensitive, Majority Sense Baseline, and the Layout Sensitive strategy proposed; the best results are obtained with the last one. Approaches based on retrievability generally support the identification of relationships, although such relationships still require to be inferred.
The identification of causal relationships in requirements text is considered by Fischbach et al. [40]. A tool-supported approach named CiRA (Causality detection in Requirement Artifact) is tested with regular expressions, ML and Deep Learning (DL) approaches, the last one using BERT, which obtains the best results. The approach is useful to identify cue phrases for causes, but the labelling for causality may refer as well to deduction; causality can be inverted, and several causality relationships may exist. The above ambiguities can lead to significant differences in identifying relationships and further ambiguities may need to be managed.
Relationships between chunks of legal texts have been explored to identify arguments in sections. Notably, Moens et al. [71] identify arguments using n-grams, adverbs, modals, couple of words, text statistics, punctuations, and keywords to annotate legal text and subsequently identify and classify them. The best results are obtained using a multinomial Bayes classifier and a maximum entropy model previously trained where training has a significant impact on performance. In a similar approach, the SUM project aims at identifying the relationships between parts of legal texts to automatically summarize legal documents [50]. The approach relies on NLP techniques together with a combination of a rule-based and statistical methods to identify the most relevant parts of the text to be summarized. The classifier supports structural analysis to identify parts of the text as candidates for the summary. The capability to summarize is tested for adequacy after having described different linguistic tools and statistical measures for relatedness. The approaches of [71] and [50] are helpful to identify relationships among different sections of legal contracts, but they do not deal with the identification of relationships among the concepts identified with semantic annotation (i.e., ontology).
RQ4: What are Main Techniques for Formalizing Natural Language Terms into a Domain Model?
The formalization of NL terms into a domain model is frequently considered as a task in the process of generating a domain ontology from legal text. Among the papers that address this problem, Saias [86] relies on NLP techniques to extract such models defined in OWL using syntactic, semantic, pragmatic analyses—where information is inferred with an abductive inference mechanism—and first-order logic, leading to the identification of concepts and relationships. Relationships are identified with unsupervised ML techniques that aim at learning subcategories for heads (e.g., republic for the republic of Ireland) and modifiers (e.g., president of the republic). The methodology is based on a parser that uses a Constraint Grammar formalism transformed into XML markups and Prolog terms. However, the approach does not define what kinds of relationships exist among concepts. Another relevant approach relies on statistical measures based on similarity and relatedness [60]. After having extracted all the terms from a sample of French legislation, terms are divided into syntactical categories and analyzed to support the identification of semantic relationships (e.g., book, chapter, general provisions). Considering the reliance on statistical measures the approach may better work for large legal corpora. The approaches presented in [86] and [60] are well documented and may support the understanding of most of the processes required to create a domain model.
Amardeilh et al. [16] present a method for semi-automatically building a domain model from a contract by populating an existing ontology with a knowledge management tool. A conceptual tree is derived from the text to map the information extracted to a concept of the domain ontology. Subsequently, knowledge acquisition rules are extracted to perform the mapping between linguistic annotation and ontological concepts. The rules are tested on 36 reports and an average of 3 acquisition rules per concept is identified. The method is tested for precision and recall in the identification of topics, attributes, associations, and roles; results highlighted more difficulties in identifying attributes and roles. Differently, Amato et al. [17] rely on a simplified NL, based on laws that are codified into pre-defined structures, to propose a process that transforms a legal document into an ontology based on RDF. The NLP system translates a legal document into tuples for a relational database by relying on different ontological and linguistic knowledge levels. The process supports the identification of structural, lexical, and domain ontology elements. It is intended for the management of notary documents and has been experimentally tested over a collection of around 100 legal documents with encouraging results. The use of simplified NL may imply a decrease in semantic richness in the translation of a legal document into a domain model.
Other approaches focus on building domain models that can potentially be reused with different languages. Notably, Francesconi et al. [42] suggest an approach for knowledge acquisitions and ontology modelling based on an existing ontology that is refined for a specific legal document with NLP techniques. They focus on the existing relations between the two layers—lexical and ontological—to define multilingual ontological requirements. The challenge of adapting ontologies or domain models for different languages is considered also by [94] in TextToOnto using extensible languages processing frameworks, such as GATE. The open-source tool has been created to support legal experts in identifying legal ontologies from text. Despite the difficulties required by the translation in different languages, they may support interoperability and the adoption of ontologies and domain models.
RQ5: What Kinds of Techniques Have Been Studied for Translating NL Expressions into Formal Ones for Legal Documents?
The translation of legal documents into formal expressions has received less attention than other RQs. Research on this RQ date as far back as 1993 [57], relying on Conceptual Graph formalism based on Sowa for knowledge representation. The use of formal logic expressive enough to represent legal contracts is a recurring challenge. Among the first attempts to translate NL expressions into a formal specification for a contract, Governatori [7] proposes the Business Contract Language (BCL) based on Propositional Deontic Logic. This work describes the process of deriving a formal system from contract provisions that account for the identification of ambiguities in a contract, determine the existence of missing or implied statements, and analyze the expected behaviours of the parties and existing relationships between parts of the contracts (e.g., clauses). In the formal system, a contract is represented as a set of deontic terms for obligations, prohibitions, and permissions. Other approaches focus on the pre-treatment of text. Montazeri [72] supports the automatic translation of a contract in NL into a formal language using a Grammatical Framework (GF). The contract is manually rewritten in structured English that can be automatically translated into a formal language. The GF has been implemented to define and manipulate grammars and to understand the implications of translating a contract into different languages. The formal language is based on deontic, dynamic, and temporal logic. Despite the significant manual effort required, the works in [7] and [72] offer a well-explained framework of reference for the generation of formal expressions. Libal [67] introduces a logical structure tool—based on deontic logic—called Normative Detachment Structure with Ideal Conditions. The logical structure is extracted from a manually normalized text that encompasses ideal normative statements, normative conditionals, and existing relationships. The work suggests that the logical representation of contrary-to-duty is consistent and reflects the logical independence of the components of text while avoiding complexity. The ability to derive actual and ideal obligations has been only preliminarily tested and requires further exploration.
Fornara [41] relies on a domain-independent ontology that can be used to generate specifications for open interaction systems and accounts for social commitments, temporal propositions, events, agents, roles, and norms. The proposal is to monitor the variation over times of such commitments using the Event Calculus. Different axioms for the temporal propositions are presented, together with an explanatory example for a contract. Differently from the other works presented in the section, this work focuses on the implications of time evolution regarding starting points and deadlines.
A structured approach to managing legal documents using RuleML has been proposed to facilitate the sharing of legal information between legal documents, business processes, and software [79]. RuleML has been extended with new modules to represent and model legal phrases including metadata. The approach is based on a Defeasible Logic and has been implemented, although it does not support the level of formality found in other approaches. The use of RuleML is an advantage of this work, as it relies on a well-tested and documenting tool.
Formal representations have been derived from requirements extracted from a legal text in [82] using a goal-oriented approach. The authors propose a method to model formal requirements based on Formal Legal_GRL (FLG) using the Goal-oriented Requirements Language (GRL). A logic-based approach is used to deal with modalities and conditionals in legal text. Legal requirements are extracted and annotated using deontic logic for obligations and permissions. Similarly, for legal requirements, Boella et al. [26] propose a logical framework for formal representation and modelling based on an extension of a Defeasible Logic to model extensive and restrictive interpretations. The running example suggests that amendments to Law may result in changes to the adopted ontology. Finally, Maxwell [69] proposes a methodology for extracting production rules from legal texts that generate a raw translation and refactors the rules to enhance understandability. The approaches of [82] and [69] focus on deriving formal requirements from legal text and are accordingly more general than the subject matter of this review.
A summary has been included in Table 10 detailing for each paper analyzed, the methodologies, tools, and resources adopted from the literature and proposed. The papers are ordered according to their appearance in this section and for the main RQ they address.
Table 10 Table summary of methods, tools, and resources used and proposed by papers relevant to SLR