1 Introduction

Named Entity Recognition (NER) is the automatic identification of named entities (NEs) in texts, typically including their assignment to a set of semantic categories [19]. The established classes (for newspaper texts) are person PER, location LOC, organization ORG and other OTH [3, 36, 37]. Research on NER has a history of more than 20 years and produced approaches based on linear statistical models, e.g., Maximum Entropy Models [1, 10], Hidden Markov Models [27], among others. Nowadays, the state of the art results are produced by methods such as CRFs [2, 4, 16, 17] and BiLSTMs [9, 20, 22, 26]. For English news documents, the best models have a performance of approx. 90 F\(_1\) [9, 20, 22, 26, 29, 38], while the best models for German are not quite as good with approx. 80 F\(_1\) [2, 4, 16, 22]. Based on their very good performance on news documents, we examine the use of CRFs and BiLSTMs in legal documents.

1.1 Application and Project Context

The objective of the project LYNX (Building the Legal Knowledge Graph for Smart Compliance Services in Multilingual Europe), a three year EU project that started in December 2017, is the creation of a legal knowledge graph that contains different types of legal and regulatory data.Footnote 1 LYNX aims to help European companies, especially SMEs, that already operate internationally, facing to offer and to promote their products and services in other countries. The project will eventually offer compliance-related services that are currently tested and validated in three use cases. The first pilot is a legal compliance solution, where documents related to data protection are innovatively managed, analysed, and visualised across different jurisdictions. In the second pilot, LYNX supports the understanding of regulatory regimes, including norms and standards, related to energy operations. The third pilot is a compliance solution in the domain of labour law, where legal provisions, case law, administrative resolutions, and expert literature are interlinked, analysed, and compared to define legal strategies for legal practice. The LYNX services are developed for several European languages including English, Spanish and German [32].

Documents in the legal domain contain multiple references to NEs, especially NEs specific to the legal domain, i.e., jurisdictions, legal institutions, etc. Most NER solutions operate in the general or news domain, which makes them not completely suitable for the analysis of legal documents, because they are unable to detect domain-specific entities. The goal is to make knowledge workers, who process and make use of these documents, more efficient and more effective in their day to day work, this also includes the analysis of domain-specific NEs, see [5, 31] for related approaches in the area of content curation technologies.

1.2 Research Questions

This article is dedicated to the recognition of NERs and their respective categories in German legal documents. Legal language is unique and differs greatly from newspaper language. This also relates to the use of person, location and organization NEs in legal text, which are relatively rare. It does contain such specific entities as designations of legal norms and references to other legal documents (laws, ordinances, regulations, decisions, etc.) that play an essential role. Despite the development of NER for other languages and domains, the legal domain has not been exhaustively addressed yet. This research also had to face the following two challenges. (1) There is no uniform typology of semantic concepts related to NEs in documents from the legal domain; correspondingly, uniform annotation guidelines for NEs in the legal domain do not exist either. (2) There are no freely available datasets consisting of documents from the legal domain, in which NEs have been annotated.

Thus, the research goal is to examine NER with a specific focus on German legal documents. This includes the elaboration of the corresponding concepts, the construction of a dataset, developing, evaluating and comparing state of the art models for NER. We address the following research questions:

  1. 1.

    Which state of the art approaches are in use for NER? Which approaches have been developed for NER in legal documents? Do these approaches correspond to the state of the art?

  2. 2.

    Which NE categories are typical for legal documents? Which classes are to be identified and classified? Which legal documents can be used for a dataset?

  3. 3.

    What performance do current models have? How are different categories recognized? Which categories are recognized better than others?

2 Related Work

NER in the legal domain, despite its high relevance, is not a well researched area. Existing approaches are inconsistent with regard to the applied methods, techniques, classifications and datasets, which makes it impossible to compare their results adequately. Nevertheless, the developed approaches make an important contribution and form the basis for further research.

The first work in which NER in the legal domain was explicitly defined as a term was described by Dozier et al. [13]. The authors examined NER in US case law, depositions, pleadings and other legal documents, implemented using simple lookups in a list of NEs, contextual rules, and statistical models. Taggers were developed for jurisdiction, court, title, document type (e.g., brief, memorandum), and judge. The jurisdiction tagger performed best with an F\(_1\) of 92. The scores of the other taggers were around 82–85.

Cardellino et al. developed a tool for recognizing, classifying, and linking legal NEs [8]. It uses the YAGO and LKIF ontologies and elaborated four different levels of granularity: NER, NERC, LKIF and YAGO. A Support Vector Machine, Stanford NER [17] and a neural network (NN) were trained and evaluated on Wikipedia and decisions of the European Court of Human Rights. The best result on the Wikipedia dataset was achieved by the NN with F\(_1\) scores for the NERC and YAGO classes of 86 and 69, respectively. For the LKIF classes, Stanford NER was better with F\(_1\) score of 77. The performance was significantly worse on decisions. The F\(_1\) scores varied according to the model and the level of granularity. Stanford NER was able to achieve a maximum F\(_1\) score of 56 with the NERC classes.

Glaser et al. tested three NER systems [18]. The first, GermaNER [4], recognized person, location, organization and other. Temporal and numerical expressions were recognized using rule-based approaches, and references using the approach described in Landthaler et al. [23]. The second system was DBpedia Spotlight [11, 28], developed for the automatic annotation of DBpedia entities. The third system, Templated, was designed by Glaser et al. [18]. It focused on NER in contracts created using templates. For GermaNER and DBpedia Spotlight a manually annotated corpus was created, which consisted of 500 decisions of the 8th Civil Senate of the German Federal Court of Justice and had reference to tenancy law. GermaNER and DBpedia-Spotlight were evaluated on 20 decisions from the created dataset and Templated was evaluated on five different contracts. GermaNER and DBpedia Spotlight achieved an F\(_1\) of 80 and 87, respectively. The result of Templated NER was 92 F\(_1\).

To adapt categories for the legal domain, the set of NE classes was redefined in the approaches described above. Thus, Dozier et al. [13] focused on legal NEs (e.g., judge, lawyer, court). Cardellino et al. [8] extended NEs on NERC level to document, abstraction, and act. It is unclear what belongs to these classes and how they were separated from each other. Glaser et al. [18] added reference [23]. However, this was understood as a reference to legal norms, so that further references (to decisions, regulations, legal literature, etc.) were not covered.

The research of NER in legal documents is also complicated by the fact that there are no freely available datasets, neither for English nor for German. Datasets for newspaper texts, which were developed in CoNNL 2003 or GermEval 2014, again are not suitable in terms of the type of text and the annotated entities. In this context, the need for a manually annotated dataset consisting of legal texts is enormous, requiring the development of a classification of legal categories and uniform annotation guidelines. Such a dataset consisting of documents from the legal domain would make it possible to implement NER with state of the art architectures, i.e., CRF and BiLSTM, and to analyze their performance.

3 A Dataset of Documents from the Legal Domain

3.1 Semantic Categories

Legal documents differ from texts in other domains, and from each other in terms of text-internal, and text-external criteria [7, 12, 15, 21], which has a huge impact on linguistic and thematic design, citation, structure, etc. This also applies to NEs used in legal documents. In law texts and administrative regulations, the occurrence of typical NEs such as person, location and organization is very low. Court decisions, on the other hand, include these NEs, and references to national or supranational laws, other decisions, and regulations. Two requirements for a typology of legal NEs emerge from these peculiarities. First, the categories used must reflect those entities that are typical for decisions. Second, a typology must concern the entities whose differentiation in decisions is highly relevant.

Domain-specific NEs in legal documents can be divided into two basic groups, namely designations and references. For legal norms (i.e., for laws and ordinances) designations are headings for their standard legal texts, which provide information on rank and content [6, Rn. 321 ff.]. Headings are uniform and usually consist of a long title, short title and abbreviation, e.g., the title of the Medicinal Products Act of 12 December 2005 ‘Gesetz über den Verkehr mit Arzneimitteln (Arzneimittelgesetz – AMG)’ (Federal Law Gazette I p. 3394). The short title ‘Arzneimittelgesetz’ and the abbreviation ‘AMG’ are in brackets. The citation of the legal norms is also fixed. There are different citation rules for full and short citations [6, Rn. 168 ff.]. The designation and citation of binding individual acts such as regulations or contracts is not uniformly defined.

For our dataset consisting of court decisions, a total of 19 fine-grained classes were developed, which are based on seven coarse-grained classes (see Table 1). As a starting point, the well-researched newspaper domain was used for the elaboration of the typology. The annotation guidelines are based on the ACE guidelines [25] and NoSta-D Named-Entity [3]. The core NEs are typical classes like PER, LOC, and ORG, which are split into fine-grained classes.Footnote 2 The coarse- and fine-grained classifications correlate such that, e.g., the coarse-grained class of person PER under number 1 in Table 1 contains the fine-grained classes of judge RR, lawyer AN and other person PER (plaintiffs, defendants, witnesses, appraisers, etc.) under numbers 1 to 3. The location LOC includes the fine-grained classes of country LD (countries, states and city-states), city ST (cities, villages and communities), street STR (streets, squares, avenues, municipalities and attractions) and landscape LDS (continents, mountains, lakes, rivers and other geographical units). The coarse-grained class organization ORG is divided into public/social, state and economic institutions. They form the fine-grained classes of organization ORG, institution INN, and company UN. Designations of the federal, supreme, provincial and local courts are summarized in the fine-grained class court GRT. Furthermore, brandFootnote 3 MRK is a separate category.

A fundamental peculiarity of the published decisions is that all personal information is anonymised on account of data privacy reasons. This applies primarily to person, location and organization. NEs are replaced by letters (1) or dots (2).

figure a

In addition to the typical categories, other classes specific to legal documents, i.e., court decisions, are also included in the categories. These are the coarse-grained classes of legal norm NRM, case-by-case regulation REG, court decision RS and legal literature LIT. The legal norm and case-by-case regulation include NEs (3) and references (4), but the court decision and legal literature only references (5). Legal norm NRM is subdivided according to legal force into the fine-grained classes law GS, ordinance VO and European legal norm EUN. Case-by-case regulation REG, on the other hand, contains binding individual acts that are below each legal standard. These include the fine-grained classes regulation VS (administrative regulations, directives, circulars and decrees) and contract VT (public service contracts, international treaties, collective agreements, etc.). The last two coarse-grained classes, court decision RS and legal literature LIT, do not have any fine-grained classes. RS reflects references to decisions, and LIT summarizes references to legal commentaries, legislative materials, legal textbooks and monographs.

figure b

3.2 Dataset Statistics and Distribution of Semantic Categories

The dataset Legal Entity Recognition (LER) consists of 750 German court decisions published online in the portal ‘Rechtsprechung im Internet’.Footnote 4 The source texts were extracted from the XML documents and split into sentences and words by SoMaJo [30]. The annotation was performed manually by one Computational Linguistics student using WebAnno [14]. In terms of future work we plan to add annotations from two to three linguists so that we can report inter-annotator agreement. The datasetFootnote 5 is freely available for download under the CC-BY 4.0 licenseFootnote 6, in CoNLL-2002 format. Each line consists of two columns separated by a space. The first column contains a token and the second a tag in IOB2 format. The sentence boundary is marked with an empty line.

The dataset consists of 66,723 sentences and 2,157,048 tokens. The percentage of annotations (per-token basis) is approx. 19%. Overall, the dataset includes 53,632 annotated NEs. The dataset has two variants for the classification of legal NEs (Table 1). The person, location and organization make up 25.66% of all annotated instances. 74.34% are specific categories like the legal norm NRM, case-by-case regulation REG, court decision RS and legal literature LIT. The largest classes are the law GS (34.53%) and court decision RS (23.46%). Other entities, i.e., ordinance, European legal norm, regulation, contract, and legal literature, are less common (between 1 and 6% of all annotations).

Table 1. Distribution of coarse- and fine-grained classes in the dataset

4 Evaluation and Results

We used two tools for sequence labeling for our experiments: sklearn-crfsuiteFootnote 7 and UKPLab-BiLSTM [35]. In total, 12 models were tested, i.e., three CRF and BiLSTM models with coarse- and fine-grained classes. For CRFs, the following groups of features and sources were selected and manually developed:

  1. 1.

    F: features for the current word in a context window between −2 and +2, which are case and shape features, prefixes, and suffixes;

  2. 2.

    G: for the current word, gazetteers of persons from Benikova et al. [4]; gazetteers of countries, cities, streets, landscapes, and companies from GOVDATAFootnote 8, the Federal Agency for Cartography and GeodesyFootnote 9 and Datendieter.deFootnote 10; gazetteers of laws, ordinances and administrative regulations from the Federal Ministry of Justice and Consumer ProtectionFootnote 11,Footnote 12. A detailed description of the gazetteers can be found in the Github project;

  3. 3.

    L: lookup table for the word similarity in a context window between -2 and +2 as in Benikova et al. [4], which contains the four most similar words to the current word.

Three models were designed to chain these groups of features and gazetteers: (1) CRF-F with features; (2) CRF-FG with features and gazetteers; and (3) CRF-FGL with features, gazetteers, and the lookup table; the model names reflect the three groups. As a learning algorithm, the L-BFGS method is used with L1 and L2 regularization parameters, set to the coefficient 0.1. The maximum number of iterations for optimizing the algorithm is set to 100.

For BiLSTM we also use three models: (1) BiLSTM-CRF [20]; (2) BiLSTM-CRF+ with character embeddings from the BiLSTM [22]; (3) BiLSTM-CNN-CRF with character embeddings from CNN [26]. As hyperparameters we used the values that achieved the best NER performance according to Reimers and Gurevych [34]. The BiLSTM models have two BiLSTM layers, each with a size of 100 units and a dropout of 0.25. The maximum number of epochs is 100. At the same time, the tool uses pre-trained word embeddings for German [33].

The results were measured with the micro-precision, -recall and -F\(_1\) measures. In order to reliably estimate their performance, we evaluated the models using stratified 10-fold cross-validation. The dataset is shuffled, sentence-wise, and divided into ten mutually exclusive partial sets of similar size. One iteration uses one set for validation and the rest for training. We iterate ten times, so that each part of the dataset is used nine times for training and once for validation. The distribution of NEs in the training and validation set remain the same over the iterations. The cross-validation prevented overfitting during training and the stratification prevented measurement errors in unbalanced data.

4.1 CRF Models

For the fine-grained classes, CRF-FGL achieved the best performance with an F\(_1\) score of 93.23 (Table 2). The recognition of legal NEs in the different classes had varied levels of success depending on the model. Lawyer, institution, court, contract and court decision reached the highest F\(_1\) with CRF-F. With the CRF-FG better results could be achieved for judge, city, regulation and legal literature. This means that the gazetteers have had a positive impact on the recognition of these NEs. The remaining classes performed better with CRF-FGL. The concatenation of gazetteers and the lookup table for the word similarity has improved the results, but not as much as expected.

For the coarse-grained classes, the CRF-FG and CRF-FGL together achieved the best result with an F\(_1\) value of 93.22 (Table 3). However, person was recognized better with CRF-FG and location and organization better with CRF-FGL. CRF-FG achieved the best result in the case-by-case regulation and court decision. With CRF-FGL, the values in the legal norm and legal literature increased. Compared to the fine-grained classes, the better balanced precision and recall were observed and the F\(_1\) increased by max. 0.1 per model.

Table 2. Precision, recall and F\(_1\) values of CRF models for fine-grained classes
Table 3. Precision, recall and F\(_1\) values of CRF models for coarse-grained classes

4.2 BiLSTM Models

For the fine-grained classes, two models with character embeddings have achieved the best result with an F\(_1\) score of 95.46 (Table 4), confirming the positive impact of character level information. A significant improvement with an increase in F\(_1\) by 5–16 (compared to the BiLSTM-CRF without character embeddings) was found in organization, company, ordinance, regulation and contract. Judge and lawyer were recognized better by about 1 with the BiLSTM-CRF. Person, country, city, court, brand, law, ordinance, European legal norm, regulation and contract were identified better with the BiLSTM-CRF+, and street, landscape, organization, company, institution, court decision and legal literature with the BiLSTM-CNN-CRF. Dependencies of the results on character embeddings produced by BiLSTM and CNN were also found. Brand, ordinance and regulation benefited significantly from the use of the BiLSTM. However, recognition of street and landscape improved with the character embeddings from the CNN.

Table 4. Precision, recall and F\(_1\) values of BiLSTM models for fine-grained classes
Table 5. Precision, recall and F\(_1\) values of BiLSTM models for coarse-grained classes

For the coarse-grained classes, F\(_1\) increased by 0.3−0.9 per model, and precision and recall were also more balanced (Table 5). The best result was produced by the BiLSTM-CRF+ with 95.95. The model had the highest values of more than 90 F\(_1\) in almost all classes. An exception was the BiLSTM-CNN-CRF in organization, which increased F\(_1\) by 0.3.

4.3 Discussion

The BiLSTMs achieved superior performance compared to the CRFs. They produced good results even with the fine-grained classes covered poorly in the dataset. The CRF models, on the other hand, delivered values that were about 1–10 lower per class. In addition, some classes are characterized by bigger differences in precision and recall, indicating certain weaknesses of the CRFs. In particular, the recognition of street and brand with the BiLSTM models improved by values of at least 10. The values for lawyer, landscape and ordinance also increased by a value of 5.

The results also show that the two model families exhibit a similar performance due to the dataset or structure of the data. The models produce their best results with 95 F\(_1\) score in the fine-grained classes judge, court and law. On the one hand, this depends on a smaller number of types compared to tokens in judge and court. On the other hand, the precise identification of law can be explained by its good coverage in the dataset and uniform citation. Incorrect predictions about boundaries are made if references had a different form such as in ‘§ 7 des Gesetzes (gemeint ist das VersAnstG)’ instead of common ‘§ 7 VersAnstG’, ‘das zwölfte Kapitel des neunten Sozialgesetzbuches’ instead of ‘das Kapitel 12 des SGB XII’. There were also incorrect classifications of terms as a NE containing the word ‘law’, such as ‘federal law’, ‘law of experience’, ‘criminal law’, etc. The recognition of country, institution, court decision, and legal literature was also very good with scores higher than 90 F\(_1\). This is also due to a smaller number of types in country, institution and uniform references of court decision and legal literature.

However, the recognition of street, landscape, organization and regulation is the lowest throughout, amounting to 69–80 with the CRF and 72–83 with the BiLSTM models, caused by inconsistent citation styles. The recognition of street and landscape is poor because they are covered in the dataset with only about 200 instances, but heterogeneously represented. The worst result, i.e., a maximum F\(_1\) value of 69.61 with the CRFs and of 79.17 with the BiLSTMs, was observed in brand. These NEs were also expressed in different contexts, such as the brand NE ‘Einstein’s Garage’ and the scientist Albert Einstein. It can be concluded that the differences in the recognition of certain NEs is firstly due to the unbalanced class distribution and secondly to the specifics of the legal documents, in particular because of the coverage in the corpus, the heterogeneity with regard to the form of names or references as well as the context.

Overall, the CRFs and BiLSTMs perform very well, producing state of the art results, which are significantly better than comparable models for newspaper text. This fact can, first, be explained by the size of the dataset which is larger than other NE datasets for German. Second, the form of legal NEs, which also includes references, differs a lot from NEs in newspaper text. The distribution of designations or references in the dataset consisting of documents from the legal domain is greater compared to person, location or organization. Third, the strictly regulated linguistic and thematic design (repeated use of NEs per one decision, repeated use of formulaic, template-like sentences, etc.) and the uniform reference style have had a positive impact on performance. The applied evaluation method made it possible to reliably estimate performance for unbalanced data. Unfortunately, it is not possible to compare our results with other systems for NER in legal documents because they are not freely available.

5 Conclusion

We describe and evaluate a set of approaches for the recognition of semantic concepts in German court decisions. In line with the goals, the characteristic and relevant semantic categories such as legal norm, case-by-case regulation, court decision and legal literature were worked out and a dataset of legal documents was built, instances of a total of 19 semantic classes were annotated. For the experiment, CRF and BiLSTM models were selected that correspond to the state of art, and tested with the two sets of classes. The results of both model families demonstrate the superiority of the BiLSTMs models with character embeddings with an F\(_1\) score of 95.46 for the fine-grained classes and 95.95 for the coarse-grained classes. We found that the structure of the data involved in the training process strongly impacts the performance. To improve NER, it is necessary to extend or optimize the unbalanced data. This helps to minimize the specific influencing factors of the legal documents on models. Our results show that there is no universal model that recognizes all classes in the best way. Accordingly, an even better universal system could be built as an ensemble of different models that perform well for particular classes.