Nature of relevance in LIR
The science of information retrieval is basically about ‘Relevance’: how to retrieve the most relevant documents from—in principle—an unlimited set? Before any methodology or system for retrieval can be developed or discussed, the concept of ‘relevance’ has to be examined. This seems to be a trivial undertaking since this concept has a tendency to be immediately understood by everybody. A thorough understanding though is of the utmost importance for the effectiveness of LIR systems, and hence it needs continuous consideration. The foundations of a conceptual framework can be adopted from general IR science.
Saracevic (1996) defined ‘relevance’ as: ‘pertaining to the matter at hand’, or, more extended: ‘As a cognitive notion relevance involves an interactive, dynamic establishment of a relation by inference, with intentions toward a context.’ From this definition it follows that relevance has a contextual dependency since it is measured in comparison to the ‘matter at hand’. Because of its dynamic establishment relevance may change over time and it involves some kind of selection (Saracevic 2007). From the definition it also follows that relevance is a comparative concept: it is a ratio scale of measurement, although by using a specific threshold it can be turned into a binary property (relevant or not). Because of this comparative character, information objects can be ranked as to their relevance.
Because of its visibility in many end-user LIR applications, ‘ranking’ might appear to be a crucial concept (Geist 2016), but ranking of search results is only one of the many practical applications of relevance, next to e.g.: ‘Filtering, assessing, inferring, (…) accepting, rejecting, associating, classifying… and other similar roles and processes’ (Saracevic 1996). By narrowing ‘relevance’ to ‘ranking’ one not only excludes these many other applications of relevance—which are also increasingly used in modern LIR systems—but inevitably runs into theoretical problems by mistaking a derivative function for the underlying concept.
Dimensions of relevance in LIR
To understand the concept of relevance it is important to disambiguate various ‘relevance dimensions’ (Cosijn and Ingwersen 2000). This term compares to ‘relevance manifestations’ as used by Saracevic (2007). We discuss these relevance dimensions here in brief, summarizing their basic features and indicating how our typology deviates from those of Saracevic and Cosijn/Ingwersen. Along the paper we will elaborate these relevance dimensions for legal information retrieval in greater detail.
-
1.
Algorithmic or system relevance The first dimension pertains to the computational relationship between a query and information objects, based on matching or a similarity between them. Traditionally, models have been described within the context of full-text search, e.g. being Boolean, probabilistic, vector-space a.s.o. Natural language processing is also perceived to be within algorithmic relevance, although in our view it covers also those processes which do not take place during the actual querying, but are intended to improve algorithmic relevance at a later stage. Examples are pre-processing of documents, automatic classification a.s.o. Unlike all other relevance dimensions that can be observed and assessed without a computer, algorithmic relevance cannot: it is system-dependent.
-
2.
Topical relevance The relationship between the ‘topic’ (concept, subject) of a request and the information objects retrieved about that topic. A topicality relation is assumed to be an objective property, independent of any particular user. ‘Aboutness’ is the traditional distinctive criterion. The topics of the information objects might be hand-coded or computed, e.g. by classification algorithms.
-
3.
Bibliographic relevance The relationship between a request and the bibliographic closeness of the information objects. One of the specific features of legal information, as described in Sect. 2.2 above, is its self-containment. This means that legal information systems (unlike information systems on medicine, classic cars or animals) are the final objects themselves. Hence, ‘isness’ is the distinctive criterion. Because of the many different versions legal information objects might have, isness is not a Boolean but a relative concept, and therefore not an issue of data retrieval, but of information retrieval. This dimension does not exist in the typologies of Saracevic and Cosijn.
-
4.
Cognitive relevance or pertinence Concerns the relation between the information needs of a user and the information objects. Unlike algorithmic, bibliographic and topical relevance, cognitive relevance is user-dependent, with criteria like informativeness, preferences, correspondence and novelty as measuring elements.
-
5.
Situational relevance or utility Defined as the relationship between the problem or task of the user and the information objects in the system. Also this dimension of relevance is dependent on the specific user, but unlike the cognitive relevance it does not focus on the request as formulated, but on the underlying motivation for starting the information retrieval process. Inferred criteria for situational relevance are the usefulness for decision-making, appropriateness in problem solving and reduction of uncertainty.
-
6.
Domain relevance As his fifth dimension Saracevic (1996) used ‘Motivational or affective relevance’, but in a critical assessment Cosijn and Ingwersen (2000) replaced this dimension by ‘socio-cognitive relevance’, which “[I]s measured in terms of the relation between the situation, work task or problem at hand in a given socio-cultural context and the information objects, as perceived by one or several cognitive agents.” Given the specific features of legal information as well as for reasons of modelling, we define this dimension as the relevance of information objects within the legal domain itself (and hence not to ‘work task or problem at hand’). For convenience we label it ‘domain relevance’.
The role of these dimensions in the interplay between user, information retrieval system and legal domain is depicted in Fig. 1.Footnote 2 It should be noted that both bibliographic and topical relevance relate to a relationship between the user request (as formulated in the user interface) and the information objects. They might be mutually exclusive—the user is either looking for the objects itself, or information about it—but not necessarily: one might search for a court decision and information about that decision at the same time, but even then the user wants these results separately or recognizable as ‘is’ and ‘about’ in his result list.
Already here it should be observed that relevance dimensions easily overlap and intermingle: “The effectiveness of IR depends on the effectiveness of the interplay and adaptation of various relevance manifestations, organized in a system of relevancies” (Saracevic 1996). In the design of IR systems it is hence of the utmost importance to distinguish between various dimensions and to pay specific attention to each of them, in the user interface, the retrieval engine and the document collection. It will definitely improve the user’s perception of the system’s performance on retrieving the most relevant information. This perception—or ‘criterion for success’—depends on the relevance dimension(s) invoked. These criteria are, together with the nature of the respective dimensions, summarized in Table 2.
Table 2 Dimensions of relevance compared
In the following subsections we will elaborate these six relevance dimensions of LIR and discuss how these dimensions may help to classify the past and current spectrum of approaches, how they correspond to information-seeking behaviour of legal professionals and how they might help bridging the conceptual gap between lawyers and informaticians.
Algorithmic relevance
Algorithmic relevance concerns the computational core of information retrieval. As expressed in Fig. 1 it is the relation between the information objects and the query; this ‘query’ is to be understood as the computer processable translation of the request as entered in the user interface or any other intermediary component. Algorithmic relevance is about the capability of the engine to retrieve a given set of information objects (the ‘gold standard’) that should be retrieved with a given query (measured in ‘recall’) with a minimum of false positives (measured in ‘precision’).
From our conceptual perspective the type of query as well as the type of retrieval framework is not relevant, but given the legal information features of volume, document size and lack of structure, textual search has for long had the focus. There are various computational models for inferring similarity between query and information objects. In the early days Boolean search was the core of any legal retrieval system, and it is still an indispensable element in most LIR systems today. In a Boolean system both the user request and the documents are regarded as a set of terms, and the system will return documents containing the terms in the request. Boolean searches often result in the retrieval of a large number of documents. In addition, they provide little or no assistance to the user in formulating or refining a query and they lack domain expertise that could improve the search outcome. Relevance performance was improved by using models as the vector space model (Salton et al. 1975) and TF-IDF (term frequency—inverse document frequency). Nevertheless, recall is often below acceptable levels because the design of full-text retrieval systems: “(I)s based on the assumption that it is a simple matter for users to foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents” (Blair and Maron 1985). Ambiguity, synonymy and complexity of legal expressions contribute substantially to this problem (Dabney 1986). Natural language processing (NLP) is gaining popularity as an addition to or alternative to pure text-based search (Maxwell and Schafer 2008).
Apart from text-based search also other types of algorithmic relevance can be considered, like the use of ontologies as higher level knowledge models (Casanovas et al. 2016; Saravanan et al. 2009), network statistics, especially when used for citation analysis (Fowler and Jeon 2008; van Opijnen 2013) as well as methods that combine different approaches (Koniaris et al. 2016).
Topical relevance
Topical relevance is about the relevancy relation between the topic as (explicitly or implicitly) formulated in the user request and the topics of the information objects. Different strategies have been explored to improve this relevance dimension.
-
1.
Mapping and indexing terms Using free text search and mapping the terms searched to the terms indexed from the information objects, too often renders poor results since legal concepts can be expressed in a variety of ways, while completely different concepts can textually be quite similar.
-
2.
Manual indexing Adding head notes and keywords from taxonomies or thesauri has been a long tradition within the legal information industry. Kuhlthau and Tama (2001) pointed to the lack of flexibility within such keyword search, as they noted that “(L)awyers seemed to require the opportunity to locate information outside the keyword range in order to spark an idea that enabled them to formulate the issues in a case.” This approach is problematic when lawyers have few or imprecise details about the area of which an overview is required. Although aboutness is assumed to be an objective property and hence independent of any particular user, manual indexing is inherently subjective, and even the same indexer may sort the same document under different terms depending on the context the document is presented in (Bing and Harvold 1977). “Manual indexing is only as good as the ability of the indexer to anticipate questions to which the indexed document might be found relevant. It is limited by the quality of its thesaurus. It is necessarily precoordinated and is thus also limited in its depth. Finally, like any human enterprise, it is not always done as well as it might be.” (Dabney 1986).
-
3.
Semi-automated classification For huge public databases manual tagging is hardly an option, but automated classification turns out not to perform better than human indexing (Mart 2010). A general drawback of such automated systems is the mandatory use of the classification scheme in the user interface. This forces the user to limit or to reformulate his request to align it with the classification system available. A problem that can only be solved by the time-consuming and tedious task of “Using a combination of automated and manual techniques, [constructing] a list of concepts and variations for expressing a concept” (Zhang 2015). This requires in-depth legal knowledge, analysis of search engine log files and continuous maintenance. Semi-automated classification using ontologies (Boella et al. 2016) is gaining popularity, and notwithstanding the current hype about legal AI applications like IBM’s Ross (Beck 2014), scepticism about their performance seems to be a healthy attitude (Paliwala 2016; Remus and Levy 2016 ).
-
4.
Relation-based search Meanwhile, developers of LIR systems should consider whether the investment is worth the effort: surveys have shown that classification systems are not very popular among users (Peoples 2005), contrary to searches by relationship (Lastres 2015). Many topics in law, at least in the juristic mindset and information seeking behaviour, have a strong connection (chain) to other legal documents. Typical requests may refer to a search for (everything) about a specific paragraph of law or court decision. In such requests these information objects represent a specific legal concept, but the only reason lawyers rephrase it might be related to the fact that the search engine cannot cope with their actual request. For well-known acts and codes such aboutness information is structured in treatises or loose-leaf encyclopaedias, but they are optimized for browsing, not for search. Since such works do not cover the whole legal domain, performing searches on citations might in principle be the obvious choice. In common law countries citators are very popular for such ‘topical citation search’, like LawCite.org (Mowbray et al. 2016) in the public domain and Shepard’s in the private domain (Spriggs and Hansford 2000). The latter is based on manual tagging and also contains qualifications of these relations. In continental Europe the importance of search by citation—as a type of aboutness—needs more attention from search providers. For example, in EUR-Lex, HUDOC and various national legislative databases, relations between documents are tagged and searchable/browsable, but especially in national case law databases citation search is extremely difficult. A first reason is that judges have lousy citation habits: research showed that only 36% of cited EU acts was in conformity with the prescribed citation style, the other citations were made with a wide range of other styles (van Opijnen 2010b). Comparable problems appear when searching for case law citations, where additional complexity is added by the fact that one decision can be cited by many different identifiers (van Opijnen 2010a), like—often ambiguous—case numbers, reporter codes, commercial references or judgment identifiers like the Europe Case Law Identifier (ECLI)Footnote 3 (van Opijnen and Ivantchev 2015). Case names—often containing the names of the parties to the case—are problematic since they have many different spelling variants and are less frequently used since court decisions are anonymized more often (van Opijnen 2016a). Moreover, slashes, commas and hyphens are essential elements of legal identifiers, but are out-of-the-box interpreted by search engines as specific search instructions (e.g. ‘/’ means ‘near’ and ‘–’ means ‘not’). Manual tagging for large scale public databases is undoable, so reference parsers have to be developed (Agnoloni and Bacci 2016; van Opijnen et al. 2015); as explained in Sect. 3.2.3 they can be used for recognizing the citations in the information objects as well as for understanding user requests.
Search in multilingual legal repositories—e.g. in the ECLI Search Engine on the European e-Justice portalFootnote 4—poses additional problems: the terms used in the request do not only have to be translated into the language of the information objects, but also into the specific legal terminology of the jurisdiction the information objects are about. Various building blocks to tackle this have been developed. EuroVocFootnote 5 is a large multilingual vocabulary; although it is used for tagging in the EUR-Lex database, it is too much policy-oriented and too little legal to be of practical use for LIR. Aligning legal vocabularies of different legal systems and/or languages has proven to be quite difficult (Francesconi and Peruginelli 2010); within the Legivoc project various national legal vocabularies have been mapped (Vibert et al. 2013), but it needs more elaboration to be of practical use.
Bibliographical relevance
Topical relevance, as discussed in the previous subsection, is about the relevancy relation between the topic as formulated in the user request and the topic of the information objects. For most information retrieval systems this topicality suffices to measure whether the documents retrieved match the information request as formulated by the user: ‘aboutness’ is used as the decisive criterion. But contrary to the information contained in many general information (retrieval) systems, the information in legal information (retrieval) systems is highly self-contained. Information retrieval systems on animals, aeroplanes or people contain information about those topics, but not the objects themselves. However, legal information retrieval systems do contain legislation, court decisions and parliamentary documents themselves—notwithstanding the fact they might also contain other documents about these objects (which might also be such legal sources themselves). The distinctive criterion for establishing this bibliographical relevance is ‘isness’: the degree to which the documents retrieved actually are those requested by the user. Probably because most academic research on information retrieval is about non-self-contained domains, bibliographical relevance is not considered to be a relevance dimension of its own [compare e.g. (Cosijn and Ingwersen 2000; Saracevic 1996)]. Contrary to topical queries or browsing, which are intended for surveying the unknown, bibliographical queries are intended for searching the known, at least from the user perspective: a specific act, court case, parliamentary document or scholarly article. Although this might look like an issue for data retrieval instead of information retrieval (Baeza-Yates and Ribeiro-Neto 1999) and hence a no-brainer (Harvold 2008), for various reasons in most legal information systems it is still a real brainteaser, and hence it is defendable to approach this as an information retrieval issue.
-
1.
The ontological Levels of FRBR
Before we elaborate this proposition, we first have to introduce the ontological topology developed within the functional requirements for bibliographical records (FRBR) of the International Federation of Library Associations and Institutions (IFLA 1998), which is also widely used for structuring, describing and identifying legal information (Boer 2009; CEN 2010). The four distinctive ontological levels of FRBR are work, expression, manifestation and item.
The work is an abstract level, defined as: ‘a distinct intellectual or artistic creation’. For e.g. a court decision the work is the judicial decision resolving the specific legal dispute brought before the court. This work level is addressed when one says: “The Google Spain decision of the Court of Justice of the European Union is a landmark decision in the realm of data protection.”
The expression is also an abstract level, defined as: ‘the intellectual or artistic realization of a work’. Note that the expression is also an intellectual or artistic product, but that it is always derived from a work. For legal documents different types of expressions exist: linguistic, temporal and editorial. Temporal expressions are especially relevant for legislation, since the law changes continuously. Editorial expressions are generally more relevant for court decisions: the authentic version of the judge, the anonymized version published on the court’s web portal or an abridged expression edited by a legal publisher.
The manifestation is a (specific) physical embodiment of an expression of a work. Printed documents, PDF-, XML- or Word versions are examples of manifestations. Apart from its non-abstract character the manifestation also lacks the intellectual or artistic effort to have it created.
Finally, the item is the single exemplar of a manifestation. It could e.g. be the digitally signed PDF version of a court decision residing in a specific directory on my computer or the most recent hardcover version of the Lithuanian criminal code lying on my desk.
-
2.
The FRBR Problem
Bibliographical relevance poses three interrelated problems to retrieval systems, all of them supporting our proposition that this is in the realm of information retrieval and not in the domain of data retrieval. The first hurdle is in understanding whether the user poses an ‘is request’ or an ‘about request’; the second issue is the identification problem and the third challenge is about retrieving the correct FRBR version(s) of a legal information object.
As to the first problem, information retrieval systems operating within non-self-contained domains can interpret a user request, written in natural language, always as an about request. They can process the request with the optimizations described in Sect. 3.2.1 on algorithmic relevance, but if asked ‘Jaguar E-type’ the system can be sure the user expects descriptions, pictures and manuals of the iconic car to be retrieved, but not the thing itself. But when asked for ‘Dublin Regulation’ the system must be able to understand that this might be a request for documents containing the two words, or for legal provisions applying to the Irish capital, but that first and foremost it must be understood as a request for the text of ‘Regulation (EU) No 604/2013 of the European Parliament and of the Council of 26 June 2013 establishing the criteria and mechanisms for determining the Member State responsible for examining an application for international protection lodged in one of the Member States by a third-country national or a stateless person’,Footnote 6 in which title the word ‘Dublin’ does not appear at all.
The second problem surfaces when one realizes that lawyers are not that precise in citing legal sources, and hence in formulating their search requests. The abovementioned regulation might also be cited as e.g. ‘Regulation No 604-2013’, ‘EC-reg. No 604/2013’ or ‘Reg (EEC) 604.2013’. All of these styles are not compliant with the EU interinstitutional style guide (EU Publications Office 2011) or even incorrect, but when used in a citation they will be understood immediately by any legal professional. When used in a search engine though they will not lead to the desired result. For the reasons already discussed under relation-based search of Sect. 3.2.2, punctuation marks are interpreted as specific query instructions and the tens of different formatting variants are too difficult to be interpreted correctly during query execution.
For this reason, as well as to understand that a user is actually searching a legal document and not performing a topical search, many legal information retrieval systems offer a complex search screen, enabling the user to specify his request very precisely as to the title of the document, its (often compound) document identifier, publication reference, document date or abbreviation. The fact that such detailed screens are often offered as the default search mode or at least very prominently advertised, underlines the importance of bibliographical searches: such forms are still needed to achieve an acceptable performance on the isness criterion. At the same time though the existence of many different labels for a wide variety of identifiers and metadata with a lot of variations between the many legal information retrieval systems a user has at its disposal nowadays is a serious threat to findability of documents and hence to the usability of these systems. This problem is often multiplied by changes in identification systems or citation habits. An example can be drawn from the EUR-Lex advanced search—where one has to split the document number into a ‘year’ part and a ‘number’ part—even a trained user can be puzzled where to put which digits in case he is looking for ‘Directive 96/95/EC’, ‘Regulation 98/2014’ or ‘Regulation 2015/2016’.Footnote 7
One could say in general that such ‘advanced’ search forms for finding specific legal documents are too strict, while also here the adagium “Be lenient in what you accept and strict in what you produce” (Musciano and Kennedy 2006) should apply. Reference parsers that have been developed for detecting citations in documents themselves (van Opijnen et al. 2015)Footnote 8 may also be used for pre-parsing user requests, making obsolete most of all those specific input fields.
Even if the LIR system understands that isness will be the evaluation criterion and not aboutness, and even if it also understands which information object(s) might be requested for, it is confronted with the third problem: which FRBR version(s) of the document should be presented to the user. There is no clear-cut answer, but some aspects have to be taken into account. First, there might be a problem of ambiguity at the work level. Above, the Dublin Regulation was mentioned as an example, stating it is an alias for Regulation (EU) No 604/2013, but although this alias is used in daily legal language, it is not unambiguous. More precisely, this regulation is dubbed ‘Dublin III Regulation’, its predecessor, Regulation (EC) No 343/2003,Footnote 9 being the ‘Dublin II Regulation’, which in turn was preceded by the Dublin ConventionFootnote 10 (the namegiver of the legal doctrine all these instruments are about). Because of the amendments already made to the Dublin II Regulation by Regulation (EC) No 1103/2008Footnote 11 and additional changes that had to be made, it was decided the regulation had to be recast, making the Dublin III Regulation actually a distinct temporal expression of the same work (‘Dublin Regulation’) as the temporal expression Dublin II.Footnote 12 For Dublin II there is the promulgated expression (published in the Official Journal), the first consolidated expression,Footnote 13 and the consolidation after its amendment in 2008. Also Dublin III exists in its promulgated expression in the Official Journal, as well as in a consolidated expression.Footnote 14 EU regulations are equally authentic in all official (24)Footnote 15 languages, and most of these language expressions exist for all temporal and promulgated/consolidated expressions. And with regard to temporal expressions, also (possible) future versions should, if available, be retrievable.Footnote 16 Many of such documents exists in different manifestations; for end-users often (X)HTML and PDF are available, for computers sometimes also e.g. (RDF/)XML or JSON.
The problem of finding and presenting the bibliographically most relevant version can be addressed by a variety of methods., e.g. taking into account the language of the user, using the metadata (e.g. on the provider of the document and its authoritativeness), offering an option for specifying the temporal expression in the request form, or the possibility to compare different linguistic or temporal expressions after a first version of the document has been retrieved. An example of the former can be found on EUR-Lex, which can now display up to three language versions at the same time.Footnote 17 Also time-travelling in legislative databases is improving: jurists often need to know the delta between the temporal version T of an act and version T + 1. Some legislative databases nowadays not only serve version T and T + 1 in parallel, but also actually show the delta in a user-friendly way.Footnote 18 On the server-side, specific ‘FRBR resolvers’ like the Akoma Ntoso resolver might be of aid for finding the best match for a given set of input parameters, even if this best match is on distinct server (Palmirani et al. 2014).
Cognitive relevance
Cognitive relevance concerns the extent to which an information object matches the cognitive information needs of the user: the information needs as he experienced them before he had to translate them into a request in the user interface. This relevance dimension is of a subjective nature: do the retrieved documents fit to the user’s state of knowledge? Are there any characteristics regarding the information objects retrieved he should be aware of?
Since this dimension is of a subjective nature, the cognitive relevance performance of a LIR system depends on the ability of the system to explicitly or implicitly understand the information needs of each individual user; the many contexts in which the term ‘personalized search’ is used all have in common that they are about cognitive relevance.
Especially the possible use of recommender systems should be mentioned here. Recommender systems rely on intelligent filtering by comparing and combining document metrics, search results and user-generated data. Two types of filtering can be distinguished. ‘Collaborative filtering’ recommends documents by making use of the user’s past search behaviour and/or that of a peer group. ‘Content-based filtering’ on the other hand uses shared features of the document at hand and other documents, based on e.g. topical resemblance, comparable metadata or closeness in a citation network. Of course, collaborative filtering and content-based filtering can also be combined. Recommender mechanisms can be used to limit the number of documents retrieved (e.g. because the system knows a given user is only interested in tax law and not in criminal law) or to increase the number of documents: by offering ‘more like this’ buttons or navigable citation graphs users can be supported in serendipitous information discovery (Toms 2000). Being tailored to the individual needs of the user, recommender system can also be used for pro-active search: notification systems informing a user about information objects that have been added to the repository and might be of interest for him, because he explicitly expressed the wish to be informed about data with those specific characteristics, or because the system reaches this conclusion based on past search behaviour. Within legal information retrieval recommender systems have not had too much attention yet (Boer and Winkels 2016; Winkels et al. 2014).
Situational relevance
While cognitive relevance is associated with search task execution, situational relevance pertains to work task execution; the relevance of documents is measured by their usefulness for the task at hand, e.g. decision-making or problem-solving (Cosijn and Bothma 2005). “The judgement of situational relevance embraces not only the user’s evaluation of whether a given information object is capable of satisfying the information need, it offers also the potential of creating new knowledge which may motivate change in the decision maker’s cognitive structures. The change may further lead to a modification of the perception of the situation and the succeeding relevance judgement, and in an update of the information need.” (Borlund 2000) It should be noted that the system is not asked to solve the problem itself—then it would be a legal expert system, not a legal information retrieval system.
Situational relevance in legal information retrieval comes close to—but should not be confused with—‘legal relevance’, which usually means that information is relevant to a proposition when it affects, positively or negatively, the probability that the proposition is true (Cross and Wilkins 1964).Footnote 19
The difference between ‘legal relevance’ and situational relevance can be understood with the help of the following definition by Jon Bing:
A legal source is relevant if:
-
1.
The argument of the user would have been different if the user did not have any knowledge of the source, i.e. at least one argument must be derived from the source; or
-
2.
legal meta-norms require that the user considers whether the source belongs to category (1); or
-
3.
the user himself deems it appropriate to consider whether the source belongs to category (1). (Bing 1991)
In this definition (1) pertains to the strict notion of ‘legal relevance’, while situational relevance in legal information retrieval also covers (2) and (3).
Probably because of the relative importance of case law in the United States and other common law countries, much LIR research has concentrated on finding the (most) relevant court decisions relating to a case at hand. This can be pursued using a variety of (sometimes combined) technologies, like argumentation mining (Mochales and Moens 2011) and natural language processing (NLP) (Maxwell and Schafer 2008).
Domain relevance
We defined ‘domain relevance’ as the relevance of information objects within the legal domain itself. It is independent from any information system and independent from any user request. As can be understood from the previous paragraph we prefer to avoid the term ‘legal relevance’, but ‘legal importance’ is safe to use as a synonym for ‘domain relevance within the legal domain’ (van Opijnen 2016b).
Domain relevance can be applied in LIR systems in different ways.
-
1.
Legal importance of classes of information objects.
This concerns categories of information objects that can be classified as to their legal importance: constitutions outweigh ordinary acts, which in turn are more important than by-laws or ministerial degrees. In a comparable way, opinions of supreme courts can be expected to have more authority than district court verdicts, but in turn are surpassed by decisions of the European Court of Human Rights. For many categories of information objects their relative legal importance can be derived from basic metadata.
-
2.
Legal importance of individual information objects.
The concept of domain relevance can be used to classify individual information objects as to their legal importance as well. In vast repositories, separating the wheat from the chaff has for long been the territory of domain experts: as publication/storage was expensive, and adding documents itself labour-intensive, a selection was made on the input side of any paper or early digital repository. The ease with which information can be published on the internet nowadays has shifted the selection process—at least partially—from the input side to the output side: ‘selection’ has evolved from a publisher’s issue into a challenge for information retrieval. Case law publication in the Netherlands could serve as an illustration: the public case law database in the NetherlandsFootnote 20 contains a small percentage (<1%) of decided cases, but in 15 years has accumulated 370,000 documents. More than 75% of those is not considered important enough to be published in legal magazines (van Opijnen 2014).
An example of domain relevance applied at the document level can be observed in the HUDOC database, containing all case law documents produced by the European Court of Human Rights. To aid the user in filtering the nearly 57,000 documents as to their legal authority, four importance levels have been introduced. Except for the highest category, containing all judgments published in the Court Reports, all documents have been tagged manually. Since this importance level is an attribute of each individual document, it can easily be used in combination with other relevance dimensions.
Since manual tagging is labour-intensive, for more massive repositories a computer-aided rating is indispensable. Given the abundant use of citations between court decisions, network analysis is an obvious methodology to assess case law authority (Fowler and Jeon 2008; Winkels et al. 2011). In the ‘Model for Automated Rating of Case law’ (van Opijnen 2013) the ‘legal crowd’—the domain specialists that rate individual court decisions as to their importance by citing them or not—is extended to legal scholars, while it also uses other variables within a regression analysis to predict the odds of a decision rendered today for being cited in the future. One of these variables is the changing perceptions over time regarding the importance of a singular court decision [see e.g. also (Tarissan and Nollez-Goldbach 2015)]. If court decisions are well-structured and citations are made to the paragraph level, importance can be calculated for the sub-document level as well (Panagis and Šadl 2015). Comparable techniques can be used for the relevance classification of legislative documents (Mazzega et al. 2009).
Network analysis is supported by the use of common identifiers, like the European Legislation Identifier (ELI),Footnote 21 the European Case Law Identifier (ECLI) and possibly in the future a European Legal Doctrine Identifier (ELDI) (van Opijnen 2017) or a global standard for legal citations.Footnote 22
Apart from establishing the bare relationship between legal information objects as can be derived from citations, added value can be created by establishing and assessing the nature of the relationship. Shepard’s citations (Spriggs and Hansford 2000) offers an example, but it is only available on subscription and since the classification itself is done manually large public datasets need automated solutions (Winkels et al. 2014).