Let me continue with a discussion of core topics in AI with the AI as Law perspective in mind. My focus is on reasoning, knowledge, learning and language.
Reasoning
First reasoning. I then indeed think of argumentation where arguments and counterarguments meet (van Eemeren et al. 2014; Atkinson et al. 2017; Baroni et al. 2018). This is connected to the idea of defeasibility, where arguments become defeated when attacked by a stronger counterargument. Argumentation has been used to address the deep and old puzzles of inconsistency, incomplete information and uncertainty.
Here is an example argument about the Dutch bike owner Mary whose bike is stolen (Fig. 8). The bike is bought by John, hence both have a claim to ownership—Mary as the original owner, John as the buyer. But in this case the conflict can be resolved as John bought the bike for the low price of 20 euros, indicating that he was not a bona fide buyer. At such a price, he could have known that the bike was stolen, hence he has no claim to ownership as the buyer, and Mary is the owner.
It is one achievement of the field of AI & Law that the logic of argumentation is by now well understood, so well that it can be implemented in argumentation diagramming software that applies the logic of argumentation, for instance the ArguMed software that I implemented long ago during my postdoc period in the Maastricht law school (Verheij 2003a, 2005).Footnote 20 It implements argumentation semantics of the stable kind in the sense of Dung’s abstract argumentation that was proposed some 25 years ago (Dung 1995), a turning point and a cornerstone in today’s understanding of argumentation, with many successes. Abstract argumentation also gave new puzzles such as the lack of standardization leading to all kinds of detailed comparative formal studies, and more fundamentally the multiple formal semantics puzzle. The stable, preferred, grounded and complete semantics were the four proposed by Dung (1995), quickly thereafter extended to 6 when the labeling-based stage and semi-stable semantics were proposed (Verheij 1996). But that was only the start because the field of computational argumentation was then still only emerging.
For me, it was obvious that a different approach was needed when I discovered that after combining attack and support 11 different semantics were formally possible (Verheij 2003b), but practically almost all hardly relevant. No lawyer has to think about whether the applicable argumentation semantics is the semi-stable or the stage semantics.
One puzzle in the field is the following, here included after a discussion on the plane from Amsterdam to Montreal with Trevor Bench-Capon and Henry Prakken. A key idea underlying the original abstract argumentation paper is that derivation-like arguments can be abstracted from, allowing to focus only on attack. I know that for many this idea has helped them in their work and understanding of argumentation. For me, this was—from rather early on—more a distraction than an advantage as it introduced a separate, seemingly spurious layer. In the way that my PhD supervisor Jaap Hage put it: ‘those cloudy formal structures of yours’—and Jaap referred to abstract graphs in the sense of Dung—have no grounding in how lawyers think. There is no separate category of supporting arguments to be abstracted from before considering attack; instead, in the law there are only reasons for and against conclusions that must be balanced. Those were the days when Jaap Hage was working on Reason-Based Logic (1997) and I was helping him (Verheij et al. 1998). In a sense, the ArguMed software based on the DefLog formalism was my answer to removing that redundant intermediate layer (still present in its precursor the Argue! system), while sticking to the important mathematical analysis of reinstatement uncovered by Dung (see Verheij 2003a, 2005). For background on the puzzle of combining support and attack, see (van Eemeren et al. 2014, Sect. 11.5.5).
But as I said from around the turn of the millenium I thought a new mathematical foundation was called for, and it took me years to arrive at something that really increased my understanding of argumentation: the case model formalism (Verheij 2017a, b), but that is not for now.
Knowledge
The second topic of AI to be discussed is knowledge, so prominent in AI and in law. I then think of material, semi-formal argumentation schemes such as the witness testimony scheme, or the scheme for practical reasoning, as for instance collected in the nice volume by Walton et al. (2008).
I also think of norms, in our community often studied with a Hohfeldian or deontic logic perspective on rights and obligations as a background.Footnote 21 And then there are the ontologies that can capture large amounts of knowledge in a systematic way.Footnote 22
One lesson that I have taken home from working in the domain of law—and again don’t forget that I started in the field of mathematics where things are thought of as neat and clean—one lesson is that in the world of law things are always more complex than you think. One could say that it is the business of law to find the exactly right level of complexity, and that is often just a bit more complex than one’s initial idea. And if things are not yet complex now, they can become tomorrow. Remember the dynamics of theory construction that we saw earlier (Fig. 6).
Figure 9 (left) shows how in the law different categories of juristic facts are distinguished. Here juristic facts are the kind of facts that are legally relevant, that have legal consequences. They come in two kinds: acts with legal consequences, and bare juristic facts, where the latter are intentionless events such as being born, which still have legal consequences. And acts with legal consequences are divided in on the one hand juristic acts aimed at a legal consequence (such as contracting), and on the other factual acts, where although there is no legal intention, still there are legal consequences. Here the primary example is that of unlawful acts as discussed in tort law. I am still happy that I learnt this categorization of juristic facts in the Maastricht law school, as it has relevantly expanded my understanding of how things work in the world. And of how things should be done in AI. Definitely not purely logically or purely statistically, definitely with much attention for the specifics of a situation.
Figure 9 (right) shows another categorization, prepared with Jaap Hage, that shows how we then approached the core categories of things, or ‘individuals’ that should be distinguished when analyzing the law: states of affairs, events rules, other individuals, and then the subcategories of event occurrences, rule validities and other states of affairs. And although such a categorization does have a hint of the baroqueness of Jorge Luis Borges’ animal taxonomy (that included those animals that belong to the emperor, mermaids and innumerable animals), the abstract core ontology helped us to analyze the relations between events, rules and states of affairs that play a role when signing a contract (Fig. 10). Indeed at first sight a complex picture. For now it suffices that at the top row there is the physical act of signing—say when the pen is going over the paper to sign—and this physical act counts as engaging in a contractual bond (shown in the second row), which implies the undertaking of an obligation (third row), which in turn leads to a duty to perform an action (at the bottom row). Not a simple picture, but as said, in the law things are often more complex than expected, and typically for good, pragmatic reasons.
The core puzzle for our field and for AI generally that I would like to mention is that of commonsense knowledge. This remains an essential puzzle, also in these days of big data; also in these days of cognitive computing. Machines simply don’t have commonsense knowledge that is nearly good enough. A knowledgeable report in the Communications of the ACM explains that progress has been slow (Davis and Marcus 2015). It goes back to 2015, but please do not believe it when it is suggested that things are very different today. The commonsense knowledge problem remains a relevant and important research challenge indeed and I hope to see more of the big knowledge needed for serious AI & Law in the future. Only brave people have the chance to make real progress here, like the people in this room.
One example of what I think is an as yet underestimated cornerstone of commonsense knowledge is the role of globally coherent knowledge structures—such as the scenarios and cases we encounter in the law. Our current program chair Floris Bex took relevant steps to investigate scenario schemes and how they are hierarchically related, in the context of murder stories and crime investigation (Bex 2011).Footnote 23 Our field would benefit from more work like this, that goes back to the frames and scripts studied by people such as Roger Schank and Marvin Minsky.
My current favorite kind of knowledge representation uses the case models mentioned before. It has for instance been used to represent how an appellate court gradually constructs its hypotheses about a murder case on the basis of the evidence, gradually testing and selecting which scenario of what has happened to believe or not (Verheij 2019), and also to the temporal development of the relevance of past decisions in terms of the values they promote and demote (Verheij 2016).
Learning
Then we come to the topic of learning. It is the domain of statistical analysis that shows that certain judges are more prone to supporting democrat positions than others, and that as we saw no longer is allowed in France. It is the domain of open data, that allows public access to legal sources and in which our community has been very active (Biagioli et al. 2005; Francesconi and Passerini 2007; Francesconi et al. 2010a, b; Sartor et al. 2011; Athan et al. 2013). And it is the realm of neural networks, back in the days called perceptrons, now referred to as deep learning.
The core theme to be discussed here is the issue of how learning and the justification of outcomes go together, using a contemporary term: how to arrive at an explainable AI, an explainable machine learning. We have heard it discussed at all career levels, by young PhD students and by a Turing award winner.
The issue can be illustrated by a mock prediction machine for Dutch criminal courts. Imagine a button that you can push, that once you push it always gives the outcome that the suspect is guilty as charged. And thinking of the need to evaluate systems (Conrad and Zeleznikow 2015), this system has indeed been validated by the Dutch Central Bureau of Statistics, that has the data that shows that this prediction machine is correct in 91 out of a 100 cases (Fig. 11). The validating data shows that the imaginary prediction machine has become a bit less accurate in recent years, presumably by changes in society, perhaps in part caused by the attention in the Netherlands for so-called dubious cases, or miscarriages of justice, which may have made judges a little more reluctant to decide for guilt. But still: 91% for this very simple machine is quite good. And as you know, all this says very little about how to decide for guilt or not.
How hard judicial prediction really is, also when using serious machine learning techniques, is shown by some recent examples. Katz et al. (2017) that their US Supreme Court prediction machine could achieve a 70% accuracy. A mild improvement over the baseline of the historical majority outcome (to always affirm a previous decision) which is 60%, and even milder over the 10 year majority outcome which is 67%. The system based its predictions on features such as judge identity, month, court of origin and issue, so modest results are not surprising.
In another study Aletras and colleagues (2016) studied European Court of Human Rights cases. They used n-grams and topics as the starting point of their training, and used a prepared dataset to make a cleaner baseline of 50% accuracy by random guessing. They reached 79% accuracy using the whole text, and noted that by only using the part where the factual circumstances are described already an accuracy of 73% is reached.
Naively taking the ratios of 70 over 60 and of 79 over 50, one sees that factors of 1.2 and of 1.6 improvement are relevant research outcomes, but practically modest. And more importantly these systems only focus on outcome, without saying anything about how to arrive at an outcome, or about for which reasons an outcome is warranted or not.
And indeed and as said before learning is hard, especially in the domain of law.Footnote 24 I am still a fan of an old paper by Trevor Bench-Capon on neural networks and open texture (Bench-Capon 1993). In an artificially constructed example about welfare benefits, he included different kinds of constraints: boolean, categorical, numeric. For instance, women were allowed the benefit after 60, and men after 65. Trevor found that after training, the neural network could achieve a high overall performance, but with somewhat surprising underlying rationales. In Fig. 12, on the left, one can see that the condition starts to be relevant long before the ages of 60 and 65 and that the difference in gender is something like 15 years instead of 5. On the right, with a more focused training set using cases with only single failing conditions, the relevance started a bit later, but still too early, while the gender difference now indeed was 5 years.
What I have placed my bets on is the kind of hybrid cases and rules systems that for us in AI & Law are normal.Footnote 25 I now represent Dutch tort law in terms of case models validating rule-based arguments (Verheij 2017b) (cf. Fig. 13 below).
Language
Then language, the fourth and final topic of AI that I would like to discuss with you. Today the topic of language is closely connected to machine learning. I think of the labeling of natural language data to allow for training; I think of prediction such as by a search engine or chat application on a smartphone, and I think of argument mining, a relevant topic with strong roots in the field of AI & Law.
The study of natural language in AI, and in fact of AI itself, got a significant boost by IBM’s Watson system that won the Jeopardy! quiz show. For instance, Watson correctly recognized the description of ‘A 2-word phrase [that] means the power to take private property for public use’. That description refers to the typically legal concept of eminent domain, the situation in which a government disowns property for public reasons, such as the construction of a highway or windmill park. Watson’s output showed that the legal concept scored 98%, but also ‘electric company’ and ‘capitalist economy’ were considered with 9% and 5% scores, respectively. Apparently Watson sees some kind of overlap between the legal concept of eminent domain, electric companies and capitalist economy, since 98+9+5 is more than a 100 percent.
And IBM continued, as Watson was used as the basis for its debating technologies. In a 2014 demonstration,Footnote 26 the system is considering the sale of violent video games to minors. The video shows that the system finds reasons for and against banning the sale of such games to minors, for instance that most children who play violent games do not have problems, but that violent video games can increase children’s aggression. The video remains impressive, and for the field of computational argumentation that I am a member of it was somewhat discomforting that the researchers behind this system were then outsiders to the field.
The success of these natural language systems leads one to think about why they can do what they do. Do they really have an understanding of a complex sentence describing the legal concept of eminent domain; can they really digest newspaper articles and other online resources on violent video games?
These questions are especially relevant since in our field of AI & Law we have had the opportunity to follow research on argument mining from the start. Early and relevant research is by Raquel Mochales Palau and Sien Moens, who studied argument mining in a paper at the 2009 ICAIL conference (2009, 2011). And as already shown in that paper, it should not be considered an easy task to perform argument mining. Indeed the field has been making relevant and interesting progress, as also shown in research presented at this conference, but no one would claim the kind of natural language understanding needed for interpreting legal concepts or online debates.Footnote 27
So what then is the basis of apparent success? Is it simply because a big tech company can do a research investment that in academia one can only dream of? Certainly that is a part of what has been going on. But there is more to it than that as can be appreciated by a small experiment I did, this time actually an implemented online system. It is what I ironically called Poor Man’s Watson,Footnote 28 which has been programmed without much deep natural language technology, just some simple regular expression scripts using online access to the Google search engine and Wikipedia. And indeed it turns out that the simple script can also recognize the concept of eminent domain: when one types ‘the power to take private property for public use’ the answer is ‘eminent domain’. The explanation for this remarkable result is that for some descriptions the correct Wikipedia page ends up high in the list of pages returned by Google, and that happens because we—the people—have been typing in good descriptions of those concepts in Wikipedia, and indeed Google can find these pages. Sometimes the results are spectacular, but also they are brittle since seemingly small, irrelevant changes can quickly break this simple system.
And for the debating technology something similar holds since there are web sites collecting pros and cons of societal debates. For instance, the web site procon.org has a page on the pros and cons of violent video games.Footnote 29 Arguments it has collected include ‘Pro 1: Playing violent video games causes more aggression, bullying, and fighting’ and ‘Con 1: Sales of violent video games have significantly increased while violent juvenile crime rates have significantly decreased’. The web site Kialo has similar collaboratively created lists.Footnote 30 Concerning the issue ‘Violent video games should be banned to curb school shootings’, it lists for instance the pro ‘Video games normalize violence, especially in the eyes of kids, and affect how they see and interact with the world’ and the con ‘School shootings are, primarily, the result of other factors that should be dealt with instead’.
Surely the existence of such lists typed in, in a structured way, by humans is a central basis for what debating technology can and cannot do. It is not a coincidence that—listening carefully to the reports—the examples used in marketing concern curated lists of topics. At the same time this does not take away the bravery of IBM and how strongly it has been stimulating the field of AI by its successful demos. And that also for IBM things are sometimes hard is shown by the report from February 2019 when IBM’s technology entered into a debate with a human debater, and this time lost.Footnote 31 But who knows what the future brings.
What I believe is needed is the development of an ever closer connection between complex knowledge representations and natural language explanations, as for instance in work by Charlotte Vlek on explaining Bayesian Networks (Vlek et al. 2016), which had nice connections to the work discussed by Jeroen Keppens yesterday (2019).