Afterword: data, knowledge, and e-discovery
- First Online:
- Cite this article as:
- Lewis, D.D. Artif Intell Law (2010) 18: 481. doi:10.1007/s10506-010-9101-0
- 166 Views
Research in Artificial Intelligence (AI) and the Law has maintained an emphasis on knowledge representation and formal reasoning during a period when statistical, data-driven approaches have ascended to dominance within AI as a whole. Electronic discovery is a legal application area, with substantial commercial and research interest, where there are compelling arguments in favor of both empirical and knowledge-based approaches. We discuss the cases for both perspectives, as well as the opportunities for beneficial synergies.
KeywordsElectronically stored informationESIAutomated reasoningPattern recognitionCategorizationQuality control
The computational linguist Ken Church has suggested that fashion in artificial intelligence is like a pendulum, swinging between Empiricism and Rationalism on a forty-year cycle (Church 2004). If that is the case, then the field of Artificial Intelligence and the Law (AI & Law) retains the imprint of its birth year. The first International Conference on Artificial Intelligence and Law (ICAIL) took place in 1987. By the end of that year the AI Winter (really a winter of Rationalism or, more precisely, of knowledge-based systems), had begun (Hendler 2008). Empirical, statistical, data-driven approaches, which had survived their own winter hunkered down in fields such as information retrieval (IR) and electrical engineering, reemerged into AI, and in 2010 are overwhelmingly dominant.
The situation in AI & Law, as reflected in this journal and at ICAIL, is quite different. Knowledge representation and logical inference are still the dominant perspective. Philosophers are cited without embarrassment. As I write this, “defeasible” gets 61 hits on the archives of this journal, and “Dirichlet” only one.
What then of e-discovery (aka eDiscovery, EDD, ED, E-Disclosure, and eDisclosure)? It has recently come to prominence in computer science through two major venues. One is the Legal Track (Baron et al. 2006; Oard et al. 2010) at TREC, the Text REtrieval Conference sponsored by the US National Institute of Standards and one of the pinnacles of Empiricism in IR and AI. The other is the DESI (Workshop on Supporting Search and Sensemaking for Electronically Stored Information in Discovery) series, two out of three of whose workshops have been collocated with none other than ICAIL (Ashley et al. 2008). As e-discovery grows in importance, which AI will it take after? I examine the case for both parents, and suggest that e-discovery will prosper through synergies between statistical and knowledge-based approaches.
2 E-discovery and Empricism
Consider the empiricist brief first. E-discovery is an application area extremely well-suited for statistical IR and AI. The volumes of documents to be handled are large. We have learned a great deal about how to represent text for statistical processing. Core e-discovery tasks, including review for responsiveness, review for privilege, identification of topics, and entity extraction can be framed as classification, a framework where statistical AI has particular success. Financial costs of e-discovery are easily measured, making statistical utility maximization natural.
Further, an application area which takes for granted rooms of people manually classifying documents is a boon for supervised learning. The review necessary to annotate a large training set can be a substantial reduction of effort in comparison with usual practices. The penetration of workflow automation in e-discovery means that integrating machine learning, even sophisticated methods such as active learning, can be done with minimal change to the experience of e-discovery personnel.
These observations are hardly novel. Supervised learning methods are already in use by many e-discovery software vendors and service providers (Kershaw and Howie 2010). Several systems using machine learning are discussed in this special issue (Attfield and Blandford 2010; Hogan et al. 2010; Privault et al. 2010), and many have been reported on at the TREC and DESI meetings.
3 E-discovery and Rationalism
The case for knowledge-based approaches follows from the fundamental nature of the law. Law is about rules and formalized arguments. Ideally, its regularities are deterministic, not statistical. Actions are legal or illegal; decisions must have justifications from first principles. The non-delivery of a document in discovery can be the subject of a judicial ruling specifying, with finality, its correctness or incorrectness. True, law is written in ambiguous natural language, and interpreted not by theorem provers, but by people with emotions and biases and failings. Still, the rule-like character of the law has been and continues to be a powerful argument for knowledge-based approaches.
The legal context of e-discovery ensures that simple classification effectiveness cannot be our only desideratum. An attorney must be able to justify the decision to produce or not produce a document, and “the classifier said so” may not suffice. Supervised learners exploit whatever regularities lead to the highest effectiveness on training data, no matter how accidental or transient (e.g. the “dead Jesuits” of the MUC-3 evaluations (Lewis 1991)). There might be value in classification decisions produced by an inference process which looks less strange to a judge or jury, even if they were slightly less accurate.
Similar issues arise in negotiations with opposing council. It may be easier to reach an agreement that all documents satisfying certain concise formal criteria be turned over, than an agreement on parameter settings for a learning algorithm. Today, these formal criteria are typically lists of keywords or simple Boolean queries, but richer query languages, formally defined concepts, and the like could be used.
Another argument for knowledge engineering is that the costs of classification errors can vary widely among documents. A classifier could have 99.9% precision and recall, yet miss all 37 documents which demonstrate the truth of opposing council’s central assertion. Such failure modes are possible, for instance, when the smoking gun documents are similar to each other, very different from other responsive documents, and/or present representational problems (e.g. scanned handwritten notes). Further, while most parties undertake discovery faithfully, the fact that a billion parameter statistical classifier can easily be modified to omit particular documents, while keeping high overall effectiveness, means that classifiers need to be understood as well as used. Classifiers with a structure that reflects agreed upon concept definitions may help with this.
Finally, the most important documents in a case are inevitably few in number, if only because few documents can practicably be introduced into evidence at trial. Further, these documents may be atypical within the set of all relevant documents: “most relevant” can be very different from “most likely to be relevant”. Indeed, the really important documents may be ones that neither party is sure exists, and thus cannot be trained on.
Knowledge-based approaches are to the point here. One could, for instance, provide a formal definition, in terms of more primitive concepts, of interesting classes of documents, without knowing whether any instances exist. Those concepts can be defined in terms of others, and so on, until one bottoms out at observable features, or at classes susceptible to supervised learning. Such an approach may allow not only finding highly relevant documents, but demonstrating with some confidence that certain types of documents do not exist in a collection.
The notion of defining categories for important aspects of a case is already used in e-discovery, though is typically not approached from a formal inference perspective. 1 Knowledge engineering approaches to building text classifiers were popular in the heyday of expert systems (Vleduts-Stokolov 1987; Hayes and Weinstein 1990), and may reemerge in conjunction with today’s more powerful learning, inference, and language processing technologies.
Reuse of classifiers and concept definitions from case to case would be desirable. Doing so across clients (say of an e-discovery service provider) raises privacy issues. Knowledge-based approaches may have advantages in verifying that information is not leaking, though work on privacy-preserving data mining (Verykios et al. 2004) is also relevant.
4 Together again for the first time
While we have rhetorically opposed Empiricism and Rationalism here, the two approaches are often used together in practical text classification systems. Manually engineered features may be combined by supervised learning, and learned classifiers may be manually edited (Lewis and Sebastiani 2001). Practitioners are rarely purists.
The promise of combining the two approaches has never been brighter. Recent years have seen advances in combining knowledge representations with statistical inference and learning. The greatest excitement surrounds statistical relational learning (Getoor and Taskar 2007), i.e. combining of probabilistic methods with first order logic or representations of similar power. On the other hand, statistical variants on propositional logic (Bayes nets, inference nets,...) are a more mature technology, and might handle most of the needs of e-discovery. After all, Boolean queries, the simplest application of propositional logic to IR, are still widely used. We might see a resurgence of the once-popular inference net approach to IR (Turtle 1995), this time focused not on effectiveness at ad hoc retrieval, but on supervised learning and explanatory capability.
Whether it can be (and has been) tested,
Whether it has been subjected to peer review and publication,
The known or potential error rate,
The existence and maintenance of standards controlling the technique’s operation, and
Whether it is generally recognized in the scientific community.
The criterion of a known or potential error will likely loom large. In the past, e-discovery evaluation by courts has largely been qualitative, with a focus on which sources were searched rather than on the quality of that search. However, requests by courts for statistical sampling evaluations of e-discovery effectiveness have started to appear. 3
In the research community, quantitative evaluation on large data sets has been more common for machine learning systems than for those produced by knowledge engineering. This is partly the culture of the subfields, but also an issue of whether appropriate data sets are available. It is striking that at the 25th anniversary of Blair and Maron (1985), the classic study of search in e-discovery, we still have rather few concrete results on realistic e-discovery datasets (Oard et al. 2010). Research studies in operational e-discovery contexts are made difficult by privacy and intellectual property issues. But these problems are no more difficult than those faced and routinely addressed in, say, clinical trials in medicine. In a world drowning in data, it surely is possible, with appropriate support from funders, to produce more realistic data sets.
The emergence of e-discovery as a huge, expensive, messy problem to which AI has a range of solutions is good news for the field of AI & Law (if less wonderful for society as a whole). Statistical, data-driven methods are already finding great success, but I believe both Empiricism and Rationalism have large roles to play. In this special issue, both Conrad (2010) and Ashley and Bridewell (2010) reach not dissimilar conclusions, and at least two of the systems reported can be said to combine statistical and knowledge-based approaches (Attfield and Blandford 2010; Hogan et al. 2010). E-discovery problems are likely to attract attention from many quarters of the research community, particularly if more realistic data sets can be made available.
Combinations more exotic than a marriage of Rationalism and Empiricism are possible. The field of human computation (Chandrasekar et al. 2010) is upending traditional notions of automation and decision support. Combining automated and manual review is of course common in e-discovery, but typically the personnel are trained paralegals or lawyers. It is an interesting question what role less trained, but less expensive, crowdsourced labor might play in an overall e-discovery architecture.
As a separate issue, the term “concept search” is widely and ambiguously used in discussions of (and marketing of) e-discovery (see Sect. 3.2 of Oard et al. (2010)).
Daubert v. Merrell Dow Pharmaceuticals, Inc. 509 US 579 (1993)
Mt. Hawley Ins. Co. v. Felman Prod., Inc., 2010 WL 1990555 (S.D. W. Va. May 18, 2010)
Many thanks to Kevin Ashley for his helpful feedback. All responsibility for errors remains with me.