We have chosen the Cochrane Review as an initial case for developing the concept of a warranting device. The more general category of systematic reviews has previously been discussed by Hitchcock (2005, p. 386) as a justified warrant for deriving a clinical guideline from a body of research, distinguished from the use of the clinical guideline itself as a justified warrant for treatment recommended to a particular patient. Here we focus on how warrants of this kind come to be justified.
A Cochrane Review is a systematic method for synthesizing a body of medical research for the purpose of informing medical practice. It is a scientific technique, but one that does not involve generating new experimental findings. The aim of a Cochrane Review is to decide what can reasonably be inferred from a body of previously generated findings. Scientific findings come in diverse forms (quantitative/qualitative, experimental/correlational, etc.), but the preferred source of data for a Cochrane Review is a kind of experiment known as a Randomized Controlled Trial (RCT) designed to evaluate quantitatively whether a particular medical treatment is effective or to assess which of several alternative treatments is most effective.
Cochrane Reviews are named for Archie Cochrane, a Scottish doctor and epidemiologist who championed the use of medical experimentation for guidance of clinical practice (Cochrane 1972). Cochrane was not a founder of the Cochrane Collaboration, nor was he an inventor of the device. He died in 1988, about 5 years before the formation of the Cochrane Collaboration that has developed and disseminated the device. Cochrane’s personal contributions to medical reasoning were of a slightly different kind, relevant to the linking of RCTs to improvements in effectiveness and efficiency of health care.
In the five subsections that follow, we (1) briefly summarize some key technical advances that made the Cochrane Review possible; (2) describe the components of the device in its present state of development; (3) explore the work required to build and maintain the material components of the device; (4) discuss external social pressures on device design; and (5) reflect briefly on the present status of the device in resetting standards for reasoning about health. Case studies of other devices will be needed before the generality of our conclusions can be assessed, but each case study of an important new warranting device has independent value on a par with analysis of an individual argumentation scheme like argument from expert opinion.
Technical Threads Leading to the Device
Central to the concept of a warranting device is the idea that new inference tools can be invented. Inventions of all kinds typically take advantage of prior work that provides foundational ideas about how some problem might be solved. Medical science has as one of its characteristic inference problems the problem of finding causal relationships between medical interventions and health outcomes. The Cochrane Review is one invention in a series of other inventions that have attempted to solve aspects of this problem. The device combines technical ideas drawn from multiple sources, woven together into a novel way of achieving an objective: the formal aggregation of multiple pieces of scientific evidence into coherent conclusions about causation. Some of these foundational ideas are abstracted from prior methodological inventions within medicine and other scientific fields, while others are inspired by significant technological changes occurring with the rise of computing and information science. Several major technical threads have converged in the design of the Cochrane Review.
Countless large and small inventions over many centuries have contributed to a broadly accepted view of what is required to demonstrate causality in the context of human health and medicine (Bradford Hill 1965). Among these, one of the most important developments is the RCT, adopted within the twentieth century as the preferred form of evidence for claims about the effectiveness of medical interventions, including drugs. The RCT is itself a warranting device, built from a large number of smaller inventions, such as the “control group,” the random assignment of observational units to treatments, double-blind procedures, and others. RCTs appeared in adjacent fields (such as agriculture and psychology) decades before they became common in medical research, but were quickly appropriated into medical science. Boring’s (1954) account of the rise of control groups in biological and psychological research (first appearing in those fields in the late nineteenth and very early twentieth century) explains how the convergence of this innovation with early twentieth century developments in inferential statistics quickly elevated the control group to the status of an evidence standard for all forms of experimental work involving animals (human or otherwise). An important point to notice in the intertwined histories of RCTs and inferential statistics is that innovations in any one field can quickly diffuse into other fields, even if the issues belonging to the various fields are substantively different. For example, control groups were essential for fields where the experimental treatment involved anything learnable, but once they appeared, they were spontaneously adopted even in fields where experimental subjects could properly act as their own controls. Another familiar example is the spread of “split-plot” designs in agricultural experiments to logically equivalent designs in education and psychology, where what is “split” is something entirely different from a plot of land—such as a class of students—and where an alternative approach might easily have developed based on assignment of many intact classes to each treatment condition.
Completely independently of advances in causal inference, important changes were occurring in the management of print resources: books and journals. Organizing a large library means having some principle for deciding where a given item will be located, so that the item can be found again when wanted. Organizing a literature is a slightly different problem; any given physical collection might include only a portion of the literature, and no one method of physical placement can assure that items sharing an important commonality will be located together. Solutions to this problem began to appear in the nineteenth century, with proposals for classification systems for books as well as proposals for creation of finding aids such as indexes that could allow readers to locate materials through conceptual search rather than through physical browsing of library shelves. By the mid-twentieth century, these finding aids were transitioning from print resources published periodically to electronic resources that were, increasingly, automated. (We discuss one example, an indexing system known originally as MEDLARS, later in this paper.) By the end of the mid-twentieth century, both print and electronic publications were being published with explicit information (keywords and other metadata) included to serve the purpose of indexing.
Both advances in causal inference and advances in the management of literature are needed to account for the emergence of a new scientific practice known as meta-analysis. With appropriately conducted experiments accumulating rapidly on many specific research topics, it became obvious that drawing conclusions about these topics meant looking not at individual research results but at bodies of work (at least partially) identifiable from indexes. In fits and starts, scholars in varied fields tried various strategies for research synthesis, including just tallying up the number of experiments supporting or failing to support a given hypothesis (later pejoratively described as the vote-counting method). But no later than mid-century, dissatisfaction with these methods prompted serious theoretical work on combining statistical information. (See, e.g., the informal histories offered by Glass 1976; Rosenthal 1984).
By the 1970s, the new methods proposed for statistical aggregation had become known collectively as meta-analysis. These methods were energetically advocated by a small number of behavioral scientists—and greeted with great suspicion by a much larger number of their colleagues. Motivated by the skepticism with which these new methods were regarded, Cooper and Rosenthal (1980) pressed the case for meta-analysis by conducting an experiment in which qualified reviewers were given a stack of studies and either instructed to review them using customary narrative procedures or to review them using supplied meta-analysis procedures. Those using meta-analysis procedures were, according to Cooper and Rosenthal’s interpretation, better able to judge the strength of evidence contained in the set of studies (less likely to see the studies as inconclusive). Although the Cooper and Rosenthal study does not provide particularly strong evidence for the validity of the meta-analysis procedures used at the time, it served the important rhetorical function of exposing unmistakable weaknesses in the narrative and interpretive methods that were, before meta-analysis, the state of the art for aggregation of scientific findings.
Meta-analysis had such pronounced argumentative advantages over narrative reviews that attention quickly shifted away from critiquing the core ideas of meta-analysis, and toward active effort to improve the practice of meta-analysis by building a body of technique and assembling associated resources. An important additional detail is that the rise of meta-analysis as a tool for synthesis fed back into practices of primary researchers. Since the value of meta-analysis is greatly amplified when primary research is conducted with meta-analysis in mind, editorial policies began shifting toward requiring the reporting of statistical information needed for later cross-study comparison. Within a surprisingly short time, meta-analysis became the preferred method for reviewing empirical research for a number of fields, including education, psychology, communication, and other social sciences, rapidly improving and stabilizing its procedures through pre- and post-publication peer review. (For a sense of the discourse surrounding the development and justification of these procedures, see Zwahlen et al. (2008) and Hedges (1986); there are many other such articles in other fields where meta-analysis has been appropriated.) The rise of meta-analysis did not just alter the way research synthesis is conducted, but also exposed facts about variability affecting the interpretation of individual studies (O’Keefe 1999).
Relatively late in this game, in 1989, a major 2-volume synthesis of research on pregnancy and childbirth appeared (Chalmers et al. 1989), with a foreword written by Cochrane praising the work as “a real milestone in the history of randomised trials and in the evaluation of care.” This was the first major systematic review in health science, a massive undertaking involving 10 years of effort to review over 3000 controlled trials published since 1950 (Young 1990). Very soon thereafter, in 1993, the Cochrane Collaboration (now known simply as Cochrane) was formed to support the production of similar reviews across a wide range of medical topics (see Bero and Rennie 1995 for a contemporaneous account), integrating the quantitative methods of meta-analysis wherever possible with a body of technique for locating all relevant evidence within a large and diffuse literature.
Considering Cochrane Review as an invention—that is, as a technical achievement—we can trace a large number of prior achievements that made this invention feasible. These include closely related advances in causal inference and proposed improvements in evidence synthesis, but also completely unrelated advances in humanity’s ability to manage an ever more massive legacy of prior writing.
At present, the work of Cochrane includes not just the production of reviews, but also the development of standards for proper conduct of the reviewing work, coordination of information resources, methodological innovation, and more. Although not all Cochrane Reviews employ meta-analysis, the device itself is designed to avoid the problems of traditional reviews that were so clearly exposed by meta-analysis, as we explain next.
Components of the Device
A Cochrane Review is a synthesis of evidence conducted using very well-defined procedures outlined in an official handbook (Higgins and Green 2011). These reviews assemble evidence that already exists in a clinically-relevant scientific literature, typically from RCTs of health interventions. The input to the review consists of evidence that nonspecialists (including journalists) would very likely consider to be inconclusive or even inconsistent—typically, a large number of individual studies whose separate conclusions about the effect of an intervention vary in size and even in direction of the effect. For an expert, the evidence, while variable, does not appear inconsistent. A Cochrane Review treats study-to-study variation in findings from multiple RCTs as normal and unremarkable, and reviewers draw inferences from this evidence in a highly disciplined way.
The Cochrane Review has already achieved the status of a trusted warranting device, largely because its procedures are so explicitly linked to critical questions on which earlier styles of research synthesis regularly failed. These procedures include exhaustive search for relevant studies; use of scoring rubrics for evaluation of the relevance and strength of evidence in each study; prescribed methods for combining information quantitatively; preferred methods for presentation of findings; and more. Each of these procedures addresses possible vulnerabilities in any review’s synthesis of evidence.
For instance, the exhaustive search and the requirement to include all discoverable relevant evidence are defense against any charge of cherry picking, even though it is understood that no method will guarantee capture of all potentially relevant references (Aagaard et al. 2016). In combination with material resources to be described in Sect. 4.3, the methodical search procedures required for a Cochrane Review make it hard for a critic to object that evidence was assembled to fit the reviewer’s own hypothesis. Reviewer bias is further minimized through highly structured procedures, defined in the Cochrane official handbook (Higgins and Green 2011). Before conducting a review, the reviewing protocol is first defined, based on standard methods. Reviews are required to have standard contents in pre-specified categories, and they must follow the handbook’s guidelines for the data and analyses. Reporting is further standardized by the use of a suite of software tools that must be used in authoring Cochrane Reviews, including templates for report generation.Footnote 2 Cochrane reviews cannot be sponsored by commercial sources that have an interest in the outcomes of a review, and authors’ conflicts of interest, including work on studies that are synthesized, must be declared (Higgins and Green 2011, Sect. 2.6).
Counter-arguing individual studies (a once-common practice in narrative reviews of literature) is replaced with careful and explicit coding decisions applied impartially to the entire corpus of potentially relevant studies. The use of scoring rubrics for evaluation of the relevance and strength of evidence in each study ensures that researchers apply the same judgmental criteria to each study, rather than scrutinizing some results very critically while accepting others without scrutiny.
Against a charge that a synthesis of research is only as good as the body of primary research available for aggregation, the Cochrane community (said at present to include more than 37,000 contributors from over 130 countries) has adopted a formal practice of “grading” the strength of the evidence base itself (Guyatt et al. 2011; Balshem et al. 2011), to reduce the risk of implying that the conclusion best supported by a current body of evidence is also, on its own merits, a strong and dependable conclusion.
Prescribed methods for combining information quantitatively, when appropriate, ensure that all evidence is taken into account in a consistent manner. Meta-analysis can be used if there are a sufficient number of studies estimating the same effect—using designs similar enough to allow for consistent measurement of effect size. The standard way to describe these study results is a “forest plot” that allows readers to inspect results on a study-by-study basis (Higgins and Green 2011, Sect. 11.1). Figure 1 is an example of a forest plot, illustrating that although the information contained is quite technical, the visual display makes the results intelligible even to motivated non-experts. Specifically, the plot shows at a glance that studies in this review are not completely uniform in their results, and at the same time shows that the treatment (a form of sex education aimed only at promoting sexual abstinence) produces worse outcomes than control conditions in most of the experimental comparisons. Forest plots make it easy to spot outliers in a set of experiments, and they make it hard for anyone to push strong claims based on a single experiment that sits at either extreme of the distribution of results.
Systematic review methods are becoming trusted inferential tools, but they are still in a period of rapid methodological innovation, and this is likely to continue for some time. As these methods gain credibility among experts, additional changes may occur in the practice of primary research as researchers attempt to anticipate the use of their primary reports in various forms of aggregation. Other related changes may occur in the standards editors and article referees apply during prepublication peer review.
Construction of Material Components for this Device
The ability of a warranting device to function as a dependable inference rule may rest on material components that have to be assembled on purpose to support the rule. This is certainly true of the Cochrane Review. The most important material components of the Cochrane Review are large curated collections of prior work available to reviewers. Two databases—MEDLINE and CENTRAL—merit further examination as technological innovations that have themselves been created through extensive efforts to curate literature.
Procedures for managing and documenting the retrieval of prior work have become increasingly detailed and documented over time (Lefebvre et al. 2013), and have developed into a chapter of the Cochrane Handbook (Lefebvre et al. 2011) called “Searching for Studies” that is under the stewardship of the Cochrane Information Retrieval Methods Group. The chapter provides basic information about what to search for, the importance of searching in multiple sources, and search strategies and filtersFootnote 3 appropriate for the most common databases. Above all, authors are advised to consult a “Trials Search Co-ordinator” (the information specialist associated with their Cochrane Review Group) and/or a local health librarian early in the process. Computer-based search has greatly facilitated Cochrane Reviews, but even so, review authors are admonished to search in multiple sources, because no retrospectively constructed database can guarantee comprehensiveness. Therefore, reviewers are also expected to search a variety of sources, including MEDLINE, EMBASE, CENTRAL and the review group’s Specialised Register, to identify every possible relevant item, and to examine each item for whether it meets inclusion criteria. A typical Cochrane Review will identify thousands of potentially relevant items and winnow these to a few dozen studies that actually provide relevant data on the question the review is designed to answer.
These procedures assume reliance on material resources, some created by Cochrane, and some created and maintained by other trusted parties. MEDLINE, for instance, is a selective index to the medical literature that was developed as a byproduct of the MEDLARS project, to more efficiently produce the Index Medicus (a printed periodical started in 1879). Computer typesetting of this monthly printed guide at the US National Library of Medicine gradually changed the way the literature could be searched. Starting in 1964, searchers could request information by telephone, mail, or in-person visit; “trained search analysts would access the system for the designated information,” and the requestor could expect a bibliography in 3–6 weeks (Office of Technology Assessment 1982, p. 19).
Access methods have varied as information technology evolved,Footnote 4 and eventually end-users were able to conduct searches without the help of intermediaries. Today any Web user can search MEDLINE online, or download selected search results or even the entire contents of the database. A selective database of high-quality resources, MEDLINE’s contents have changed over time: A committee determines which journals to index, and journals can be removed as well as added. Retrospective data loads and digitization have added some records even from before 1964, and backfiles are no longer searched separately. The rate of change has also varied: Starting in June 2014, new citations could be added to MEDLINE 7 days a week (U.S. National Library of Medicine MEDLINE FactSheet).
CENTRAL, the Cochrane Central Registry of Controlled Trials, was created in 1993 because of a key problem with MEDLINE: Reports of RCTs could not be systematically identified by searching the database. In fact, one study found that about half of the available trials would be missed if MEDLINE were the only source searched, even if they were contained within the MEDLINE collection, because the indexing did not include any code to distinguish trials from other kinds of studies (Dickersin et al. 1994). Initially, a collaboration between Cochrane and the National Institutes of Health was launched to improve MEDLINE indexing, by tagging two Publication Types: RCTs and also Controlled Clinical Trials—trials that may have been randomized but were not explicitly described as such (Dickersin et al. 2002; Harlan 1993; Lefebvre et al. 2013). Cochrane’s carefully constructed search filters helped winnow likely RCTs from electronic searching (for an example, see the Appendix of Dickersin et al. 2002). In addition to electronic searching, individual Cochrane members used handsearching (page by page manual examination of journals and conference abstracts) to identify mentions of RCTs even in items not then typically indexed in MEDLINE.Footnote 5 Cochrane’s wide geographic range facilitated extensive checking of non-English language sources. All of these materials were later used in CENTRAL, along with records culled from Elsevier’s EMBASE database (Dickersin et al. 2002).
Maintenance of CENTRAL is ongoing. Each month, new records are added, drawing on systematic searches of MEDLINE and EMBASE, handsearches of approximately 2400 journals, as well as materials added to the Specialised Registers maintained by over 50 Cochrane Review Groups (Cochrane Library, CENTRAL creation details). To aid in the time-consuming task of screening database records, in 2014 Cochrane introduced a citizen science project called Cochrane Crowd (Cochrane Crowd). Anyone can sign up for the RCT classification task. After completing a 20-item training set, volunteers are presented with titles and abstracts to classify as ‘RCT/CCT’, ‘Reject’, or ‘Unsure,’ and responses across volunteers are aggregated (Cochrane Crowd; Noel-Storr et al. 2015). Disagreement (which occurs for only 6% of titles presented to volunteers) escalates the case to a more experienced ‘resolver’; otherwise materials are directly added to CENTRAL (or discarded) once three volunteers agree. Validation studies (Noel-Storr et al. 2015) have found that the crowdsourcing procedures, including the escalation procedures for cases with disagreements, result in over 99% accuracy, as compared with the normal procedure previously followed (which used the reconciled judgments of pairs of Cochrane experts).
The point of all this effort is to provide in advance the strongest possible assurance, for any individual review, that nothing has been overlooked due to carelessness or personal bias. Instead of leaving the thoroughness of a search to the ingenuity and perseverance of individual searchers, the expert community as a whole invests in creating a repeatable and accountable method that can be presumed to result in as complete a collection of evidence as possible. Of course it is still possible for a search to be incomplete, but the fact that reviewers report exact details of search procedures (including the exact query strings used) means that any objection to the completeness of a search would also need to specify what more could have been done (for example, by showing that additional query strings returned relevant items that the original strings did not, or by showing that the database itself systematically excluded relevant items). Of special interest here for understanding warranting devices is the mobilization of a field’s effort around material requirements for the production of strong arguments. Because these material resources exist apart from any one context of use, they are far less subject to challenge on grounds of reviewer bias in the search for relevant evidence.
Although the Cochrane Review is relatively stabilized in the sense that it has become a trusted way of arriving at conclusions, its form is by no means static. On the contrary, Cochrane has 17 methods groups, charged with addressing various ways of strengthening the device. These groups are organized around different types of topics, including disciplines (information retrieval, statistics), sources of evidence (qualitative evidence, non-randomized studies, individual participant data, prognosis studies, diagnostic tests, patient reported outcomes), evidence assessments (grading evidence, risk of bias), and policy applications (priority setting, economics, equity). The remaining groups are formed around kinds of evidence synthesis: three concern types of meta-analysis (prospective meta-analysis, individual participant data meta-analysis, and network meta-analysis).
In evaluating the credibility of a warranting device like the Cochrane Review, the due diligence exercised by expert practitioners is of central and fundamental importance. This is especially so when non-experts must decide whether or not to trust the conclusions of experts (Jackson 2015a). We return to this point in Sects. 6 and 7.
Social Factors in Device Development
Warranting devices change mainly to overcome discovered weaknesses in the conclusions they support, but they may also change for other reasons, such as making an enterprise more efficient overall. This echoes a familiar finding in science and technology studies (following Pinch and Bijker 1984) that technologies do not always develop in such a way as to prefer the superior technical option, but often choose options that balance technical superiority against other values. Cochrane Reviews are but one style of research synthesis, and they compete with other technological concepts (including meta-analysis and narrative reviews). Reviews have been considered an ever evolving ‘family’ (Moher et al. 2015) comprising numerous categories (Grant and Booth 2009). Despite the unquestioned rigor of the Cochrane methods, a Cochrane Review must compete (for expert adherents and for policy consumers) against other types of evidence synthesis, including other types of review.
One important challenge is how to do more in less time. Conducting a Cochrane Review is a labor-intensive process, typically taking a team of reviewers 1 to 2 years or more. Cochrane has formed a working group (the Cochrane Rapid Reviews Methods Group, formally established in October 2015; see Garritty et al. 2016) to develop methods for answering questions more quickly and better meeting policymakers’ needs, while maintaining Cochrane’s rigorous standards. Compared to systematic reviews, rapid reviews (RR) are faster to conduct (under 6 months and perhaps just weeks rather than years; see Khangura et al. 2012). At least 29 international organizations conduct rapid reviews, but there is no standard approach, and there is limited agreement as to which standardized methods should apply to rapid reviews (Polisena et al. 2015). As the Cochrane Rapid Reviews Methods Group has noted, “While RR producers must answer the time-sensitive needs of the healthcare decision-makers they serve, they must simultaneously ensure that the scientific imperative of methodological rigor is satisfied” (Garritty et al. 2016).
The importance of noting these non-logical and non-epistemic factors in device development is to acknowledge that any warranting device, being a human invention, may find itself in competition with other proposed warranting devices. Often, warranting devices take shape around multiple competing goals, sometimes involving compromise, for example trading off timeliness against tightness of argument. Both the strategies used to build confidence in a device and those used to evaluate acceptable tradeoffs between rigor and efficiency may provide insight into “warrant-establishing arguments” (anticipated but not adequately theorized by Toulmin).
Current Status of the Device
The work invested in making Cochrane Reviews more credible has been immense. It has involved not only accumulation of vast collections of scientific reports, but also production of metadata, development of new annotation systems, invention of search tools and strategies, and much more. This device replaces (and obsolesces) styles of literature review that were common until just a few decades ago—one-off arguments about a body of literature whose credibility was nearly always tied to the personal credibility of the individual reviewer. Perhaps most intriguing is how, in changing the way a community reasons with evidence, the device also shapes how new experimental evidence itself gets produced, presented, and assessed.