Like other approaches to evidence aggregation in the literature, Hunter and Williams’ proposal can be seen to accord with a witness model, but this in itself does not go far in terms of settling the quality of the inferences. We now consider in more detail whether their procedure is justified or fit for purpose. We address this issue not solely for Hunter and Williams’ proposal but with respect to automated evidence aggregation procedures more generally: What is the appropriate extent of automation for an evidence aggregator to help facilitate policy recommendations in, say, a medical context?
The large question that is left open by the ‘witness schema’ (i.e., that goes beyond the bare inferential framework), is how to delineate the independent ‘witnesses’, and assess the nature and reliability of ‘their’ findings. Reliability, for instance, is a matter of both the quality and relevance of the experiment or ‘witness’ given the hypotheses at hand. One must determine what features of individual pieces of evidence should inform this complex reliability assessment and how exactly reliability should be determined on the basis of these features. Assessing the reliability of the witnesses is one of the key things that is automated in automated evidence aggregators. This can be automated to varying extent. Recall that Hunter and Williams automate the reliability assessment via meta-arguments.
In Sect. 4.1 below we consider this question first from an ideal perspective, free from any constraints on computational resources. In effect, we consider better and worse treatments of (or inferences based on) a fixed amount of evidence (the more the better). Here we introduce our key idea of ability to perform an adequate robustness analysis. Then in Sect. 4.2 we go on to consider the more realistic scenario where there may be constraints on computational resources. We depict this as a further problem of a trade-off between nuanced analysis of evidence and volume of evidence. We argue that it may not be the case that aggregators that can handle more evidence are better. In fact, that very consideration is pivotal in assessing and comparing aggregators. We will argue that there is no prerogative to aggregate the total evidence available if this would in fact reduce overall accuracy.
Assessing aggregators in an ideal setting: no computational constraints
Absent any computational constraints, the goodness of an automated evidence aggregator is all about ideal performance—the idea is to make the best or wisest inferences possible for a class of cases (i.e., the type of evidence aggregation problem at hand), given all the available evidence. So we want the reasoning process to be as nuanced as possible, where the more of this nuance that can be captured by an explicit algorithm and so automated, the better. A higher degree of automation is desirable, all else being equal, because it allows transparency, removes computational error, enhances speed, and facilitates analysis of the sensitivity of results to choices of parameter values, i.e., robustness analysis. In the ideal setting then, any part of the reasoning process that can be made explicit in advance of seeing the particular evidence at hand, should indeed be made explicit. The only reason to leave some aspects of the reasoning process as a black box is if there are aspects of the procedure where it is less advantageous to try to specify the reasoning in advance; better to leave it to experts (assuming they are experts of average competence relative to the appropriate group) to interpret and weigh the particular evidence when it arises.
It is one thing to state the goal of automating all reasoning that can be made explicit without sacrifice in nuanced analysis, but of course it is another thing to make these extremely difficult judgments. We cannot hope to provide detailed advice on how to make such judgments in this paper. For one thing, much of the detail will depend on the type of policy task at hand and the kind of evidence available. Instead, what we seek are strategies that a practitioner may employ to approach or frame the question in a way that may assist, in a modest way, in arriving at an answer. To put it differently: What general criterion provides the best avenue for assessing optimal automation?
The main strategy or criterion we propose for assessing the degree of automation of an evidence aggregator can be stated as follows: Does the automated aggregator allow conducting a robustness analysis that would yield a thorough and compelling survey of the possibility space? Of course, robustness analysis has already been mentioned above as a useful by-product of automation. We initially elaborate this point (4.1.1). Our novel proposal, however, is that this consequence of automation should serve as a focal point in the algorithm design (4.1.2). This is not just a matter of wise implementation; as mentioned, we are already assuming the implementation of the algorithm in question allows parameter values to be changed by the user. It is rather a matter of settling on the explicit algorithm structure with an eye to whether the subsequent robustness analysis will serve as a reasonable survey of the possibility space for the type of aggregation problem at hand. In short, focussing on the ability to conduct an adequate robustness analysis serves to direct one’s priorities to what really matters—away from the precise ‘dial settings’ of an aggregator, so to speak, and towards whether we have the right dials to begin with.
Before proceeding, let us first clarify a couple of terms concerning the structure of an automated evidence aggregator that will be central to our discussion in this section. First, there are input variables describing the evidence. Recall, for instance, Hunter and Williams’ evidence table: the columns are the input variables accounting for relevant features of the evidence, and each row—an individual piece of evidence—is effectively a vector of values for these input variables.
Second, there are the parameters of the aggregation function that dictate how the values of the input variables bear on the assessment of each piece of evidence and ultimately on the overall aggregation or final inference concerning the hypotheses in question. For instance, the aggregation function might include a parameter ‘threshold sample size’ which is used to measure the quality of a piece of evidence with respect to sample size. These parameters are associated with a range of possible values. Thanks to programming design, the parameters can be set to any value within this range/set, depending on initial user input.
Robustness analysis in its traditional role: as a useful upshot of automation
As a general characterisation, robustness analysis involves determining the stability of a result given changes in underlying assumptions. Two forms of robustness analysis are widely distinguished in the literature: Derivational (or inferential) robustness analysis looks at the stability of model derivations (or inferences) given changes in the model (or background) assumptions (Kuorikoski et al. 2010, p. 542).Footnote 16Measurement robustness analysis looks at the stability of empirical results given changes in empirical modes of determination (such as different types of experiments) (Wimsatt 1981, p. 128).
Here, we focus on derivational robustness analysis (DR) since it is a form of error analysis, i.e. a way of exploring the sensitivity of results to choices of parameter values. This role of DR is also called heuristic function as it allows a transparent and traceable way of dealing with unavoidable idiosyncratic choices in the construction of a model (here, evidence aggregators), due to uncertainty and/or reasonable disagreement. For DR allows determining the relative importance of various components of a model with respect to the output variable of interest (Kuorikoski et al. 2010, p. 543).
Automated evidence aggregators, given our assumption about wise implementation, are well set up for this kind of error analysis. This point has been brought up by Hunter and Williams in relation to the meta-arguments in their aggregator. They note that their procedure allows “a form of sensitivity analysis” (Hunter and Williams 2012, p. 25) by including different meta-arguments. By including different meta-arguments, in effect, the reliability of the evidence can be assessed in a range of ways, and the impact of these differing assessments on the resulting inferences can be monitored.
In other contexts too, sensitivity analysis is recommended as a way to keep track and explore the implications of model choices that are subject to uncertainty/reasonable disagreement. For instance, Stegenga (2011, p. 498) points out that there are many choices of this kind in statistical meta-analysis:
Meta-analysis fails to constrain intersubjective assessments of hypotheses because numerous decisions must be made when performing a meta-analysis which allow wide latitude for subjective idiosyncrasies to influence the results of a meta-analysis. (ibid.)
Robustness analysis in its new role: as a central design criterion
For all the good of robustness analysis, one might regard it a secondary issue when it comes to assessing the degree of automation of an evidence aggregator. Surely the primary issue is whether the aggregator facilitates roughly the best inferences possible given the available evidence; error analysis is a matter of extra detail. As suggested above, robustness is indeed typically considered an ex post analysis or a way to check what confidence one should have in a model result. Here we defend, however, a more central role for robustness analysis in the construction and assessment of an evidence aggregator. In short, the prospect of what robustness analysis can be performed focuses one’s attention on what really matters—the functional form and possible inputs to the evidence aggregator, rather than the precise parameter values featuring in the aggregator. In other words, the prospect of robustness analysis helps one to assess what parts of the inference process can be made explicit and transparent.
There are at least two reasons why focussing on the capacity for robustness analysis is helpful in making these judgments about algorithm design: (a) it allows one to recognise that certain types of uncertainty/error (regarding precise parameter values) do not compromise automation, since the impact of these uncertainties can be explored via the robustness analysis, and (b) it allows one to recognise that other types of uncertainty/error do in fact compromise automation; cases where there is not only low confidence in the ‘best-guess’ estimates, but in fact there is low confidence in the entire possibility space that would be afforded by robustness analysis. In this case, it is not clear whether the remedy is more or less automation, but robustness analysis can guide the deliberation process.
Let us clarify the latter deliberation process by appeal to an example. It helps to imagine the incremental development of an aggregator. The starting structure might be a very basic one, where the pieces of evidence are described in terms of two input variables, say, ‘type of study’, with possible values ‘randomised controlled trial (RCT)’ and ‘observational study’, and also ‘statistical significance’, where possible values are simply ‘yes’ and ‘no’. The logic of the reliability assessments might be along the following lines: Only studies that are statistically significant have positive reliability (such that they are included in the aggregation), and amongst those, the RCTs are given more weight according to a parameter beta, specifically, RCTs are given beta times the reliability weighting of observational studies. (Note that Hunter and Williams introduce a crude reliability judgment of this sort by considering meta-arguments that include/exclude evidence pieces based on whether results are statistically significant or not.) Now one might reflect on the prospective robustness analysis afforded by this aggregator. The possibility space will include inferences based on a range of values for beta. But this might be regarded too limited and misleading a set of possibilities.
For the above example aggregator, the possibility space afforded by robustness analysis might be deemed more adequate if the algorithm for making reliability judgments were either more or less detailed. A relatively straightforward innovation is to convert judgments that are currently implicit but need not be to explicit aspects of the algorithm. For instance, with regard to our example, the judgments of statistical significance could be spelled out more explicitly; this would involve substituting the p value of the study as the input variable, and then deriving whether the study is statistically significant according to the parameter alpha, such that if the p value is less than alpha the study is deemed statistically significant. The corresponding robustness analysis would then produce a possibility space that includes a range of values for alpha, which would presumably be more thorough.Footnote 17
So much for the low hanging fruit. The more difficult judgments concern aspects of the reasoning process where it is not clear whether more or less detail in the explicit algorithm would be better. Adding detail to the explicit algorithm is a good thing provided this is tracking genuine nuance of reasoning; the alternative scenario, however, is when extra detail in the explicit algorithm systematically distorts the reasoning process, making it more rigid in a way that is not rectified by robustness analysis. Returning to our example, a key reason why the initial robustness analysis might be deemed inadequate is that the reliability weightings depend purely on ‘study type’ (amongst those that are statistically significant), and yet it might be thought that this is not the most pertinent grouping as far as quality of evidence is concerned. One possibility is to add further dimensions to this grouping: perhaps ‘sample size’ and a measure of the ‘relevance of experimental subject’, i.e., the closeness of the experimental group to the patient class at hand, could also be included as input variables, and treated in the reliability function with reference to appropriate parameters. In this case, the robustness analysis would effectively survey the possibilities associated with changing the relative weights of these more fine-grained study groupings, which would potentially be a more adequate representation of the real space of possibilities.Footnote 18 On the other hand, it might be thought that this extra detail in the explicit algorithm would only make matters worse; that the corresponding robustness analysis would yield a possibility space that is even more misleading and would not include reasonable inferences. Perhaps the reliability assessments should be shifted entirely to implicit ‘on the fly’ expert reasoning rather than explicitly coded. In this case, the evidence input variable would simply be ‘reliability of study’. Of course, this means one could not explore the possibility space so readily, but one might at least be happier with the ‘best guess’ estimate.
This example of course raises a lot of further substantial questions. But we hope the central point is clear: that it is far from obvious what aspects of an evidence aggregation process are best made explicit; focussing on the prospects for robustness analysis provides some help, however modest, in settling this question.
Handling computational constraints: when is greater volume of evidence better?
Now we turn to the scenario where there are constraints on (or costs associated with) computational and other resources. In practice, this is always the case, and no doubt Hunter and Williams have resource constraints in mind. To start, the on-going maintenance of the database of evidential inputs demands a lot of person hours—for identifying and recording the relevant features of each study or piece of evidence. (Moreover, it may be that only some characteristics of a study are available in the first place.) Then when it comes to running the aggregator, once the database is in hand, the algorithms for selecting, assessing the reliability and aggregating relevant pieces of evidence to arrive at a conclusion about the hypotheses in question all require processing time. In addition, there may be costs associated with the end-user interpreting the methodology/ inferential output of the aggregator (an important aspect of transparency). In short, evidence aggregation typically involves a variety of costs; most importantly with respect to person hours and computer processing time.
One might suppose that the assessment of evidence aggregators when resources are costly is a very messy business. It seems that it calls for some trade-off between inferential accuracy and efficiency. And epistemology alone cannot answer the question: To what extent should we ‘cut corners’ in assessing the total body of evidence so as to arrive at conclusions more quickly or with less expenditure of other resources?
We want to stress here, however, that this is not quite the right way to think about assessing the performance of an evidence aggregator that is subject to resource constraints. There is no getting around the difficult trade-off between reducing resource costs and increasing inferential accuracy.Footnote 19 But one must be careful in thinking about how to spend any given resource budget to best achieve accuracy. The above subtly presupposes that all the available evidence must be taken into account. Of course, it is natural to think that all the evidence should be considered and a way to meet the resource constraints should be found in the way the evidence is analysed, the reason being one of the core tenets of evidential logic, i.e., the view that inference should be based on all available evidence.Footnote 20 But that tenet refers to an ideal setting where all the available evidence is taken into account in a fully appropriate and nuanced way. There is no similar logical demand to attend to all available evidence in a practical setting where, due to resource constraints, this evidence cannot all be assessed in full detail. That would indeed be an odd requirement on an evidence aggregator—that just because some apparently relevant evidence has been tabled, it must influence the inference at hand, even if in a necessarily crude fashion.Footnote 21
Our claim is the following: The assessment of the optimal degree of automation for an evidence aggregator under circumstances of resource constraints (once the budget has been settled) is not so different from the assessment of an aggregator that is free from resource constraints. In both cases it is performance, i.e., quality of inference regarding the hypotheses at hand, that matters, and this is best assessed by focussing on the capacity for robustness analysis. In the context of resource constraints, the ‘principle of total evidence’ may be better honoured by processing a subset of evidence in more detail rather than a greater amount of evidence in lesser detail. That is the underlying balancing act, in any case: Are the available resources spent in the best way possible? Should the resource budget(s) be spent on greater nuance in the description and analysis of evidence, or rather on a greater volume of evidence?
We do not deny that this adds yet a further dimension to the question of what is the optimal extent of automation for an aggregator. In the ideal case, it was simply a matter of what aspects of the reasoning process could sensibly be made explicit—a difficult enough judgment in itself. Resource constraints introduce a new complication: it may be that less-than-ideal treatment of evidence enables more volume of evidence to be processed, in the interests of accuracy. This is a further balancing act, but one that may also be informed by the capacity of an evidence aggregator for robustness analysis. Let us simply note an extreme scenario that may shed light on the more difficult non-extreme cases. The worst case, so to speak, is an evidence aggregator for which we are not confident that any of the input evidence contributes to higher quality inference about the hypotheses at hand, regardless of the precise values of key parameters. Here it is not even the case that, conditional on using the aggregator at hand, inferences based on more evidence (requiring more resources) are better. In the more ordinary and difficult cases, by contrast, all the aggregators under assessment will be ones for which more evidence (requiring more resources) permits better inferences using that aggregator. (Here the possibility space associated with each aggregator, owing to robustness analysis, is at least deemed adequate.) The further question in this case is which aggregator allows for the best quality inferences given the resource budget at hand, where some aggregators process less evidence with greater nuance while others process more evidence with lesser nuance.