Blinded Inference: an Opportunity for Mathematical Modelers to Lead the Way in Research Reform
In a blinded inference study, researchers are asked to analyze condition-blinded datasets and make inferences about various aspects of the data generation process, such as whether or not a variable that manipulates some target cognitive process varied across conditions. This procedure directly tests researchers’ ability to make valid conclusions about underlying processes based on data patterns and assesses the extent to which they accurately report the level of uncertainty associated with their research conclusions. As such, blinded inference studies are a valuable tool in the effort to improve research practices. In this comment, we review three recent studies in the cognitive modeling literature to highlight the benefits of blinded inference, and we make recommendations for future blinded inference studies. We conclude by encouraging modelers to champion the blinded inference method as a fundamental component of effective psychological research.
KeywordsBlinded inference Inference crisis Replication crisis Mathematical modeling
Experimental psychologists are experiencing a period of deep introspection about fundamental research methods. The target article for this issue highlights questions about whether “findings are as reliable as they need to be to build a useful and cumulative body of knowledge” (Lee et al. in press, p. 1). We agree that seeking ways to improve the reproducibility of published results is an important consideration; however, “building a useful and cumulative body of knowledge” requires not only reliable empirical results but also the ability to make valid inferences based on those results. In other words, a cumulative science demands that researchers can validly interpret the empirical results of their discipline to make sound generalizations about underlying mechanisms.
The target article aptly notes that this process of interpretation is a central goal in the practice of mathematical modeling. As such, modelers have unique expertise that is well suited to the task of assessing the validity of data interpretations, and they should be at the forefront of efforts to ensure that psychological research is characterized by effective interpretive techniques in addition to high-quality data. In this commentary, we will (1) briefly describe recent efforts to assess the ability of psychological researchers to properly interpret empirical results with a blinded inference procedure, (2) discuss how blinded inference studies can play a prominent role in attempts to reform psychological research, and (3) encourage mathematical modelers to play a leading role in promoting the blinded inference methodology.
Blinding procedures have a long history in scientific research. For example, the target article describes the practice of blinding the researcher to condition labels during data analysis as a strategy that reduces the “pull” towards desirable outcomes (e.g., MacCoun and Perlmutter 2015). Similarly, one may crowd source the analysis of condition-blinded data sets to assess whether conclusions are robust to analytical choices made by different researchers (Silberzahn et al. 2018). In general, these approaches assume that the analyst knows the design of the experiment, including the specific manipulations involved, or in the case of observational data, understands the true source of the data; only the details of how the data are mapped to conditions are hidden. For obvious reasons, these blinding strategies are usually called “blinded analysis.”
Our specific conception of the “blinded inference” method was inspired by a recent project led by Gilles Dutilh and Chris Donkin (Dutilh et al. 2018) that we will discuss below. The defining feature of blinded inference is that analysts are not only blinded to which levels of a variable correspond to which conditions, but are also unaware of whether or not the variable actually changed between conditions. The researchers are asked to infer whether the data were created with or without a change in the variable, and their performance is scored based on their ability specify the true state of affairs. Researchers should be able to make accurate inferences if they have an appropriate theoretical understanding of the psychological processes affected by the variable and have valid analytical tools for measuring those processes. Thus, the blinded inference procedure provides a strong test of some of the fundamental requirements of sound psychological research.
Sharpen the link between theoretical constructs and experimental manipulations
More quickly identify ineffective inference procedures and zero in on effective ones
More precisely identify the appropriate level of confidence for their research conclusions
Explore parameter/model recovery with masked details of data generation to better simulate actual research scenarios.
The accuracy of contributors’ conclusions should be objectively verifiable; in other words, there should be an uncontroversial “answer key” defined in terms of experimental manipulations or observable variables rather than theoretical constructs
Contributors should express their theoretical inferences as probability distributions over the set of possible conclusions.
Generally, we hope to demonstrate that blinded inference studies are an indispensable tool for improving psychological research and rehabilitating its public image. As psychological researchers, we claim to be able to make accurate conclusions about underlying processes by analyzing data. We have an obligation to put our inferential abilities to the test in an objective and public manner in order to validate these claims (or to advance the field by learning from our failures).
Dutilh et al. (2018): Inferring Processes with Response Time Models
Gilles Dutilh and Chris Donkin conducted an innovative blinded inference study for response time (RT) modelers, and the results serve as an instructive example of this paradigm’s potential for sharpening theoretical assumptions. They collected a large dataset from participants performing a standard decision task (random dot motion) with experimental manipulations of three of the four major processing components assumed by RT models. Specifically, participants received instructions that emphasized either accuracy or speed (targeting response caution parameters in the models), saw stimuli with either a high or low proportion of dots moving coherently (targeting evidence strength parameters), and completed blocks with different prior probabilities of seeing a stimulus with leftward or rightward motion (targeting response bias parameters). Non-decision time was the only major processing component that was not targeted by an experimental manipulation. Dutilh and Donkin created 14 pairs of conditions from various levels of these factors and sent the data to RT modelers (“contributors”) with the conditions labeled simply as A and B and very sparse details on the source of the data. Basically, contributors only knew that the data were from a decision task in which any of the four processing components mentioned above might or might not have been manipulated between the paired conditions. They did not know how the components were manipulated (although they were familiar with standard ways to manipulate them in the literature), and they were not informed that none of the datasets had a manipulation of non-decision time. Contributors were asked to model the data in any way they wanted and to indicate whether the two conditions varied in each of the four processing components.
Dutihl et al. (2018) found that different contributors were fairly consistent in their inferences, although there was some variability even among contributors who applied the same RT model. Contributors also tended to make correct inferences about the experimental factors (caution, strength, bias, and non-decision time), although the overall error rate was somewhere around 20–30%. For our purposes, the most interesting discovery was the reason why error rates were difficult to define exactly: In some cases, contributors disagreed about what constituted a correct answer. The most contentious element of the design was the manipulation of speed versus accuracy instructions. All contributors agreed that this variable should manipulate response caution, but some thought it should also be interpreted as a manipulation of evidence strength (see Rae et al. 2014; Starns et al. 2012) and/or non-decision time (see Rinkenauer et al. 2004). These disagreements necessitated different ad hoc scoring rules that produced different accuracy levels.
The Dutilh et al. (2018) results demonstrate how the blinded inference procedure can advance theory by revealing the shortcomings of available models as measurement tools, thereby encouraging theorists to sharpen the link between theoretical constructs and experimental manipulations. For example, imposing a speed emphasis and degrading the stimulus information are very different experimental manipulations, so it should be a little disconcerting to blithely assume that both of these variables impact parameters that represent the strength of evidence available from the stimulus. Our understanding of decision-making would likely be improved if we developed models that can more cleanly map these manipulations onto underlying processes, and these models would be likely to outperform current models in any blinded inference task that requires discriminating evidence strength and response caution manipulations. Certainly, models with a rigorous mapping of manipulations to processes are easier to test and are desirable by that virtue alone. Blinded inference studies can help modelers do a better job of “carving nature at the joints” by cleanly mapping experimental manipulations to theoretical constructs (Plato 1952 translation).
The interpretative difficulties encountered in the Dutilh et al. (2018) study motivate our first recommendation for future blinded inference studies: Contributors should make inferences about objective, observable aspects of the study. One way to achieve this goal is to ask contributors to make inferences about experimental manipulations themselves. For example, instead of indicating whether two conditions differ in terms of response caution and/or evidence strength, contributors could indicate whether or not the conditions reflect different levels of the experimental factors (i.e., speed vs. accuracy instructions; motion coherence). This approach leaves it up to the contributors to decide how to map experimental conditions onto psychological processes, meaning that successful performance depends on having both valid inference tools to measure psychological processes and accurate theories about the link between empirical factors and underlying processes. In our view, this is a core strength of the blinded inference paradigm: If researchers lack manipulations that map cleanly onto theoretical constructs or model parameters, then it is unclear how they could ever come to understand those constructs, much less measure them accurately.
In summary, the Dutilh et al. (2018) study took a creative approach to assessing cognitive models as inference tools, and the results suggested several ways that the blinded inference procedure could be improved in future studies. We now turn our attention to one of our own recent studies that was inspired by the Dutilh and Donkin project.
Starns et al. (in press): Inferring Manipulations of Discriminability and Response Bias
The Starns et al. (in press) study was similar to Dutilh et al. (2018), but we switched to a recognition memory paradigm and simplified the inference task by manipulating only two factors: evidence strength (i.e., memory discriminability) and response bias. Measures designed to distinguish these two factors are reported in hundreds, if not thousands, of recognition memory experiments, and classic measurement models targeting these factors have been in common use for over 60 years (see Macmillan and Creelman 2005). To manipulate memory discriminability, we presented words one, two, or three times on the study list. Response bias was manipulated with instructions to specifically avoid certain types of errors, and these instructions were reinforced with performance feedback at test. Consistent with our recommendation to focus on inferences that have an “answer key,” we asked our contributors to make inferences about experimental manipulations. Specifically, for each of seven two-condition experiments, contributors judged whether the conditions represented different levels of a memory discriminability manipulation, different levels of a response bias manipulation, both, or neither. Consistent with our second recommendation, they were also required to communicate their level of uncertainty by assigning a probability to each of those four possibilities. Although we obtained inferences about response bias to emphasize the fact that it could vary across conditions, we were mainly interested in each contributor’s ability to correctly specify whether or not memory discriminability was manipulated between conditions. Accordingly, we collapsed across the bias categories to yield the reported probability that each dataset was generated by an experiment with a discriminability manipulation.1
We see no possible favorable interpretation for this pattern of performance, which suggests that recognition researchers have failed to develop consistent methods for separating discriminability and bias. Notably, most of our contributors relied on inferential tools that were specifically developed to distinguish these processes and have been in common use for this purpose for decades (Macmillan and Creelman 2005). Thus, the results suggest that the normal processes of research and peer review are not sufficient for identifying problematic inference techniques. By providing explicit feedback on inference performance, an increased emphasis on blinded inference studies can facilitate the process of zeroing in on effective methods that are consistently applied across researchers.
In addition to being highly variable, our contributors’ inferences were generally less accurate than one would expect in relation to both their reported confidence level and the level of accuracy that could be achieved with a valid inference method. Across all researchers and experiments, our contributors correctly inferred whether or not memory discriminability was manipulated across conditions for 67% of the experiments that they analyzed, but they reported an 82% chance that their inferences were correct, on average. In other words, they generally underestimated the risk that they were making an error in their research conclusions. We used Brier scores (Brier 1950) as joint measure of the extent to which our contributors could infer the true state of the discriminability factor and express a level of uncertainty appropriate for their true error rate. Although some contributors fared well by this metric, about a third fell below chance performance, indicating that they expressed higher confidence in incorrect conclusions than correct ones. We performed recovery simulations to define the level of performance that one would expect if the only source of inference errors was sampling variability in the data (and not ineffective analysis techniques). Specifically, we created two-condition datasets like the ones we sent to contributors by simulating them from a known model (the unequal variance signal detection model), calculated a measure of memory discriminability based on the data-generating model, and performed Bayesian t tests on the discriminability measures to define the probability that discriminability varied between the two conditions. These simulations showed that it is possible to make correct inferences 79% of the time for datasets like the ones we sent to contributors, well above the 67% accuracy that they actually achieved.
The Starns et al. results provide a strong justification for both of our recommendations for future blinded inference studies. First, consider the recommendation that contributors should be required to express their level of certainty in their inferences in terms of a probability distribution. Without those probability estimates, Starns et al. would have been limited to stating that some researchers thought that discriminability was manipulated and others thought it was not, a fairly mundane conclusion. Collecting the probability estimates was essential to revealing the true depth of the inconsistency across contributors. We suspect that most people would be surprised to learn that two experts in the same field can look at the same data and make opposite conclusions that they each report with 100% certainty. Moreover, having contributors make probabilistic inferences tests the extent to which they have an appropriate impression of the uncertainty associated with their research conclusions. Our results showed that researchers were overconfident in their inferences. Granted, some of our contributors might have adopted a loose interpretation of the probability scale, but public confidence in science hinges on researchers’ ability to report valid epistemic probabilities. Scientists are often called on to discuss issues that are somewhat murky (e.g., “Does exercise meaningfully protect against age-related memory declines?”), and probabilities provide a powerful language for communicating uncertainty. This is especially true if researchers in a field can point to a record of calibrated inference performance (e.g., a record in which they make mistakes 10% of the time when they report 90% confidence, etc.).
Next, consider our recommendation that contributors should be asked to make inferences about objectively verifiable aspects of the study. Contributors in the Starns et al. study knew that they were making inferences about empirical manipulations rather than psychological processes, and this practice eliminated disagreements about how the results should be scored. That said, in hindsight we acknowledge that we could have been even more direct with our contributors. Although we told contributors that they were inferring whether or not there was a discriminability manipulation, we did not identify the exact nature of this manipulation. This ambiguity had some unanticipated effects. For example, one contributor used the REM model (Shiffrin and Steyvers 1997), which assumes different mechanisms for different types of discriminability manipulations. Thus, we could have laid out a clearer task for contributors if we had asked something more along the lines of “what is the probability that the words in these two conditions were studied a different number of times?” We doubt that a more specific question would have meaningfully changed results in our particular study, but this example highlights the need for careful consideration of the range of relevant models and analytic tools when selecting the appropriate level of manipulation blinding.
Conducting this study was an enlightening experience that makes us optimistic about researchers’ commitment to improving their craft, and the results show that the blinded inference procedure can identify inference problems more effectively than standard research practices. Many of our contributors reported that taking part in the study changed how they will approach analyzing recognition memory experiments in the future, and it would be interesting to see if their inference performance improves in a follow-up study. We think it is likely that it would, and it is reasonable to expect that explicit feedback on inference performance will help researchers identify the best analysis practices and more appropriately calibrate their confidence levels to actual error rates. Certainly, early results indicate ample room for improvement on both of these fronts, even for common inference scenarios in mature fields of study.
Boehm et al. (2018): Crowd-Sourced Parameter/Model Recovery
Another recent blinded inference study demonstrates the flexibility of the approach. Boehm et al. (2018) simulated a number of datasets using the diffusion model (Ratcliff 1978) under a range of parameter settings. RT modelers were then asked to infer the parameter values used in the data generation process. The main goal of the study was to measure how accurately researchers could estimate the model’s parameters that define across-trial variability in evidence strength (drift rate), response bias (starting point), and non-decision time. Consistent with the prior recovery simulations of these parameters (Ratcliff and Tuerlinckx 2002), Boehm et al.’s contributors showed fairly high estimation error but also tended to report appropriately wide ranges of uncertainty.
The Boehm study represents an innovative approach to crowd-sourcing traditional parameter or model recovery simulations. We agree with the target article’s call for the increased reliance on recovery simulations, and we acknowledge that achieving this goal will mostly fall to promoting the traditional approach in which the same researcher generates the simulated datasets and fits them with the models of interest. However, taking a blinded, crowd-sourced approach does offer a unique perspective that will be useful for some purposes, such as surveying the analysis approaches that are currently popular among researchers or assessing whether researchers have an appropriate sensitivity to the degree of uncertainty produced by sampling variation even with a known generating model. More ambitiously, by hiding aspects of the data generation process from analysts, blinded inference techniques could be used to explore recovery performance in situations that better approximate real research scenarios. For example, studies of this sort could challenge researchers to perform model recovery without exact knowledge of the set of models that might have generated the data, or to determine whether conditions differ on some factor common to multiple models (e.g., evidence strength) in a dataset in which different simulated participants are described by different models.
A Note on Bayesian Methods
As supporters of Bayesian methods, we feel obligated to acknowledge an awkward finding of the studies reviewed above: None of these studies show any clear indication that the researchers who applied Bayesian methods made more valid inferences than the researchers who relied on traditional frequentist methods. We believe that Bayesian methods have the potential to convincingly outperform traditional approaches in these sorts of blinded inference studies, especially when performance depends on expressing an appropriate level of uncertainty in conclusions (as we think it always should). Perhaps the most attractive property of Bayesian methods is that they return explicit distributions of uncertainty; that is, they define what one can reasonably believe about what is likely to be true given the information available. In theory, this is exactly the information needed to perform well in a blinded inference study. In practice, though, the Brier scores reported by Starns et al. (in press) reveal a disappointingly weak relationship between these theoretically derived uncertainties and researcher performance. Perhaps the early state of development for Bayesian approaches in psychology is hampering performance, in which case future blinded inference studies could help researchers identify best practices for Bayesian analysis. Another potential challenge is that setting effective prior distributions could be more difficult in blinded inference studies than in normal research settings, so future studies should consider giving contributors as much information about the research design as possible without overly simplifying the inference scenario. Time will tell, but for now, we hope that Bayesian advocates rise to the challenge of outperforming traditional approaches in settings that require inferences about objectively verifiable manipulations that are unknown to the analyst.
Public Confidence in Science
One risk in any effort to improve scientific practice is that doing so may degrade public confidence in scientific conclusions. We believe that assessing whether this is a positive or negative outcome is contextual. If the conclusions in question are based on methods that fail to produce accurate inferences in blinded tests, then the public should lose confidence in those results. For example, if a forensic technique proves to be substantially more error prone than previously believed, then degrading public confidence in the technique is a critical service that could prevent tragic outcomes like wrongful convictions. That said, we must acknowledge the risk that people will overgeneralize findings from problematic domains and needlessly devalue well-grounded scientific conclusions. Perhaps the best way to avoid this outcome is to demonstrate that we have ways to separate the good from the bad; that is, we can test researchers’ ability to make valid conclusions in an objective and public manner. Furthermore, we can rely on preregistration tools to ensure that disappointing results cannot remain hidden. Hopefully, this level of transparency will help to appropriately fortify public confidence in the research areas that deserve it.
We will close by noting our wholehearted agreement with the target article’s contention that “the test of the usefulness of a theory or model is whether it works in practical applications” (Lee et al. in press, p. 8). Models serve many roles, so we need to have many research strategies in our toolbox to find the ones that “work” in each situation. One critical role for many psychological models is the measurement of underlying processes, and the blinded inference paradigm is the most direct way to determine if researchers can perform this task. Blinded inference has many potential benefits, such as identifying the most effective analysis techniques, uncovering implicit knowledge about how to apply a model (e.g., how to decide which parameters should be freely estimated), and sharpening the link between theoretical constructs and observable variables. We hope that mathematical modelers will play a leading role in making blinded inference paradigms a standard part of psychological research.
- Boehm, U., Annis, J., Michael, J. F., Guy, E. H., Heathcote, A., Kellen, D., Krypotos, A.-M., Lerche, V., Gordon, D. L., Thomas, J. P., van Ravenzwaaij, D., Servant, M., Singmann, H., Jeffrey, J. S., Voss, A., Wiecki, T., Matzke, D., & Wagenmakers, E.-J. (2018). Estimating across-trial variability parameters of the diffusion decision model: expert advice and recommendations. Journal of Mathematical Psychology, 87, 46–75. https://doi.org/10.1016/j.jmp.2018.09.004.CrossRefGoogle Scholar
- Dutilh, G., Annis, J., Brown, S. D., Cassey, P., Evans, N. J., Grasman, R. P. P. P., Hawkins, G. E., Heathcote, A., Holmes, W. R., Krypotos, A. M., Kupitz, C. N., Leite, F. P., Lerche, V., Lin, Y. S., Logan, G. D., Palmeri, T. J., Starns, J. J., Trueblood, J. S., van Maanen, L., van Ravenzwaaij, D., Vandekerckhove, J., Visser, I., Voss, A., White, C. N., Wiecki, T. V., Rieskamp, J., & Donkin, C. (2018). The quality of response time data inference: a blinded, collaborative assessment of the validity of cognitive models. Psychonomic Bulletin & Review, 1–19. https://doi.org/10.3758/s13423-017-1417-2.
- Lee, M. D., Criss, A., Devezer, B., Donkin, C., Etz, A., Leite, F. P., Matzke, D., Rouder, J. N., Trueblood, J. S., White, C. N., Vandekerckhove, J. (in press). Robust modeling in cognitive science.Google Scholar
- Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user's guide (2nd ed.). Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers.Google Scholar
- Plato. (1952). Plato’s Phaedrus. Cambridge: University Press.Google Scholar
- Silberzahn, R., Uhlmann, E. L., Martin, D., Anselmi, P., Aust, F., Awtrey, E. C., … Nosek, B. A. (2018). Many analysts, one dataset: making transparent how variations in analytical choices affect results. Advances in Methods and Practices in Psychological Science, 1, 337–356. https://doi.org/10.17605/OSF.IO/QKWST
- Starns, J. J., Cataldo, A. M., Rotello, C. M., et al. (in press). Assessing theoretical conclusions with blinded inference to investigate a potential inference crisis. Advances in Methods and Practices in Psychological Science.Google Scholar