Cognitive control refers to the set of processes involved in deliberate regulation of information processing to facilitate goal-directed behavior (Miller & Cohen, 2001). Nearly a half-century of research in cognitive psychology has been devoted to the development of experimental task paradigms designed to investigate the processes involved in cognitive control (Posner & Snyder, 1975). Examples from this literature include the Stroop, Simon, flanker, stop-signal, cued task-switching, AX-CPT, and certain variants of the Sternberg item-recognition task. Although these tasks vary along a number of dimensions, one common element to them all is that they involve the utilization of task rules or prior contextual information to resolve response conflict (i.e., competition between task-relevant and automatic response tendencies). How fast and/or accurately the conflict is resolved has been treated as an indicator of cognitive control function. Most of the research in this literature has focused on detailed investigation of individual tasks and “benchmark findings” (e.g., the Stroop interference effect), as a means of testing theories and models regarding core mechanisms of cognitive control (Braem et al., 2019; Bugg, 2012; Kiesel et al., 2010; Verbruggen & Logan, 2009). However, more recent work has focused on the question of whether cognitive control can be considered a domain-general construct, with individuals varying systematically (i.e., in a trait-like fashion) in cognitive control functioning. This shift in the literature has prompted a focus on analyses and measurement of individual differences in cognitive control tasks and batteries (von Bastian et al., 2020).

The current study is situated relative to other recent attempts investigating the measurement of individual differences in cognitive control function (Friedman & Miyake, 2017; Frischkorn et al., 2019; Paap & Sawi, 2016; Rey-Mermet et al., 2018; Whitehead et al., 2019). Specifically, we focus on one of the key issues that has become of recent interest and controversy within this literature; namely, whether it is fundamentally problematic to utilize classic cognitive control tasks, which were developed within the tradition of experimental psychology, to assess individual differences in control functions (Cooper et al., 2017; Hedge, Powell, & Sumner, 2018b; Tucker-Drob, 2011). The cognitive control tasks developed from the experimental tradition are popular because their effects replicate under a wide variety of research settings and task conditions. This success is largely attributable to a combination of low between-subject variance and high within-subject variance. Unfortunately, an individual differences approach thrives under opposite conditions, i.e., high between-subject and low within-subject variance. As a result, when these tasks are used in individual differences research, the measures have often been found to be inconsistent and unreliable, which has been recently termed “the reliability paradox” (Hedge, Powell, & Sumner, 2018b; Kucina et al., 2022; Rey-Mermet et al., 2018; Rouder & Haaf, 2019).

Dual mechanisms of cognitive control

The development of the Dual Mechanisms of Cognitive Control (DMCC) project and task battery (Braver et al., 2021; Tang et al., 2022) was in part motivated by this paradox. A key distinguishing feature of the DMCC battery is that the tasks included in the battery were specifically designed to test the Dual Mechanisms of Control theoretical framework. This framework postulates distinct proactive and reactive modes of control (Braver, 2012; Braver et al., 2007), that may reflect key dimensions of individual variation in control function. The Dual Mechanisms of Control account provides a theoretical framework that decomposes cognitive control into two qualitatively distinct mechanisms – proactive control and reactive control (Braver, 2012; Braver et al., 2007). Proactive control refers to a sustained and anticipatory mode of control that is goal-directed, allowing individuals to actively and optimally configure processing resources prior to the onset of task demands. Reactive control, by contrast, involves a transient mode of control that is stimulus-driven, and relies upon retrieval of task goals and the rapid mobilization of processing resources following the onset of a cognitively demanding event (Braver, 2012; Braver et al., 2007). In other words, proactive control is preparatory in nature, while reactive control operates in a just-in-time manner. The DMCC task battery includes conditions that are designed to experimentally and independently bias subjects towards the use of proactive and reactive control modes.

In contrast to the selection of tasks used in prior work, we explicitly developed the DMCC task battery to more closely exemplify an integrated experimental/correlational approach first advocated by Cronbach (1957). As Cronbach (1957) articulated, experimental evidence is standardly utilized to inform normative models of the structure and function of cognitive abilities, while correlational/differential data is used to investigate individual differences in those abilities and their role in real-world behavior. Ideally, the experimental and differential approaches inform each other, allowing for a theoretical framework that integrates different kinds of empirical evidence and accounts for inter-individual differences in terms of intra-individual psychological processes.

Experimental companion paper

We have conducted a systematic validation of the full task battery in terms of its behavioral characteristics. In the current paper, we focus on the psychometric properties of the task battery and its utility for individual difference analyses. Conversely, in a recently published companion paper (Tang et al., 2022), we focused on group effects, testing for dissociations between behavioral markers of proactive and reactive control. The experimental companion paper provides an extensive description of the tasks, manipulations and their rationale, and data-gathering procedures (additional rationale for the tasks is also provided in Braver et al., 2021). Rather than providing a full duplication of this information in the current paper, we report only pertinent methodological details, along with a slightly expanded description in Appendix 3. Interested readers are thus referred to Tang et al. (2022) or the Appendix for this information.

A key element of Tang et al. (2022) was to provide a comprehensive introduction to the DMCC battery and the associated dataset acquired with it, highlighting both its convergent (cross-task) and divergent (discriminant) validity. Tang et al. (2022) reported analyses demonstrating that dependent measures show both consistent proactive and reactive effects across tasks within the battery, with 20 out of 26 of the key theoretical predictions being confirmed. Specifically, in terms of convergent validity, the experimental manipulations were generally effective in producing group-level shifts in proactive control and reactive control in each task, suggesting consistent across-task sensitivity to changes in cognitive control demands due to the experimental manipulations. In terms of divergent validity, there were clear patterns of double dissociation, in that the behavioral markers of proactive and reactive control could effectively be distinguished in all tasks.

For the current paper, we utilize the DMCC battery as a vehicle from which to evaluate whether the cognitive control tasks included in the battery can measure individual differences reliably. As the DMCC battery utilizes theoretically motivated task manipulations, a critical question is whether such manipulations impact their sensitivity to individual variation in task performance. According to classical test theory, the proportion of variability that is specifically related to the construct of interest (in this case, cognitive control demand) is referred to as “true score variance” (Novick, 1966). Tasks that have high true-score variance are also expected to exhibit stronger reliability and validity (Chapman & Chapman, 1978). Interestingly, in prior work focusing on only one task in the DMCC battery, the AX-CPT, we demonstrated differential sensitivity to individual differences in working memory capacity in the proactive control mode, relative to baseline and reactive modes (Gonthier et al., 2016; Lin et al., 2022). This finding is consistent with the hypothesis that, by isolating proactive and reactive control modes within the DMCC battery, we have increased true-score variance in the task metrics of interest. Concretely, using Bayesian linear mixed effect models we found that AX-CPT measures theoretically linked to proactive control (A-cue bias, BX RT interference, d’prime-context) were selectively stronger in the proactive condition (i.e., the condition experimentally encouraging proactive control), even when statistically controlling for variance in the baseline and reactive conditions.

Nevertheless, it is generally accepted that the most rigorous approach to assess sensitivity to individual differences is through a comprehensive analysis of psychometric reliability. Consequently, our goal for the current paper is to provide such a comprehensive analysis for the DMCC task battery. Nevertheless, this type of analysis can be particularly complex within the domain of experimental tasks assessing cognitive control. To illustrate this complexity more fully, in the sections that follow, we briefly review the literature on individual differences in cognitive control, the approaches used to assess such individual differences, and the measurement challenges associated with the evaluation of task reliability in this domain.

Measuring individual differences in cognitive control

Individual differences in cognitive control are associated with several important real-world outcomes, including psychopathology (Snyder et al., 2015), impulsivity (Sharma et al., 2014), addiction (Hester & Garavan, 2004), and age-related cognitive decline (Hasher et al., 1991). The ability to engage cognitive control is strongly linked to working memory capacity, which is associated with a broad range of outcomes, including academic achievement (Alloway & Alloway, 2010; Gathercole et al., 2003), reading comprehension (Daneman & Carpenter, 1980), mathematical ability (Ramirez et al., 2013), and multi-tasking (Redick et al., 2016). Cognitive control plays an important role in contemporary theories of intelligence. By some accounts, cognitive control is considered to be the primary source of variance in overall cognitive ability (Engle & Kane, 2004; Kovacs & Conway, 2016).

Despite these established findings, a major concern in the field is that the tasks used to measure cognitive control often show poor reliability and weak correlational results. Recently, several research groups reported low task reliabilities and/or weak between-task correlations, especially with respect to tasks thought to index aspects of inhibitory control (Hedge, Powell, & Sumner, 2018b; Rey-Mermet et al., 2018 and Stahl et al., 2014). For example, in the Hedge, Powell, and Sumner (2018b) study, the median test–retest reliability across seven classic experimental effects (e.g., Stroop, flanker) was surprisingly low, with a median of .40. Similarly, across multiple studies, the correlation between flanker (Eriksen & Eriksen, 1974) and Stroop (Stroop, 1935) effects was below .20 (Draheim et al., 2020; Gärtner & Strobel, 2019; Hedge, Powell, & Sumner, 2018b; Rey-Mermet et al., 2018). Based on these and other similar dismal correlational results, Rey-Mermet et al. (2018) concluded, “we should perhaps stop thinking about inhibition as a general cognitive construct”.

A fundamental question raised by these findings is whether classic experimental tasks are suitable for examining individual differences (Tucker-Drob, 2011). As mentioned, experimental tasks are designed to maximize variance across conditions (within-subject variance) and minimize between-subject variance. This is clearly problematic for researchers interested in studying individual differences. Also, measures of cognitive control that are obtained from experimental tasks (e.g., Stroop effect) are often based on difference scores (e.g., Incongruent RT – Congruent RT). This poses a further challenge, because the reliability of difference scores is constrained by the reliability of the two condition scores and is attenuated by the correlation between the two condition scores. As a result, difference score measures of cognitive control often suffer from low reliability (Cronbach & Furby, 1970; Hedge, Powell, & Sumner, 2018b). Finally, the correlation between any two measures of cognitive control (e.g., Stroop effect and flanker effect) will be constrained by the amount of between-subject variance and the reliability of each measure, so conclusions drawn from correlational studies using experimental tasks may also be inconsistent and unreliable (Nunnally Jr., 1970; Parsons et al., 2019; Spearman, 1904). Thus, based on these reliability issues, it could be argued that the examination of relationships between individual difference measures extracted from experimental tasks (i.e., between-task relationships) maybe highly problematic in a foundational way (Spearman, 1910).

The measurement and reporting of reliability

Definitional confusion

In addition to the concerns regarding the measurement of individual differences in experimental tasks, there are numerous issues related to the measurement and reporting of reliability itself. One of the most important issues is that reliability is actually only infrequently reported in cognitive experimental research (Parsons et al., 2019). As described above, part of the reason may be that experimental researchers often have less fluency and familiarity with psychometric issues, including a confusion regarding the technical meaning of reliability as it is utilized in psychometrics. A potential source of confusion may be that the term “reliable” has different meanings in experimental versus correlational psychology. An experimental manipulation is “reliable” when the intended effect is replicated across multiple studies (in different labs, with different stimuli, etc.). In contrast, an individual differences measure is considered “reliable” when it consistently gives similar rankings for individuals. This lack of concern regarding psychometric reliability may be one of the reasons it has not been typically considered as a source of poor correlational results (Flake et al., 2017; Hussey & Hughes, 2020). Conversely, based on this confusion, some results may have been erroneously reported as replicable and generalizable, perhaps propagating false standards in the field (e.g., the replication crisis).

Problems with reporting reliability: Internal consistency

A second and more fundamental issue is that there is currently no gold-standard procedure for estimating reliability, particularly for experimental tasks (Parsons et al., 2019). Consequently, even when reliability is reported for these tasks, it is not always clearly communicated what estimation approach was utilized, which can lead to erroneous assumptions regarding the reliability of a particular experimental measure. Relatedly, although many statistical software packages supply functionality for computing reliability, these packages assume that the data conforms to analysis-specific assumptions which may not be valid for common experimental tasks and measures. An illuminating example can be seen in the case of Cronbach’s alpha, a measure of internal consistency, which is probably the most common and well-known index of reliability. Alpha is commonly derived by averaging the correlations between each item (trial) and the sum of the remaining items (trials). The default method offered in statistical software packages calculates alpha based on the assumption that items and the order of the items are identical for all subjects. Furthermore, it is assumed that each item measures the same underlying construct, to varying degrees, as a function of item difficulty and discriminability. In survey research, this is often the case. However, in cognitive-behavioral tasks, trial order is often random. More concerning, the cognitive processes involved in task performance may change across trials, as a function of practice, fatigue, sequential effects, or strategy development/deployment. If these issues are ignored, which is typically the case, then reliability estimates may not be accurate or valid. Hence, Cronbach’s alpha is unsuitable for tasks designed to measure individual differences in cognitive control.

There are other issues with the use of Cronbach’s alpha as a measure of split-half reliability. Formally, if the assumptions above hold, Cronbach’s alpha is identical to the average of all correlations between two halves of the data. However, split-half reliability is most commonly calculated in a sample by splitting the data – once – into the first and second half or even- and odd-numbered trials, and computing the correlation between these measures. However, it has been demonstrated that split-half reliabilities based on these kinds of simple split methods are unstable. Enock et al. (2012) showed that reliabilities vary depending on which trials were used in the partitioning. They recommend applying multiple random splits to the data to generate multiple split-half reliability estimates and then taking the average of all split-half estimates as the overall reliability estimate (Enock et al., 2012; Parsons et al., 2019). This permutation-based method for calculating split-half reliability approximates Cronbach’s alpha (Cronbach, 1951), while simultaneously avoiding the pitfalls described above. However, another important issue is that splitting the number of observations in half leads to underestimation. The Spearman–Brown (prophecy) formula can be applied to correct for this underestimation (corrected reliability = [2*reliability] / [1+reliability]), yet this correction approach is not well known or frequently utilized.

Problems with reporting reliability: Test–retest

A third important issue is that internal consistency reliability is not the same as test–retest reliability. The measurement and utilization of test–retest reliability can be used when the same individuals are measured on the same test on two or more assessment occasions. Test–retest reliability indices estimate the degree to which the measure provides stable rankings of individuals across time. The most well-established index of test–retest reliability is the intraclass correlation coefficient (ICC), which indicates how well the measurements consistently rank-order the subjects. However, one of the complexities of ICC, which has also created some confusion in its usage, is that there are ten distinct forms available (Mcgraw & Wong, 1996). Yet only two forms are particularly pertinent for measures from cognitive experimental tasks (for a more in-depth discussion see Koo and Li (2016).

A critical distinction in the use of ICC estimates is whether reliability is based on either consistency or the absolute agreement between the two measurements (e.g., the relationship). A consistency relationship is not affected by systematic changes (e.g., practice effects, learning between measurements) and only the consistency of the rank-order is rated. An absolute agreement relationship is one in which the two measurements are expected to be identical in rank-order and in value (e.g., session mean), in other words, this relationship is affected by systematic differences. For example: these two measurements {1,2,3}, {4,5,6} would have a perfect consistent relationship (ICC (3,1) = 1.00), but the measurements would be far from absolute agreement (ICC (2,1) = .09). Thus, the type of relationship expected is a critical consideration when deciding which form of ICC to use when calculating test–retest reliability of samples from cognitive behavioral measures. If the researcher expects systematic differences between measurement occasions (e.g., practice effects), then the preferred form of ICC is the type termed ICC (3,1) in the standard terminological conventions developed by Shrout and Fleiss (1979). Conversely, if systematic differences between occasions should be considered to be problematic for the reliability of a measure, then the ICC (2,1) type should be selected. Importantly, it is necessary for the researcher to explicitly specify which type of ICC was used for calculation, and the rationale for selection, so that no ambiguity exists with regard to interpretation.

Traditional versus hierarchical Bayesian approaches

A final issue is that traditional analytic approaches, such as ICC, may be sub-optimal, and actually even inappropriate, when calculating test–retest reliability in cognitive experimental tasks. Specifically, traditional approaches to test–retest reliability treat summary score measures (sometimes referred to as mean point-estimates; MPE) as representative indicators of performance; yet these measures do not consider trial-to-trial variability, which in itself could be an important source of individual differences (Haines et al., 2020; Lee & Webb, 2005; Rouder & Haaf, 2019; Rouder & Lu, 2005). Indeed, Rouder and Haaf (2019) have presented evidence that by ignoring trial-to-trial variability, test–retest reliability is “greatly” attenuated (see also von Bastian et al., 2020). As an alternative approach, newer analytic methods, involving hierarchical modeling (also termed multilevel or linear mixed effects modeling), have been introduced for measuring reliability, which simultaneously assess between- and within-subject (i.e., trial-to-trial) variation. Hierarchical modeling is a statistical framework for modeling data that have a natural hierarchical structure. For example, data from cognitive-behavioral tasks often have trials within subjects and subjects within groups. By restructuring a model hierarchically, all individuals are considered in two contexts: in isolation, to determine how behavior varies across trials, and as a contributing member of a group, to determine how behavior varies across the group. This increases the number of available parameters from one (i.e., MPE) to multiple (e.g., mean, standard deviation). The model can now distribute uncertainty (e.g., measurement error) that exists in the data over those multiple parameters, which results in more precise estimates at both the individual and group levels (Kupitz, 2020). In particular, hierarchical models provide the means to appropriately correct for the attenuation of reliability that may occur when using more traditional methods.

Additionally, these recent efforts have also pointed to the advantages of hierarchical Bayesian models (HBM), relative to classic “frequentist” approaches. A key advantage of the HBM approach is that it can be used to specify a single model that jointly captures the uncertainty at both the individual- and group-level. Even in a typical study that involves a modest number of subjects, each performing a limited number of trials with the observed data confounded by measurement error, HBM can provide reasonable estimates of performance, by assuming that the data are generated from a population of infinite trials (Raudenbush & Bryk, 2002; Snijders & Bosker, 1999). A second advantage of HBM is it enables explicit specification of distributions and associated parameters, which best fits a generative approach in which individual trial performance measures are thought to reflect samples drawn from these distributions. Among others, Haines et al. (2020) highlight the advantages of generative models, by suggesting that models more accurately “simulate data consistent with true behavioral observations at the level of individual participants”. In contrast to HBM, frequentist methods of accounting for hierarchical sources of variability, such as structural equation modeling or classical attenuation corrections, do not provide a natural framework for generative modeling (Kurdi et al., 2019; Westfall & Yarkoni, 2016).

Introduction summary

This brief review of the current state of research on individual differences in cognitive control function suggests that a barrier to progress is the lack of knowledge on the part of researchers coming from the cognitive experimental tradition, regarding some of the psychometric complexities associated with individual difference measurement. A potential remedy is for researchers to be more explicit regarding assumptions that are being utilized regarding measurement method. Part of this explicitness relates to the reporting of measurement reliability and the analytic approach used for estimation. Moreover, when possible, estimates of both internal consistency (i.e., permutation-based split-half) and temporal stability (i.e., test–retest, ICC) forms of reliability should be assessed and reported. Finally, further investigation and comparison is needed between traditional frequentist and Bayesian approaches to estimation, since the use of Bayesian approaches in individual differences analyses is a relatively new development in the literature.

Current study

The overarching goal of the current study is to test whether a task battery designed in accordance with a unifying theoretical framework, can more successfully bridge the divide between experimental and differential approaches in cognitive control research. Here we provide an evaluation focused on the utility of cognitive control measures for individual differences research purposes; specifically, we examine the psychometric issues described above within the context of the DMCC battery. In particular, a key objective associated with the development of the DMCC battery was to examine how experimental manipulation of cognitive control mode affects individual difference properties of classic cognitive control tasks (Stroop, AX-CPT, cued task-switching, and Sternberg). A key question of interest was whether these task manipulations would allow for more reliable measurement of individual differences in cognitive control function (Cooper et al., 2017). More specifically, by employing task variants that selectively isolate proactive and reactive control modes, respectively, the reliability of mode-specific individual variation can be estimated. Moreover, it is possible that mode-specific individual variation is associated with reduced measurement error. Tang et al. (2022) provide initial support for this hypothesis by demonstrating the convergent and divergent validity of the proactive and reactive control indices, in terms of the robustness of group-average experimental effects. Here we test whether the tasks also demonstrate strong psychometric reliability as individual difference measures of cognitive control ability. Consequently, we sought to assess task reliability in a systematic and comprehensive manner.

Another important focus of the paper was to compare traditional and the newer HBM approaches described above, for the assessment of psychometric reliability. The first set of analyses thus report reliability, both internal consistency and test–retest, employing traditional approaches based on summary score measures (MPEs) from each subject. In contrast, for the second set of analyses we implement hierarchical methods to incorporate modeling of trial-to-trial variability (i.e., individual-level standard deviation) (Rouder & Haaf, 2019). Specifically, we directly compare the traditionally derived test–retest reliability measures with those derived from the HBM approach. Our second hypothesis was that traditional approaches would substantially under-estimate the degree of reliability present in cognitive control tasks, replicating prior findings (Rouder & Haaf, 2019).

Our third hypothesis was that Bayesian parameter estimates, if more reliable, would also be more suitable for individual differences analyses that address the question of whether cognitive control can be considered a domain-general construct (i.e., with individuals varying in a consistent, trait-like manner). Consequently, as a final analysis, we examined correlations present in the DMCC task battery, both within task (i.e., the relations between the baseline, proactive, and reactive variants of each task), and across each task with the same control mode condition (i.e., the relations between the different task paradigms).

Method

Subjects

Subjects were recruited via the Amazon Mechanical Turk (MTurk) online platform. Our inclusion criteria required subjects to be physically in the United States of America, have an Amazon HIT approval rate of or greater than 90%, and had prior to our study completed at least 100 online experiments offered on MTurk. Subjects were excluded if they had participated in any of our other experiments with DMCC tasks, or if they were Mac OS users (due to limitations in the testing software, particularly for recording vocal reaction times in the Stroop task). After reading a description of the study that indicated its multi-session nature and time commitment, 225 interested subjects accessed a link which allowed them to review and sign the consent form. Only 128 subjects completed the entire study.

After consent was given, the web-links for the first session of the study were made available on MTurk. Subjects were not restricted with regard to age rangeFootnote 1, and as such a wide range was included in the sample (N = 128; 22–64, M = 37.11, SD = 9.90; 82 females, 46 males).

Design and procedure

The study protocol consisted of 30 separate testing sessions that subjects completed in a sequential manner (15 for the test phase, and another 15 for retest). Subjects completed the sessions at a rate of five per week, i.e., taking 6 weeks to complete the full protocol. Baseline task variants were completed during the first and fourth week, the reactive task variants during the second and fifth week, and the proactive task variants were completed during the third and sixth week. Each session lasted approximately 20–40 min in duration, with the exception of the first session, which was 1 h in duration (and included a Stroop practice to validate operation of vocal response recording, along with a battery of demographic and self-report questionnaires). To both incentivize and prorate study completion, completion of the first session of both test and retest phases resulted in a $4 payment, each subsequent session was paid $2, with the exception of session 6 and 11, which were paid $4 for each. Additional bonuses of $20 were paid for completion of the test phase and $30 for full study completion. Together, successful completion of the entire protocol resulted in a payment of $122.

For each completed session, the experimenter checked for overall accuracy and completion of each task and questionnaire to make sure that subjects were complying with instructions and maintaining sufficient attention to the task. A criterion of 60% accuracy and response rate was used to determine whether the data would be included, and the subject invited to remain in the study. For each task that did not meet the criterion, the experimenter attempted to communicate with the subject first to determine if they had trouble understanding the instructions or had technical difficulties. If so, the subject was given a second chance to complete the task before a designated deadline. Within each of the test and retest phases, sessions were conducted in a fixed order for all subjects.

Task paradigms

Here we present a schematic representation of the tasks and their manipulations (see Fig. 1). The full task descriptions are provided in Tang et al. (2022). Additionally, tasks, experimental design, and manipulation rationale are also provided in Appendix 3 within the current manuscript.

Fig. 1
figure 1

DMCC task paradigms and overview of session manipulations. Note. PC = proportion congruency; MC/MI = mostly congruent/mostly incongruent; ITI = intertrial interval; Ang = no-go trials with an A cue; Bng = no-go trials with an B cue; NP = novel positive; NN = novel negative; RN = recent negative. For a more detailed description, see Appendix 3 and Tang et al., 2022

Data pre-processing

To facilitate comparison of results across task paradigms, subjects who failed to complete all 30 sessions were not included in the analyses reported here; data from 128 subjects entered the pre-processing stage. For all tasks: correct trials with reaction time (RT) values faster than 200 ms and slower than three standard deviations above the mean RT at the trial type level (i.e., trial type within session, phase, and subject) were removed. This resulted in the removal of 1.7% of Stroop RT trials, 2.9% AX-CPT RT trials, 1.5% of task-switching RT trials, and 0.8% of Sternberg RT Trials.

After removal of RT trials, each condition at the trial type level was inspected to ensure that no more than half of completed correct trials were removed during the pre-processing. Subjects passing the criterium were deemed to have enough trials to enter further analyses, but this criterium was ultimately arbitrary. Any subject that had a condition removed based on the criterium was removed from the task-specific analyses (i.e., test–retest), but not from the experiment. Hence, the disparity in sample sizes reported in these analyses. This step resulted in the removal of ten subjects from the AX-CPT data, 24 from the Sternberg data, and 0 from the Stroop or task-switching data.

For error rates, following Gonthier et al. (2016), we set a 40% error rate cutoff for the AX-CPT task. Common error rate cutoffs for the remaining tasks were less well documented and we utilized multiverse analyses of different cutoff values (i.e., error rate of 20, 30, 40, and 50%) to determine fitting thresholds. Examining and visualizing the remaining data at each cutoff for the remaining tasks, revealed that a 40% cutoff removed severe outliers, while maximizing subject retention. Applying a 40% cutoff at the trial type level across all task resulted in: one subject removed from the Stroop data, no subjects removed from the AX-CPT data, two subjects removed from the task-switching data, and 24 subjects removed from the Sternberg data. The high number of subjects removed from both the RT and error rate Sternberg data is due to the low number of critical novel negative trials in the reactive session, see further comments in the discussion section. Finally, for the correlational analyses in the current study (i.e., Figs. 3 and 4, Tables 4 and 5), complete data per bivariate analysis were used. Additionally, data entering the correlations were tested for bivariate outliers using Mahalanobis distance with a cutoff of 10.828 (alpha = .001, and df = 1).

Data analyses

We assessed psychometric reliability (both split-half and test–retest) of the measures taken from the four DMCC tasks within each of three sessions (e.g., baseline, proactive, reactive). The analyses reported in the main text focused on the critical conditions of the tasks (i.e., Stroop biased condition, task-switching biased condition, Sternberg list-length 5 condition), as defined in Tang et al. (2022). The critical conditions were designed specifically to allow for comparison across tasks and analytic methods. Full descriptive statistics and experimental results by session, task, and trial type for all conditions are reported in Tang et al. (2022). Additional reliability analyses (using traditional approaches only) of other non-critical conditions are reported in the Appendix. In addition to examining the reliability of each critical condition measure, we also examined the strength of correlation between measures, focusing on both within-task, between-condition correlations (e.g., Stroop baseline vs. Stroop proactive) and between-task, same-condition correlations (e.g., AX-CPT reactive vs. Sternberg reactive). If reliability indeed serves as a bottleneck that attenuates the magnitude of between-measure correlations, then improving reliability should dis-attenuate true underlying correlations (given that a relationship exists between the measures).

Reliability estimates: Traditional approach

Both internal consistency and test–retest forms of reliability were calculated, based on traditional psychometric approaches. Internal consistency estimates were calculated as permutation-based split-half correlations. The data were repeatedly (5000 permutations) and randomly split into halves, which were then correlated and a Spearman–Brown correction was applied. The estimates reported here are an average of those 5000 corrected correlations. Test–retest reliabilities are reported as intraclass correlation coefficients (ICC). Because practice effects are expected to occur from session to session and from test to retest phases, the ICC relationship parameter was examined as both absolute agreement (ICC(2,1)) and consistency (ICC(3,1)), as per the Shrout and Fleiss (1979) convention. The former is sensitive to changes in the mean between repeated measures, whereas the latter appropriately corrects for such changes. Here, we report both forms for comparison purposes.

Reliability estimates: Hierarchical Bayesian model

In addition to the traditional psychometric approach to test–retest reliability estimation, HBM was also used to generatively model the reaction time difference score effects from the four tasks in the Dual Mechanisms of Cognitive Control (DMC) task battery. Specifically, we focused on the Stroop effect, the BX interference effect from the AX-CPT, task-rule congruency effect (TRCE) from the cued task switching task, and the recency effect from the Sternberg task. Although the HBM approach works for accuracy measures as well, given concerns regarding difference scores in psychometric analyses (which are traditionally reaction time based), we focused on these measures to determine the potential advantages of the HBM approach. Additionally, these measures are ones that are commonly computed for each task (for more information and rationale on task measures see Appendix 3). Finally, specifying a generative model encapsulates the key assumptions that are shared among the tasks: (1) reaction time cannot be negative; (2) reaction time responses vary around some central tendency (this is ignored with MPE); (3) the central tendency varies per subject; (4) within-individual (i.e., trial-by-trial) variability varies per subject; and (5) reaction time distributions from cognitive-behavioral tasks tend to be right-skewed (Wagenmakers & Brown, 2007).

In the HBM approach, it is important that estimation of test–retest reliability considers trial variability at the individual-level; hence, the individual-level distribution is defined first, followed by the group-level distribution. Given the additional complexity and lower reader familiarity with the HBM approach, we elaborate on how these distributions and parameters are estimated. Individual-level reaction time response distributions are here conceptualized as coming from a lognormal distribution, satisfying the skewed distribution assumption (assumption 5). The distribution is further shaped by mean and standard deviation parameters, which both vary per subject and between each condition (satisfying assumptions 2, 3, and 4). Theoretically, the distribution parameters are not expected to vary much between the test and retest phase. However, for test–retest reliability purposes, the model assumes unique distributions for each phase as well.

$${\textrm{RT}}_{i,c,p}\sim \textrm{Lognormal}\left({\mu}_{i,c,p},\exp \left({\sigma}_{i,c,p}\right)\right)$$
(4)

Formally, in Eq. (1), RTi, c, p is the observed reaction time data for subject i = {1, … , N}, in condition c = {control, interference}Footnote 2, during phase p = {test, retest}.

~Lognormal(μi, c, p, exp(σi, c, p)) signifies that the data are drawn from a generative process producing a skewed distribution (i.e., a lognormal distribution), shaped by a mean and standard deviation parameter for each subject, condition, and phase combination. A lognormal distribution has an asymmetrical spread; more variability is found on the right-side (i.e., slow reaction times) of the central tendency than the left-side (i.e., fast reaction time). Importantly, the lognormal distribution has a property that determines how the mean and standard deviation interact, allowing the model to fit the many different shapes of reaction time distributions produced by the ~ 120 subjects. Wagenmakers and Brown (2007) show that this property adheres to a law of [reaction] time, which states that in reaction time performance, the standard deviation increases linearly with the mean. In other words, the slower a subject’s mean reaction time, the more individual-level variability they show. Additionally, to ensure that the individual-level standard deviation parameters are greater than 0, they are exponentially transformed.

Individual-level parameters are informed by group-level parameters, and vice versa. The hierarchy of the model is constructed so that the individual-level distribution parameters from Eq. (1), denoted by μi, c, p and σi, c, p, are drawn from group-level multivariate normal distributions (i.e., prior models), with unobserved (i.e., unknown) means and standard deviations (σ):

$${\displaystyle \begin{array}{cc}\left[\begin{array}{c}\begin{array}{c}{\mu}_{i,c=1,p=1}\\ {}{\mu}_{i,c=1,p=2}\end{array}\\ {}{\mu}_{i,c=2,p=1}\\ {}{\mu}_{i,c=2,p=2}\end{array}\right]& \sim \textrm{MVNormal}\left(\left[\begin{array}{c}\begin{array}{c}{\mu}_{mean,c=1,p=1}\\ {}{\mu}_{mean,c=1,p=2}\end{array}\\ {}{\mu}_{mean,c=2,p=1}\\ {}{\mu}_{mean,c=2,p=2}\end{array}\right],{\textbf{S}}_{\mu}\right)\\ {}\left[\begin{array}{c}\begin{array}{c}{\sigma}_{i,c=1,p=1}\\ {}{\sigma}_{i,c=1,p=2}\end{array}\\ {}{\sigma}_{i,c=2,p=1}\\ {}{\sigma}_{i,c=2,p=2}\end{array}\right]& \sim \textrm{MVNormal}\left(\left[\begin{array}{c}\begin{array}{c}{\sigma}_{mean,c=1,p=1}\\ {}{\sigma}_{mean,c=1,p=2}\end{array}\\ {}{\sigma}_{mean,c=2,p=1}\\ {}{\sigma}_{mean,c=2,p=2}\end{array}\right],{\textbf{S}}_{\sigma}\right)\end{array}}$$
(2)

By defining these prior models, the group-level multivariate distribution allows for the pooling of subject-level performance across the four condition and phase combinations. Each of the individual-level parameters, μi, c, p and σi, c, p, inform the group-level means and standard deviations, μmean, c, p, μsd, c, p and σmean, c, p, σsd, c, p, which in turn inform all other individual-level parameters. This mutual interaction creates hierarchical pooling, regressing the individual-level parameters towards a group mean (also called shrinkage or regularization), and increases the precision of Bayesian estimation (Gelman et al., 2013). Bayesian modeling allows for such a “joint model” specification, in which the individual-level and group-level parameters are estimated simultaneously. This embodies the generative perspective (Haines et al., 2020).

Keen observers will notice that the group-level distributions are both modeled as normal, whereas the individual-level distributions are lognormal. Recall that the individual-level standard deviation parameter [Eq. (1); exp(σi, c, p)] was exponentially transformed to force it to assume positive values only. Mathematically, when y has a normal distribution then the exponential function of y has a lognormal distribution. It follows then, that the group-level distribution modeled on the individual-level standard deviation parameter ( exp(σi, c, p)) corresponds to a lognormal distribution.

Another key aspect of HBM is the definition of prior probability distribution, which expresses a prior belief about an underlying distribution of interest. Here, parameter estimation is rather robust to prior models, because the priors are rather diffuse and the sample sizes of observed data are relatively large. The prior model for the group-level mean parameters were specified as normal.

$${\displaystyle \begin{array}{c}{\mu}_{mean,c,p}\sim \textrm{Normal}\left(0,1\right)\\ {}{\sigma}_{mean,c,p}\sim \textrm{Normal}\left(0,1\right)\end{array}}$$
(3)

The prior model for the group-level standard deviations parameters were specified as half-normal (i.e., if y is a normal distribution, then | y | is a half-normal distribution, folded along the mean with the purpose of consisting of only positive values). Because the individual-level standard deviation parameter is exponentially transformed, the group-level distribution assumes only positive values.

$${\displaystyle \begin{array}{c}{\mu}_{sd,c,p}\sim \textrm{Half}-\textrm{Normal}\left(0,1\right)\\ {}{\sigma}_{sd,c,p}\sim \textrm{Half}-\textrm{Normal}\left(0,1\right)\end{array}}$$
(4)

To estimate the test–retest reliability, a difference score parameter delta (i.e., ∆) was specified in Stan’s generated quantities code section. To again take the Stroop task as an example, the Stroop effect is incongruent (interference) minus congruent (control) performance. A delta was estimated for the test and retest phase.

$${\displaystyle \begin{array}{c}{\Delta }_{i, test}={\mu}_{i, interference, test}-{\mu}_{i, control, test}\\ {}{\Delta }_{i, retest}={\mu}_{i, interference, retest}-{\mu}_{i, control, retest}\end{array}}$$
(5)

Then, using the MCMC samples, we correlated delta at test (∆i, test) with delta at retest (∆i, retest), resulting in a posterior distribution of test–retest reliabilities. Test–retest reliability estimates for the delta parameter were calculated for each task and session combination and shown in Fig. 3, indicated as HBM. Importantly, test–retest reliability is calculated as a Pearson r correlation between the test and retest phase estimates r(∆1, ∆2). Here, Pearson r is chosen over an intraclass correlation coefficient (ICC). In the traditional ICC approach, the within-subject variance (i.e., our model’s σi) is still in the mean point-estimates from which the different types of variances needed are calculated. However, ultimately we are interested in the correlation between the delta parameters, which are composed of mu parameters with much of their variance modeled out by the sigma parameter. Hence, a simple correlation suffices and fits our model. This also replicates the generative modeling approach of prior work (i.e., Haines et al., 2020; Rouder & Haaf, 2019).

All model parameters were estimated with Stan (Stan Development Team, 2020b) through an interface in R, called RStan (Stan Development Team, 2020a). All models were fit with three chains of 3000 iterations after 1000 warm-up iterations. For each of the four tasks in the task battery, the model was fit three times (e.g., once for each task-variant), resulting in 12 model fits. From the model fits we extracted three families of parameters: mu, sigma and, most importantly the delta parameters.

Furthermore, the individual-level means (i.e., μi, c, p; referred to as mu) and standard deviations (i.e., σi, c, p; referred to as sigma) were extracted for each condition and phase. All R scripts and the Stan model file are available on https://osf.io/pqvga/. A graphical representation of the model is included as well (see Fig. 2). The extracted delta, mu, and sigma parameters for each task and session combination are available on https://osf.io/pqvga/. All relevant convergence statistics have been extracted and are visually presented on https://osf.io/pqvga/ as well.

Fig. 2
figure 2

A structured schematic representation of the hierarchical model. Note. i = subject; c = condition; p = phase; sd = standard deviation; μi = individual-level mean parameter; σi = individual-level variability parameter

Between-measure correlations

For computation of the comprehensive between-task correlations that are reported in the Appendix, we utilized Spearman’s rho (ρ). In particular, Spearman’s rho (ρ) is a good non-parametric substitute for the parametric Pearson’s r, since Pearson’s r assumes that the relationship between two variables is both monotonic and linear (among other assumptions). The relationship between RT and error rate indices of cognitive-behavioral tasks is often monotonic, but not necessarily linear (Hedge, Powell, Bompas, et al., 2018a). Thus, Spearman’s rho will likely provide a more robust alternative, since Pearson’s r assumptions are not likely to be met. However, for the between-task and within-task analyses discussed in the Results section below, the focus was on reaction time indices associated with common difference score measures (e.g., RT Stroop effect). Hence, with the linearity assumption met, we employed Pearson r correlations for the latter hierarchical Bayesian within-, and between-task, correlational analyses.

Results

Reliability estimates: Traditional approach

Due to the large number of measures, all reliability estimates are presented in Appendix 1 (Appendix Tables 6, 7, 8, 9, 10 and 11). There, a full report includes internal consistency and test–retest reliabilities for the aggregate measures (mean RT, error rate) for all trial types, across all tasks and sessions. Although the aggregate measures are briefly discussed, only the difference score results are presented here due to their theoretical importance as measures of cognitive control, and within the DMCC battery (Tang et al., 2022). Table 1 presents both the split-half and test–retest reliability estimates for RT, computed separately for each control mode condition (baseline, reactive, proactive), for each task paradigm (3 x 3 x 4 = 36 estimates total). The corresponding 36 error rate estimates are shown in Table 2. In addition, for the AX-CPT task, four additional derived indices were also examined in addition to the difference scores (A-cue bias, d’-context, and Proactive Behavioral Index (PBI) for both RT and errors; see Table 3). These AX-CPT-derived estimates have been commonly employed as theoretically sensitive measures of cognitive control in this task, and have also been the focus of prior psychometric investigations (Boudewyn et al., 2015; Cohen et al., 1999; Lin et al., 2022; Richmond et al., 2015; Stawarczyk et al., 2014). Consequently, they were also of particular interest, to determine whether psychometric properties were improved within the context of the DMCC battery and experimental manipulations. For ease of interpretation, estimates of test–retest reliability below .50 are considered poor; between .50 and .75 are considered moderate; between .75 and .90 are considered good; and above .90 are considered excellent (Koo & Li, 2016). However, these thresholds are somewhat arbitrary; they are offered here as a guide. Of course, the qualitative description of reliability is not a substitute for understanding the numerical estimate in its context.

Table 1 Reaction time reliability across sessions
Table 2 Error rate reliability across sessions
Table 3 AX-CPT-derived indices reliability across sessions

As expected, the reliabilities of difference score measures were weaker than the reliabilities of aggregate measures. For example, the split-half reliability for Stroop incongruent RT was on average r = .99 across sessions, Stroop congruent RT was on average r = 1.00 across sessions (see Appendix 1), but the reliability of the RT Stroop effect was on average r = .82 across sessions. The same general pattern is observed for the test–retest reliability RT estimates: r = {.79, .93, .43}, respectively. This pattern is observed across all tasks, for both split-half and test–retest reliability estimates, for both RT and accuracy measures.

For the Stroop, cued task-switching, and AX-CPT difference score estimates the reliability results yield mixed conclusions. The split-half estimates indicate mostly moderate-to-good reliability, for both RT and error rate (x̅ = .68, range = .43–.90). However, the test–retest estimates indicate poor reliability, regardless of which ICC computation was used, ICC2,1: x̅ = .40, range = .16–.79; ICC3,1: x̅ = .42, range = .16–.98. Moreover, the session level manipulations (i.e., proactive and reactive variants) did not produce demonstrative improvements in reliability. Although reliability was generally highest in the reactive session, the overlapping confidence intervals across sessions suggests that this was not a robust effect.

The reliability of the AX-CPT-derived indices revealed a similar pattern as the difference score measures; the split-half reliability estimates were stronger than test–retest estimates. In contrast, two novel and interesting patterns emerged. First, all four proactive session-derived indices were internally consistent, with split-half estimates ranging from .79–.80. Second, split-half estimates for d′-context exceeded the reliability threshold of .75 in all sessions and thus is considered to be internally consistent as well. This suggests that the reliability of the d′-context and the proactive indices should not pose a bottleneck when used to examine between-measure correlations.

In the Sternberg task, the recency effect measure was found to be generally unreliable, in both RT and error rate. The poor reliability and high variability of the Sternberg estimates may stem from the task design (i.e., low number of observations available to calculate a difference score). To induce proactive control, recent negative (RN) trials were presented infrequently in the baseline and proactive sessions, with only eight RN trials per subject. It is therefore not advised to calculate a traditional difference score from the current Sternberg paradigm for use in individual differences research.

Overall, the reliability analyses computed in the traditional manner suggested that the DMCC cognitive control tasks were not robust, particularly for test–retest reliability, a finding consistent with that of other psychometric analyses of cognitive control tasks (Hedge, Powell, & Sumner, 2018b; Kucina et al., 2022; Rey-Mermet et al., 2018; Rouder & Haaf, 2019). That said, the difference between split-half and test–retest estimates of reliability is intriguing and may provide some insight into the measurement of cognitive control; we discuss this finding in more detail in the discussion section. We next examined whether the reliability analyses produce different results when computed using HBM approaches to estimation.

Reliability estimates: Hierarchical Bayesian modeling approach

As shown in the first set of analyses, we were not able to extract reliable individual differences from experimental task difference score measures. The goal of the second set of analyses was to examine whether hierarchical Bayesian modeling improved reliability estimation in the DMCC task battery data. This also replicates the modeling approach of prior work similar to the current study (i.e., Haines et al., 2020; Rouder & Haaf, 2019). For a comparison between the traditional MPE and HBM approach, the corresponding mean point-estimate of test–retest reliability (also using Pearson r to increase comparability) is provided as well in Fig. 3. As guidelines for test–retest reliability, we again follow Koo and Li’s (2016) thresholds (i.e., respectively, poor, moderate, good, excellent : < .50, .50–.75, .75–.90, > .90). Although those guidelines are for ICC, commonly accepted test–retest correlation guidelines based on Pearson’s product-moment correlation coefficient do not exist to our knowledge.

Fig. 3
figure 3

Test–retest reliability estimates of the difference score parameter. Note. Distribution of observed reliability estimates, split by analysis type for comparison. Density plot to visualize uncertainty of HBM delta estimate, dashed line of respective MPE estimates for comparison of reliability magnitude. MPE Pearson correlation coefficient obtained from traditional mean point estimates approach; HBM Pearson correlation coefficient of delta estimates obtained by hierarchical Bayesian modeling. n ranges between 104 and 122; different n sample sizes due to additional multivariate outlier removal

In contrast to the traditional psychometric approach to estimating test–retest reliability (i.e., based on mean point estimates), which indicated poor-to-moderate test–retest reliability (x̅r = .39), the HBM extracted estimates of test–retest reliability could be classified as good to excellent (all above .75, x̅r = .85), with the only exception being the Sternberg recency effect in the reactive condition (r = .52). The strong reliability estimates obtained using the HBM approach are consistent with Haines et al. (2020), and Rouder and Haaf (2019). The test–retest estimates of the delta parameter indicate that HBM can indeed provide reliable individual differences from cognitive control tasks, even when using a difference score indexFootnote 3. An additional interesting pattern emerged when comparing test–retest reliability in the different control mode conditions. In particular, reliability was highest for the proactive conditions (x̅ = .94; vs. x̅ = .82 for baseline, and x̅ = .79 for reactive), which also differed from the pattern observed in the traditional ICC analyses (for which the reactive condition tended to show the highest values).

Between-measure correlations

Within-task correlations

Next, our analyses examined the correlations between measures while comparing the traditional MPE estimates to the HBM-derived ones. We began by focusing on correlations within the same task paradigms, between sessions (see Fig. 4). Because these are within-task correlations, we expected them to be consistently positive and overall relatively high, since the experimental manipulations of cognitive control mode are quite subtle. Thus, they provided a potentially more useful testbed from which to examine the relationship between reliability of measures and their correlations.

Fig. 4
figure 4

Within-task correlation estimates of the difference score parameter. Note. Distribution of observed correlations within task paradigms, split by analysis type for comparison. MPE = Pearson correlation coefficient obtained from traditional mean point estimates approach; HBM = Pearson correlation coefficient of delta estimates obtained by hierarchical Bayesian modeling. n = 116. a Correlations calculated on test (i.e., time 1) data only. b Correlations calculated on combined test and retest (i.e., time 1 & 2) data

Because of the potential for learning effects that might impact within-task correlations, we first conducted these analyses restricted to cognitive control estimates from the test phase only (see Fig. 4a). The average within-task correlations derived with the MPE approach were weak to moderate (x̅ = .32), with a maximum correlation (between Stroop proactive and reactive) of r = .54. In comparison, the values of the HBM-derived correlations were on average moderate (x̅ = .48), with a maximum correlation (between Stroop baseline and proactive) of r = .76. Although the test phase correlations are overall lower than expected, it is particularly true for the MPE task switching and MPE Sternberg estimates. However, a clear pattern did emerge: with the exception of two AX-CPT correlations, the HBM estimates are higher than their MPE counterparts. Indeed, a Wilcoxon signed-rank test with continuity correction suggests the difference in the strength of the HBM relative to the MPE correlations was significant at α = .05 (p = .01).

Reliability and within-task correlations

We then examined the relationship between the test–retest reliabilities and within-task correlations. In particular, we experimentally tested the key psychometric principle that reliability serves as a bottleneck to individual differences analysis. To test this principle, we examined whether the strength of within-task correlations was related to the level of reliability in the estimates. For these analyses, we used Pearson correlation as an indicator of pattern similarity between the two measures (values near +1 reflect high similarity, values near 0 reflect low similarity), after first computed r-to-z transformations to linearize the within-task correlation values. Indeed, the results were supportive of the hypothesis (see Fig. 5). For the MPE-derived estimates, the distribution of test–retest reliabilities exhibited a highly similar pattern to the distribution of within-task correlations (r = .89); a similar relationship was found for the HBM-derived estimates (r = .65). Thus, when examining the within-task relationships, we find clear support for the hypothesis that the higher within-task correlations that we observed in the HBM extracted values was closely related to their overall higher reliabilities.

Fig. 5
figure 5

Standardized relationship between reliability and within-task correlations. Note. MPE = Pearson correlation coefficient obtained from traditional mean point estimates approach; HBM = Pearson correlation coefficient of delta estimates obtained by hierarchical Bayesian modeling; n = 104. Both reliability and within-task correlations were r-to-z transformed. Green area is 95% confidence interval around linear regression line

Re-analyses with combined test and retest data

As a follow-up, we conducted the latter two analyses again, but now combining the test and retest data to observe the impact of data aggregation. Interestingly, a different pattern emerged (see Fig. 4b). The within-task correlations derived with the MPE approach increased strongly to a moderate level (x̅ = .50 versus x̅ = .38 with test-phase only data), with a maximum correlation (between AX-CPT baseline and proactive) of r = .68. In comparison, the values of the HBM-derived correlations were quite similar (x̅ = .52 versus x̅ = .48 with test-phase only data), with a maximum correlation (between Stroop baseline and proactive) of r = .81. Unlike with the test phase only data, here a Wilcoxon signed-rank test with continuity correction suggested no significant difference between test + retest MPE and HBM estimates at α = .05 (p = .41). Nevertheless, the same relationships were observed between reliability level and the strength of within-task correlations (MPE: r = .82; HBM: r = .61). Taken together, these findings are consistent with prior research that suggests HBM-derived estimates are more stable and robust than the traditional MPE approach, particularly when fewer trials are available for estimation (Efron & Morris, 1977; Hox et al., 2012; Smid et al., 2020). But in all cases, the data are strongly supportive of the psychometric principle that tasks with lower reliabilities will tend to be associated with reduced strength in individual differences correlations.

Between-task correlations

Lastly, we conducted a more comprehensive examination of between-task correlations in the DMCC battery, first using the traditional MPE estimates. Because of the large number of tasks, conditions, and measures, we relegate full reporting of these correlations to Appendix 2, and only provide a brief summary here. In total, we examined 198 between-task correlations with a median correlation of r = .13, and of these only 12 had an absolute magnitude of r = .25. These values are on par with the so-called “crud factor” in differential psychology, which refers to the idea that correlations with magnitudes between 0 and .20 should be interpreted as nothing but noise (Lykken, 1968; Meehl, 1986; but see Orben & Lakens, 2020 for a recent critique).

We then focused on between-task, same-condition correlations (e.g., correlation of Stroop baseline to AX-CPT baseline) of key difference score measures and compared between traditional MPE and HBM approaches. Examining both approaches on the test phase only data (see Table 4), reveals that 33 out of 36 correlations are between r = – .20 and r = .20, with the remaining 3 correlations not being much higher (r = {– .23, .23, .24}). Following the analyses of the within-task correlations, we also calculated the between-task correlations on both test and retest phase data. A similar pattern emerged: 34 out of 36 correlations are between r = – .20 and r = .20, with the remaining two correlations again not being much higher (r = {– .21, .27}). Moreover, there was no consistent difference between the correlations computed from the traditional MPE (x̅ = .03) and HBM estimated values (x̅ = – .01) at test only, nor for test plus retest phase data (x̅ = .03), (x̅ = – .01), respectively. Thus, the results of this analysis do not support our hypothesis that the increased test–retest reliabilities observed in the HBM parameters would also translate into higher between-task correlations.

Table 4 Between-task correlations of test phase reaction time difference score parameters
Table 5 Between-task correlations of test + retest phase reaction time difference score parameters

Discussion

The goal of the current study was to examine psychometric reliability in experimental tasks of cognitive control. To this end, we utilized the new DMCC task-battery, as it comprised classic cognitive control tasks, but also included theoretically derived task variants that could isolate effects related to engagement of proactive and reactive control modes (Braver, 2012; Braver et al., 2021). It was our primary hypothesis that psychometric reliability would not be satisfactory in these tasks, when using traditional summary-statistic approaches, but that a different conclusion would be drawn when reliability was estimated with HBM approaches, which are likely to better capture individual differences variability associated with task performance in this domain. Indeed, when using traditional statistical approaches (i.e., split-half, ICC), the psychometric analyses suggested that our theoretically optimized task battery did not improve reliability above and beyond that of existing tasks and batteries. Plainly stated, the reliability of the DMCC task battery measures, when computed with popular difference score indices, were moderate at best, which is quite consistent with prior psychometric reports using different task variants (von Bastian et al., 2020 see also; Hedge, Powell, & Sumner, 2018b; Rouder & Haaf, 2019). In particular, when analyses were conducted with traditional psychometric methods, there was no evidence suggesting improved reliability associated with metrics of proactive and reactive cognitive control.

One important finding was that, with the conventional analyses, reliability estimates focused on internal consistency (i.e., split-half indices) were almost always higher than those focused on temporal stability (i.e., test–retest; i.e., ICC2,1 & ICC3,1). Given that split-half methods are calculated on a single timepoint measure, and test–retest on two (or more) timepoint measures, this finding is not surprising. It does, however, reaffirm that the two methods cannot be treated as interchangeable indices of reliability. When possible, an index of both internal consistency and temporal stability should be reported. Importantly, the observed discrepancy indicates that our measures of cognitive control have some internal consistency, but additional work needs to be conducted to determine why temporal stability appears to be lower than desirable. In our case, the “additional work” meant that we investigated whether traditional psychometric statistics might not be appropriate or well aligned for the calculation of individual differences in experimental cognitive control tasks.

In particular, we utilized hierarchical Bayesian modeling (HBM) as an alternative approach, to provide another test of the hypothesis suggested from recent work, that this approach might be better suited for reliability estimation with cognitive experimental tasks (Haines et al., 2020; Rouder & Haaf, 2019). Our results were strongly consistent with this hypothesis. Specifically, we found that with HBM estimation, the cognitive control indices were actually highly reliable, even when using indices derived from difference scores. Specifically, our findings indicate that test–retest reliability estimates for the delta (difference score) parameters in our sample can be almost always classified as good, and sometimes even excellent. This finding is a striking one, particularly when compared to the weak and moderate intraclass correlation coefficients (ICC) observed in the traditional set of analyses. The HBM analyses clearly suggest that accounting for individual-level variability and the type and shape of the distribution can “rescue” the reliability estimation, using the formulation of Rouder and Haaf (2019). Interestingly, it was also found that in both the traditional and HBM analyses, reliability estimates were highest for the proactive task variants, which also supports our hypothesis that theoretical motivated task manipulations may contribute to improved reliability.

One of the primary reasons for the enduring importance and need for attention to reliability measures is the view – which is well accepted in the psychometric literature (Hedge, Powell, & Sumner, 2018b; Parsons et al., 2019; Rouder et al., 2019; Spearman, 1904) – that reliability might serve as a bottle-neck or constraint on the ability to detect correlations between measures of individual differences. The key point is that, for measures with low reliability, there should be reduced sensitivity for the detection of between-measure correlations. Yet this assumption has been rarely experimentally tested (Cooper et al., 2017).

Our analyses also provided experimental support for this contention, when examining correlations between DMCC task measures within tasks (i.e., between control modes; baseline, proactive, reactive). Specifically, we assumed that within-task correlations could be treated as “benchmarks” since we assumed ground-truth positive correlations, given that the same subjects were performing subtle variants of the same task across sessions. Indeed, we found that not only was test–retest reliability increased with HBM estimates relative to the traditional ICC measures, but also so were the within-task correlations. Thus, the results provide clear support for the psychometric perspective, in demonstrating the importance of reliability, as well as the improved potential to estimate individual differences in cognitive control with HBM-based approaches. Moreover, we conducted analyses that compared results when using test-phase only data, which we assumed would be less impacted by learning or strategy effects, with those combining the test and retest data together. This comparison also revealed the advantages of the HBM approach, as the within-task correlation strengths were more stable across both sets of analyses; conversely, with the MPE estimates, the strength of within-task correlations was quite lower when using the test-phase only data. This suggests that MPE-based estimates of reliability and correlation will be more susceptible and impacted by the amount of data available for analysis, whereas HBM estimates are more stable and efficient.

Unfortunately, the one dimension to which the increased reliability obtained with HBM-estimates did not translate into improved correlation strength, was in the correlations observed between DMCC tasks. Here, we observed on average near-zero correlations (with majority less than r = 0.2) that did not differ from difference score measures derived with a traditional approach, nor between test only and test plus retest data. Thus, at least in the case of the DMCC task battery, it cannot be claimed that the weak between-task correlations are due to the unreliability of the measures.

Indeed, the contrast among the within-task and between-task correlations is striking. Moreover, it clearly points to the need for future research to understand the basis for the repeated findings of low between-task correlations among cognitive control measures (von Bastian et al., 2020), particularly given that our results argue against an interpretation in terms of low measurement reliability. As such our findings converge strongly with that of Rouder and Haaf (2019), who also observed that HBM estimates were associated with increased test–retest reliability in cognitive control tasks but did not change the nature of cross-task correlations. We discuss this issue further below, along with other limitations of the current work and fruitful directions for further research.

Limitations and future directions

The current study design, though promising as validation of the newly developed DMCC task battery in terms of its psychometric robustness, does come with some limitations. First, it is important to acknowledge the fully online format of the design. This design has clear and significant advantages, the foremost of which is that the multi-session nature of the study would place a stronger burden on subjects if frequent laboratory visits were required. Moreover, at the time of this writing, the SARS-CoV-2 pandemic has accelerated this shift of experimental research towards an online format. Finally, much work has validated online task administration as a viable format for cognitive tasks, with many important results replicated (Anwyl-Irvine et al., 2021; Bridges et al., 2020; Chaytor et al., 2021; Crump et al., 2013; Pronk et al., 2021). Nevertheless, the online format also has a number of drawbacks, which are also well known in the literature. These include reduced experimental control over the task environment, and an increased risk of potential distractions being present.

Another limitation of the design comes from the fact that not all the tasks were optimized to be delivered in a test–retest format. In addition to standard concerns about practice effects impacting retest sessions, the DMCC battery also includes some tasks and conditions that are likely to be more impacted by prior experience than others. For example, in the Cued-TS proactive and reactive conditions, incentives are given based on performance, although these are not present in the baseline condition. During the initial baseline condition, subjects are not told about the potential for incentives in the subsequent proactive and reactive sessions. However, during the retest baseline session, they do have this knowledge, which could impact the cognitive strategies used in this session. Likewise, in the AX-CPT proactive condition, subjects receive explicit strategy training for how to utilize the contextual cues. Again, in the preceding baseline test session, which is otherwise identical to proactive, they have not yet received this strategy training, but in the retest baseline sessions subjects have already had much experience in following the strategy instructions, which could also impact their performance in this session. Thus, in future investigations of test–retest reliability with the DMCC battery, it would be useful to reconsider the manipulations used for the proactive and reactive sessions, to minimize the carry-over effects of prior practice. Conversely, however, we found that at least for HBM estimates, the strength of within-task correlations was not strongly impacted by whether test-phase only or both the test and retest data were combined. This suggests that such concerns may primarily impact traditional analyses based on MPE estimates, for which the results may be more strongly impacted by both the amount and variability present in the within-subject data.

The current study adds to a growing literature highlighting the promise and potential of HBM approaches for analyzing cognitive experimental tasks. Yet, currently these types of Bayesian analyses are still relatively rare in the literature; consequently, there is still a poor understanding of how they are different from traditional analyses, or how effects might diverge. Given the lack of widespread adoption of HBM methods, we opted for a more conservative approach, of first presenting results from traditional psychometric analyses of reliability, before comparing them with HBM estimates. We utilized Bayesian models that estimated effects for each task-variant separately, following current literature (Haines et al., 2020; Rouder & Haaf, 2019).

However, the approach can be expanded to a single all-encompassing model. In particular, it is also possible to develop a generative model in which the different conditions and even different tasks are assumed to be additional level(s) of hierarchy from which the distributions arise (i.e., analogous to the way subjects are drawn from a higher-level distribution) (see also Rouder et al., 2019 (in pre-print)). Our current model benefits from shared information across subjects and trial-types (i.e., congruent, incongruent), but only within one variant (i.e., baseline, proactive, reactive) of each task-paradigm. A complete generative model has the benefit of between-condition and between-task information sharing as well. However, building full generative models will increase the complexity of the modeling endeavor, so it is worthwhile to progress in a more incremental fashion. Nevertheless, the promise of the current approach suggests that further development of Bayesian statistical approaches to task parameter estimation may be a particularly worthwhile direction for the field (Gelman et al., 2013; Lee & Wagenmakers, 2014; McElreath, 2020).

As part of the limitations of the current study, we acknowledge recent work suggesting that analyses solely based on reaction time measures also pose a challenge in interpreting results. For example, Draheim et al. (2019, 2020) have argued that the use of reaction time difference scores “is the primary cause of null and conflicting results” when examining individual differences in attentional control. Their work suggests that measures based on accuracy rather than reaction time can improve reliability, intercorrelations among tasks, latent factor scores, and associations with measures of working memory and fluid intelligence. Although it was beyond the scope of the current study, it is of course possible to use HBM approaches with accuracy measures as well, which suggests another possible direction for future work (Lin et al., 2022). Other work by Hedge et al. (2021) suggests the importance of cognitive modeling to properly estimate latent processes, for example by employing a diffusion model for conflict tasks (Rey-Mermet et al., 2021; Ulrich et al., 2015; Weigard et al., 2021). In this work it was found that, when conflict processes were decomposed from non-conflict processes, only weak correlations (r < .05) were observed between conflict processes across different cognitive control tasks. Contrarily, correlations between model parameters representing processing speed and strategy were consistently positive, with moderate to strong correlations. Future work should follow suit and use cognitive models that account for the speed–accuracy tradeoff and the multiple latent processes that underlie observed measures.

The key unresolved question from the current study relates to the low between-task correlations observed, even among the theoretically derived tasks that comprise the DMCC battery. These findings are not unprecedented; indeed, they are quite consistent with a number of prior studies that have examined correlations among cognitive control measures through task batteries and latent variable modeling (Draheim et al., 2020; Rey-Mermet et al., 2018; Rouder & Haaf, 2019; von Bastian et al., 2020). Nevertheless, the current results are quite discouraging, as they increase doubt on the domain-generality of cognitive control constructs. In some ways, however, the results are discrepant from work that has been emerging from the neuroimaging literature, which has also become more attuned to questions of individual differences and domain-generality (Dubois & Adolphs, 2016; Elliott et al., 2020; Finn et al., 2017; Freund et al., 2021; Gratton et al., 2018).

Indeed, within the neuroimaging literature, an important emerging finding is that although lower-dimensional (e.g., “univariate”) descriptions may not be reliable for characterizing individual differences in brain activity, higher-dimensional (e.g., multivariate) descriptions can be quite discriminative. This can be seen most clearly in “fingerprinting” studies (Finn et al., 2015), in which pattern similarity techniques demonstrate that individuals show high test–retest reliability, such that their activation profile from a test scan can be easily discriminated from other individuals in a retest session (i.e., significantly higher test–retest similarity within-individuals than between). Moreover, our group has extended this approach into the domain of task fMRI and cognitive control, using twin-based study designs to demonstrate a remarkable degree of similarity among identical twin-pairs relative to unrelated pairs (or even fraternal pairs) in the fronto-parietal regions most strongly associated with cognitive control functions (Tang et al., 2021). Most strikingly, these effects were only observed when utilizing multivariate activation pattern similarity, rather than univariate measures (Etzel et al., 2020), and demonstrated clear domain-generality (i.e., cross-task effects; (Tang et al., 2021)). Together, this work suggests the possibility that utilizing multivariate rather than univariate descriptions of the individual might be a promising direction even for behavioral characterizations. Indeed, initial work in this direction, utilizing behavioral fingerprinting approaches, has begun (see Han & Adolphs, 2020), though much more investigation is needed.

Conclusions

We examined whether well-established experimental tasks, but modified with theoretically aligned variants and task manipulations, are viable tools for measuring individual differences in cognitive control. As previously reported (Tang et al., 2022), the experimental manipulations included in this task battery were validated to be highly robust at the group level, in inducing consistent shifts towards proactive and reactive control. Yet, traditional psychometric approaches suggested that the theoretically derived cognitive control indices were not highly reliable, either in terms of internal consistency (split-half) or temporal stability (test–retest) measures, which were observed to be moderate at best. In contrast, when the test–retest data were re-examined using hierarchical Bayesian modeling, the findings were quite different, with good to excellent reliability observed in most measures. Moreover, these reliability effects translated into improved strength of within-task correlations.

Nevertheless, even with the reliable Bayesian estimates, between-task correlations were unaffected and remained uniformly poor, in other words, the poor between-task correlations were not due to reliability constraints. Together, these findings add to the growing literature suggesting the importance of Bayesian generative models when estimating individual differences, and its superior robustness to changes in number of observations when compared to traditional methods. Most importantly however, our findings also point to the need for further investigation into the source of low between-task correlations among experimental tasks that attempt to measure putatively domain-general cognitive control constructs. We encourage other researchers interested in cognitive individual differences to attend more closely to psychometric issues when conducting this important research.