Materials and method
Participants
A total of 137 individuals (102 females, 35 males; mean age = 19.97, SD = 1.07) with normal or corrected-to-normal vision participated in the experiment. Participants provided written informed consent, and the study was approved by the Ethics Committee at Southwest University. Participants received monetary compensation for their participation. Two participants were excluded because they had negative average capacity values, resulting in a final sample of 135 participants.
Stimuli
The stimuli were presented on monitors with a refresh rate of 75 Hz and a screen resolution of 1,024 × 768. Participants sat approximately 60 cm from the screen, though a chinrest was not used so all visual angle estimates are approximate. In addition, there were some small variations in monitor size (five 16-in. CRT monitors, three 19-in. LCD monitors) in testing rooms, leading to small variations in the size of the colored squares from monitor to monitor. Details are provided about the approximate range in degrees of visual angle.
All stimuli were generated in MATLAB (The MathWorks, Natick, MA) using the Psychophysics Toolbox (Brainard, 1997). Colored squares (51 pixels; range of 1.55° to 2.0° visual angle) served as memoranda. Squares could appear anywhere within an area of the monitor subtending approximately 10.3° to 13.35° horizontally and 7.9° to 9.8° vertically. Squares could appear in any of nine distinct colors, and colors were sampled without replacement within each trial (RGB values: red = 255 0 0; green = 0 255 0; blue = 0 0 255; magenta = 255 0 255; yellow = 255 255 0; cyan = 0 255 255; orange = 255 128 0; white = 255 255 255; black = 0 0 0). Participants were instructed to fixate a small black dot (approximate range: .36° to .47° of visual angle) at the center of the display.
Procedures
Each trial began with a blank fixation period of 1,000 ms. Then, participants briefly viewed an array of four, six, or eight colored squares (150 ms), which they remembered across a blank delay period (1,000 ms). At test, one colored square was presented at one of the remembered locations. The probabilities were equal that the probed square was the same color (no-change trial) or was a different color (change trial). Participants made an unspeeded response by pressing the “z” key, if the color was the same, or the “/” key, if the color was different. Participants completed 180 trials of set sizes 4, 6, and 8 (540 trials total). Trials were divided into nine blocks, and participants were given a brief rest period (30 s) after each block. To calculate capacity, change detection accuracy was transformed into a K estimate using Cowan’s (2001) formula K = N × (H − FA), where N represents the set size, H is the hit rate (proportion of correct responses to change trials), and FA is the false alarm rate (proportion of incorrect responses to no-change trials). Cowan’s formula is best for single-probe displays like the one employed here. For change detection tasks using whole-display probes, Pashler’s (1988) formula may be more appropriate (Rouder et al., 2011).
Results
Descriptive statistics for each set size condition are shown in Table 1, and data for both Experiments 1 and 2 are available online at the website of the Open Science Framework, at https://osf.io/g7txf/. We observed a significant difference in performance across set sizes, F(2, 268) = 20.6, p < .001, η
p
2 = .133, and polynomial contrasts revealed a significant linear trend, F(1, 134) = 36.48, p < .001, η
p
2 = .214, indicating that the average performance declined slightly with increased memory load.
Table 1 Descriptive statistics for Experiment 1
Reliability of the full sample: Cronbach’s alpha
We computed Cronbach’s alpha (unstandardized) using K scores from the three set sizes as items (180 trials contributing to each item), and obtained a value of α = .91 (Cronbach, 1951). We also computed Cronbach’s alpha using K scores from the nine blocks of trials (60 trials contributing to each item) and obtained a nearly identical value of α = .92. Finally, we computed Cronbach’s alpha using raw accuracy for single trials (540 items), and obtained an identical value of α = .92. Thus, change detection estimates had high internal reliability for this large sample of participants, and the precise method used to divide trials into “items” does not impact Cronbach’s alpha estimates of reliability for the full sample. Furthermore, using raw accuracy versus bias-corrected K scores did not impact reliability.
Reliability of the full sample: Split-half
The split-half correlation of the K scores for even and odd trials was reliable, r = .88, p < .001, 95% CI [.84, .91]. Correcting for attenuation yielded a split-half correlation value of r = .94 (Brown, 1910; Spearman, 1910). Likewise, the capacity scores from individual set sizes correlated with each other: r
ss4-ss6 = .84, p < .001, 95% CI [.78, .88]; r
ss6-ss8 = .79, p < .001, 95% CI [.72, .85]; r
ss4-ss8 = .76, p < .001, 95% CI [.68, .83]. Split-half correlations for individual set sizes yielded Spearman–Brown-corrected correlation values of r = .91 for set size 4, r = .86 for set size 6, and r = .76 for set size 8, respectively.
The drop in capacity from set size 4 to set size 8 has been used in the literature as a measure of filtering ability. However, the internal reliability of this difference score has typically been low (Pailian & Halberda, 2015; Unsworth et al., 2014). Likewise, we found here that the split-half reliability of the performance decline from set size 4 to set size 8 (“4–8 Drop”) was low, with a Spearman–Brown-corrected correlation value of r = .24. Although weak, this correlation is of the same strength that was reported in earlier work (Unsworth et al., 2014). The split-half reliability of the performance decline from set size 4 to set size 6 was slightly higher, r = .39, and the split-half reliability of the difference between set size 6 and set size 8 performance was very low, r = .08. The reliability of differences scores can be impacted both by (1) the internal reliability of each measure used to compute the difference and (2) the degree of correlation between the two measures (Rodebaugh et al., 2016). Although the internal reliability of each individual set size was high, the positive correlation between set sizes may have decreased the reliability of the set size difference scores.
An iterative down-sampling approach
To investigate the effects of sample size and trial number on the reliability estimates, we used an iterative down-sampling procedure. Two reliability metrics were assessed: (1) Cronbach’s alpha using single-trial accuracy as items, and (2) split-half correlations using all trials. For the down-sampling procedure, we randomly sampled participants and trials from the full dataset. The number of participants (n) was varied from 5 to 135 in steps of 5. The number of trials (t) was varied from 5 to 540 in steps of 5. Number of participants and number of trials were factorially combined (2,916 cells total). For each cell in the design, we ran 100 sampling iterations. On each iteration, n participants and t trials were randomly sampled from the full dataset and reliability metrics were calculated for the sample.
Figure 1 shows the results of the down-sampling procedure for Cronbach’s alpha. Figure 2 shows the results of the down-sampling procedure for split-half reliability estimates. In each plot, we show both the average reliabilities obtained across the 100 iterations (Figs. 1a and 2a) and the worst reliabilities obtained across the 100 iterations (Figs. 1b and 2b). Conceptually, we could think of each iteration of the down-sampling procedure as akin to running one “experiment,” with participants randomly sampled from our “population” of 137. Although it is good to know the average expected reliability across many experiments, the typical experimenter will run an experiment only once. Thus, considering the “worst case scenario” is instructive for planning the number of participants and the number of trials to be collected. For a more complete picture of the breadth of the reliabilities obtained, we can also consider the variability in reliabilities across iterations (SD) and the range of reliability values (Fig. 2c and d). Finally, we repeated this iterative down-sampling approach for each individual set size. The average reliability as well as the variability of the reliabilities for individual set sizes are shown in Fig. 3. Note that each set size begins with 1/3 as many trials as in Figs. 1 and 2.
Next, we looked at some potential characteristics of samples with low reliability (e.g., iterations with particularly low vs. high reliability). We ran 500 sampling iterations of 30 participants and 120 trials, then we did a median split for high- versus low-reliability samples. No significant differences emerged in the mean (p = .86), skewness (p = .60), or kurtosis (p = .70) values of high- versus low-reliability samples. There were, however, significant effects of sample range and variability. As would be expected, samples with higher reliability had larger standard deviations, t(498) = 26.7, p < .001, 95% CI [.14, .17], and wider ranges, t(498) = 15.2, p < .001, 95% CI [.52, .67], than samples with low reliability.
A note for fixed capacity + attention estimates of capacity
So far, we have discussed only the most commonly used methods of estimating WM capacity (K scores and percentages correct). Other methods of estimating capacity have been used, and we now briefly mention one of them. Rouder and colleagues (2008) suggested adding an attentional lapse parameter to estimates of visual WM capacity, a model referred to as fixed capacity + attention. Adding an attentional lapse parameter accounts for trials in which participants are inattentive to the task at hand. Specifically, participants commonly make errors on trials that should be well within capacity limits (e.g., set size 1), and adding a lapse parameter can help explain these anomalous dips in performance. Unlike typical estimates of capacity, in which a K value is computed directly for performance for each set size and then averaged, this model uses a log-likelihood estimation technique that estimates a single capacity parameter by simultaneously considering performance across all set sizes and/or change probability conditions. Critically, this model assumes that data are obtained for at least one subcapacity set size, and that any error made on this set size reflects an attentional lapse. If the model is fit to data that lack at least one subcapacity set size (e.g., one or two items), then the model will fit poorly and provide nonsensical parameter estimates.
Recently, Van Snellenberg, Conway, Spicer, Read, and Smith (2014) used the fixed capacity + attention model to calculate capacity for a change detection task, and they found that the reliability of the model’s capacity parameter was low (r = .32) and did not correlate with other WM tasks. Critically, however, this study used only relatively high set sizes (4 and 8) and lacked a subcapacity set size, so model fits were likely poor. Using code made available by Rouder et al., we fit a fixed capacity + attention model to our data (Rouder, n.d.). We found that when this model is misapplied (i.e., used on data without at least one subcapacity set size), the internal reliability of the capacity parameter was low (r uncorrected = .35) and was negatively correlated with raw change detection accuracy, r = −.25, p = .004. If we had only applied this model to our data, we would have mistakenly concluded that change detection measures offer poor reliability and do not correlate with other measures of WM capacity.
Discussion
Here, we have shown that when sufficient numbers of trials and participants are collected, the reliability of change detection capacity is remarkably high (r > .9). On the other hand, a systematic down-sampling method revealed that insufficient trials or insufficient participant numbers could dramatically reduce the reliability obtained in a single experiment. If researchers hope to measure the correlation between visual WM capacity and some other measure, Figs. 1 and 2 can serve as an approximate guide to expected reliability. Because we had only a single sample of the largest n (137), we cannot make definitive claims about the reliabilities of future samples of this size. However, given the stabilization of correlation coefficients with large sample sizes and the extremely high correlation coefficient obtained, we can be relatively confident that the reliability estimate for our full sample (n = 137) would not change substantially in future samples of university students. Furthermore, we can make claims about how the reliability of small, well-defined subsamples of this “population” can systematically deviate from an empirical upper bound.
The average capacity obtained for this sample was slightly lower than some other values in the literature, typically cited as around three or four items. The slightly lower average for this sample could potentially cause some concern about the generalizability of these reliability values for future samples. For the present study’s sample, the average K scores for set sizes 4 and 8 were K = 2.3 and 2.0, respectively. The largest, most comparable sample to the present sample is a 495-participant sample in a work by Fukuda, Woodman, and Vogel (2015). The average K scores for set sizes 4 and 8 were K = 2.7 and 2.4, respectively, and the task design was nearly identical (150-ms encoding time, 1,000-ms retention interval, no color repetitions allowed, and set sizes 4 and 8). The difference of 0.3–0.4 items between these two samples is relatively small, though likely significant. However, for the purposes of estimating reliability, the variance of the distribution is more important than the mean. The variabilities observed in the present sample (SD = 0.7 for set size 4, SD = 0.97 for set size 8) were very similar to those observed in the Fukuda et al. sample (SD = 0.6 for set size 4 and SD = 1.2 for set size 8), though unfortunately the Fukuda et al. study did not report reliability. Because of the nearly identical variabilities of scores across these two samples, we can infer that our reliability results would indeed generalize to other large samples for which change detection scores have been obtained.
We recommend applying an iterative down-sampling approach to other measures when expediency of task administration is valued, but reliability is paramount. The stats-savvy reader may note that the Spearman–Brown prophecy formula also allows one to calculate how many observations must be added to improve the expected reliability, according to the formula
$$ N = \frac{\rho {*}_{x{ x}^{\prime }}\left(1-{\rho}_{{}_{x{ x}^{\prime }}}\right)}{\rho_{{}_{x{ x}^{\prime }}}\left(1 - \rho {*}_{x{ x}^{\prime }}\right)} $$
where \( \rho {*}_{x{ x}^{\prime }} \) is the desired correlation strength, \( {\rho}_{{}_{x{ x}^{\prime }}} \) is the observed correlation, and N is the number of times that a test length must be multiplied to achieve the desired correlation strength. Critically, however, this formula does not account for the accuracy of the observed correlation. Thus, if one starts from an unreliable correlation coefficient obtained with a small number of participants and trials, one will obtain an unreliable estimate of the number of observations needed to improve the correlation strength. In experiments such as this one, both the number of trials and the number of participants will drastically change estimates of the number of participants needed to observe correlations of a desired strength.
Let’s take an example from our iterative down-sampling procedure. Imagine that we ran 100 experiments, each with 15 participants and 150 total trials of change detection. Doing so, we would obtain 100 different estimates of the strength of the true split-half correlation. We could then apply the Spearman-Brown formula to each of these 100 estimates in order to calculate the number of trials needed to obtain a desired reliability of r = .8. So doing, we would find that, on average, we would need around 140 trials to obtain the desired reliability. However, because of the large variability in the observed correlation strength (r = .37 to .97), if we had only run the “best case” experiment (r = .97), we would estimate that we need only 18 trials to obtain our desired reliability of r = .8 with 15 participants. On the other hand, if we had run the “worst case” experiment (r = .37), then we would estimate that we need 1,030 trials. There are downsides to both types of estimation errors. Although a pessimistic estimate of the number of trials needed (>1,000) would certainly ensure adequate reliability, this might come at the cost of time and participants’ frustration. Conversely, an overly optimistic estimate of the number of trials needed (<20) would lead to underpowered studies that would waste time and funds.
Finally, we investigated an alternative parameterization of capacity based on a model that assumes a fixed capacity and an attention lapse parameter (Rouder et al., 2008). Critically, this model attempts to explain errors for set sizes that are well within capacity limits (e.g., one item). If researchers inappropriately apply this model to change detection data with only large set sizes, they would erroneously conclude that change detection tasks yield poor reliability and fail to correlate with other estimates of capacity (e.g., Van Snellenberg et al., 2014).
In Experiment 2, we shifted our focus to the stability of change detection estimates. That is, how consistent are estimates of capacity from day to day? We collected an unprecedented number of sessions of change detection performance (31) spanning 60 days. We examined the stability of capacity estimates, defined as the correlation between individuals’ capacity estimates from one day to the next. Since capacity is thought to be a stable trait of the individual, we predicted that individual differences in capacity should be reliable across many testing sessions.