We conducted three experiments with the same core experimental design and procedure. The only differences across the experiments were (1) the similarity of competitive pairmates increased very slightly from Experiments 1 to 2 to 3, and (2) the minimum number of learning rounds was increased from Experiment 1 to Experiments 2 and 3 to account for the greater similarity/difficulty. Analyses and predictions for Experiment 3 were preregistered (https://osf.io/s2gnq) after analyzing data from Experiments 1 and 2. Thus, analyses are first reported for Experiments 1 and 2, and then, separately, for Experiment 3 (to test for replication). Exploratory analyses that combined data across experiments are also reported.
Participants were undergraduate students from the University of Oregon who received course credit for participation. A total of 40 participants were recruited for Experiment 1. Four participants were excluded from analyses due to technical/procedural errors (see preregistration for full exclusion criteria: https://osf.io/s2gnq), resulting in a sample of 36 participants (Mage = 19.11 ± 1.65 years, range 18–25 years; 25 females). We sought a similar sample size in Experiment 2 and recruited 41 participants (Mage = 20.49 ± 2.47 years, range 18–28 years; 28 females); no participants were excluded for technical/procedural errors. Based on the effect sizes in Experiments 1 and 2 and corresponding power analyses, we recruited a sample 60 participants for a preregistered Experiment 3 (see https://osf.io/s2gnq). Three participants were excluded for technical/procedural errors, resulting in a sample of 57 participants (Mage = 19.00 ± 2.41 years, range 18–22 years; 40 females). Each experiment involved a single session for each participant that lasted 90–120 min. Informed consent was obtained in accordance with procedures approved by the University of Oregon Institutional Review Board. All participants who were not excluded due to technical/procedural errors were included in our analyses of the associative memory test performance (see Procedure). Inclusion in all subsequent analyses was based on a set of performance-based exclusion criteria (see Performance-based exclusion criteria).
For each participant and each experiment, the same set of 12 cue words was used (farmer, dentist, lawyer, teacher, chef, tailor, plumber, actor, artist, surgeon, judge, barber). Each cue word was assigned to a unique face, with the assignment randomized for each participant. All of the cue words referred to professions, consisted of one or two syllables, and were displayed in white with all capital letters.
Face images appeared in color with a uniform ellipse shape with a horizontal radius of 81 pixels and a vertical radius of 120 pixels. For all experiments, face images were generated from a set of eight base faces. The base faces were derived from a separate experimental procedure in which participants sorted a corpus of 1,008 faces into “families” based on subjective assessment of the likelihood that faces were genetically related. Clustering algorithms were applied to the sorting responses to identify distinct clusters (families). Each of the eight base faces represents the mean face from a cluster, normalized for features not relevant to the grouping (see https://osf.io/6cew9/ for full details of stimulus-generation methods). Critically, because of the way in which the eight base faces were generated, the base faces were distinct from each other according to characteristics that were orthogonal to the dimensions of affect and gender (which were the dimensions manipulated in the current experiments).
For each participant in each experiment, half of the base faces (four) were assigned to a competitive condition and half (four) were assigned to a non-competitive condition. The assignment of base faces to conditions was randomized for each participant. Base faces were manipulated along two dimensions – affect and gender – in order to generate the specific faces that participants studied (studied faces). For the four base faces assigned to the competitive condition, we created pairmates by generating two studied faces from each base face, with the common base being the source of competition. For the four faces assigned to the non-competitive condition, each base face was manipulated to generate a single studied face. Thus, a total of 12 studied faces were generated and used for each experiment.
For each experiment, each studied face was manipulated to fall into one of four locations in a two x two (affect x gender) space. That is, within each experiment, each studied face had one of two affect values and one of two gender values. To manipulate these dimensions, we collected subjective affect and gender ratings for all of the 1,008 faces in the corpus (see https://osf.io/znc58/) and then used regression analyses to learn the mapping between the gender and affect ratings and face image parameters (739 parameters in total) derived from an Active Appearance Model (AAM) (Chang & Tsao, 2017; Cootes et al., 2001; Edwards et al., 1998). Thus, the regression weights allowed for different affect and gender values to be translated to the 739-parameter feature space to manipulate the base faces. In order to maximize the independence of the affect and gender dimensions, for each of the AAM parameters, the dimension (affect or gender) with the highest magnitude regression weight was retained and the regression weight for the other dimension was set to 0. Thus, each face dimension (affect, gender) was associated with a distinct set of AAM parameters.
For the non-competitive condition, the four studied faces corresponded to the four locations in affect-gender space (one face per location), with the assignment of base faces to locations randomly determined for each participant. For the competitive condition, the eight studied faces again corresponded to the four locations in affect-gender space (two faces per location), with the assignment of base faces to locations randomly determined for each participant. Critically, the eight faces in the competitive condition included four sets of pairmates. For two of those sets, the pairmates within each set differed on affect and were matched on gender (i.e., diagnostic dimension = affect, non-diagnostic dimension = gender). For the other two sets, the pairmates differed on gender and were matched on affect (i.e., diagnostic dimension = gender, non-diagnostic dimension = affect) (see Fig. 1a). For the sets of pairmates that shared the same diagnostic dimension, each set corresponded to a different value on the non-diagnostic dimension, but the pairmates within each set had the same value on the non-diagnostic dimension. For example, for the two sets of pairmates for which gender was the diagnostic dimension, each set of pairmates would have a different value on the affect dimension, but the pairmates within each set would have the same value on the affect dimension.
For Experiment 1, the difference between competitive faces along the diagnostic dimension was determined based on subjective assessment of the authors and initial pilot data. The goal was for the differences to be very subtle, yet learnable (see Fig. 1a for examples). Note: The units for these differences were not meaningful and are therefore not reported. For Experiment 2, the difference between competitive pairs was reduced by 25% relative to Experiment 1 in order to slightly increase the difficulty/interference. This was motivated by evidence that repulsion is more likely to occur when discrimination is relatively more difficult (Chanales et al., 2021). For Experiment 3, the difference between competitive pairs on the gender dimension was the same as in experiment 2, but the difference on the affect dimension was reduced by 50% relative to Experiment 1. This was motivated by evidence, from Experiment 2, that interference was somewhat lower along the affect dimension compared to the gender dimension. Note that since the differences between competitive pairs in Experiment 1 were quite small to begin with, the changes across experiments were subtle. For additional consideration of differences between affect versus gender across experiments, see Fig. S1 in the Online Supplemental Material (OSM).
Within each experiment, the difference between competing faces (pairmates) on the diagnostic dimension is described in relative terms (scaled units), with each face being 1 unit from the center of face space and, therefore, 2 units from each other. All faces were also exactly 1 unit away from the affect and gender borders in the response window (see Reconstruction phase, below). Analyses of face memory from the reconstruction phase were performed based on the distance, in units, between participants’ responses and the actual location of the studied phases.
Each experiment consisted of two main phases: a learning phase and a reconstruction phase. The purpose of the learning phase was for participants to extensively study and practice remembering the cue-face associations. The reconstruction phase served as the critical memory test for measuring bias and precision in face memory. All experiments were run in Matlab, using the Psychophysics Toolbox extensions (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997). All phases of the experiment had a gray background.
The learning phase consisted of up to 12 rounds, with each round split into two sub-rounds. Each sub-round included three blocks corresponding to the following experimental tasks, in the following order: study, recall, and associative memory test (Fig. 1b), with the exception that rounds one and two did not include the recall task. For each participant and each round of the learning phase, the 12 associations were randomly split into two groups of six associations each (four competitive, two non-competitive), with each group of six associations assigned to a separate sub-round. In other words, in each round of the learning phase, half of the associations went through study/recall/associative memory test and then the other half of the associations went through study/recall/associative memory test (with the exception, as noted above, that rounds one and two did not include the recall task). The rationale for splitting the associations into two sub-rounds was to facilitate learning by reducing the amount of information per block.
In the study task, participants viewed and studied the cue-face pairings. On each trial (2,000 ms), a cue appeared directly above a face image. In between trials, there was a fixation cross for 200 ms. Participants were instructed to study the cue-face pairings; no response was made. In the recall task, participants attempted to recall the face associated with each cue. On each trial, a cue was presented above a blank ellipse (representing the to-be-recalled face) for 2,500 ms. Participants were instructed to recall the associated face image as vividly as possible. Although no response was made, the correct face would then appear below the cue for 1,000 ms as a way of providing feedback. In between trials, there was a 200-ms fixation cross. In the associative memory test, participants attempted to match face images with corresponding cue words. On each trial, a face image was presented for 2,000 ms and was then replaced by a set of six different cue words displayed in the bottom half of the screen (three cues in each of two rows with the position randomly determined for each trial). The cue words included all of the cues from the current sub-round. For faces in the competitive condition, the set of cues included the correct answer (target), the cue that had been paired with the current face’s pairmate (interference error), and four cues that had been paired with the other, unrelated faces (lures). For faces in the non-competitive condition, the set of cues included the correct answer (target) and five cues that had been paired with unrelated faces (lures). Participants made responses by clicking on the cue word with the mouse. After each response was registered, feedback indicated whether the response was correct (“Correct!”; 500 ms) or incorrect with the correct cue indicated (e.g., “Incorrect. This is the BARBER.”; 2,000 ms).
During the first two rounds of the learning phase, each study block presented each cue-face association three times. In subsequent rounds, each association was studied once per block. As noted above, there was no recall task in the first two rounds of the learning phase. In subsequent rounds, each association was recalled twice per recall block. Across all rounds of the learning phase, each association was tested three times per associative test block. For each task block (study/recall/associative test), the order in which each association was presented/tested was pseudo-randomly determined, with the following constraints: (1) all of the associations in each block were studied/presented once before any were repeated, (2) a given association was never presented/tested consecutively, (3) competing associations (face pairmates) were never presented/tested in consecutive trials. These constraints helped ensure that any comparisons between stimuli/associations were memory-based.
In Experiment 1, participants repeated the learning phase for at least nine rounds and until they reached 100% accuracy on the associative memory test, up to a maximum of 12 rounds. Most participants had reached perfect accuracy after nine rounds (24/36), and nearly all did so after ten rounds (31/36). Only two participants went through all 12 rounds, with one achieving perfect performance and the other being removed for continued poor performance (see below for performance-based exclusion criteria). In Experiments 2 and 3, all participants completed 12 rounds of the learning phase regardless of associative memory test performance. For each experiment, participants were given the opportunity to take a break after every two rounds, with the length of the break determined by the participant. Participants were instructed to press the space bar when ready to proceed.
After the learning phase, participants’ memories for the features of the faces were probed with a surprise reconstruction task (Fig. 1c). On each trial in the reconstruction task, participants were first shown a cue (e.g., “What does the BARBER look like?”) above a blank ellipse for 2,500 ms and were instructed to bring the target face to mind. Next, an altered version of the target face appeared in the ellipse with a response box beneath the face representing the search space (see Reconstruction search space, below, for details). Participants used a mouse to click through the box; the face image above the box changed according to the location of each mouse click in the box. Although participants were not explicitly made aware of this, the box represented a two-dimensional affect-gender space. Participants were instructed to continue searching (clicking through the box) until the face matched their memory for the target face. Participants finalized their response by pressing the space bar. There was no limit on the response time. A fixation cross appeared for 200 ms between trials. Each of the 12 studied faces was probed (reconstructed) a total of four times in the reconstruction phase (48 trials total). The rationale for probing faces multiple times was so that the precision (variability) of reconstructions for each face could be measured. Faces were reconstructed in a pseudo-random block order. In each of four consecutive blocks (with no break or demarcation between blocks), each of the 12 faces was reconstructed once. As in the learning phase, the same face was never tested consecutively and pairmate faces were never tested in consecutive trials. After the reconstruction phase, there was a short phase where participants were prompted to provide a rating on a 9-point scale for both affect and gender for each stimulus. Results from this task (which was only included for validation) are not described here.
Reconstruction search space
In the reconstruction task, the altered face presented on each trial was derived from the same base face as the target face, but the affect and gender values were randomly selected from a range of possible values. This range of possible values corresponded to the size of the two-dimensional search space (i.e., the size of the response box). Importantly, the range of the search space and the center of the search space were identical across all trials, but the mapping of the dimensions to the x and y axes (e.g., x axis = affect, y axis = gender) and the direction/orientation of the axes (e.g., left = low, right = high) were randomly varied for each trial so that participants would not learn to associate a given face with a fixed spatial position in the response box. For each experiment, the size of the search space relative to the distance between pairmate faces was identical. That is, for each experiment the height and width of the search space was exactly twice the distance between pairmate faces on the diagnostic dimension. Thus, with pairmate faces 2 units apart (in our standardized units), the height and width of the search space was 4 units. For each trial, the location of the correct answer (target face) and the location of the pairmate face (for faces in the competitive condition) always corresponded to one of four possible locations (the center of each quadrant) with all four of those locations contained in the search space (see Fig. 1a).
Performance-based exclusion criteria
For analyses that involved the reconstruction task data, we excluded a small number of participants based on performance during rounds 9–12 of the associative memory test. Participants were excluded if (a) their error rate for non-competitive trials was greater than 20% for any of these rounds or (b) they selected the lure faces on greater than 20% of the competitive trials for any of these rounds. Based on these criteria, one participant was excluded from analysis of the reconstruction task data in Experiment 1 (yielding N = 35), four were excluded from Experiment 2 (yielding N = 37), and eight were excluded from Experiment 3 (yielding N = 49) (see https://osf.io/dj6q2/ for other exclusion criteria that were established but did not apply). The rationale for having a high threshold for inclusion of participants in the reconstruction task analysis was to minimize cases where participants reconstructed an entirely wrong face and to instead focus on bias/precision in otherwise correctly remembered faces.
Measuring associative memory
As noted above, the associative memory test was used to confirm that participants achieved high accuracy in associating cues with faces. The associative memory test also allowed for a manipulation check of whether the competitive condition induced interference (lower associative memory accuracy) compared to the non-competitive condition. Data from the associative memory test was first analyzed in terms of accuracy on competitive compared to non-competitive trials. We ran a separate repeated-measures ANOVA for each experiment with factors of condition (competitive, non-competitive) and learning round (1–9 for Experiment 1, 1–12 for Experiments 2 and 3). For competitive trials, we also separated errors according to whether they were attributable to competition (interference error) or not (lures). If errors were random, interference errors would occur on one-fifth (20%) of the error trials. To test whether interference errors occurred at above chance levels, we therefore ran one-sample t-tests for each experiment, comparing the mean percentage of interference errors (across all learning rounds) to 20%.
As described above, on each trial in the reconstruction task the target face was located in one of four locations (the center of the four quadrants). Thus, for both the x and y axes of the search space, the target was half-way between the center and the border of the search space (Fig. 1a). To measure for potential bias, for each experiment all responses were aligned onto a common axis and rescaled onto a common scale, separately for each feature dimension (affect, gender). For the rescaled data, the range of possible responses for each dimension was -2 to 2, with 0 being the center of the face space (i.e., the center of the search space). For the competitive condition, the location of the target face on the diagnostic dimension = 1 and the location of the pairmate face = -1 (Fig. 1c). Thus, a bias away from the pairmate face would be represented by values greater than 1, whereas a bias toward the pairmate face (or toward the center of face space) would be represented by values lower than 1. For the non-diagnostic dimension, the location of the target face and the pairmate face = 1. Although faces from the non-competitive condition were included in the reconstruction task, bias was not measured for these faces because the distinction between diagnostic versus non-diagnostic dimensions did not exist. Rather, non-competitive faces were of critical importance in the associative memory test, where they served to establish an overall memory interference effect.
It is important to note that, for the reconstruction task, the response range on each trial was asymmetrically distributed around the target. If the response range had been symmetrically distributed around the target, then the correct response on each trial would have, by definition, been the center of the search space – which likely would have led participants to learn to simply respond in the center. However, the drawback of the approach we used is that, for the diagnostic dimension in the competitive condition, there was more opportunity to respond toward the pairmate face (values between -2 and 1) than away from the pairmate face (values of 1 to 2). Of course, this asymmetry works against our predicted effect of repulsion (values greater than 1). Nonetheless, in order to account for the asymmetrically restricted response range, we estimated the true mean by fitting truncated normal distributions to the data. For each participant, separate models were run for the diagnostic and non-diagnostic dimensions, with each model pooling data across faces and feature dimensions (affect, gender) in order to include a sufficient number of data points. Thus, each model included 32 data points (eight faces in the competitive condition × four reconstruction trials per face). Maximum-likelihood estimation was used to find the mean and standard deviation of a truncated normal distribution that best fit the data. The distributions were modelled using the truncnorm and MASS packages in R. We constrained the search space of the mean to a range of plausible values evenly balanced on either side of the target (± 1 unit) and constrained the standard deviation to be a maximum of 1 and a minimum of .1. Although we view the modelled means as a better estimate of the true means, there are some sources of variance that the models do not account for. For example, the models do not account for potentially unique distributions for each feature dimension and/or stimulus. Furthermore, there is evidence that there may be inherent, global biases in how face features are later recalled (Bülthoff & Zhao, 2020; Won et al., 2020). Critically, however, any global biases would equally influence the diagnostic and non-diagnostic dimensions. Therefore, our analysis primarily focused on differences in modeled means for the diagnostic versus non-diagnostic dimensions.
In order to measure the precision with which diagnostic and non-diagnostic features were remembered for each face, we calculated the standard deviation of responses across the four reconstruction trials for each face, separately for the diagnostic and non-diagnostic feature dimensions. We then computed the mean of these standard deviation values for each participant, separately for the diagnostic and non-diagnostic dimensions.
Measuring the relationship between reconstruction bias and associative interference
In order to determine whether bias on the diagnostic feature dimension plays an adaptive role in reducing memory interference, we ran a series of mixed-effects models that focused on the relationship between bias measured during the reconstruction task and accuracy on the associative memory test (averaged across the last four rounds in order to capture the end state of learning). Although this analysis was performed at the level of individual items (faces), the accuracy value for each face was defined as the average accuracy for that face and its pairmate. As such, both pairmates with each set had the same accuracy value. The rationale for averaging accuracy across pairmates was that if, for example, participants associate two competing faces (pairmates) with the same cue word (profession), rather than treating one of these associations as “correct” and the other as “incorrect,” it is more appropriate for the error to be shared across the two faces.
For the analyses relating reconstruction bias to associative memory accuracy, we excluded participants who had perfect accuracy, across all trials, on the final four rounds of the associative memory test. The rationale for this exclusion was that, for these participants, there was no variance in associative memory for the model to explain. Additionally, we did not run this analysis for Experiment 1 given the near-ceiling performance on the associative memory test over the last four rounds (11 participants (31%) had 100% accuracy; and the remaining participants had a mean accuracy of 95.96 ± 3.01% with an average SD within a participant of 3.62 ± 1.70). For Experiments 2 and 3 – which used more similar pairmates – associative memory accuracy was lower and, therefore, fewer participants were excluded due to ceiling performance (seven participants (19%) in Exp. 2 and six participants (12%) in Exp. 3; mean accuracy for the remaining participants, Exp. 2: M = 92.47 ± 7.58%, Exp. 3: M = 93.56 ± 6.26%).
For these models, it was critical to compute reconstruction bias at the level of individual faces. However, the method described above of estimating the average bias for each participant by pooling across trials/faces was not feasible for this analysis given the small number of observations (four trials per face). Thus, for this analysis we simply used the mean of the reconstruction response (across the four trials per face). In order to address the concern that any observed relationship between reconstruction bias and associative memory accuracy might be driven by potential “swap errors,” our preregistered approach was to exclude any individual responses (trials) for which the scaled response was between -2 and 0 and to only retain responses for which the scaled response was between 0 and 2. For the diagnostic dimension, any responses that were closer to the competing pairmate than to the target were therefore excluded. All remaining responses were included in the mean response for each face. While rare, if a face was associated with an excluded response on all four reconstruction trials, that face was entirely excluded from analysis. For Experiment 2, this occurred for a total of four faces distributed across four participants; for Experiment 3, this occurred for a total of six faces distributed across six participants. While this preregistered approach for exclusion of potential swap errors was intended as a conservative approach for eliminating the influence of extreme errors, all of our main results remained significant when no responses were excluded. Additionally, in exploratory analyses that combined data across Experiments 2 and 3, instead of excluding extreme responses altogether, responses between -2 and 0 were capped at a value of 0, which allowed for all trials to be retained in the model, but reduced the influence of extreme responses.
Mixed-effects models were implemented in R using the lme4 package (Bates et al., 2014). Likelihood ratio tests were used to compare models with relevant variables to null models that excluded those variables. In order to account for potential differences related to whether the diagnostic dimension was affect versus gender, all models included this categorical variable as a fixed effect. In order to allow the relationship between reconstruction bias and associative memory accuracy to vary for each participant, we modeled the relationship between bias and associative memory accuracy with random intercepts and random slopes for each participant, where possible. Our preregistered approach to dealing with models that failed to converge or that reached a singular fit was to rerun the same model with the random slope for bias removed (see Barr et al., 2013). While all of our preregistered models did converge, an exploratory model that used the difference in bias on the diagnostic versus non-diagnostic dimension as a predictor failed to converge when a random slope was included; thus, we removed the random slope. Exploratory models that included only unsigned error or precision as predictors (without bias) failed to converge when random slopes were included for these variables; thus, we removed random slopes for these variables. Finally, exploratory models that included bias along with precision and unsigned error as predictors also failed to converge when random slopes were included for all variables; when removing random slopes, we prioritized retaining a random slope for bias, which led to the exclusion of random slopes for precision and unsigned error.