The distinction between episodic and semantic memory

In his seminal chapter Episodic and Semantic Memory, Tulving (1972) proposed two distinct stores of long-term memory—one episodic and one semantic. Whereas episodic memory involves memory for particular episodes in time (e.g., remembering that the word “apple” was presented on an earlier study list), semantic memory is more generic in nature; the representations have been abstracted away from the original episodes that formed them (e.g., knowing that an apple is a fruit and what qualities it possesses).

The episodic-semantic distinction inspired much research searching for differences between the two (Gardiner, 2001; Landauer & Dumais, 1997; Plaut, 1995; Posner & Keele, 1968; Wheeler et al., 1997) and developing separate memory models for each (e.g., Dennis & Humphreys, 2001; Smith et al., 1974). However, the extent to which episodic and semantic memory are dissociable (e.g., Devitt et al., 2017) versus closely intertwined (e.g., Bellmund, 2020; Park et al., 2020; Rajah & McIntosh, 2005) is now debatable, and a growing body of literature suggests that episodic and semantic memory are more overlapping and interdependent than originally proposed (e.g., Duff et al., 2020; Graham et al., 2000; Greenberg & Verfaellie, 2010; Irish & Vatansever, 2020; Keane et al., 2020; Kumar, 2021; McRae & Jones, 2013; Renoult et al., 2019; Solomon & Schapiro, 2020; Yee et al., 2017).

The tabula rasa assumption in episodic recognition memory simulations

Despite the burgeoning shift toward a more overlapping view of episodic and semantic memory, in practice, there often remains a separation of the two in episodic recognition memory simulations of processes theorized to take place in list-learning recognition paradigms (e.g., Arndt & Hirshman, 1998; Clark & Gronlund, 1996; Hicks & Starns, 2006; Hintzman, 1988; McDowd & Murdock, 1986; Rotello, 2017; Wixted & Mickes, 2010). The issue in recognition memory simulations is similar to the fallacy of the tabula rasa discussed by Hintzman (2011, p. 257), whereby he questions the oft-made implicit assumption that when a study list is presented to a participant, each item is being laid down in memory for the first time, as if onto a blank slate. Though Hintzman focuses on multiple-list designs in which each new study-test phase is assumed to be starting anew with no influence of the prior study-test phase, a similar issue is apparent in recognition memory simulations.

In recognition memory research, participants must discriminate between studied (old) items and unstudied (new) items on a test. Computational models usually assume that this happens through various manifestations of signal detection theory (Arndt & Hirshman, 1998; Clark & Gronlund, 1996; Hintzman, 1988; Rotello, 2017). For example, in global matching models (see Clark & Gronlund, 1996, for a review), a familiarity signal is computed as a function of the degree of match between the features present in the recognition test item and the features present in the memory traces that were encoded at study. Test items with a high degree of feature match with one or more memory traces will tend to elicit a higher familiarity signal than those with a lower degree of match. Old–new discrimination emerges through criterion placement; test items that elicit a familiarity signal that falls above the criterion are called “old” while those that do not are called “new.”

In most recognition memory simulations, there is an implicit tabula rasa assumption in that typically, no memory traces are present prior to laying down memory traces from the study list (e.g., Arndt & Hirshman, 1998; Clark & Gronlund, 1996; Hicks & Starns, 2006; Hintzman, 1988; McDowd & Murdock, 1986). Thus, test probes meant to correspond to “new” items typically correspond to zero items in the memory store (e.g., see Hintzman, 1988, Fig. 4, p. 531). Similarly, it is common in simulations to fix the mean of the new distribution at zero. For example, Spanton and Berry (2020) state, “The strength of old (studied) or new (unstudied) items are represented as two separate normal (Gaussian) distributions, with the mean of the old item distribution (mo) being typically greater than that of the new item distribution (mn: typically fixed at 0)” (p. 1). Other examples can be found in Wixted and Mickes (2010, p. 1035) and Rotello (2017, p. 202).

As many recognition memory simulation studies are primarily focused on a narrow theoretical question, the implicit tabula rasa assumption does not hinder the points being made in them. However, a full understanding of how the feature-based computation of a familiarity signal operates in the human mind in real-world settings will require a firmer grasp of how existing knowledge is involved. Because unstudied items likely still have representations in the participant’s general knowledge base (e.g., words themselves and their meaning, orthography, phonology), scientific progress will be made by considering how preexperimental knowledge representations might play a role in the mathematical computation of the familiarity signal, and how this might affect old–new discriminability as well as the variances of the old and new distributions in list-learning recognition paradigms.

The present study aims to take a step toward achieving this. Here, we focus on a unique and likely familiarity-based recognition phenomenon that involves preexisting knowledge in its operation—the semantic-feature-based recognition without recall phenomenon—and demonstrate how the MINERVA 2 model can be used to simulate the interaction between preexisting knowledge (semantic memory) and episodic familiarity in list-learning paradigms in a way that helps to mechanistically account for the phenomenon. In so doing, our simulations raise important considerations for future simulation work.

The recognition without recall phenomenon: Familiarity-detection during recall failure

The semantic-feature-based recognition without recall phenomenon that is the focus of the present study is a variant of the more general recognition without recall phenomenon (Cleary, 2004; Cleary et al., 2016; Jia et al., 2020; Ryals & Cleary, 2012). In studies of the recognition without recall phenomenon, participants study a list of items (e.g., PITCHFORK, TRANSPARENT) and then receive a test list of cues, half of which resemble studied items on some dimension (e.g., PATCHWORK, TRENSORENT) and half of which do not (e.g., ERLIGANT, FRONNEL). The task is to attempt to use each cue to recall a studied item, and, even in cases of cued recall failure, to rate how familiar the cue seems on a scale of zero (completely unfamiliar) to 10 (extremely familiar). Recognition without recall is the finding that, among cues that fail to elicit recall, participants discriminate between cues that resemble studied items and cues that do not, usually through cue familiarity ratings (e.g., Cleary, 2004; Cleary et al., 2009, 2012, 2016; Cleary & Claxton, 2018; Ryals & Cleary, 2012), though it can occur with yes–no judgments (Ryals & Cleary, 2012) and d’ tends to be comparable whether obtained through ratings or yes–no judgments (Cleary, 2004, 2005; Ryals & Cleary, 2012).

Recognition without recall appears to be well-described by feature-matching approaches to the computation of the familiarity signal such as that which takes place in MINERVA 2 (described below and depicted in Fig. 1). Specifically, one can assume that the test cue (e.g., POTCHBORK) contains features that are matched with features stored in memory (e.g., the letters and sounds of PATCHWORK) to produce a familiarity signal. Due to this feature overlap, a cue that resembles a studied item will elicit a higher familiarity signal than a cue that does not.

Fig. 1
figure 1

The feature-matching process in the MINERVA 2 model. From “Judgments of Frequency and Recognition Memory in a Multiple-Trace Memory Model,” by D. L. Hintzman, 1988, Psychological Review, 95, 528–551. Copyright 2021 by the American Psychological Association. Reproduced with permission.

Within the recognition without recall paradigm, perceived cue familiarity strength can be systematically increased in ways broadly consistent with MINERVA 2’s mechanisms. For example, Ryals and Cleary et al. (2012) found that when the test cue (e.g., POTCHBORK) shared features with four different studied words (e.g., PATCHWORK, PITCHFORK, POCKETBOOK, PULLCORK) that all failed to be recalled in response to the cue, familiarity ratings were significantly higher than when the cue shared features with only one studied (unrecalled) item. In turn, a cue that resembled no studied items received the lowest familiarity ratings. This pattern fits the global matching assumption of the MINERVA 2 model in that the degree to which each memory trace overlaps in features with the test probe combines across memory traces to boost the strength of the familiarity signal (see Fig. 1).

Note that recognition with recall likely involves a different mechanism than recognition without recall (Ryals & Cleary, 2012), presumably one relating to the fact that participants can use a recalled study item to make their recognition decision rather than relying solely on the perceived familiarity of the cue. The present simulations focus on the recognition without recall phenomenon, whose mechanisms may be well-described by MINERVA 2’s familiarity computation process.

Overview of the MINERVA 2 model

MINERVA 2 is a multiple-trace model that Hintzman (1984) developed to be a single memory system in which individual experiences (episodic memory) and abstract knowledge (semantic memory) can interact and arise from the same representations (Neath, 1998). MINERVA 2 has been used to simulate semantic memory processes—namely, responses to prototypes that were never presented to the model but that relate to the specific encoded episodes within the model (Hintzman, 1986). Although an assumption of MINERVA 2 is that semantic knowledge builds up through multiple episodes (also made by others as well, such as Moscovitch et al., 2005, and Cermak, 1984), preexisting knowledge is not generally incorporated into MINERVA 2 simulations of episodic recognition (e.g., Arndt & Hirshman, 1998; Clark & Gronlund, 1996; Hicks & Starns, 2006; Hintzman, 1988; McDowd & Murdock, 1986).

In MINERVA 2, every experience is laid down in a new memory trace; thus, repeated experiences have multiple memory traces. Each memory trace is a vector of feature values that can be +1, −1, or 0 (see Fig. 1); a value of 0 indicates a missing feature. Although MINERVA 2 assumes that memory traces are sets of features (and that test items are decomposed into sets of features during their processing), the model is agnostic regarding what exactly constitutes a feature in the human mind. The +1 and −1 values that are meant to represent hypothetical features could, in theory, be any number of aspects of a stimulus. For example, if the stimulus is a word, the features could theoretically be letters, phonemes, or even semantic features of the type specified by McRae et al. (2005), such as “has bark” or “has leaves” (see also Cree et al., 2006, or Pexman et al., 2003). In short, the model only assumes that stimuli are segmented into features without specifying the nature of the features themselves. This holds true for most models of episodic recognition (e.g., Clark & Gronlund, 1996; Cox & Shiffrin, 2017).

In a simulated recognition test trial, a test probe launches the feature-matching process by which the familiarity signal is computed (see Fig. 1). The test probe is matched on a feature-by-feature basis with each of the memory traces stored in memory to provide an index of its similarity to each individual trace. This similarity value for each memory trace is computed by multiplying each feature in the trace by the corresponding feature within the test probe, summing together all of these products, and then dividing by the number of features within the test probe and the memory trace. This process of calculating the similarity (S) between a probe and a trace (i) is demonstrated in Eq. 1 below. Let Pj be the value of feature j in the given test probe, Ti,j be the value of feature j in trace i, and Nr be the number of relevant features (that is, features that are not 0 in either the trace or probe):

$$ {S}_i={\sum}_{j=1}^N{P}_j{T}_{i,j}\frac{1}{N_r} $$
(1)

In this equation, Si is a correlation coefficient, taking on a value of +1 when the probe and trace are identical or a value of 0 when they are orthogonal. Once calculated, Si is cubed to obtain the activation value, Ai, preserving the sign of the similarity measure. These computations are done for every trace stored within memory, and the activation values are then summed across all traces (M) to compute the echo intensity (I):

$$ I={\sum}_{i=1}^M{A}_i $$
(2)

Echo intensity is synonymous with familiarity intensity, with large values (e.g., 0.80) signaling strong levels of familiarity and low values (e.g., 0.06) signaling weak levels.

A test item will be classified as “old” or “new” in a simple yes–no/old–new recognition test paradigm based on the criterion placement along the familiarity continuum; if the familiarity signal value exceeds the criterion, the test item will be classified as “old” whereas if it falls below the criterion, it will be classified as “new.”

MINERVA 2’s account for recognition without recall

The MINERVA 2 model provides a possible mechanism for the recognition without recall phenomenon. If the probe depicted in Fig. 1 is the cue POTCHBORK, the computed familiarity signal would be higher if the study item PITCHFORK resides among the memory traces than if no memory traces share a substantial number of features with that probe. Moreover, because the activation values are summed across the memory traces, the familiarity signal will be even higher if PATCHWORK, POCKETBOOK, and PULLCORK also reside among the memory traces, as was found by Ryals and Cleary et al. (2012). Thus, both the feature-matching aspect and the global matching aspect make the model well suited for accounting for how it is that participants can detect increased familiarity with a test cue during recall failure of the relevant study episode(s) (e.g., Cleary, 2004; Ryals & Cleary, 2012). The cue elicits a stronger familiarity signal if there is a greater feature match in memory, and an even stronger familiarity signal if more items have a high degree of feature overlap with the cue.

Importantly, the feature-matching process in humans is likely not devoid of preexperimental knowledge representations. For example, orthographic and phonological knowledge is presumably involved in the types of features used in this example, and reside in the participant’s knowledge base prior to the experiment.

Semantic feature-based familiarity-detection during recall failure

The likelihood that pre-experimental knowledge representations are involved in the recognition without recall phenomenon becomes further apparent in a variant of the phenomenon that involves semantic information. For example, in one experiment, Cleary (2004) presented participants with cues (e.g., jaguar) that potentially overlapped semantically with studied words (e.g., cheetah). During cued recall failure, participants gave higher ratings to cues that corresponded to a studied item than to cues that did not. This and other research (e.g., Jia et al., 2020) suggests that semantic information can lead to recognition without recall independently of orthographic or phonological features.

To investigate whether this type of recognition without recall might be based on a semantic feature-matching process, Cleary et al. (2016) drew from the literature on feature-based semantic knowledge representation (e.g., Griffiths et al., 2007; Landauer & Dumais, 1997; Plaut, 1995; Seidenberg, 2007; Smith et al., 1974; Yee et al., 2009) regarding how to manipulate and quantify semantic feature overlap. They tested whether the semantic features of the type specified by McRae et al. (2005) and traditionally studied in the context of semantic memory could participate in the type of global feature-matching process described by MINERVA 2 for episodic familiarity detection.

Cleary et al. (2016) found that when cued recall failed, cues (e.g., cougar) sharing semantic features (e.g., “has fur,” “has four legs”) with four studied items (e.g., cheetah, leopard, panther, tiger) received higher familiarity ratings than cues sharing semantic features with only two (e.g., tiger, cheetah); in turn, cues sharing semantic features with two studied items received higher familiarity ratings than cues not sharing semantic features with any studied items. Thus, semantic features, as conceived of by researchers of semantic knowledge representation (e.g., McRae et al., 2005), were able to participate in a process resemblant of MINERVA 2’s global feature matching process for familiarity detection.

The fact that recognition without recall can be semantic in nature (or based on semantic feature overlap between a test cue and unrecalled studied items) suggests that preexisting knowledge is involved. Processing semantic aspects of a stimulus requires top-down processing stemming from an existing knowledge base. Though others have used MINERVA 2 to make important points through simulating presumptive semantic-feature-based processes while keeping the aforementioned tabula rasa assumption, such as when simulating the Deese–Roediger–McDermott (Deese, 1959; Roediger & McDermott, 1995) false memory phenomenon (e.g., Arndt & Hirshman, 1998), in reality, in order for information to be semantic in nature, it must be present in the knowledge base prior to the experiment (or for the present purposes, prior to the start of the simulation run). Below, we present a means of incorporating preexisting knowledge into the MINERVA 2 model in simulations of familiarity-based discrimination in list-learning paradigms. Doing so may help to shed light on the potential role of preexisting knowledge in the mechanisms behind the recognition without recall phenomenon, particularly the semantic-feature-based recognition without recall phenomenon (Cleary, 2004; Cleary et al., 2016; Jia et al., 2020). Our simulation method borrows from assumptions present in classic signal detection theory.

A role of existing knowledge in classic signal detection theory

Classic signal detection theory (e.g., Atkinson & Juola, 1974; Crowder, 1976, p. 373), assumed a role of preexisting knowledge in the process, such that all items have some baseline level of familiarity in a person’s mind, the variability of which can be represented by the spread of a distribution along the familiarity continuum. This is depicted on the left-hand side of Fig. 2.

Fig. 2
figure 2

Classic signal detection theory

When a select set of items is presented on a study list, the study list presentation of these items leads to a boost in their familiarity. This boost in familiarity among studied items can be represented by a rightward shift in the baseline distribution, depicted in the right-hand side of Fig. 2. As occurs in global matching models like MINERVA 2, old–new responses depend on criterion placement. In Fig. 2, the gray dotted area depicts the false alarms for the criterion shown, whereas the unshaded part of the rightward distribution to the right of the criterion depicts the hits. The unshaded part of the rightward distribution to the left of the criterion depicts the misses, and the area of the leftward distribution to the left of the criterion depicts the correct rejections.

The important take away from Fig. 2 for the purposes of the present study is that the leftward distribution—that representing the baseline familiarity for the new items—assumes a role of general knowledge in familiarity-based old–new discrimination in a list learning paradigm. There is a preexisting spread to the baseline familiarity distribution at the start of a list-learning experiment. What underlies that baseline familiarity and its variance is likely frequency of past exposure to the items in the real world and with what variability. One possible indicator, for example, might be word frequency (e.g., Reder et al., 2000).

From this perspective, the semantic features participating in the feature matching process thought to be involved in the semantic feature-based familiarity detection shown in Cleary (2004) and Cleary et al. (2016) could be hypothesized to exist in the leftward distribution at the start of an experiment. The study list presentation then shifts the distribution rightward for memory traces overlapping with study list items to account for how familiarity-based discrimination can take place within the list-learning paradigm.

Incorporating preexisting knowledge in MINERVA 2’s familiarity-detection process

In the present study, we aimed to incorporate baseline familiarity levels into the MINERVA 2 model similar to the basic idea present in classic signal detection theory above. Toward this end, we created pre-simulation memory traces that varied in their number of repetitions in the memory store. This was similar to how experimental frequency was represented in Hintzman (1988), except that our varying frequencies were meant to mimic baseline exposure levels to the stimuli in the general knowledge base before the start of an experiment. One purpose of our first simulation was to illustrate the effects of having probes for unstudied items correspond to zero memory traces in the store (i.e., the tabula rasa assumption) versus having probes for unstudied items instead correspond to memory traces in a preexisting general knowledge base (similar to what was envisioned in classic signal detection theory). Another purpose was to show that the latter allows for a simulation of the basic semantic-feature-based recognition without recall phenomenon shown by Cleary (2004). Our follow-up simulations build upon this first demonstration to explore the behavior of the “signal” and “noise” distribution variances under situations of higher signal-noise discriminability, as well as to simulate the patterns shown by Cleary et al. (2016) in their investigation of a semantic feature overlap account of the semantic-feature-based recognition without recall phenomenon and to explore scaling up versus scaling down the baseline knowledge trace repetitions. Finally, our last simulation presents a possible recipe for scaling up to the scale of human word knowledge, given adequate computational resources.

All of the simulations reported here are based on Hintzman’s (1988) MINERVA 2 model. All were conducted using Python 3.7, including NumPy (Oliphant, 2006) and Jupyter (Kluyver et al., 2016), specifically designed to generate a set of memory traces, create test probes based on a subset of those traces, and then compute the echo intensity based on Eqs. 1 and 2. The code for the simulations is publicly available via GitHub (https://github.com/dwhite54/minerva2), and the generated data are accessible via OSF (https://osf.io/fvs3t/). For all Simulations reported here, the effects of different parameter settings are reported in the Supplementary Materials.

Simulation Set 1: Creating baseline familiarity and simulating Cleary (2004)

Simulation 1a

To create a preexisting knowledge base in our first simulation, we first generated 20,000 memory traces that consisted of 200 features (this number followed from McNeely-White et al., 2021, who used 200 features per trace in attempting to simulate musical features’ involvement in familiarity-detection using MINERVA 2). The features themselves were randomly generated to be either 0, +1, or −1, as per the MINERVA 2 model (e.g., Hintzman, 1988).

To create varying degrees of baseline familiarity, we randomly assigned each memory trace to a repetition value, which dictated how many representations of that specific memory trace would be added into the model to comprise its existing knowledge base. The repetition values were randomly generated using a normal distribution centered at zero with a standard deviation of 7 (see the Supplementary Materials for different standard deviation values), rounding each value to an integer. To ensure that a repetition value was no less than 1 (as a preexisting knowledge trace cannot have zero or fewer representations), we then took the absolute value of each repetition value and added 1 (e.g., a generated repetition value of 0 would be transformed into 1, a generated repetition value of −1 would be transformed into 2). As an illustration, when applying these transformed repetition values to the memory traces, Trace1 might be randomly assigned to a repetition value of 1 while Trace2 might be randomly assigned to a repetition value of 3. When encoding these memory traces into the model, Trace1 would be added only once while Trace2 would be added three times, resulting in these traces having lower and higher degrees of baseline familiarity in the knowledge base, respectively. This process led to a preexisting knowledge repetition range of 1-22 repetitions per trace, with a mean of 7 repetitions per trace. Additionally, as elaborated on below, noise was applied to the traces such that there was a 50% probability that any feature within a memory trace would be set to 0.

The first simulation was designed to mimic the results of Cleary (2004) in which participants studied a list of words (e.g., cheetah) before being presented with test cues, half of which semantically resembled studied words (e.g., jaguar). During cued recall failure, participants discriminated between cues that resembled studied words and cues that did not, presumably based on familiarity detection. In the current simulations, Cleary’s methodology can be likened to adding one additional instance of a knowledge memory trace into the overall store upon study list presentation, and to have that newly added trace share a proportion of the features of a specific set of repeated traces already in the knowledge store. This was part of the logic behind applying 50% noise to the memory traces—this way, none of the traces were identical, but sets of traces overlapped in roughly half of their features, and a trace from a study list presentation would add another trace of overlapping features to the preexisting knowledge traces that were already there. For this simulation, we randomly selected a subset of the preexisting knowledge memory traces (25%) to be Studied1X Traces (the “old” distribution), thereby adding one more trace into the store with the same 50% noise as the preexisting knowledge traces. The remaining 75% of the preexisting knowledge traces were labeled as Unstudied Preexisting Traces (the “new” distribution).

We then probed the model to obtain the echo intensities for test probes (cues) corresponding to the Unstudied Preexisting Traces and to the Studied1X Traces, and the model’s behavioral responses (“yes, familiar” or “no, unfamiliar”). Test probes were perturbed such that each feature of the probe had a 30% chance of being set to 0. Note that the Unstudied Preexisting Traces, although not mapping onto a studied item, do still map onto a proportion of the preexisting knowledge traces in feature overlap. For comparison, we also created test probes that corresponded to zero memory traces in the store (in order to examine how the findings differ when the standard tabula rasa assumption is used for the Unstudied condition); this distribution is labeled Novel below.

Normal distributions were fit to each of the Studied1X, Unstudied Preexisting, and Novel echo intensities as seen in Fig. 3a, from which d' was calculated between Studied1X and Unstudied Preexisting in order to mimic the recognition without recall effect of Cleary (2004). We performed an ROC analysis in which discriminability was determined at varying echo intensity thresholds, which is akin to using different confidence ratings in recognition memory research (see Fig. 3b). For further metrics, an echo intensity threshold was found that maximizes the geometric mean of sensitivity (the proportion of “old” probes correctly classified) and specificity (the proportion of “new” probes correctly classified). We computed the hit and false alarm rates using this threshold then used these to compute d'.

Fig. 3
figure 3

Echo intensity values (a) and ROC analysis (b) for Simulation 1a

Our first simulation demonstrates that with this approach, the “old” and “new” distributions are discriminable based on the average familiarity signal difference between them, and the pattern simulates that shown by Cleary (2004). Just focusing for now on the Studied1X and the Unstudied Preexisting conditions shown in Fig. 3a, the distribution for the Studied1X traces has shifted slightly over to the right, leading to a d' value of .15 and an area under the curve (AUC) value of .54 from the model’s yes–no responses. The overall pattern strongly resembles that reported in Cleary (2004), where the d' values for recognition without recall ranged from .10 (for semantic feature overlap) to .28 (for combined orthographic and phonological feature overlap); also, the ROC patterns look remarkably similar between Fig. 3b and in Cleary (2004), and the variances appear to be roughly equal as in Cleary (2004), where the z-ROC slopes approximated 1.0 (indicating equal variances).

Turning now to the Novel distribution, another goal of this simulation was to illustrate how assumptions regarding the distribution of unstudied items impact the outcomes of simulations. Specifically, how do the outcomes change as a function of whether test probes for unstudied items correspond to traces in a preexisting knowledge base versus correspond to no specific memory traces in the store (the tabula rasa assumption)? Hintzman’s (1988) MINERVA 2 simulations on study item frequency suggest that as the number of memory traces in the store that match the test probe increase from zero to one to two to three to four to five, the variances of the echo intensity distributions progressively increase, with less dramatic variance increases occurring at higher frequencies (see Hintzman, 1988, Fig. 4, p. 531). Therefore, one a priori expectation was that the variances of the distributions would be more similar to one another when the “new” distribution corresponds to preexisting knowledge traces than when it corresponds to zero traces. Indeed, as shown in Fig. 3a, when probes mapped onto no stored memory traces (beyond partial matches due to chance), the variance was much lower than when the probes mapped onto existing memory traces. Moreover, old–new discriminability was impacted by this assumption as well; d' for Studied1X versus Novel distributions was 1.21 (dramatically higher than the d' of .15 reported above for Studied1X versus Unstudied Preexisting) and AUC was .81 (also much higher than the AUC of .54 for Studied1X versus Unstudied Preexisting). The dramatic difference is also apparent in the ROC shown in Fig. 3b. In short, assumptions about unstudied items affect both the extent to which the variances change with study status as well as old–new discriminability. Furthermore, assuming that test cues for unstudied items still correspond to preexisting traces in the knowledge base leads to a fairly straightforward manner of simulating the semantic-feature-based recognition without recall phenomenon reported by Cleary (2004).

Fig. 4
figure 4

Echo intensity values (a) and ROC analysis (b) for Simulation 1b

Simulation 1b

Although the low discriminability and ROC patterns are a good replica of the patterns shown by Cleary (2004), it is worth investigating how the variances of the “old” and “new” distributions behave under circumstances of higher discriminability. Equal variances could simply be the result of highly overlapping distributions. To achieve higher discriminability in our follow-up simulation, when adding the Studied1X Traces into the model, we perturbed the features of the traces using a 40% probability that each feature would be changed to zero. In this way, the newly added study list traces still highly resemble preexisting knowledge traces and a potential later cue without being identical to them, and a recency boost is incorporated through slightly lower noise among studied traces than among preexisting knowledge traces (40% instead of 50%). This not only helps to increase old–new discriminability, but also stems from the rationale that items that were more recently encountered should theoretically be less noisy than long-existing knowledge traces due to less opportunity to decay over time, an important consideration for eventually scaling up the present approach to the full scale of human knowledge, as discussed in the General Discussion.

Focusing first on the Studied1X and Unstudied Preexisting distributions, as shown in Fig. 4a and b, the variances are still approximately equal with an increased d' of .29 and an AUC of .58. Turning to the comparison of the Studied1X and the Novel echo intensity distributions, the variability of the Novel distribution is dramatically lower than that of the Studied1X distribution and old–new discriminability was quite high; d' was 1.36 and AUC was 84, demonstrating again what was shown in the first simulation regarding modeling assumptions about unstudied items. In both of these simulations, an assumption of preexisting knowledge leads to a fairly straightforward simulation of the recognition without recall pattern reported by Cleary (2004), whereas a tabula rasa assumption for unstudied items does not.

Simulation Set 2: Simulating Cleary et al. (2016)

Simulation 2a

In this next simulation, we followed the same procedure and parameters as in Simulation 1b with the following exceptions: The Studied Traces were added back into the model either two times (Studied2X) or four times (Studied4X) in order to mimic Cleary et al.’s (2016) use of semantic test cues (e.g., cedar) that either resembled two (birch, oak) or four (birch, oak, pine, willow) studied memory traces, respectively, and we no longer included a Novel (tabula rasa) condition. Cleary et al. found that participants’ ratings of cue familiarity intensity for cues that elicited no target recall increased systematically from cues that resembled no studied items (M = 2.49) to cues that resembled two unrecalled studied items (M = 2.75) to cues that resembled four unrecalled studied items (M = 3.06). Rather than with ratings, we aimed to simulate this pattern of cue familiarity detection using the model’s yes–no judgments described above. To determine the hit and false alarm rates of these three conditions (Unstudied Preexisting, Studied2X, Studied4X), we placed the criterion threshold at the median echo intensity value for the Studied2X condition.

As shown in Fig. 5a, the addition of two “study” episodes shifted the Studied2X distribution rightward from the Unstudied Preexisting distribution, allowing for discriminability, with a d' of .45, and AUC of .63. The addition of yet two more “study” episodes shifted the Studied4X distribution even farther to the right. For the Unstudied Preexisting versus the Studied4X conditions, d' was .86 and AUC was .74. For the Studied2X versus Studied4X conditions, d' between the two was .41 and AUC was .62. These results are similar to those of Cleary et al. (2016).

Fig. 5
figure 5

Echo intensity values (a) and ROC analysis (b) of Simulation 2a

Although the specific level of familiarity increase was not a focus of the study reported by Cleary et al. (2016), in re-examining their data, we found that the level of familiarity increase across their conditions was roughly equal and roughly additive. That is, the level of increase in familiarity ratings from their Studied0X condition to their Studied2X condition (M = .26, SD = .87) did not differ significantly from the level of increase from their Studied2X condition to their Studied4X condition (M = .31, SD = 1.06), t(96) = 0.35, SE = .17, p = .73, BF01 = 3.38. The same general pattern occurred in our simulations: The false alarm rate was .32, the hit rate for the Studied2X condition was .50, and the hit rate for the Studied4X condition was .67. Thus, the level of increase from the Studied0X to the Studied2X condition (a difference of .18) approximates the level of increase from the Studied2X condition to the Studied4X condition (a difference of .17). In short, the level of familiarity increase from the Studied0X to the Studied2X to the Studied4X conditions in both the empirical data in Cleary et al. (2016) and our present simulations was roughly equal and roughly additive.

Simulation 2b

The simulations reported thus far present a step toward incorporating preexisting knowledge (i.e., semantic memory) into the feature-based familiarity-detection mechanisms of MINERVA 2 and demonstrate some implications of doing so; they also demonstrate how doing so can enable simulating empirical phenomena presumed to result from feature-based familiarity-detection like the semantic-feature-based recognition without recall phenomenon (Cleary, 2004; Cleary et al., 2016). However, the preexisting knowledge that was built into these simulations was nowhere near the full scale of human knowledge. Whereas the range of repetition values in the preexisting knowledge traces of Simulation Sets 1 and 2 was 1–22, databases of word frequency estimates, such as the English Lexicon Project (Balota et al., 2007), suggest that a given word can have thousands of memory trace representations, as individuals have repeatedly encountered a given word across the life span. Undoubtedly, scaling up to the level of human word frequency knowledge would mean that recent memory traces from occurrence on a study list would become small drops in the bucket, overwhelmed by the preexisting knowledge traces. That said, it is plausible that recently encountered words benefit from their recency in list-learning paradigms, such as if they are initially less noisy than the older, preexisting memory traces in the knowledge base but eventually decay down to that level over time. Such a recency mechanism was incorporated in Simulations 1b and 2a, but would likely be needed to a greater extent when scaling up.

It is at least theoretically possible that preexisting knowledge can be built into MINERVA 2 at a scale that approximates that of human word knowledge. In Simulation 2b, we attempted to scale up the baseline trace repetitions in a manner concordant with research on human word frequency knowledge. Specifically, we used the SUBTLWF frequencies of Brysbaert and New (2009)—the frequency of each word per million words—for the actual stimulus words used by Cleary et al. (2016). All but one of the 480 words used by Cleary et al. (hip hop) were in this pool; the mean SUBTLWF for these was 19.43 and the range was 0.060–569. To approximate this, we changed Simulation 2a’s repetition range from 1–22 to a range of 1–500, but this effort exceeded our available computational resources (see the Supplementary Materials for the resource details). Staying within our available resources required scaling the repetition range down to a range of 1–200 and reducing the total number of preexisting knowledge traces from 20,000 to 10,000. The results of this simulation (with a resulting mean repetition value of 51) are below.

Indeed, as shown in Fig. 6a, the distributions are now highly overlapping such that the study list memory traces are more of a drop in the bucket. For the Unstudied Preexisting versus the Studied 2X distribution, d' was .06 and AUC was 0.52. For the Unstudied Preexisting versus the Studied4X distribution, d' was .13 and AUC was .54. For the Studied2X versus Studied4X distribution, d' was .08 and AUC was .52. The probability of a “yes, familiar” response went from .47 to .50 to .52 going across the three distributions.

Fig. 6
figure 6

Echo intensity values (a) and ROC analysis (b) of Simulation 2b

Simulation 2c

In Simulation 2c, a 20% noise level was applied to the study traces rather than the 40% used in Simulation 2b while the preexisting knowledge trace noise remained at 50%. This served to increase the recency boost afforded to study list traces. As can be seen in Fig. 7a, discriminability increased yet the variances remained roughly equal. For the Unstudied Preexisting versus the Studied 2X distribution, d' was .16 and AUC was .55. For the Unstudied Preexisting versus the Studied4X distribution, d' was .31 and AUC was .59. For the Studied2X versus Studied4X distribution, d' was .15 and AUC was .55. The probability of a “yes, familiar” response went from .44 to .50 to .57 going across the three conditions, with a roughly equal and additive increase.

Fig. 7
figure 7

Echo intensity values (a) and ROC analysis (b) of Simulation 2c.

Simulation 3: A recipe for scaling up

Word frequency in many languages—including English—follows an exponential power function (see Brysbaert et al., 2016, 2018; Piantadosi, 2014; van Heuven et al., 2014). For example, the majority of words in English are estimated to be low-frequency words (e.g., “gloom” or “accordion”) while very few are high-frequency (e.g., “friend” or “money”) words. When these words are ordered as a function of estimated word frequency, the resulting distribution follows an exponential function, such as that depicted in Fig. 8a.

Fig. 8
figure 8

a The exponential distribution of log subtitle frequency from the English Lexicon Project (Balota et al., 2007). b The distribution of words from the pool of experimental stimuli from Cleary et al. (2016) by log subtitle frequency

Accordingly, we attempted to scale up the preexisting knowledge base by adding the preexisting memory traces into the knowledge pool using a log10 function as estimated using the Balota et al. (2007) word frequency estimates to simulate 40,000 preexisting knowledge traces, as a typical participant is estimated to have a lexicon consisting of approximately 40,000 words (e.g., Brysbaert et al., 2016; Brysbaert & New, 2009). Each trace was associated with a repetition value generated using the exponential distribution (see Fig. 8a). However, this effort required 729 GiB of memory, far exceeding our available 128 GiB. Therefore, in lieu of actually scaling up the preexisting knowledge distribution to this level, in Simulation 3 we present a possible recipe for such a scale-up while at the same time, scaling down in baseline knowledge repetitions from the previous simulations to allow for a comparison of different preexisting repetition scales.

We generated preexisting memory traces that followed the same distribution as that thought to comprise a human participant’s word frequency knowledge base (an exponential distribution). We generated 40,000 memory traces, then generated trace repetition values using the exponential distribution function of \( f\left(x;\beta \right)=\left(\frac{1}{\beta}\right){e}^{-x/\beta } \) with β = 1 and x = 0.7 (these values were based on the Balota et al., 2007, database of log subtitle frequency, rounded to the nearest integer). This exponential distribution is depicted in Fig. 9a. Each memory trace was randomly assigned to one of these repetition values. Noise was applied to the preexisting knowledge traces, such that there was a 50% chance that each feature within the memory traces would be set to zero.

Fig. 9
figure 9

a Based on the Balota et al. (2007) data set, the exponential distribution function of \( f\left(x;\beta \right)=\left(\frac{1}{\beta}\right){e}^{-x/\beta } \) was fit to the rounded word frequency estimates, such that β = 1 and x = 0.7, resulting in a range of repetition values between 1 and 4. b Traces of to-be-probed studied and unstudied items were selected from the simulated exponential distribution depicted in Panel a such that they followed the normal stimulus distribution derived from Cleary et al. (2016) using the function N(μ, σ2), μ = 2.5 and σ = 0.7

We also submitted the 480 stimulus words from Cleary et al. (2016) to the English Lexicon Project (Balota et al., 2007) to obtain their log subtitle frequencies, which Brysbaert and New (2009) suggest is the most accurate index of relative word frequency. The 476 words with listed log subtitle frequencies were roughly medium-frequency words, falling approximately in the middle of the exponential distribution of the overall database and forming a normal distribution (see Fig. 8b).

To mimic this stimulus distribution in our scaled-down “recipe” simulation, we selected 5% of the memory traces from the exponential preexisting knowledge base to be Unstudied, Studied2X, and Studied4X, such that the distribution of the simulated experimental stimuli traces followed the same general normal distribution as the Cleary et al. (2016) rounded word frequency indices (see Fig. 9b), resulting in a memory trace repetition mean of 2.5 (SD = 0.7). Note that at this lowered scale, our preexisting knowledge trace repetitions were the middle frequencies examined in Hintzman’s (1988), Fig. 4, page 531 (one, two, three, or four). At this lowered scale relative to Simulation Sets 1 and 2, we would expect the variances of distributions to change comparably to those reported by Hintzman (1988). Furthermore, we would expect discrimination to be quite high at this low scale; thus, the level of noise between the existing knowledge traces and the study list traces was kept constant at 50% (as in Simulation 1a). Probe noise was the same as in Simulation Sets 1 and 2 (a 30% chance of each feature being set to zero).

Indeed, similar to what was shown by Hintzman (1988), at this lowered baseline repetition scale, the variances showed a more dramatic increase with increasing study trace frequency than in the prior simulations (see Fig. 10 for the density plot and ROC analysis). Also, discriminability was greater. For the Unstudied Preexisting versus the Studied2X distributions, d’ was .88 and AUC was .74. For the Unstudied Preexisting versus the Studied4X distributions, d' was 1.49 and AUC was .86. For the Studied2X versus the Studied4X distributions, d' was .64 and AUC was .67. Finally, the roughly additive increases shown when the variances are approximately equal is no longer shown (the probability of a “yes, familiar” response increased from .18 to .50 to .73 for the Unstudied Preexisting versus the Studied2X versus the Studied4X distributions, respectively—an increase of .32 and .23, respectively).

Fig. 10
figure 10

Echo intensities (a) and ROC analysis (b) for Simulation 3

General discussion

Outcomes of the present simulations

A plausible mechanism for semantic-feature-based recognition without recall

The present simulations represent a step toward achieving a mechanistic understanding of the semantic-feature-based recognition without recall phenomenon (Cleary, 2004; Cleary et al., 2016). By incorporating preexisting knowledge into the MINERVA 2 model’s mechanisms for handling familiarity-based old–new discrimination in a list-learning paradigm, we simulated patterns found with semantic-feature-based recognition without recall (Simulation Sets 1 and 2). That said, although MINERVA 2 provides a plausible mechanism for the cue familiarity-detection process that is thought to enable recognition without recall, it may not be a suitable account of overall old–new recognition in standard recognition memory paradigms where familiarity-detection is not isolated. Recall or recall-like processes can presumably contribute, which likely differ mechanistically from familiarity-detection (Ryals & Cleary, 2012). That said, the present simulations do raise broader issues for memory research beyond identifying a plausible mechanism for the semantic-feature-based recognition without recall phenomenon.

Broader issues raised in the present simulations

At a broader level, we show how the signal and noise distribution variances and overall signal discriminability are affected by taking the more standard tabula rasa approach (of test probes for unstudied items corresponding to zero traces in the store) versus incorporating a preexisting knowledge base that includes all potential simulated experimental stimuli (with only some of them appearing at study). A noteworthy finding is that the signal variances change less dramatically with increasing study list frequency when there are a greater number of baseline memory trace repetitions in the preexisting knowledge base than when there are fewer. This suggests that the degree to which the spread of the signal and noise distributions vary from one another may itself systematically vary depending on the strength and scale of the built-in preexisting knowledge distribution. Further investigating this may be a worthy future research pursuit in its own right. In fact, as shown in the Supplementary Materials, some circumstances might lead to decreased variances with increasing study list frequency in MINERVA 2.

Another broader noteworthy finding is the roughly additive increase in the likelihood of responding “yes, familiar” going from 0 to 2 to 4 study presentations when the variances in the echo intensity distributions were approximately equal (Simulation Set 2), but not when the variances began to substantially increase (Simulation 3). McNeely-White et al. (2021) noted that the additivity principle present within MINERVA 2’s mean echo intensities (whereby activation values for separate memory traces combine additively to produce the resultant familiarity signal; see Fig. 1) may only be captured behaviorally in the form of yes–no judgments when the variances are roughly equal.

Future directions

We hope that these simulations will provide a springboard from which future research can (1) scale up to a closer approximation of the size of the human word knowledge base, and (2) attempt to represent known semantic feature information in the simulations. Although our computing resource limitations prevented us from scaling our simulations to the level of human word knowledge, Simulation 3 presents a possible “recipe” for future efforts at doing so. An important consideration for doing so is the likelihood that a preexisting knowledge base at that scale will overshadow any effects of study list traces, possibly preventing any familiarity-based signal-noise discrimination from occurring. This raises the important question of how such familiarity-based discrimination happens in the recognition without recall phenomenon found among human participants. It is possible that recently encoded memory traces are much less noisy than older memory traces, and that this recency boost enables a short-lived form of signal-noise discrimination that decays back to baseline over time (as the recent traces increase in noise over time). Whether or not the recognition without recall phenomenon adheres to such a pattern (of diminishing across a delay) is an empirical question that can be investigated in future research.

One possibility is that there is a mechanism for representing recency. For example, if the memory traces for recently studied items are initially less noisy than preexisting knowledge traces, this could help recently added memory traces contribute more strongly to the familiarity signal than preexisting traces, allowing for familiarity-based recent/nonrecent discrimination. We implemented such a mechanism in MINERVA 2 by incorporating a higher level of random noise into the preexisting knowledge base of traces than among study list traces.

That the preexisting knowledge traces might be noisy is plausible, as memory traces would be expected to be imperfect over a lifetime, and MINERVA 2’s mechanisms enable gist-like representations to emerge across multiple noisy memory traces in response to a probe (Arndt & Hirshman, 1998; Hintzman, 1986), creating a means for generic knowledge to take place among an enormous number of noisy traces. The plausibility of lower noise in study traces than preexisting knowledge traces likely requires an assumption that the recency boost would be relatively short-lived in nature, with the noise level increasing at some rate over time to that of the preexisting knowledge traces. This approach would predict that the recognition without recall phenomenon should be fairly short-lived—an empirical question that has not yet been investigated but could be with a time delay paradigm to determine if the recognition without recall effect diminishes fairly rapidly over time.

Another limitation of the present study is that we did not attempt to map the feature representations in MINERVA 2 (i.e., +1 and −1 values) to real, quantifiable features that have been identified in prior research. One way in which semantic features of words are potentially quantifiable is through the feature sets identified by McRae et al. (2005). For each word in their set, McRae et al. list the semantic features identified from their norming study (e.g., “oak” has the semantic features “a tree,” “grows in forests,” “has leaves,” “is brown,” “is hard,” “is large,” “is strong,” “is tall,” “made of wood,” “used as home by animals,” and “used for making furniture,” for a total of 11 identified semantic features). One could attempt to quantify semantic features per item in MINERVA 2 using this database, as well as to quantify the amount of semantic feature overlap between a test cue and a given memory trace (by number of semantic features shared between those two items). This might be accomplished by having each feature value (+1 or −1) in a particular vector location correspond to a single particular semantic feature (e.g., “has leaves”). Semantic feature overlap between two items would be represented by the identical feature values in identical locations across the two vector representations; nonshared features would be represented by opposite feature values in identical locations across the two vector representations. Another way might be to simulate a “feature” as a pattern of +1 and −1 values rather than as a single value.

One challenge is that semantic features are not the only type of feature contained within a word. Words also have orthographic and phonological features, and while these features are quantifiable, directly mapping both these and semantic features to the +1 and −1 representational format used in MINERVA 2 presents many challenges, including attempting to quantify all of these features per word at the scale of the human word knowledge base for input into the model’s preexisting knowledge base. That said, a small step in this direction might be to start with only the quantifiable semantic-features-per-item identified by McRae et al. (2005). Then, focusing only on those semantic features (and not at the scale of human word knowledge), attempt to simulate the semantic-feature-based recognition without recall patterns shown by Cleary et al. (2016) using these quantifications of cue-to-memory-trace semantic feature overlap.

Finally, future research could build on the present approach to examine how existing semantic knowledge representations affect how item-level characteristics impact episodic recognition judgments. For example, when recognition decisions are familiarity-based rather than recall-based, how does word frequency in the knowledge base interact with the item’s recent presentation on a study list to produce the sensation of familiarity?

Open practices statements

The code for the simulations is publicly available via GitHub (https://github.com/dwhite54/minerva2), and the data generated using the simulation tool are accessible via OSF (https://osf.io/fvs3t/). The simulations were not preregistered.