Since Ebbinghaus (1885), many theorists have tried to characterize the shape of learning. Knowledge about the mathematical equation of the learning curve is important for theoretical reasons—for example, to establish whether learning performance continues to improve at the same rate, or is instead constantly slowing down (Lewandowsky & Farrell, 2011). Knowing the precise shape of the learning curve also has practical advantages—for example, to optimize the learning process in training situations. With computer-based learning, it is possible to use a student’s learning history to predict future learning performance, while the learning process itself may be optimized. For these purposes, neural network models or other detailed computational models—though insightful in their own right—are often too complex. The calculations involved in processing the learning histories of individual students would be too time-consuming. For such applications, a highly abstracted, concise mathematical model is preferable. Hence, the search for the learning curve equation continues.

I do not think, however, that it is theoretically fruitful to assume the existence of one principal equation that determines the shape of learning: “the learning curve,” a hypothetical construct that would characterize learning at all levels, in all systems, across groups and individuals, and for all types of materials. The reason why I do not believe in a generally valid learning curve is that all of these mechanisms and processes may affect the shape of learning at a fundamental level. For example, taking the average of several learning curves of individual subjects or items sounds like a neutral operation, but it is not. Simple averaging may give rise to a power function for a group’s performance, even if the individual learning curves are all exponential (see below for a brief discussion of the exponential and power function). Several studies have now established this both experimentally (Heathcote, Brown, & Mewhort, 2000), computationally (R. B. Anderson, 2001; R. B. Anderson & Tweney, 1997; Brown & Heathcote, 2003), and mathematically (Murre & Chessa, 2011; Myung, Kim, & Pitt, 2000). Rather than calling such averaged power functions “spurious” or “artifactual,” we must recognize that averaging is an operation that affects the shape of learning. Instead of searching for globally applicable learning functions, I propose to focus on limited domains of application with more explicit assumptions about the processes involved. I believe that this differentiation may result in more insightful analyses and more powerful results.

In this article, I will focus on learning verbal material, specifically foreign language vocabulary, and analyze how we can translate characteristics of the to-be-learned material and individual subjects to parameters of the observed learning curve. I will investigate under what circumstances S-shaped curves emerge, what parameters characterize these, and which aspects of an experiment may influence them. In order to explore these issues, I will in the Theory section discuss the exponential and power function and argue why it is important that for verbal materials, learning equations must be able to accommodate both S-shaped and non-S-shaped (concave) learning curves. I will propose two such functions, which are derived as equations for learning not one thing but many. I will also discuss a framework for learning that gives rise to a third equation that resembles the exponential curve and that we feel is useful for understanding learning and memory. All three equations predict that S shapes will mainly be prominent with low learning rates, while being hard to detect with higher rates and with prolonged experience with the materials. In the Data and Modeling section, I describe two experiments and analyze these in some detail at the level of individual subjects. The goal here is not to find the “winning” model but rather to evaluate the theoretical usefulness of the three learning equations. In the Discussion section, I will consider why S-shaped curves in learning often remain unnoticed.

Theory

Concave versus S-shaped curves

Many different learning curve equations have been proposed in the literature, of which the most frequently encountered ones are the exponential function,

$$ p(t)=1-{e}^{-\mu t} $$

(e.g., Estes, 1950; Hull, 1943; Rescorla & Wagner, 1972; Spence, 1956; Thurstone, 1927), and the power function,

$$ p(t)=1-{\left(t+1\right)}^{-\mu } $$

(e.g., J. R. Anderson & Schooler, 1991; Newell & Rosenbloom, 1981), where p(t) is the probability of correct performance at time t, and μ is the learning parameter, which will be low for slow learners. In these expressions, time starts at t = 0 and may also be expressed in terms of equivalent learning trials (starting at 0).Footnote 1

With the exponential and power functions, performance increases strongest in the beginning, leveling off as the maximum performance is approached. This “hollow” shape is called concave, and the majority of learning curves proposed share this shape. Many experimentalists, however, have reported S-shaped curves (e.g., Culler & Girden, 1951; Gallistel, Fairhurst, & Balsam, 2004; Gulliksen, 1934; Stroud, 1931, 1932) and my colleagues and I have frequently found these as well in our own research, especially with learning verbal material. This presents a problem for the exponential and power functions because they cannot represent an S shape, since they are strictly concave. Though it may seem theoretically interesting to assume that concave functions suffice to describe the shape of learning and that S-shaped curves are unimportant or “odd,” this assumption gives rise to a logical contradiction, which becomes evident if we consider various ways in which the to-be-learned material may be scored for correctness.

Suppose, we have the following vocabulary-learning experiment, which may approach the real-life learning situation faced by a student of a foreign language. We present lists of n paired-associate items—for example, English words with translations into Dutch, which the student must answer with the correct Dutch translation, as in “the man = ? (answer: de man),” “walks = ? (loopt),” and so forth. On errors, the correct Dutch translation is given immediately as feedback. The words in a specific list are presented until the translation is correct at least once. Each learning trial consists of one full run through all words in the list not yet learned. The (cumulative) score for each learning trial is the total number of words that have been translated correctly at that point. If the subject has no prior knowledge of Dutch, the score at trial 0 is likely to be 0, and at the end of learning the score will equal the list length, n, which we normalize by dividing all scores by n. What would be the shape of the learning curve of this vocabulary-learning experiment?

We will first assume that for a given student, learning performance does indeed follow the exponential function above: p(t) = 1 − exp(−μt). This function is not S-shaped, because it does not have an inflection point (i.e., a point where it changes from a convex to a concave shape, or vice versa). Now, suppose that we change the scoring such that we arbitrarily group the list items into groups of two items. We only count such a group as correct if both answers are correct. The probability of having both Item 1 and Item 2 correct is the product of the probabilities of having each individual item correct: p(t) = p 1(t)p 2(t) = [1 − exp(−μt)]2, if the learning parameters μ are equal for both items, as I will assume here. This form of scoring may at first seem contrived, but the task is similar to the common task of translating short sentence fragments, such as “the man walks = ? (de man loopt).” Also, when first starting to learn a new language, its phonotactics may be unfamiliar; what seems to be a single word, such as the Dutch word groenteboer (“greengrocer”), actually constitutes a combination of individual fragments (groente = “vegetable,” boer = “farmer”), each of which must to be learned and later retrieved correctly at the same time.

Hence, if there are items for which the learning curve is p(t) = 1 − exp(−μt), some compound items, consisting of c > 1 fragments, will have a compound learning probability

$$ {p}_c(t)={\left[1\hbox{--} \exp \left(\hbox{--} \mu t\right)\right]}^c. $$
(1)

I prove in the Appendix that this compound learning curve has an inflection point at t = log(c)/μ, and is, thus, S-shaped for c > 1. A similar proof is given for the compound power-based learning curve,

$$ {p}_d(t)={\left[1\hbox{--} {\left(t+1\right)}^{\hbox{--} \mu}\right]}^d, $$
(2)

which has inflection points at t = [(1 + )/(1 + μ)]1/μ − 1. Hence, if one believes that the learning curve is an exponential or power function, one must of necessity also admit to the existence of S-shaped (compound) learning curves. Both equations for inflection points suggest that they occur for higher t, if c (or d) is larger and μ is smaller. Thus, they will be more prominent in the data (i.e., are more visible because they occur later in the learning process), if items are more complex and a subject has a lower learning rate.

Under the above assumptions, I would not argue that S-shaped curves occur because the learning proceeds at first slowly, then speeds up, and finally slows down (but see the Discussion, where I consider these possibilities in some detail). It is perfectly possible that the underlying learning process—at the level of fragments—proceeds at a steady pace, but it is in the transformation from this process to the final score that the learning curve assumes its observed S shape. Other authors have recently presented a similar argument, although they have departed from completely different assumptions (Leibowitz, Baum, Enden, & Karniel, 2010). In the following section, this distinction between underlying process and transformation to behavior is developed further, in the so-called memory chain model, which distinguishes more explicitly between an underlying memory “intensity” and derived behavioral measures. This model leads to a learning equation that may be viewed as a generalization of the exponential function.

Learning curves in the memory chain model

The principal idea of the memory chain model is that memory is grounded in neural events and processes that operate at widely different time-scales, ranging from rapidly firing neurons to axonal growth, which is extremely slow by comparison. The model was developed by my colleague Antonio Chessa and me (Chessa & Murre, 2007). Despite the varying nature, we proposed that from the perspective of learning and memory each neural process can be characterized by two basic characteristics: (i) the neural events or traces in them tend to decline, for example, through decay, neural noise, interference, or decrease in number, and (ii) as long as they have not declined, they tend to induce events in higher-level processes that have a lower decline rate.

In the model, we formalized these processes and their interactions by abstracting from the individual types of neural events or traces and replacing these by the general concept of intensity, which we adapt from the theory of point processes (Daley & Vere-Jones, 1988; Diggle, 1983). Intensity means how many neural events or traces we may expect to find when searching for a memory in the brain. Such a memory representation consists of zero or more neural events or traces, any of which suffices to retrieve the memory. Copies of a memory can be more-or-less complete (cf. trace replicas in Nadel, Samsonovitch, Ryan, & Moscovitch, 2000), or merely a critical feature (Murdock, 1974) that allows for retrieval of a memory representation. On retrieval, a cue is given, for example an English word, asking for a Dutch translation, which may or may not be found in memory. A retrieval failure occurs, for example, if the neural pathways activated by a given cue do not manage to connect to any of the memory traces.

This retrieval process is like searching for a battery in a dark apartment using only a flashlight with a narrow beam of light. Suppose that zero or more batteries are randomly scattered around a large dark room and that I need one for my second flashlight. I point my flashlight beam around on the floor randomly until I stumble upon a battery. This example highlights five aspects of the search process: (i) Cue quality: A large, broad flashlight beam will speed up the search. This may be compared to a more specific or better memory cue. (ii) Intensity and learning time: The more batteries there are, the higher the chances of finding one soon. Thus, if more traces that represent a given memory are in the brain, chances of retrieval increase. Such an increase is accomplished through longer or more efficient learning. (iii) Retrieval time: The longer we stumble around with our flashlight, the higher the chance of eventually finding a battery. This suggests that the longer we allow a subject to attempt to retrieve something, the higher the chances of eventual retrieval. (iv) Failure to retrieve: If I have only one minute to find a battery, I may not find any, even if several are present. Thus, in time-limited search, retrieval failures are common even if memory traces are present. (v) Retrieval as a chance process: This is an consequence of points (iii) and (iv) above, but it is worth noting separately. In the time-limited case, given the same number of memory traces, retrieval may sometimes be successful and at other times it may fail, even if all circumstances stay the same. Of course, if many batteries are on the floor, my chances of getting lucky are high.

Now, suppose that my second flashlight is of the type that requires not one but two batteries. This means that I need to keep searching until I have retrieved at least two: the retrieval threshold has been increased from 1 to 2. This could correspond to different aspects of a compound vocabulary item (cf. “greengrocer” = groenteboer above) or simply an increased threshold due to task manipulation, such as a high penalty on mistakes. So, whereas the default value of the retrieval process is to look for at least one point (battery, feature, neural event, etc.), this threshold may sometimes be raised. I will review this in some detail below, where I show how the threshold can often be linked to characteristics of the materials in a sensible manner.

This article concentrates on the learning curve, and I, therefore, shall not discuss forgetting or consolidation here. In fact, I shall assume that there is no significant forgetting during learning (e.g., between subsequent learning trials) and that each learning trial is equally effective. I am well aware that this is often not the case (Pavlik & Anderson, 2005; Raaijmakers, 2003), and a more complete version of the memory chain model, which addresses these aspects in the cases of print and TV advertisements, including massed versus spaced learning, is published elsewhere (Chessa & Murre, 2007).

It can easily be shown that in the memory chain model recall probability immediately after learning follows a Poisson distribution, p(t) = 1 – Poisson(μt, b), where t is learning time, μ is the learning rate, b is the threshold parameter discussed above, and the cumulative distribution function, or “cdf,” of the Poisson distribution is used. Only if b > 1 shall I call the threshold “elevated.” In the default case with b = 1, it can easily be shown that we have p(t) = 1 – e μt. This expression is identical to the exponential learning curve, discussed above.

An alternative, and more concise way to express the memory chain model learning curve equation above is as p(t) = 1 – Q(⌊b⌋, μt), which is based on a standard way to express the cdf of the Poisson distribution (Feller, 1966), where Q is the regularized incomplete gamma function. To interpolate between integer-valued thresholds, ⌊b⌋ (i.e., floor b) may be changed to just b in the equation. The interpretation of what a non-integer retrieval threshold like 2.31 means will depend on the situation at hand, but can often be understood as an average or interpolated value. Without the floor operator, the expression becomes:

$$ p(t)=1\hbox{--} Q\left(b,\mu t\right). $$
(3)

The memory chain model, thus, gives a new interpretation of the exponential curve, p(t) = 1 – e μt—namely, the probability of retrieving at least one trace from the underlying process. It generalizes this equation to threshold values higher than one. Elsewhere, we discuss how it also proposes a way to accommodate retrieval cue quality and forgetting and we show that it in many cases the processes in the model can be given a neurobiological interpretation, such as storage in medial-temporal lobe or neocortex, and can be used to model lesion data from amnesia patients and experimental animals (Murre, Chessa, & Meeter, 2013; Murre, Meeter, & Chessa, 2007).

In the Appendix, a proof is given that the memory chain model has an S-shaped learning curve with an inflection point at t = (b – 1)/μ for b > 1. A point worth emphasizing is that for all the three compound learning curves discussed so far, namely Eqs. 13, we can identify two independent factors that promote the appearance of an S shape in learning curves: (i) a low learning parameter, μ, and (ii) a high threshold b (or the equivalent parameters c and d, used in Eqs. 1 and 2, respectively).

I assume that the threshold parameter reflects (lack of) metaknowledge of the materials such as of the phonotactics of a foreign language or that it represents learning and retrieval of multiple fragments. If this assumption is correct, one would expect this parameter to decrease toward its default value of 1 in the course of prolonged learning. After all, as a student becomes more and more familiar with the phonotactics of a language, she will tend to form a single chunk of complex words over time (Gobet et al., 2001). If this is indeed the case, we would expect the threshold parameter to decrease in subsequent learning sessions. Also, with highly familiar materials such as paired-associates in one’s native language or other “simple” learning materials, we would not expect S-shaped curves.

Further assumptions about the learning rate

For all of the learning equations studied here, I will assume that the learning rate parameter μ is both subject-dependent and material-dependent, as follows: μ = μ s μ m . This relation will be explored further below. In this article, I do not include a mechanism for time-dependent change of the learning rate itself (a few of such mechanisms are mentioned in the Discussion). Although this is of great interest, I feel that such a theory extension, if indeed required, can only be developed when also taking into account the details of memory consolidation processes and learning saturation, which fall outside the scope of this article.

Data and modeling

In this section, the concepts introduced above are elaborated further by presented two experiments and by fitting the models to the experimental data. The goal of these fitting exercises is to outline where such models may usefully be applied in the analysis of different learning processes. Straightforward fitting of a learning curve equation to an averaged learning curve is not very informative as such curves tend to be highly similar and hence do not impose many constraints on the models fitted (Roberts & Pashler, 2000). In line with my argument about the heterogeneous nature of learning processes, I have not attempted to fit the models above to large number of learning curves with disparate materials and subjects (e.g., Mazur & Hastie, 1978), but rather conducted two experiments in which separate curves are available for individual subjects. I believe that such experiments allow a better constrained analysis of the learning process suitable to undertake an initial test of the hypotheses (see also Heathcote et al., 2000, for a similar approach). As mentioned above, I am aware that learning with intermediate forgetting, consolidation, and “saturation” of the learning process may be important factors in a full model of learning, but I consider these issues to be outside the scope of the present article, and will address them elsewhere (see also Chessa & Murre, 2007).

Experiment 1: Turkish–Dutch vocabulary

Experiment 1 was conducted to assess the following research questions. (i) Does learning of foreign vocabulary produce S-shaped learning curves and is it the case that the circumstances under which they occur are captured sufficiently by the models outlined above. More specifically, (ii) is it the case that the threshold b is characteristic of the to-be-learned materials and may thus be shared by all students, while the students themselves may have varying learning parameters μ s . I also wanted to use these data to compare the three compound learning equations introduced above: (iii) Are the three models broadly similar, or do they diverge in the ways they describe the data? This experiment was carried out at the University of Amsterdam. I will give the details before proceeding to the modeling.

Method

Subjects

A total of 141 subjects (108 female, 33 male, mean age 20.4 years, SD = 3.84) were taught 30 Turkish words in a paired-associated manner (Turkish to Dutch). The subjects were Dutch first-year psychology students who did not speak Turkish. They received either payment or research credit for participation.

Materials and procedure

In order to simplify learning, only Turkish nouns (without their articles) that were composed of letters in the English alphabet—which is the same as the Dutch alphabet—were included, without diacritics. The words selected were high-frequent in Dutch (as determined from the CELEX corpus, found at http://celex.mpi.nl). A learning trial for a single word consisted of the presentation of a Turkish word on a computer screen, a response by the subject typed at the keyboard, and immediate feedback. In case of error the correct response was given and the trial was counted as not correct, else it was counted as correct. Subjects had to type the correct answer before they could proceed when the answer was not correct. Learning was self-paced and all 30 words were presented at each list presentation, in the same order. Also words that had already been translated correctly were still included in subsequent list presentations (see Exp. 2 below for a different approach). Training continued until all 30 responses had been correct once, with a maximum of 15 list presentations (some students did not quite manage to learn all words in 15 presentations).

Modeling

I first fitted each individual learning curve with learning Eq. 3. The learning rate μ s was allowed to vary over the 141 subjects, while the recall threshold b was shared. The model gave the best fit for a threshold b = 2.533. The model was fitted by minimizing the goodness-of-fit X statistic (see the Appendix). The first learning trial (i.e., an entire list presentation) was not included in the fitting procedure (or X statistic) and was always set to 0 in the fitted model. The best-fitting (i.e., lowest) X statistic found was 1,080.0. With 1,973 degrees of freedom (141 × 15 data points and 141 + 1 free parameters estimated from the data), this corresponds to a size α > .999 of the chi-square test. This means that the model had a good fit and was not rejected by the data (see the Appendix for a discussion of the fitting procedure). Size α stayed at a high level, even when leaving out the last seven or eight data points, which have recall probabilities close to 1 and are therefore not very informative. At the level of individual learning curves, where only μ s was varied per subject, four of the 141 models were rejected at α = .05, using 14 degrees of freedom. The average value of X statistic for individual curves was 7.66, and the average R 2 was .986 (in other words, the model explained 98.6 % of the variance).

The distribution of the learning parameters, which varied for each subject, is shown in the insert of Fig. 1. The average learning parameter was 1.143 with a standard deviation of 0.556. Figure 1 also shows the data with model fits. I have averaged the data (and fitted curves) in four groups: 20 subjects whose value μ s was lowest, then a group with 50 lower-middle values, a group with 50 upper-middle values, and finally a group with the 21 fastest learning subjects. The fitted curves are the averages of the individual fitted curves (i.e., the model is not fitted to these averaged data points directly). When looking at the four curves, the reader should bear in mind that the averaging operation will often transform the shape of curves, though this effect is small if the range of the learning rate is narrow (Murre & Chessa, 2011).

Fig. 1
figure 1

Data and fitted learning curves of a vocabulary-learning experiment with Turkish–Dutch noun translations. Curves are shown for proportions correct as a function of learning trials. The solid line are averages of individually fitted curves using a single shared threshold; the dots are averaged data. The four curves are averages of quantiles of the following sizes: lowest curve, average of 20 subjects with the lowest learning rates; lower-middle, 50 next best learning rates; lower-upper, next 50 learning rates; upper, 21 highest learning rates. The insert shows the distribution of the learning parameter over the 141 subjects

Both the empirical and the fitted curves differ considerably in shape, varying from a monotonically decelerating (i.e., concave) curve for the highest learning rates to a clearly S-shaped one for the lowest learning rates. I concluded that this model could adequately describe the data, which had the most prominent S-shaped learning curves for individual subjects that learn slowly, as was predicted by the models above. A shared threshold b suffices to capture the general shape of the individual learning curves, with the exception of a few curves that had irregular shapes. I will now analyze the data with the other two models, the exponential and power compound-learning curves, using the same approach.

The best fit for the exponential equation, p c (t) = [1–exp(– μt)]c, was obtained with c = 2.924. The goodness-of-fit X statistic was 1,135.4, which though higher than 1,080.0 above, still gives α > .999 (df = 1,973), indicating that this model also fitted the data well. Four of the 141 individual curves did not fit the data at α < .05 (these were the same data points as above, which showed rather atypical learning curves). The average R 2 was .985.

I concluded that the compound exponential curve also did a good job fitting these data, with broadly similar parameters. In fact, the individual learning rates of both fitted equations showed a nearly perfect correlation of .9931. Only the three highest learning rates diverge visibly, probably owing to the fact that for very fast learners the models have much less information to constrain them, as they quickly reach 100 % correct. If I left these data points out and fitted a straight line that passed through the origin, the correlation coefficient was nearly 1 (.99965), with a slope of 0.7169. This suggests that some type of mathematical equivalence might be constructed between the two equations, but I have not been able to find this for unconstrained values of b and c.

According to the assumptions of the compound-learning curves above, the threshold values of b = 2.533 and c = 2.924 suggest that on average for each of the 30 words two to three memory representation “fragments” must be found in order to successfully recall it. In this respect, it is interesting to note that 25 out of the 30 Turkish words consisted of two syllables (four words had one syllable, and one word had three syllables). One could thus hypothesize that the memory representations retrieved contain the equivalent of two to three necessary features. More research is clearly needed to investigate this further. A few steps toward this are taken in Experiment 2, discussed below.

I also fitted the compound power equation, p d (t) = [1–(t + 1)– μ]d to the data. This equation fitted the data less well and also tended to use very high threshold values, which I could not give a sensible interpretation. The X statistic was 1,519.8, which though much higher than those for the other two models, with 1,973 degrees of freedom this still produced a good fit; with α > .999, the model was by no means rejected by the data. Six of the 141 individual curves were rejected at α < .05, and the R 2 was .980. The threshold value was d = 15.12 and the average learning rate was 3.281 (SD = 0.0856). It seems that with this function the high learning rates combine with an inflated threshold to produce the S-shaped curve necessitated by the data. Though the compound power curve fits, it does not do so in way that is theoretically interesting.

Experiment 2: Dutch–Italian vocabulary with spaced relearning

Experiment 1 had a number of limitations, which prompted me to do a follow-up experiment with three spaced learning sessions. The following issues are addressed: (i) In Experiment 1, a list of words was studied until all words in it were correct. A more typical real-life learning scenario would be that a word in a list is studied until once correct and then dropped from the list. Can the model also be applied to such a learning situation? (ii) In Experiment 1, the Dutch translation had to be retrieved for a Turkish word. In real-life situations of active use, typically the foreign translation must be produced. Can the model accommodate this as well? (iii) Is it the case that the learning rate parameter remains more or less constant over subsequent learning sessions or does it increase? I had no a priori expectations about this. An elevated learning rate is to be expected if the efficiency of the learning process for certain materials increases, but the memory chain model currently does not include mechanisms for this. (iv) Does the learning threshold drop on subsequent sessions? I hypothesized that this would indeed be the case, given that I assume chunk formation over time. (v) Can the learning rate parameter be applied across different (equivalent) word lists? Here, I also hypothesized this to be the case given that the learning rate is assumed to be dependent jointly on the subject and materials in a multiplicative manner. I did, therefore, also hypothesize that corrections in learning rate across lists had to be made to compensate for unintended variations in list difficulty. Finally, I had no a priori hypotheses about the forgetting process between sessions, which is outside the scope of this article. After presenting the experiment, I will apply various models to the data in order to elucidate the five issues outlined here.

In this experiment, which was also carried out at the University of Amsterdam, we studied learning by Dutch students of short lists of Dutch-to-Italian word pairs. The subjects came back the next day and during Session 2 relearned the lists. This was repeated with another learning session six days after Session 1. Subjects studied two lists of thirteen word pairs each. In Condition 1, lists were studied until all words could be translated correctly, but unlike in the previous experiment, if the correct Italian translation was given, the word pair was dropped from the list and the next learning trial (i.e., next list presentation) no longer included it. This mimics the type of vocabulary learning typically occurring when learning a foreign language, especially in a school or college setting. All subjects also participated in Condition 2, which paralleled Condition 1 (with counterbalanced lists), except that all word pairs were kept in the list at all learning trials until they were all correct (i.e., similar to Exp. 1 above).

Method

The details of the experiment were as follows.

Subjects

A total of 28 Dutch subjects (mean age 22.43 years, SD = 6.60, 20 female, eight male) were rewarded with money or research credits. Six subjects were excluded from data analysis because they did not manage to learn the words, because they missed sessions, or because they had too much prior knowledge of Italian or a related language (Spanish), leaving 22 subjects in the experiment.

Materials

Four lists of 13 words were used, which were nouns and adjectives between four and seven letters long. The indefinite article was also included with a noun (e.g., “a bed” = un letto; the article was counted as one letter). No cognates were included. High-frequency words (in Dutch) that met these criteria (from CELEX) were assigned to each of the lists using a round-robin scheme in descending order of frequency.

Procedure

Subjects learned two of the four lists. Half learned Lists 1 and 3 and the other half Lists 2 and 4. Each list was presented word-by-word (in unvarying word order in a list) on a laptop screen with a white background. A Dutch word appeared in large black letters in the middle of the screen, with an input field below it in which the answer had to be typed. After pressing the Enter key or the OK button, feedback was given immediately. The answer was assessed as correct only when the completely correct Italian translation of the Dutch word was given (including the right article). Feedback consisted of a green signal for correct or orange for incorrect. With incorrect answers, the correct translation was given with specific feedback about the right and wrong letters in the answer. The subject was required to correct his or her answer in the input field to the standard answer, before being allowed to continue. In Condition 1, correctly translated words did not reappear in the next list presentation, whereas they did in Condition 2 (as in Exp. 1).

Modeling

Condition 1

The results of Condition 1 (learning with drop-out of learned items) are shown in Fig. 2. As in Experiment 1, the hypothesis was that each individual subject could be characterized by a learning parameter. I furthermore assumed that the learning rates for a given subject were equal for both lists studied. Lists may differ somewhat in difficulty, despite efforts to prevent this during the selection of words to be learned. To accommodate this in the model, a correction factor was included for the learning rate of the second list. This factor was shared by all subjects and multiplied with each subject’s learning rate. When students returned the next day for Session 2, they typically had retained some of the learned vocabulary. These “savings” were accommodated in the model by adding an intensity at the start of learning in Session 2 that was equivalent to the observed probability at the initial trial. This was also done for the six-day retention interval. As in Experiment 1, a hypothesis was that the threshold b would initially be higher than the default value of 1, because Dutch students are not used to the phonotactics of Italian. The b value was expected to drop during subsequent learning sessions. I had no clear expectations about the learning rates on Sessions 2 and 3, a session-dependent factors were included that were multiplied with all subject-dependent learning rates. The same two session factors were shared by all subjects. If the efficiency of learning increased with practice, this would be reflected in these factors increasing above 1.

Fig. 2
figure 2

Data and fitted learning curves of Condition 1 in Experiment 2, a vocabulary-learning experiment with Dutch–Italian word translations in three learning sessions. Shown are proportions correct as a function of learning trials. The solid lines are averages of the individually fitted curves (see the text for details). The three curves are averages of quantiles of the following sizes: lowest curve, average of 11 subjects with the lowest learning rates; middle, 22 middle learning rates; upper, 11 highest learning rates. a Session 1. b Session 2, 1 day after Session 1. c Session 3, 6 days after Session 1

The model of Eq. 3 was fitted with the following parameters. Each of the 22 subjects had a learning parameter μ s . Two of the four lists had a “difficulty multiplier, which could deviate from 1 and which was shared among subjects and sessions. This parameter was multiplied with the learning parameter. Each learning session i was characterized by a freely varying threshold parameter, b i , whereas Sessions 2 and 3 also had a session-specific learning rate multiplication factor. The session-dependent parameters were shared by all subjects. The total number of parameter was: 22 + 2 + 3 + 2 = 29.

The model was fitted by minimizing the X statistic (see the Appendix), which was 317.0. The number of degrees of freedom was 44 × (7 + 6 + 3)–29 = 675, which gives a chi-square estimate of α > .999. The model is not rejected by the data. Two individual curves out of 132 (i.e., Subjects × Lists × Sessions = 22 × 2 × 3 = 132) were rejected by the data for α > .05. The average R 2 values for the individual fitted curves in Sessions 1 to 3, were .9742, .9702, and .9835, respectively. The threshold values b i were 1.996, 2.156, and 1.138, for Sessions 1–3, respectively. The average learning rates without multipliers were 1.504, 1.585, and 2.380, for Sessions 1–3, respectively. It, thus, seems that we have gradual chunk formation while the learning rate increases. We need a more detailed model of the learning process to further investigate whether this chunk formation indeed takes place at this rate and to further explore the development of the efficiency of learning during learning.

This fit of the model to the learning process suggests a slightly elevated threshold of around 2, just a little lower than in the Turkish vocabulary experiment above, which conforms to the intuition that for Dutch natives, Turkish is more difficult than Italian. After a number of spaced learning sessions, the threshold approaches its default value of 1. The list difficulty multipliers were 1.100 and 0.8472, indicating that the third list was slightly easier (higher learning rates) than the first, which value was fixed at 1, and the fourth list was slightly more difficult than the second list (which value was also fixed at 1).

Condition 2

I did a completely separate fit of the model to the data of Condition 2 (learning without drop-out of correct items). The fit is similar to Condition 1 (X statistic = 316.3, df = 675, α > .999, three individual curves out of 132 rejected at α < .05). The average R 2 values for the individual fitted curves in Sessions 1–3, were .9600, .9625, and .9536, respectively. The average learning rates without multipliers were 1.109, 1.777, and 1.857, for Sessions 1–3, respectively. The thresholds b i were 1.688, 1, and 1 for Sessions 1–3, respectively. The list difficulty multipliers were 1.603 (List 3) and 0.9267 (List 4), which is in the same direction as above but with a rather more extreme correction factor for List 3, which apparently was easier to learn in this condition.

If we review the research issues addressed in this experiment, we can affirm the ability of model Eq. 3 to capture these adequately. In addition to the modeling undertaken in Experiment 1, the model here fitted data from Experiment 2 on list learning (i) with drop-out at first correct response (Condition 1) and (ii) using active vocabulary learning (with responses in the foreign language). Furthermore, (iii) a subject’s learning rate increased on subsequent learning sessions, although (v) we could keep it constant across lists (within one session, with a list difficulty correction factor that was shared by all subjects), whereas (iv) in subsequent learning sessions, the learning threshold dropped to its default value.

A separate fit of the data with the compound exponential function gave similar results for both conditions (Condition 1: X = 316.7, df = 675, α > .999, two individual curves out of 132 rejected at α < .05; Condition 2: X = 317.8, df = 675, α > .999, two individual curves out of 132 rejected at α < .05). All other effects were in same direction. For example, the threshold values b i in Condition 1 were 1.792, 1, and 1 for Sessions 1–3, respectively. The similarity in fit is not surprising as for b = 1, the model of Eq. 3 is identical to (compound) exponential model.

As in Experiment 1, the compound power function fitted less well (e.g., Condition 1: X = 343.0, df = 675, α > .999, two individual curves out of 132 rejected at α < .05; average R 2 values in Sessions 1 to 3, were .9736, .9687, and .9788). Also, the threshold values b i took on extreme values when minimizing the X statistic, namely 100.62, 41.81, and 18.65, for Sessions 1–3 in Condition 1, respectively. Although the threshold decreases sharply with prolonged learning, I could not sensibly relate these values to the aspects of the materials and, therefore, consider the compound power function less promising as a model of the learning curve.

Discussion

In this article, three models for the learning curve are introduced, based on a number of assumptions of the underlying memory process, distinguishing between a subject-specific learning rate and a material-specific threshold. I have argued throughout that the threshold is related to the perceived complexity of the to-be-learned item, such as the number of syllables in a word. Especially when a student is faced with a language that has unfamiliar phonotactics, the learning threshold will be elevated. With prolonged learning, fragments are integrated into a single chunk at which point the threshold declines to its default value of 1.

As can be seen in Figs. 1 and 2, S shapes are indeed most prominent for the slow-learning subjects and seem to have disappeared for the fast learners. It should perhaps be pointed out here that the models predict that the S shapes would also be present here for the fast learners if the learning threshold were elevated. However, because for fast learners the inflection points will occur very early during learning, they may be difficult or impossible to see in the data.

When averaging over the individual learning curves from Condition 1 of Session 1 in Experiment 2, however, no S shape is discernible (see Fig. 3a). Both the memory chain model (MCM) learning from Eq. 3 and the compound exponential equation fit the averaged curve well and are not rejected by the chi-square test (MCM: μ = 0.9555, b = 1.438, X = 5.246, α = .3866, R 2 = .9997; compound exponential: μ = 0.8557, b = 1.520, X = 5.640, α = .3428, R 2 = .9996). The compound power function does not fit as well and is rejected, even thought it still explains a very high 99.58 % of the variance (μ = 2.980, b = 6.590, X = 26.58, α < .001, R 2 = .9958). Of interest is what happens when we restrict the thresholds to 1—in other words, remove the inflection point, making the fitted curves strictly concave (Fig. 3b). In that case, the fit deteriorates considerably (MCM: μ = 0.6945, X = 23.58, α < .001, R 2 = .9943; the compound exponential equals memory chain model for b = 1), though much more so for the power function (μ = 1.413, X = 208.52, α < .001, R 2 = .9448). If we apply the hierarchical chi-square test for nested models (see the Appendix), then we may conclude that the addition of the threshold parameter gives a significant improvement in fit in the models. In other words, even if the S shape is not obvious, introducing it into a model by using a threshold higher than 1 will significantly improve the fit to these data.

Fig. 3
figure 3

Averaged curve for proportions correct as a function of learning trials in Experiment 2, Condition 1, Session 1. The data are shown as diamonds (with error bars). a The solid line is a best-fitting compound memory chain model learning curve. The dotted line is a best-fitting compound power learning curve. b As in panel a, but now fitted using strictly concave functions without S shapes (i.e., b = d = 1; see the text for details)

If I had only fitted the simple power function (i.e., with d = 1) to the averaged curves in Session 1 of Experiment 2, I would probably have noticed that it did not have a great fit, though it still explains a respectable 94.5 % of the variance. The (noncompound) exponential function explains 99.4 % of the variance, however, which is a seemingly good-enough fit that appears to warrant the conclusion that these learning curves were not S-shaped. This example helps to understand why S shapes are often not observed in learning curves: they are not easily observed in the averaged curve, especially when fit is assessed by an informal evaluation of the variance explained. It is of course possible that in many experiments the learning rate is high or the material is not very complex and so we would not expect to find S shapes curves. However, it is likely that in many experiments, the slowest-learning subjects do in fact also show S-shaped learning curves, which remain unanalyzed and which are no longer clearly visible in the averaged learning curve.

It is beyond this article to explore the mathematical details of the functions investigated here. I only want to remark that the memory chain model function and exponential function bear an interesting relation to the Weibull function. The latter can also produce S-shaped functions and is relevant when modeling response times (Chessa & Murre, 2006). The functions considered have in common with the Weibull function that they can approach a step-function. This happens in Eqs. 13 when a high threshold combines with a high learning rate. One could envision learning tasks in which items are very complex, for example, learning a series of long sentences in an unfamiliar foreign language, like a poem or lyrics. Another example may be operant conditioning, in which the task may have many components that must be executed correctly at the same time for first reinforcements to occur. Many such tasks also occur in real-life, such as when first start to learn to ride a bicycle, producing musical sounds on the trumpet or violin, et cetera.

In Fig. 4, some examples of S-shaped functions are shown, in which we see that for high learning rates, the S shape becomes so steep that it approaches a step or jump. Interestingly, such step functions are often signaled in operant conditioning—for example, in an article by Gallistel et al. (2004), who also proposed to fit the curves of individual subjects, not averaged curves, and who selected the Weibull function for their purpose. It is likely that operant condition requires additional or different assumptions about the learning mechanisms than I have made in this article, which undoubtedly will lead to learning curves that have different mathematical characteristics.

Fig. 4
figure 4

Illustration of how the compound exponential function may produce learning curves approaching step functions. a Plots for the compound exponential function, using c = 50, and μ = 2, 1.5, 1, and 0.5, for the four curves, from left to right, respectively. b Plots for the memory chain model function, using b = 20, and μ is 5, 4, 3, and 2, from left to right, respectively

Though Ebbinghaus (1885) was firmly convinced that learning was a gradual process, other early theorists argued for the importance of sudden or step-wise learning, notably referring to the Gestalt concept of insight (Köhler, 1947; Yerkes, 1916) to explain why jumps from not-learned to perfectly learned would be expected in learning processes. Careful experiments by Rock (1957), Estes (1960), and others showed that all-or-none learning is a reasonable explanation for at least some learning processes, which could be captured for example by two-state models of learning (Bower, 1962). More encompassing reviews of the literature, however, limited the applicability of these all-or-none explanations to cases in which the lists are not too long and have only two response alternatives that are thoroughly known by the subjects (Kintsch, 1970; Restle, 1965). In an early analysis of learning based on an urn model, Thurstone (1930) derived a learning equation that also produces S-shaped curves. Interestingly, this model explains “insight” (i.e., the sudden rise to perfection) by a high subject-related learning rate and low (perceived) complexity of the task (p. 484ff), which is opposite to the approach taken here.

A clear prediction made in this article is that if beginning learners of a foreign language must correctly reproduce long sentences, as when learning lyrics in an unknown language, the learning curves will resemble those of Fig. 4. More generally, whenever the learning performance is measured in items consisting of several features (words, syllables, etc.) that must all be correct in order for the item (as a whole) to be counted as correct, we would predict an S-shaped learning curve. For example, when mastery of a poem is measured as the number of lines reproduced perfectly, we would expect learning to be S-shaped. This was indeed found by Stroud (1931; see his Fig. 1). The results above thus confirm Stroud’s interpretation that “It may be that the size of the unit of response, a whole line, was conducive to learning curves of this kind” (p. 685). In a follow-up study in which he systematically varied the length of the items from 1 to 4 words, this was exactly the conclusion: The more words per item, the more the shape of the learning curve conformed to an S shape (Stroud, 1932).

The assumptions made here that give rise to S-shaped learning curves cover only a few of the possible mechanisms that may be involved in any case of learning. A relevant mechanism not yet discussed may be the effects of stress on memory (Het, Ramlow, & Wolf, 2005; Joëls, 2006), in which moderate stress is most beneficial to a high learning rate. It is likely that such stress plays a role in real-life learning situations—for example, with students who experience performance-limiting math anxiety (Ashcraft & Kirk, 2001). Improvements in learning performance are typically observed as the anxiety subsides with time (Hembree, 1990). Other inverse U-shaped relationships, such as the orienting response in infants (Lewkowicz & Turkewitz, 1980), may also be of interest to learning theory. Initially an infant often does not orient toward a stimulus, and we may expect little learning to occur. With repeated exposures, the infant may start to orient, but after many exposures, interest in the stimulus diminishes and orientation becomes less. When we have a factor such as stress or orientation that exhibits such an inverse U-shaped relationship with the number of exposures, we could incorporate this into the model by making the value of the learning parameter μ dependent on the number of learning trials. In that case, the learning rate would be depressed in the first portion of the curve, possibly reshaping it into an S shape.

A related mechanism that can also lead to S-shaped learning curves may be initial failure to locate the cue. The difference with orientation is that there is a deliberate search for the cue in some larger stimulus. For example, we may have to search a large array for a specific small pattern and describe its shape (e.g., when learning to detect tumors in X-rays or CT scans). If subjects are still inexperienced with the task and only have a brief moment to inspect an array, during the first few presentations they may not be able to find the pattern and hence may give chance responses. Once they have located a pattern in an array, however, on every subsequent encounter after having recognized the array, they may rapidly locate the pattern again and give a correct response. This is a three-stage process, in which the subjects first learn to recognize the array, then learn where in it the pattern is, and finally associate the pattern with the correct response. Even if the first and third stages are extremely easy, it may take many experimental trials before the pattern is found, during which time the learning curve will be close to zero. After the pattern is found, the curve will make a jump to near-perfect performance. Averaging over different arrays, each with varying difficulties of the three stages, may produce an S-shaped learning curve.

Another aspect noted in the literature that may depress the initial portion of the learning curve (and make it S-shaped) is the development of a learning set, or the process of “learning to learn.” For example, Harlow (1949) did a long-running experiment with monkeys who had to learn how to solve object-quality discrimination problems with two alternatives. The monkeys had no prior laboratory experience and learned every aspect of the task in the experiment, including operating the apparatus. The averaged learning curve for the first eight learning discrimination problems was S-shaped (Harlow, 1949, his Fig. 2). As the monkeys gained more experience with the task, the learning curve (per problem) changed to a concave shape. With even more experience, eventually nearly step-wise learning curves developed, which Harlow described as “indicators of ‘insightful’ learning” (p. 53). This parallels the findings in Experiment 2 of this article, in which a shift can be observed from S-shaped to concave learning curves as subjects become more familiar with the Italian vocabulary.

In this article, I have argued against attempts to find a generally valid learning curve, but instead have advocated focusing on limited domains of application, modeling not the averaged learning curves but those of individual subjects. It may be perfectly valid to select a simple mathematical expression like the power function for the purposes of capturing general aspects of the learning process—for example, as part of a more encompassing cognitive model. The search for the shape of learning, however, must not stop there, and we may not conclude that such broadly fitting curves are sufficient to characterize the mechanisms of learning. Only when zooming in on the details of the learning process will the true form of the learning curve be unraveled.