The partial-reinforcement extinction effect and the contingent-sampling hypothesis

Hochman, Guy; Erev, Ido

doi:10.3758/s13423-013-0432-1

The partial-reinforcement extinction effect and the contingent-sampling hypothesis

Brief Report
Published: 18 April 2013

Volume 20, pages 1336–1342, (2013)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

The partial-reinforcement extinction effect and the contingent-sampling hypothesis

Download PDF

Guy Hochman^1,2 &
Ido Erev³

9377 Accesses
17 Citations
2 Altmetric
Explore all metrics

Abstract

The partial-reinforcement extinction effect (PREE) implies that learning under partial reinforcements is more robust than learning under full reinforcements. While the advantages of partial reinforcements have been well-documented in laboratory studies, field research has failed to support this prediction. In the present study, we aimed to clarify this pattern. Experiment 1 showed that partial reinforcements increase the tendency to select the promoted option during extinction; however, this effect is much smaller than the negative effect of partial reinforcements on the tendency to select the promoted option during the training phase. Experiment 2 demonstrated that the overall effect of partial reinforcements varies inversely with the attractiveness of the alternative to the promoted behavior: The overall effect is negative when the alternative is relatively attractive, and positive when the alternative is relatively unattractive. These results can be captured with a contingent-sampling model assuming that people select options that provided the best payoff in similar past experiences. The best fit was obtained under the assumption that similarity is defined by the sequence of the last four outcomes.

Affective–associative two-process theory: a neurocomputational account of partial reinforcement extinction effects

Article Open access 14 September 2017

Robert Lowe, Alexander Almér, … Christian Balkenius

Effects of blocked versus interleaved training on relative value learning

Article 18 April 2023

William M. Hayes & Douglas H. Wedell

Selection history in context: Evidence for the role of reinforcement learning in biasing attention

Article 15 July 2019

Brian A. Anderson & Mark K. Britton

The partial-reinforcement extinction effect (PREE; Humphreys, 1939) is one of the best examples of a basic behavioral phenomenon, detected in the laboratory, with potentially important practical implications. As most introductory psychology textbooks explain, the PREE refers to the fact that learned behavior is more robust to extinction when not all responses are reinforced (partial schedules) than when 100 % of responses are reinforced in training (full schedule; see, e.g., Atkinson, Atkinson, Smith, Bem, & Nolen-Hoeksema, 1995; Baron & Kalsher, 2000). For example, Atkinson et al. stated that partial reinforcements facilitate higher performance rates, since the probability that individuals will continue responding in the absence of reinforcements is much higher under partial schedules than under full schedules.

Unfortunately, however, many empirical studies fail to support this textbook assertion. Most early demonstrations of the PREE used between-subjects laboratory designs (e.g., Grosslight & Child, 1947; Mowrer & Jones, 1945; Pavlik & Flora, 1993) to show that under partial-reinforcement schedules, individuals tend to engage in more responses during extinction than under full schedules. However, most laboratory studies using within-subjects designs (Nevin, 1988; Papini, Thomas, & McVicar, 2002; Svartdal, 2000; but see Exp. 3 of Nevin & Grace, 2005, for an exception) and field research (Latham & Dossett, 1978; Pritchard, Hollenback, & DeLeo, 1980; Yukl, Latham, & Pursell, 1976) have reported that partial reinforcements impair, rather than improve, performance. For example, Yukl et al. found that tree planters were less productive under partial- than under full-reinforcement schedules. This negative effect of partial schedules was even observed when partial reinforcements yielded higher average payoffs.

Nevin (1988, and see Nevin & Grace, 2000) proposed behavioral momentum theory in order to account for the mixed PREE results. According to this account, two effects compete under partial-reinforcement schedules. On the one hand, partial reinforcements have an overall negative effect on the likelihood of selecting the reinforced alternative, due to a decrease in reinforcement rates. At the same time, however, partial reinforcements have a positive local effect, as they slow extinction due to a generalization decrement, which hinders detection of changes in the reinforcement schedule in some settings (see the related observations in Gershman, Blei, & Niv, 2010). Thus, momentum theory suggests that the apparent inconsistency between basic research and field studies of PREE can be explained by the assertion that classical demonstrations of PREE focused on the positive local effect of partial reinforcement in slowing extinction, whereas field studies document the overall (negative) effect of partial reinforcements.

The main goal of the present analysis was to clarify and extend Nevin and Grace’s (2000) explanation of the mixed PREE findings. We relate Nevin and Grace’s (2000) assertion to the suggestion that people tend to select the action that has led to the best outcomes in similar situations in the past (see Biele, Erev, & Ert, 2009; Gonzalez, Lerch, & Lebiere, 2003; as well as a related observation by Patalano & Ross, 2007), and elucidate the conditions under which partial reinforcements are likely to be effective and countereffective.

Experiment 1: evaluation of the positive and negative effects of partial reinforcement

In most previous demonstrations of the PREE (e.g., Grant, Hake, & Hornseth, 1951), the expected benefit from the reinforced choice was higher under full than under partial schedules, because the same magnitudes were administered at higher rates. In the present study, we avoided this confound by manipulating the size of the rewards to ensure equal sums of reinforcements under both partial and full schedules. The primary goal of Experiment 1 was to examine whether the overall negative effect of partial reinforcements could be observed, even when this condition did not imply a lower sum of reinforcements.

Method

Participants

A group of 24 undergraduates from the Faculty of Industrial Engineering and Management at the Technion served as paid participants in the experiment. They were recruited via signs posted around campus for an experiment in decision making. The sample included 12 males and 12 females (mean age 23.7 years, SD = 1.88).

Apparatus and procedure

For the experiment we used a clicking paradigm (Erev & Haruvy, 2013), which consisted of two unmarked buttons and an accumulated payoff counter. Each selection of one of the two keys was followed by three immediate events: a presentation of the obtained payoff (in bold on the selected button for 1 s), a presentation of the foregone payoffs (on the unselected button for 1 s), and a continuous update of the payoff counter (the addition of the obtained payoff to the counter). The exact payoffs were a function of the reinforcement schedule, the phase, and the choice, as explained below.

Participants were instructed to repeatedly choose a button in order to maximize their total earnings. No prior information regarding the payoff distribution was delivered. The study included 200 trials, with a 1-s interval between trials. The task lasted approximately 10 min.

Design

The participants were randomly assigned to one of the two reinforcement schedule conditions: full (n = 11) and partial (n = 13). Each participant faced two phases of 100 trials (“training” and “extinction”) under one of the schedules and selected one of the buttons for each trial. A selection of one of the two buttons, referred to as the “demoted option,” always led to a payoff of eight points. The payoff from the alternative button, referred to as the “promoted option,” depended on the phase and the reinforcement schedule, as follows:

During the first phase of 100 trials (the “training phase”), the promoted option yielded a mean payoff of nine points per selection. The two schedules differed with respect to the payoff variability around this mean. The full schedule involved no variability: Each selection of the promoted option provided a payoff of nine points. In contrast, the partial schedule involved high variability: Choices of the promoted option were rewarded with a payoff of 17 points on 50 % of the trials, and with a payoff of one point on the remaining trials.

The second phase of 100 trials simulated “extinction.” During this phase, the promoted option yielded one point per selection. Thus, the promoted option was more attractive than the demoted option during training (mean of nine relative to eight points), but less attractive during extinction (one relative to eight). The left side of Table 1 summarizes the experimental conditions of Experiment 1.

Table 1 Experiment 1: Observed mean proportions of selecting the promoted option (P-prom), as a function of the phase and the reinforcement schedule. The two rightmost columns present the predictions of the contingent-sampling (CS) and fictitious-play (FP) models

Full size table

To motivate participants, we provided a monetary incentive. The translation from points to actual payoffs was according to an exchange rate in which 100 points = NIS 1.5 (about 33 cents). This resulted in an average total payoff of NIS 14 (about $3.2).

Results and discussion

The left side of Fig. 1 presents the observed choice proportions for the promoted option (P-prom) in blocks of 10 trials as a function of the reinforcement schedule and phase. Table 1 presents the mean values. During training, P-prom was higher under the full (M = .92, SD = .15) than under the partial (M = .68, SD = .19) schedule. The opposite pattern was observed during extinction: P-prom was lower under the full (M = .03, SD = .01) than under the partial (M = .06, SD = .03) schedule. A 2 × 2 repeated measures analysis of variance (ANOVA) was conducted to test the effects of the phase and reinforcement schedule on P-prom. This analysis revealed significant main effects for both the phase [F(1, 22) = 604.65, p < .0001, η _p ² = .965] and the schedule [F(1, 22) = 10.07, p < .0005, η _p ² = .314], as well as a significant interaction between the two factors [F(1, 22) = 22.69, p < .0001, η _p ² = .508]. These results suggest that the negative effect of partial reinforcements in the training phase and the positive replication of the PREE in extinction were significant.

Over the 200 trials, P-prom was significantly higher under the full (M = .48, SD = .08) than under the partial (M = .36, SD = .11) schedule, t(22) = 2.74, p < .01, d = 1.17. These results support Nevin and Grace’s (2000) resolution of the mixed PREE pattern. The observed effect of partial reinforcements was positive at the transition stage, but negative across all trials.

The contingent-sampling hypothesis

Previous studies of decisions from experience have highlighted the value of models that assume reliance on small samples of experiences in similar situations. Models of this type capture the conditions that facilitate and impair learning (Camilleri & Newell, 2011; Erev & Barron, 2005; Yechiam & Busemeyer, 2005), and have won recent choice prediction competitions (Erev, Ert, & Roth, 2010). Our attempt to capture the present results starts with a one-parameter member of this class of models. Specifically, we considered a contingent-sampling model in which similarity was defined on the basis of the relative advantage of the promoted option in the m most recent trials (m, a nonnegative integer, is the model’s free parameter). For example, the sequence G–G–L implies that the promoted option yielded a relative loss (L = lower payoff, relative to the demoted option) in the last trial, but a relative gain (G = higher payoff) in the previous two trials, and with m = 3, all trials that immediately follow this sequence are “similar.” The model assumes that the decision is made on the basis of comparing the average payoffs from the two options in all past similar trials (and of random choice, before gaining relevant experience).

To clarify this logic, consider the contingent-sampling model with the parameter m = 2, and assume that the observed outcomes in the first nine trials of the partial condition provided the sequence L–L–L–G–L–L–G–L–L. That is, the payoff from the promoted option was 17 (a relative gain) in Trials 4 and 7, and one (a relative loss) in the other seven trials. At Trial 10, the agent faces a choice after the sequence L–L. She will therefore recall all of her experiences after identical sequences (Trials 3, 4, and 7), compute the average experience from the promoted option in this set to be (1 + 17 + 17)/3 = 11.67, and since the average payoff from the demoted option is only nine, will select the promoted option.

The model’s predictions were derived using a computer simulation in which virtual agents, programmed to behave in accordance with the model’s assumptions, participated in a virtual replication of the experiment. Two thousand simulations were run with different m values (from 1 to 5) using the SAS program. During the simulation, we recorded the P-prom statistics as in the experiment.

The results revealed that the main experimental patterns (a negative overall effect of partial reinforcement and a small positive effect after the transition) were reproduced with all m values. The best fit (minimal mean squared distance between the observed and predicted P-prom rates) was found with the parameter m = 4. The predictions with this parameter are presented in Table 1 and Fig. 1.

Notice that with m = 4, the model implies reliance on small samples: Since 16 sequences are possible, the decisions during the 100 training trials will typically be based on six or fewer “similar” experiences. This fact has no effect in the full schedule (in which a sample of one is sufficient to maximize), but it leads to deviations from maximization in the partial schedule (since some samples include more “1” than “17” trials). During extinction, however, after the fourth trial all decisions are made after the sequence L–L–L–L. In the full schedule, participants never experience this sequence during training. Thus, all of their experiences after this sequence lead them to prefer the demoted option (and the first experience occurs in the fifth extinction trial). In contrast, the typical participants in the partial schedule experience this sequence six times during training, and these experiences can lead them to prefer the promoted option in the early trials of the extinction phase.

Inertia, generalization, noise, and bounded memory

The present abstraction of the contingent-sampling idea is a simplified variant of the abstraction that won Erev, Ert, and Roth’s (2010) choice prediction competition (Chen, Liu, Chen, & Lee, 2011). The assumptions that were excluded were inertia (a tendency to repeat the last choice), generalization (some sensitivity to the mean payoff), a noisy response rule, and bounded memory. A model including these assumptions revealed a slight fit improvement in the present settings but did not change the main, aggregate predictions. Thus, the present analysis does not rule out these assumptions, it only shows that they are not necessary in order to capture the aggregate effect of partial reinforcements documented here.

Fictitious play and reinforcement learning

In order to clarify the relationship between the contingent-sampling hypothesis and popular learning models, we also considered a two-parameter smooth fictitious-play (SFP) model (Fudenberg & Levine, 1999). SFP assumes that the propensity to select option j at trial t + 1 after observing the payoff v( j, t) at trial t is

$$ Q\left( {j,t+1} \right)=\left( {1\text{-} w} \right)Q\left( {j,t} \right)+(w)v\left( {j,t} \right), $$

where w is a free weighting parameter and Q(j, 1) = 0. The probability of selecting j over k at trial t is

$$ P\left( {j,t} \right)=1/\left( {1+{e^{\sigma}}^{{\left[ {Q\left( {k,t} \right)\text{-} Q(j,t)} \right]}}} \right), $$

where σ is a free response strength parameter. SFP is an example of a two-parameter reinforcement learning model (Erev & Haruvy, 2013). Table 1 and Fig. 1 show that this model fits the aggregate choice rate slightly better than the one-parameter contingent-sampling model.

Experiment 2: the relative importance of the two effects

The contingent-sampling model implies that the overall effect of partial reinforcements depends on the difference between the two alternatives during training. When the advantage of the promoted option is sufficiently large, partial reinforcement can increase the overall choice rate of this option. The fictitious play model predicts a weaker effect: it implies that a large advantage of the promoted option will only eliminate the negative effect of partial reinforcements. Experiment 2 was designed to clarify and compare these predictions by examining the effect of the payoff from the demoted option. Two payoff environments were compared: The payoff from the demoted option was five in Environment 5, and two in Environment 2. In all other respects, Experiment 2 was identical to Experiment 1.

The right panels of Figs. 2A and B and Table 2 presents the predictions of the contingent-sampling and fictitious-play models to Experiment 2 with the parameters that best fit the results of Experiment 1. The predictions were derived from computer simulations in which virtual agents (that behaved in accordance with the models’ assumptions for the parameters estimated in Exp. 1) faced 100 training and then 100 extinction trials. The contingent-sampling model predicts that the overall effect of partial reinforcements on the choice rates of the promoted option will be negative in Environment 5, but positive in Environment 2. In contrast, the fictitious-play model predicts similar learning patterns in the two schedules.

Table 2 Experiment 2: Observed mean proportions of selecting the promoted option (P-prom), as a function of the environment, the reinforcement schedule, and the training phase, as well as the predictions of the contingent-sampling (CS) and fictitious-play (FP) models

Full size table