Learning how to learn by self-tuning reinforcement

Torsell, Christian; Barrett, Jeffrey A.

doi:10.1007/s11229-024-04649-1

Learning how to learn by self-tuning reinforcement

Original Research
Open access
Published: 11 June 2024

Volume 203, article number 209, (2024)
Cite this article

Download PDF

You have full access to this open access article

Synthese Aims and scope Submit manuscript

Learning how to learn by self-tuning reinforcement

Download PDF

Christian Torsell¹ &
Jeffrey A. Barrett¹

322 Accesses
Explore all metrics

Abstract

Humans and many animals are capable of learning and learning how to learn better. We are concerned here with one way that reinforcement learners might learn how to learn better. In an experiment described by Harlow in (Psychol Rev 56:51–65, 1949) a group of rhesus monkeys learn a new way of learning in the context of a specific type of problem. We will consider how such agents might coevolve a new learning dynamics and new attendant saliences. To this end, we propose a self-tuning dynamics that illustrates one way that a reinforcement learner might acquire forms of learning that are well-suited to context-specific problems.

Reinforcement Learning Algorithms: Categorization and Structural Properties

Reinforcement Learning

Reinforcement learning improves behaviour from evaluative feedback

Article 27 May 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

David Hume believed that we can never have rational justification for our expectations or beliefs regarding matters of fact (1975, pp. 25–39). In his sceptical solution to the problem of induction, he explained that beliefs regarding matters of fact, and expectations regarding the future in particular, were naturally learned not rationally justified. This shift in focus from rational justification to learning has both pragmatic and naturalistic virtues.^{Footnote 1}

Hume held that beliefs regarding expectation and matters of fact were produced from experience by means of custom or habit. Custom, in the sense in which he used the term, is a principle of our psychological nature that acts to produce and adjust propensities when presented with experience. He explained that “wherever the repetition of any particular act or operation produces a propensity to renew the same act or operation, without being impelled by any reasoning or process of the understanding ...this propensity is the effect of Custom” (1975, p. 43). Custom ”is nothing but a species of instinct or mechanical power, that acts in us unknown to ourselves” (1975, p. 108). We learn just as animals do who “by the proper application of rewards and punishments, may be taught any course of action” (1975, p. 105). To learn by custom, then, is to learn by reinforcement on success and punishment on failure.

Hume regarded such reinforcement learning to be a fortunate natural endowment of human psychology:

Custom $\ldots $ is the great guide of human life. It is that principle alone which renders our experience useful to us, and makes us expect, for the future, a similar train of events with those which have appeared in the past. Without the influence of custom, we should be entirely ignorant of every matter of fact beyond what is immediately present to the memory and senses. We should never know how to adjust means to ends, or to employ our natural powers in the production of any effect. There would be an end at once of all action, as well as of the chief part of speculation (1975, pp. 44–45).

And he took its effect to be as “unavoidable as to feel the passion of love, when we receive benefits; or hatred, when we meet with injuries” (1975, p. 46).

There is a great deal of evidence that Hume was right to believe that humans and other animals very often learn by means of some form of reinforcement with punishment.^{Footnote 2} That said, humans and many animals are also capable of learning in other context-specific ways.

A natural extension of Hume’s account of how we learn would consider how a reinforcement learner might develop and learn to implement other forms of learning, forms better suited to particular practical contexts. Barrett (2023) takes up this theme, offering a general Humean strategy for how an agent might use simple reinforcement to learn how to learn better. Here we focus on a narrower problem regarding natural learning. Specifically, we consider how rhesus monkeys might learn how to learn more efficiently using a form of self-tuning reinforcement learning in the context of a classic experimental study by Harry Harlow (1949),^{Footnote 3} This dynamics allows a reinforcement learner to learn how to learn in a way that is better suited to a particular type of problem while also learning how to apply the new form of learning to the problem. We take this form of reinforcement learning to accord well with Hume’s commitment to custom.

The argument proceeds as follows. In Sect. 2 we introduce two kinds of learning, reinforcement and win-stay/lose-shift. In Sect. 3 we explain how the learning to learn achieved by Harlow’s subjects can be thought of as a gradual transition from the former to the latter. In Sects. 4 and 5 we argue that this transition is well modeled by a kind of “heating up” where the two parameters governing a reinforcement learner’s behavioral and attentional dispositions are gradually increased over time. In Sects. 6 and 7 we present a model where the heating-up process is implemented by a form of higher-order reinforcement learning that operates on how much the learner reinforces or punishes. This second model shows that learning to learn of the kind achieved by Harlow’s monkeys can be accomplished by a self-tuning process that involves nothing more sophisticated than reinforcement of dispositions when they produce successful actions and punishment of dispositions when they produce unsuccessful actions. The two-tiered formulation of reinforcement with punishment that we describe is both simple and highly adaptable. In Sect. 8 we briefly discuss the results.

2 Two forms of learning

Edward Thorndike (1898) was one of the first to investigate in detail how animals learn by reinforcement with punishment. He summarized the results of his experiments on cats, dogs, and chicks in two laws. The first was the law of effect:

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond.

The second was the law of exercise:

Any response to a situation will, other things being equal, be more strongly connected with the situation in proportion to the number of times it has been connected with that situation and to the average vigor and duration of the connections. (Thorndike, 1911, p. 244)

Together these laws capture the key features of reinforcement with punishment. Namely, an animal is more likely to perform an action when it has been rewarded in connection with that type of action, less likely to perform it when it has been punished, and both the magnitude and the number of rewards and punishments matter in a cumulative way to the animal’s subsequent probabilistic dispositions.^{Footnote 4}

In its most basic form learning by reinforcement with punishment can be modeled as follows.^{Footnote 5} Let $q_k(t)$ be an agent’s propensity for action k at time t. Her propensities evolve according to the update rule:

$$\begin{aligned} q_k(t+1) = \left\{ \begin{array}{ll} q_k(t) + \pi (t) &{} \text {if action } k \text { was taken} \\ q_k(t) &{} \text {otherwise.} \end{array} \right. \end{aligned}$$

Here $\pi (t)$ is the payoff received by an agent taking the action k on round t. It may be positive for reinforcement or negative for punishment depending on the degree of success or failure resulting from the action. If one allows for punishment, then one needs to do something to prevent negative propensities. One strategy is to specify a constant $b>0$, then to stipulate that if a punishment would result in $q_k(t+1)<b$, then $q_k(t+1)=b$.

An agent’s propensities, in turn, determine her probabilistic dispositions. This works by means of the response rule:

$$\begin{aligned} p_k(t) = \frac{q_k(t)}{\sum _j q_j(t)}, \end{aligned}$$

where $p_k(t)$ is the probability that the agent takes action k at time t and j ranges over the available actions. In order to say how the process gets started, one must also specify the set of initial propensities $q_j(0)$.

While both humans and animals often learn by this or a similar variety of reinforcement, they also learn in other ways. Sometimes an agent considering multiple possible actions begins by picking one at random. If her guess leads to success, then she repeats the same action the next time she finds herself in a similar situation. But if her attempt results in failure, then she tries a different response at random the next time around.

Win-stay/lose-shift formalizes this type of trial-and-error learning. Consider a learner who confronts a series of trials each of which results in either success or failure depending on which of a finite number of acts she chooses on that trial. As above, we will use t to denote the current time-step. At $t = 0$, a win-stay/lose-shift learner chooses each available act with equal probability. At each subsequent step, if she chooses act a at t and that choice leads to successful action, then she chooses a again at $t+1$; if she chooses a at t and fails on that trial, she chooses an act at random and without bias from the set of all available acts except for a at $t+1$.

Win-stay/lose-shift does better than reinforcement in some learning problems. Harlow (1949) presented rhesus monkeys with a series of such problems and recorded their behavior. The monkeys started as reinforcement learners then slowly learned how to learn by win-stay/lose-shift in a context-specific way that involved the coevolution of what they took to be salient as they learned. We show that the learning accomplished by Harlow’s monkeys is well-modeled by a process in which they gradually shift from implementing a simple form of reinforcement learning to implementing a learning rule that closely approximates win-stay/lose-shift even as they learn by reinforcement how to apply the new form of learning to the particular type of problem they face.

One might think of this co-evolutionary process as a self-assembling discrimination game.^{Footnote 6} In a self-assembling game, structural features of a strategic interaction, such as the payoff structure or the players’ strategy sets, evolve alongside the strategic dispositions of the players. In the game played by Harlow’s monkeys, both the learning dynamics they use to update their dispositions and the features of the world on which they condition their actions coevolve as they play.

3 Harlow’s monkeys

Harry Harlow (1949) performed a series of experiments to determine how rhesus monkeys might learn how to learn in the context of a particular type of problem. Here we are primarily concerned with his first experiment.

In Harlow’s first experiment, the monkeys were presented with a series of discrimination problems. Each problem consisted of a different pair of objects $O_1$ and $O_2$ that were easily distinguishable, with one of these, say $O_1$, always covering a small piece of food. As a concrete example, $O_1$ might be a handkerchief and $O_2$ a small pillow for a given problem. The two objects were then placed randomly to the left and right before the monkey. The monkey was rewarded if it chose the object covering the food. Each problem consisted of a sequence of repeated trials with the objects $O_1$ and $O_2$ randomly placed before the monkey and with the same object $O_1$ always covering the food. Then the experimenter introduced a new learning problem with two new objects and with the food always under one of those. The full experiment consisted of a series of 344 such problems, each consisting of multiple trials, using 344 different pairs of stimuli (objects) run on a group of eight monkeys (1949, p. 52).

Harlow found that the monkeys initially learned to select the right object within a problem by means of a process that is closely modeled by simple reinforcement learning. But in later problems, the monkeys learned where the food was much faster and in a qualitatively different way. Figure 1 is reproduced from Harlow’s original paper. It plots, for each of a series of blocks of problems, the monkeys’ mean aggregate success rate for the first six trials of each problem in that block. The key finding is that the monkeys gradually come to learn much faster on later blocks of problems.

By learning across problems, the monkeys learned how to learn more efficiently within each problem. Instead of their default reinforcement learning, they gradually began to learn by means of a form of win-stay/lose-shift.^{Footnote 7} They would choose an object on their first trial blindly. If the food was there, they would stay with that object no matter where it might be located on a future trial. If the food wasn’t there, they would choose the other object regardless of where it might be located on a future trial.^{Footnote 8}

Harlow referred to this type of acquired skill as a learning set, a way of learning in the context of a particular type of problem. In allowing for more efficient forms of learning, he said, the formation of a new learning set “delivers the animal from Thorndikian bondage.” (1949, p. 59). The monkeys are no longer dependent on their usual reinforcement learning, a form of learning that does not work nearly as well as win-stay/lose-shift for the task at hand.

In learning how to learn better, the monkeys coevolve a new learning dynamics and new associated saliences that allow for the effective use of the new dynamics.^{Footnote 9} The monkeys’ probabilistic dispositions gradually shift from those associated with reinforcement learning to those associated with win-stay/lose-shift over subsequent problems. They also learn that objects matter and locations don’t, and they learn to use win-stay/lose-shift not simple reinforcement for this type of problem. In this way, the coevolved saliences provide conditions for how the new dynamics is used.

In brief, the monkeys begin as reinforcement learners who consider both position and object quality, then gradually learn to use win-stay/lose-shift on object quality.^{Footnote 10} In doing so, they self-assemble a new way of learning in the context of this particular type of task.

Harlow also described a series of experiments where children are presented with a similar task. The children learned in much the same way as the monkeys but were faster in moving from reinforcement to win-stay/lose-shift with the associated saliences (1949, pp. 55 and 59).

While it is unclear precisely how the monkeys or children are learning how to learn, one can model how a reinforcement learner might learn to use win-stay/lose-shift with appropriate attendant saliences. We will consider how an agent might learn new saliences by reinforcing on what she attended to when her action was successful and how she might gradually evolve from learning by simple reinforcement to learning by win-stay/lose-shift by updating the magnitudes by which she reinforces on success and punishes on failure.

In the first model, we show that the monkeys’ gradual transition from reinforcement learning to win-stay/lose-shift can be thought of as a kind of “heating up" of the monkeys’ act- and salience-level learning, in which they are always learning by a form of reinforcement with punishment but the magnitudes by which they reinforce on success and punish on failure grow over time. The first model, however, does not consider how this heating-up process might be realized by a learning mechanism. This is addressed by the second model, which we introduce in Sect. 4. The second model shows how a shift in learning like that achieved by Harlow’s monkeys might be accomplished by a learner who implements a self-tuning form of reinforcement with punishment.^{Footnote 11}

4 The first model

Consider an agent who learns by reinforcement with punishment over a sequence of learning problems both what to attend to and how to act. One might picture how she learns by considering a set of urns from which she might draw balls to determine her actions and add or remove balls to update her dispositions.^{Footnote 12} All draws from urns are random and without bias. We will start with a description of how learning occurs within a problem, then discuss how learning evolves across problems.

Each learning problem consists of a sequence of trials in which the agent chooses one of two objects $O_1$ or $O_2$. At the beginning of a problem, one of these objects is randomly selected as the reward object and remains the reward object for each trial of the problem. On problem n, the reward for success is $i_n$ and the punishment for failure is $j_n$ on each trial. The positions of the objects are randomly determined between trials.

At the beginning of a trial, the agent draws from a salience urn containing Q balls and P balls as in Fig. 2. Before the first problem, this urn contains one ball of each type. The result of the draw determines which type of stimulus the agent attends to in determining her action.

If the agent draws a Q ball from the salience urn, then she chooses an object to select on the trial by a draw from her quality urn. Before the first trial of each problem this urn contains one $O_1$ ball and one $O_2$ ball. If the agent chooses the reward object, then she is successful, and she returns the balls she drew from the salience urn and the quality urn and adds $i_n$ new balls of the same type to each. If she selects the non-reward object, then she is unsuccessful, and as long as doing so will not drive the weight associated with the relevant type below a small $l>0$, she returns the ball she drew to the urn from which she drew it then removes j many balls of its type from each of the two urns. If removing $j_n$ balls would drive the associated weight below l, then she sets the weight associated with that type to l. The weights associated with each disposition are thus bounded from below by l. This prevents initially possible strategies from being completely eliminated.^{Footnote 13}

The process is analogous if a P ball is drawn from the salience urn. In this case, the agent determines which object to select on the trial by a draw from her position urn. Before the first trial of each problem, this urn contains one R ball and one L ball. If an R is drawn, the agent selects the object on the right; and if an L is drawn, she selects the object on the left. Reinforcement and punishment on the trial works the same way as it does on a quality draw.

An agent also adjusts how she learns between trials by updating the magnitude by which she reinforces on success $i_n$ and punishes on failure $j_n$. Specifically, we will suppose that $+i_n$ reinforcement with $-j_n$ punishment evolves by the following recursive rule:

$$\begin{aligned} i_{n+1}&= \alpha i_n + \beta \\ j_{n+1}&= \alpha j_n + \beta \end{aligned}$$

where $\alpha > 0$ and $\beta \ge 0$ are constant over the full multi-problem experiment. While a more complex model would allow for different scale and shift parameters for reinforcement and punishment, we will suppose that the two parameters are the same in both contexts. We will also suppose that the magnitudes of reinforcements and punishments for act-level learning change from trial to trial according to the recursive rule, while the magnitudes of the reinforcement and punishments are constant for the learning of saliences.

An agent’s salience urn is not reset between problems. This is so she might learn whether the series of problems she faces involve quality or position. In contrast, her object and position urns are reset at the beginning of each new problem. This corresponds to the appearance of a new set of objects for which the agent needs to evolve effective dispositions.

While we are interested in the quantitative fit with Harlow’s data, our primary concern is the basic structure of the model. We will start with a particularly simple set of parameters then discuss other settings that also work well.

Harlow’s experiments do not allow us to determine precisely how the monkeys learn how to learn, but they do illustrate the sort of heating up that the higher-order dynamics must accomplish. The recursive rule is designed to capture this aspect of their metalearning. Later, we will consider an explicit model designed to capture this higher-order learning.

Monotonically increasing act-level reinforcements and punishments represent a monkey’s sharpening sense of the type of learning problem it faces in Harlow’s first experiment. If the agent tries an object and succeeds, then she will reinforce more on that object than she would have in earlier problems in the degree to which she has learned that when an object works in a problem, then it will work again if she tries it again. Similarly, if the agent tries an object and fails, then she will punish more on that object than she would have in earlier problems in the degree to which she has learned that when an object doesn’t work in a problem, then it will still not work if she tries it again.

5 First model: results

Following Harlow’s experimental design, a single run of the model consists in a series of 344 problems. The first 32 problems involve 50 trials each, followed by 200 six-trial problems and 112 nine-trial problems.

The following parameters for a single simulated agent provide a close qualitative fit with Harlow’s experimental data for the mean aggregate behavior of his eight monkeys:

$$\begin{aligned} i_1&= 1 \\ j_1&= 0.25 \\ \alpha&= 1 \\ \beta&= 0.0004 \end{aligned}$$

Since $\alpha =1$, this transformation just additively shifts the reinforcements and punishments with no rescaling between trials. And since $\beta =0.0004$, the difference in learning dispositions between contiguous trials is small.

These parameters generate a sequence of learning curves that capture the steepening pattern across problems that Harlow reports in his experiment. This is illustrated in the comparison between Harlow’s experimental data and the simulation data from the treatment where act-level and salience learning coevolves as illustrated in Fig. 3.^{Footnote 14}

Although we are primarily interested in the basic structure of the model and the qualitative steepening pattern reflecting the gradual transition from simple reinforcement to win-stay/lose-shift, it is worth noting that the quantitative fit with the experimental data is close. The model thus offers not only an account of how a reinforcement learner might in principle come to implement win-stay/lose-shift, it closely approximates the aggregate learning data from a particular case of such learning to learn. That said, the closeness of the fit varies somewhat across problem blocks.

Figure 4 reports the mean absolute difference between the percent correct responses in Harlow’s experiment and in the simulated treatment using the parameter settings above. Only trials 2–6 are counted in each problem. Starting with trial 2, the monkeys have the chance to shift on object choice if their first guess was incorrect, and we only use data up to trial 6 as that is all Harlow reports. Averages are taken over all of the trials in each block of problems.

As indicated in Fig. 4, the worst match between simulation results and the experimental data is in the first and last problem blocks, 1–8 and 289–344, but even here the difference between the predicted success rate and the experimental data is never greater than 6%. Given the relatively low resolution of the experimental data itself, this is a very close quantitative fit.

The upshot is that the additive shifting of reinforcements and punishments by a constant between trials provides an account of how the monkeys learn how to learn that fits well with the experimental data. As they update how they reinforce and punish, they gradually come to act as win-stay/lose-shift learners. And they learn to pay attention to objects, not locations, in the context of this type of problem. In this, the simulated agents behave just as the rhesus monkeys do in aggregate in Harlow’s experiment.

The model also does surprisingly well under quite different parameter settings. Instead of increasing reinforcements and punishments by means of iterated additive shifts ($\alpha =1$, $\beta >0$), one might rescale reinforcements and punishments after each trial ($\alpha >1$, $\beta =0$). Starting with the same initial values for $i_0$ and $j_0$ as above, a pure rescaling of $\alpha =1.0005$ and $\beta =0$ delivers a qualitative overall fit approximately as good as the pure additive shift of $\alpha =1$ and $\beta =0.0004$, and it does somewhat better on the final problem block than the additive shift. As with the original parameters, the mean absolute difference between the model’s learning on the rescaling parameters and that of Harlow’s monkeys for trials 2–6 never exceeds 6% for any problem block. Unsurprisingly, there are also parameter settings that involve both an additive shift and rescaling that provide a good match with Harlow’s data.

The robustness of the model under different parameter settings means that the model gets the basic structure of the monkeys’ higher-order learning right. In particular, increasing levels of reward and punishment capture the aggregate shift in the dispositions of the monkeys as they evolve from reinforcement learners to win-stay/lose shift learners.

That different parameters settings work similarly well, however, also represents a significant limitation regarding what one can infer from Harlow’s experiments. A higher-order learning dynamics that works by simple reinforcement (additive shift) is different in kind from one that works by multiplicative reinforcement (rescaling).

There are two further things to note regarding the present model. The first concerns the evolution of salience. The second concerns the form of reinforcement learning required to capture the behavior of the monkeys.

To this point, we have only considered how an agent might learn that object quality is salient to learning within a problem, but the story is much the same for location. In subsequent experiments, Harlow tried always placing the reward in the same position rather than under the same object in a problem. The monkeys were able to learn how to learn in the context of such problems by win-stay/lose-shift on position just they had on object quality. The present model also captures this behavior. Since quality and position are symmetric, if the reward is always put in the same position, a simulated agent gradually learns how to learn by win-stay/lose-shift on position rather than object quality, agreeing well with the aggregate behavior of Harlow’s monkeys.

The second thing to note is that punishment is an essential feature of the present model. While there are parameter settings that allow for the emergence of something roughly akin to win-stay/lose-shift learning without punishment, one cannot get a good match with Harlow’s aggregate data without punishment. The reason is straightforward. Since the object urns are reset between problems, the expected success rate in trial 2 within a problem for a simple reinforcement learner without punishment but with optimal saliences is bounded from above by 0.75.^{Footnote 15} Hence no level of positive reinforcement alone, can generate a success rate of 0.97 in trial 2, as observed in the final problems of Harlow’s first experiment.

6 The second model

The first model closely approximates the behavior of Harlow’s monkeys as their learning evolves in this type of discrimination problem. That said, there is good reason to hesitate in taking the model as illustrating how they learn how to learn. The modeled agent’s salience- and act-level learning evolves by iterated transformations of reinforcement and punishment values, but those transformations occur automatically between problems independently of the agent’s experience. In a genuine learning process, one should expect an agent’s dispositions to change over time in response to the specific content of her experience.

In this section, we consider a higher-order learning process that would lead an agent to gradually shift from implementing slower reinforcement learning to fast win-stay/lose-shift-like learning in Harlow-style discrimination problems. The meta dynamics that describes the evolution of an agent’s first-order learning parameters is a variety of reinforcement learning in which the agent reinforces and punishes those first-order parameters based on her experience. Before describing the model in detail, it will be helpful to consider the motivation.

To begin, note that for any fixed initial assignment of propensities over first-order acts, the higher a reinforcement learner’s level of reinforcement, the more likely she is to repeat successful actions; and the lower her level of reinforcement, the less likely she is to repeat successful actions. Thus, increasing the level of reinforcement can be thought of as reinforcing the disposition to stay with actions which just led to success, and decreasing the level of reinforcement can be thought of as punishing the disposition to stay with actions which just led to success. Similarly, for any fixed initial assignment of propensities over first-order acts, the lower the learner’s level of punishment, the more likely she is to repeat unsuccessful actions; and the higher the level of punishment, the less likely she is to repeat unsuccessful actions. So, decreasing the level of punishment can be thought of as punishing the disposition to switch from actions which just led to failure, and increasing the level of punishment can be thought of as reinforcing the disposition to switch from actions which just led to failure.

These observations provide the basis for the second model’s higher-order reinforcement dynamics. Suppose that an agent performs two actions $A_1$ and $A_2$ of precisely the same type under identical conditions at contiguous times $n_1$ and $n_2$. If $A_1$ was successful and $A_2$ was successful, then the agent would want to reinforce in the future more strongly than she did since the action succeeded when it was repeated. In contrast, if $A_1$ was successful and $A_2$ was unsuccessful, she would want to reinforce in the future less strongly than she did, since the action failed when it was repeated. Similarly, if $A_1$ was unsuccessful and $A_2$ was successful, then the agent would want to punish in the future less strongly than she did since the action succeeded when it was repeated. And if $A_1$ was unsuccessful and $A_2$ was unsuccessful, she would want punish in the future more strongly than she did, since the action failed when it was repeated.

Consider an agent facing a series of m many k-trial discrimination problems, formalized precisely as in the first model. And like the first model, suppose the learner’s saliences and act-level dispositions evolve by reinforcement with punishment, as described by the urn model in Fig. 2. Let $i_n$ represent the level of reinforcement for salience and act learning at time n, and let $j_n$ represent the corresponding level of punishment at n, where timesteps are cumulative across problems. Let s(n) denote the salient dimension on that trial, i.e. the feature of the stimulus objects the learner attended to at n, and let $I_s(n,n+1)$ be a function whose value is 1 if $s(n)=s(n+1)$ and 0 otherwise. Let o(n) denote the outcome of the trial at timestep n, where $o(n)=0$ if the trial was unsuccessful and $o(n)=1$ if the trial was successful. a(n) denotes the act chosen at n, and $I_a(n,n+1)$ is a function whose value is 1 if $a(n)=a(n+1)$ and 0 otherwise. $\gamma >1$ and $\lambda <1$ are constants. Figure 5 describes precisely how reinforcement and punishment dispositions evolve, along with qualitative descriptions relating the formal characterization to the interpretation in terms of reinforcing and punishing on the stay with and switch from dispositions as described above.

In the present model $\gamma $ and $\lambda $ are constants by which the learner’s reinforcement and punishment levels may be rescaled as she learns to learn from trial to trial. This involves several modifications of the original model. Most importantly, higher-order reinforcement on the present model is not automatic; rather, first-order levels of reinforcement and punishment are only modified as a result of higher-order learning on the basis of the agent’s experience. In considering results from the first model, we focused on the case in which reinforcement and punishment levels are updated by additive translations; in the second model, higher-order reinforcement is accomplished by rescaling as this provides a natural way of avoiding the possibility of negative first-order reinforcement or punishment levels. And while the first model modifies reinforcement and punishment in lockstep, first-order reinforcement and punishment levels are never modified simultaneously in the present model. Rather, they respond independently to the learner’s experience.

Another feature of the present model is that reinforcement and punishment levels remain unchanged whenever the agent’s saliences shift between contiguous trials (i.e., whenever $I_s(n,n+1)=0$). The thought is that which dimension of the stimuli is salient to the agent in a given trial determines a framing of the choice problem at hand and that choices made in trials with different frames are not comparable. In particular, it is not meaningful to treat the sequence of choices made in two contiguous trials in which the learner switched saliences as an instance of staying with or switching from a given strategy.

An example may be helpful. Consider a Harlow discrimination problem in which the two stimulus objects are a red bowl and a green cup. Suppose that object quality is salient to the learner in trial n. She therefore frames the problem at hand in terms of choosing the right kind of object. On trial $n+1$ her salience changes. Now, she attends to position, and thus sees the problem as requiring her to choose the correct location.

The example illustrates how a shift in the salience a learner uses marks a change in how she frames her options. In trial n, she faces a choice between different types of objects. In trial $n+1$, she faces a choice between different locations. Suppose she first (at n) chooses the green cup, and then (at $n+1$) chooses the right-hand position, which happens to be occupied by the green cup. Of course, from an outside perspective, the same object was selected between the trials. But from the learner’s perspective, the acts choose the right-hand position and choose the green cup are not comparable in the way they would have to be in order for talk of the learner staying with the same strategy or switching to a new strategy between n and $n+1$ to be meaningful. As a result, we suppose that the agent does not update her dispositions to stay with or switch away from successful or unsuccessful actions after two-trial sequences when her saliences change between trials.

7 Second model: results

To investigate whether the second model can replicate the desired shift from gradual reinforcement learning to win-stay/lose-shift, we ran a series of 1000 computer simulations, each consisting in 1000 blocks of 10-trial Harlow discrimination problems. Propensities for acts and saliences were bounded from below at 0.01. All initial base-level propensities were set to 1. Initial reinforcement and punishment levels were set to $i_0=1$ and $j_0=0.25$, as in the earlier model. The meta-parameters $\gamma $ and $\lambda $ were set to 1.02 and 0.98, respectively.

On simulation, the modeled agent reliably learned that success in the type of problem she was given required her to use large reinforcement and punishment levels, which led to dispositions approximating win-stay/lose-shift. The mean cumulative success rate over all 1000 problems, averaged across the 100 runs, was 0.91. Restricted to the last 100 problems on each run, the mean success rate across all runs is 0.929. This is close to the optimal expected success rate of 0.95 for a true win-stay/lose-shift learner in this problem, indicating that the agent typically successfully learns to attend to the task-relevant dimension and to very closely approximate win-stay/lose-shift.^{Footnote 16}

The mean cumulative success rate on the last hundred problems was less than 0.9 on just 47 of the 1000 runs. On all but two of these runs, the agent had mistakenly learned to attend to the task-irrelevant dimension of the stimulus objects with high probability. On these runs, the agent performed approximately as well as chance, with mean success rates lying between 0.47 and 0.52.^{Footnote 17} It is unsurprising that the agent’s performance was close to chance in this case as the task-irrelevant dimension is uncorrelated with the location of the reward.^{Footnote 18}

8 Conclusion

It is natural to understand learning by Humean custom as learning by means of a form of reinforcement with punishment. But for custom to provide a compelling account of natural learning, one also needs to explain how an agent might start as a reinforcement learner, then learn how to learn in a manner well suited to a particular type of problem. Here we consider one way this might work in the context of a famous type of discrimination problem.

Harlow showed that his monkeys were able to learn how to learn by win-stay/lose-shift, a learning dynamics that is better suited to the type of problem they face than their default reinforcement learning. And he showed that they were able to learn how to apply this new form of learning in a context-specific way by co-learning the saliences relevant to that type of problem. The first model illustrates how a simple reinforcement learner might learn saliences appropriate to the type of discrimination problem Harlow describes while gradually shifting to learning by means of win-stay/lose-shift. Building on this, the second model shows how a more subtle sort of reinforcement learner, one equipped with a higher-order dynamics that allows her to reinforce and punish the magnitudes of first-order reinforcements and punishments, might learn how to learn more effectively in a Harlow-style problem as she learns.

The learning dynamics in the second model is self-tuning. As an agent uses it, the higher-order dynamics provides a way for her to learn how to adjust her first-order learning to make it more effective given how well it is performing in the task at hand. We have shown that this form of reinforcement is highly effective in the context of Harlow-type problems.^{Footnote 19}

Hume was right to believe that humans and other animals very often learn by a form of reinforcement. Here we have described a type of reinforcement learner who can learn how to learn in a way that is well suited to a type of problem for which simple reinforcement is not at all well suited. While Hume did not consider this type of self-tuning reinforcement, it is compatible with his insistence that we learn by means of custom. Here custom itself provides a mechanism for an agent to better learn by custom.

Notes

See Barrett (2023) for an extended version of the present discussion of Hume. See Morris et al. (2019) for a further discussion of the nature and role of custom in Hume’s account of learning and his “sceptical solution” to the problem of induction.
See Herrnstein (1970) for a description of such experiments and a formal account of positive reinforcement. See Roth and Erev (1995), Erev and Roth (1998), and Bereby-Meyer and Erev (1998) for examples of reinforcement learning in humans subjects. See Fudenberg and Levine (1998), Skyrms (2010), and Barrett (2024) for discussions of reinforcement and closely-allied types of learning in the context of games and Huttegger (2017) for a discussion of the basic ideas behind reinforcement learning and rational learning more generally.
Harlow discusses learning to learn in terms of learning set formation the development of different methods of learning suited to different practical contexts. His 1949 paper was the first of several papers, by Harlow and others, investigating the development of learning sets in animals. The original paper remains a classic in comparative psychology. See Schrier (1984) for details on the reception of Harlow’s research and its influence on subsequent work in comparative psychology. See Barrett (2024) for a discussion of other self-tuning forms of reinforcement learning.
See Thorndike (1898, 1901, 1911) for descriptions of his experiments and his understanding of how reinforcement learning works.
This way of characterizing the dynamics follows Roth and Erev (1995) and Erev and Roth (1998).
See Barrett and Skyrms (2017) for a general account of the self-assembly of games by ritualization.
See Cochran and Barrett (2021) (2022) for discussions of various forms of win-stay/lose-shift learning and their use by human subjects.
Harlow reported that some of the monkeys were eventually able to solve 20 to 30 consecutive problems with no errors whatsoever after their first blind trial (1949, p. 56).
The notion of salience at work here is one on which a feature of an agent’s environment is salient for that agent if she is disposed to notice and condition her response on that feature’s state. When we speak of the saliences of an agent, we mean this as shorthand for the agent’s dispositions to attend to and condition her actions on the various bits of her environment.
In his subsequent experiments, Harlow showed how the monkeys might learn to take position rather than object quality as salient and even switch between the two learned saliences (1949, pp. 56–59).
See Skyrms and Pemantle (2000) and Barrett (2020, 2024) for discussions regarding the evolution of salience. See Herrmann et al. (2022) for an application to the evolution of saliences in the context of basic Lewis-Skyrms signaling games.
We will also allow for fractional changes in the number of balls of each type in an urn. This will affect the probability of drawing a ball of a given type just as one would expect.
Each of the simulations in the next section starts with one ball of each type in each urn, a punishment level of 0.25, and a lower bound on each weight of $l=10^{-14}$. The lower bound allows the reinforcement learner to retain an unsuccessful strategy that has a long track-record of failure as at least an in-principle possibility and perhaps even try it again later, particularly if nothing else has worked well either. While our particular choice of l is largely arbitrary, choosing a value that is small relative to the initial propensities ensures that punishments occurring early in the learning process can significantly affect subsequent choice probabilities, as is necessary for the possibility of generating win-stay/lose-shift-like behavior in an agent implementing reinforcement with punishment.
The experimental data is estimated from Harlow’s figures. The resolution of each data point is approximately 2%.
Suppose that with probability 1 a simple reinforcement learner without punishment attends to the relevant dimension in every trial of problem n. At the beginning of the problem, the object act urn is reset to one ball of each type. The agent will thus choose the rewarded object in trial 1 with probability 0.5. Let $i_n$ be large so that the probability that the agent will choose the rewarded object in trial 2 conditional on having chosen the correct object in trial 1 is 1-$\epsilon $, where $\epsilon $ is small. If the punishment level $j_n$ is zero, then the agent will choose the rewarded object in trial 2 with probability 0.5, conditional on having chosen the unrewarded object in trial 1. Thus the probability of success in trial 2 is $(0.5)(1-\epsilon )+ (0.5)(0.5)$. Letting $\epsilon $ go to zero for very large reinforcements, the value of this expression approaches 0.75 from below.
Recall that, conditional on her attending to the task-appropriate dimension, a win-stay/lose-shift learner will succeed on the first trial of a given problem half the time on average, and will succeed on every subsequent trial in that problem.
For the other two runs, success rates for the last one hundred problems were 0.57 and 0.62. In one of these runs, the agent had learned to attend to the task-relevant dimension with probability very close to 1; in the other, the agent’s final probability of attending to object quality was 0.377.
When the 47 outlier runs are removed, the mean success rate over the final one hundred problems of each run is 0.954, with noise just slightly better than the optimal expected success rate.
The present model implements the motivation behind the meta-dynamics in a way that is well suited to Harlow-type discrimination problems. In doing so, it takes advantage of the fact that there are only two options available to the learner. A more general implementation of this motivation remains a topic for future research.

References

Barrett, J. A. (2024). Self-assembling games. Oxford University Press.
Google Scholar
Barrett, J. A. (2023). Humean learning (how to learn). Philosohical Studies, 181, 281–297.
Article Google Scholar
Barrett, J. A. (2020). Self-assembling games and the evolution of salience. British Journal for the Philosophy of Science, 74, 75–89.
Article Google Scholar
Barrett, J. A., & Skyrms, B. (2017). Self-assembling games. The British Journal for the Philosophy of Science, 68(2), 329–353.
Article Google Scholar
Beggs, A. W. (2005). On the convergence of reinforcement learning. Journal of Economic Theory, 122, 1–36.
Article Google Scholar
Bereby-Meyer, Y., & Erev, I. (1998). On learning to become a successful loser: A comparison of alternative abstractions of learning processes in the loss domain. Journal of Mathematical Psychology, 42(2–3), 266–286.
Article Google Scholar
Cochran, C. T. , & Jeffrey A. B. (2022). The efficacy of human learning in Lewis-Skyrms signaling games.
Cochran, C. T., & Barrett, J. A. (2021). How signaling conventions are established. Synthese, 199(1–2), 4367–4391.
Article Google Scholar
Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88, 848–81.
Google Scholar
Fudenberg, D., & Levine, D. (1998). Learning and the theory of games. MIT Press.
Google Scholar
Harlow, H. F. (1949). The formation of learning sets. Psychological Review, 56, 51–65.
Article Google Scholar
Herrmann, D. A., & VanDrunen, J. (2022). Sifting the signal from the noise. The British Journal for the Philosophy of Science. https://doi.org/10.1086/720805
Article Google Scholar
Herrnstein, R. (1970). On the law of effect. Journal of the Experimental Analysis of Behavior, 13, 243–266.
Article Google Scholar
Hume, D. (1975). Enquiries concerning human understanding and concerning the principles of morals. Oxford University Press.
Book Google Scholar
Huttegger, S. (2017). The probabilistic foundations of rational learning. Cambridge University Press.
Book Google Scholar
Morris, W. E., & and Brown, C. R.(2019). David Hume. The Stanford Encyclopedia of Philosophy (Summer 2022 Edition), Edward N. Zalta (ed.), https://plato.stanford.edu/archives/sum2022/entries/hume/.
Roth, A. E., & Erev, I. (1995). Learning in extensive form games: Experimental data and simple dynamical models in the intermediate term. Games and Economic Behavior, 8, 164–212.
Article Google Scholar
Schrier, A. M. (1984). Learning how to learn: The significance and current status of learning set formation. Primates, 25, 95–102.
Article Google Scholar
Skyrms, B. (2010). Signals: Evolution, learning, & information. Oxford University Press.
Book Google Scholar
Skyrms, B., & Pemantle, R. (2000). A dynamic model of social network formation. Proceedings of the National Academy of Sciences of the United States of America, 97(16), 9340–9346.
Article Google Scholar
Thorndike, E. (1898). Animal intelligence: An experimental study of the associative processes in animals. The Psychological Review: Monograph Supplements, Vol. II., No. 4 (Whole No. 8), June 1898. The Macmillan.
Thorndike, E. (1901). The human nature club: An introduction to the study of mental life (2nd ed.). Macmillan.
Google Scholar
Thorndike, E. (1911). Animal intelligence. Macmillan.
Google Scholar

Download references

Acknowledgements

We would like to thank Brian Skyrms, Jack VanDrunen, Nathan Gabriel, Simon Huttegger, Daniel Herrmann, Jason Alexander, Nathaniel Imel, and the reviewers who read an earlier version of this paper for their comments and suggestions. We would also like to thank the Biology Interest Group at the University of Minnesota.

Author information

Authors and Affiliations

Department of Logic and Philosophy of Science, UC Irvine, Irvine, CA, USA
Christian Torsell & Jeffrey A. Barrett

Authors

Christian Torsell
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Barrett
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Torsell.

Ethics declarations

Conflict of interest

The authors have no relevant financial or nonfinancial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Torsell, C., Barrett, J.A. Learning how to learn by self-tuning reinforcement. Synthese 203, 209 (2024). https://doi.org/10.1007/s11229-024-04649-1

Download citation

Received: 05 August 2023
Accepted: 06 May 2024
Published: 11 June 2024
DOI: https://doi.org/10.1007/s11229-024-04649-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning how to learn by self-tuning reinforcement

Abstract

Similar content being viewed by others