Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules
The tension between exploitation of the best options and exploration of alternatives is a ubiquitous problem that all organisms face. To examine this trade-off across species, pigeons and people were trained on an eight-armed bandit task in which the options were rewarded on a variable interval (VI) schedule. At regular intervals, each option’s VI changed, thus encouraging dynamic increases in exploration in response to these anticipated changes. Both species showed sensitivity to the payoffs that was often well modeled by Luce’s (1963) decision rule. For pigeons, exploration of alternative options was driven by experienced changes in the payoff schedules, not the beginning of a new session, even though each session signaled a new schedule. In contrast, people quickly learned to explore in response to signaled changes in the payoffs.
KeywordsPigeon Human learning Associative learning Acquisition
Direct interaction with the environment provides much of the information that informs subsequent actions. Rarely is choice made in the presence of perfect knowledge. In a multitude of domains, organisms begin by choosing almost blindly; what is learned about the environment varies according to which of the possibilities are experienced. The world often fails to reveal information about the utility of options not chosen—the route not taken, the career not selected, the product not purchased (Taleb, 2007). In a complex environment where options are many and/or variable, complete knowledge of prevailing contingencies may require very long-term exploration. Even after long experience with the prevailing contingencies, continued exploration of options with less utility may be necessary in order to adapt to change.
Under similar conditions, what leads some choosers to exploit their knowledge of differential utility and others to explore their options? Continued exploration may be an adaptive behavior learned through experience with changing environments (Rakow & Miler, 2009; Stahlman, Roberts, & Blaisdell 2010; Stahlman, Young, & Blaisdell 2010), or it may be that imperfect knowledge maintains exploration so that responding to changing conditions is a side effect rather than an adaptation. A complete study of the trade-off between exploration and exploitation will require the use of choice environments in which more than two options are available (cf. Rakow & Miler, 2009). We examined this trade-off in the present project by investigating human and pigeon behavior in an eight-option task.
In addition to contending with the real-world complexity related to large numbers of options, most species live in changing environments. Although researchers in foraging behavior have investigated decision-making mainly through familiar, stationary environments, such that the individuals are fully informed about the nature of the options (e.g., Lin & Batzli, 2002; Zach, 1979), there is increasing interest in how such information is acquired (e.g., Mettke-Hofmann, Wink, Winkler, & Leisler 2004; Plowright & Shettleworth, 1990). The introduction of environmental changes has often been used to study how animals gather information about their environment.
We took an approach that was inspired by the study of reinforcement-learning algorithms as applied to machine learning (Koulouriotis & Xanthopoulos, 2008; Sutton & Barto, 1998). In its simplest form, reinforcement-learning analyses often use the multi-armed (or “n-armed”) bandit task to evaluate various methods of distributing exploration and exploitation (e.g., Dimitrakakis & Lagoudakis, 2008; Sikora, 2008). This task provides an excellent platform to explore choice in stationary (with unchanging payoffs) and nonstationary (with changing payoffs) environments, and it has also been applied to the domains of human learning and cognition (e.g., Burns, Lee, & Vickers 2006; Plowright & Shettleworth, 1990), economics (e.g., Banks, Olson, & Porter 1997), marketing and management (e.g., Azoulay-Schwartz, Kraus, & Wilkenfeld 2004; Valsecchi, 2003), and math and computer science (e.g., Auer, Cesa-Bianchi, Freund, & Schapire 1995; Koulouriotis & Xanthopoulos, 2008).
The multi-armed bandit task (MABT) usually involves choosing among multiple possible actions that lead to immediate reward and about which nothing is initially known. The MABT took its name from the “one-armed bandit,” another term for the slot machine. Rather than the one arm of a slot machine, however, a MABT has n options. It can be thought of as a set of n slot machines, each with an independent payoff schedule. After each selection, the reinforcer is awarded based on an underlying schedule of reinforcement. A player must explicitly explore an environment in order to learn the expected payoffs for these n options, and then can later exploit this knowledge. In a four-armed bandit task similar to the one used in the present study, Steyvers, Lee, and Wagenmakers (2009) employed a Bayesian optimal-decision model derived from the softmax equation (Luce, 1963) to explore how humans balance exploration with exploitation. In addition, eight-stimulus arrays very similar to the one used in the present study have been used with nonhuman animals (Jensen, Miller, & Neuringer 2006) and humans (Rothstein, Jensen, & Neuringer 2008), and in both cases behavior came under the control of the prevailing contingencies. Thus, this MABT provides a decision task that is potentially both complex and challenging, yet at the same time simple enough that it can be used to study a wide range of decision-making in both humans and other animals.
Exploration versus exploitation
An arm pull is an action, and at any point an actor is expected to rely on an estimate of action values based on the sampling history with each option. Choosing the action with the highest estimated action value (the “greedy” action) is exploitation, because the actor is exploiting its current knowledge. If the actor chooses a nongreedy action, it is exploring—a behavior that potentially enhances overall knowledge by improving the estimate of a nongreedy option. Greedy actions allow the actor to maximize its chance of immediate reward for the very next action, but nongreedy actions may be preferable, in order to maximize long-term reward or value (i.e., they actually are greedy, but over an extended time horizon).
Reward may be lower in the short term when exploring, but long-term value may be greater, since the actor may discover actions that are better than the current greedy action or that provide viable alternatives if the action with the long-run highest value is currently less profitable (due to molecular aspects of the payoff schedule in which an option’s value is temporarily lower; e.g., for VI schedules) or later becomes unprofitable (due to molar changes in the payoff schedule; e.g., changing from a variable ratio 5 to variable ratio 50). Whether exploration or exploitation is best at any given choice point will depend on the expected changes in these payoffs, inter alia. For a nonstationary bandit task, option values change during the task by changing the underlying molar contingencies—as if the room full of slot machines were reprogrammed occasionally during the allotted time of play. Continued exploration is critical if an organism is to track and adapt to these changes.
The machine-learning literature provides some guidance regarding methods for action selection appropriate to the bandit task. The greedy strategy may be used to solve stationary bandit problems, and it requires that every response be made to the option with the highest value (i.e., the richest reinforcement schedule). This strategy results in quick and complete preference for one option, which is precisely what should be avoided in a nonstationary environment.
People and pigeons are not Turing machines, and their estimates of action values may be imperfect. Regardless, these action values may be based simply on an overall history with each option, such as the proportion of total responses to that option that have been reinforced, or by some more complex calculation. For example, these estimates may be weighted to more recent experience or sensitive to the changes in reinforcement probability over time that are inherent in VI schedules. For this study, we assumed these action values to be equal to the overall programmed likelihood of reinforcement represented by the VI schedule for each option. Thus, we operationally defined exploration as choosing a response that has a lower molar reinforcement rate.
The present experiments examined both pigeon and human performance using a nonstationary MABT. Each species chose from among eight response options in order to provide a complex set of choices that would constrain the theoretical analysis. We were interested in testing three hypotheses. First, could Luce’s decision rule be used to assess the balance between exploitation and exploration for pigeons and humans in our choice task? Second, would both species adaptively and quickly increase their level of exploratory behavior in response to environmental cues that signal a change in choice payoffs? For pigeons, each daily session began with a new set of choice payoffs, and thus an adaptively optimal pigeon would begin each session with maximal exploration and be unaffected by the previous day’s programmed schedules. For people, a new session began every few minutes and was signaled by a discriminative cue at the top of the display that should prompt a sudden increase in exploration. Third, would exploration continue throughout a session, or would pigeons and people exhibit a higher level of exploitation later in the session, once differential choice value had been determined?
A total of 6 experimentally naïve adult White Carneaux pigeons (Columba livia) participated in the experiment. The pigeons were individually housed in steel home cages with metal wire mesh floors in a vivarium, and a 12-h light:dark cycle was maintained. Testing was conducted 5–7 days/week during the light cycle. The pigeons were maintained at approximately 85% of their free-feeding weights, and were given free access to grit and water while in their home cages.
Testing was conducted in a flat-black Plexiglas chamber (38 cm wide × 36 cm deep × 38 cm high). All stimuli were presented by computer on a color LCD monitor (NEC MultiSync LCD1550M) visible through a 23.2 × 30.5 cm viewing window in the middle of the front panel of the chamber. Pecks to the monitor were detected by an infrared touch screen (Carroll Touch, Elotouch Systems, Fremont, CA) mounted on the front panel. A 28-V houselight located in the ceiling of the box was used for illumination, except during time outs. A food hopper (Coulbourn Instruments, Allentown, PA) was located below the monitor with an access hole situated flush with the floor. All experimental events were controlled and data recorded by a Pentium III class computer (Dell, Austin, TX). A video card controlled the monitor using the SVGA graphics mode (800 × 600 pixels).
The 6 pigeons were first trained to eat from the hopper in the chamber. Next, responses were autoshaped to a white disk that appeared in the center of the screen. Pecking to the disk resulted in the hopper rising for 3 s before lowering again. This was followed by a 60-s intertrial interval (ITI) before the next disk was displayed. Once the pigeon was consistently responding to the disk, training began.
Initial training consisted of assigning random ratios (RRs) to the disks, using the following probabilities of each peck being rewarded: .61, .37, .22, .14, .08, .05, .03, and .02. After 60 sessions of training, it became clear that the pigeons were showing strong biases to disks located in particular positions and were not pecking to disks located in other positions, even if they had the highest reward value.
One possibility was that pecking to the disks was relatively cheap to the pigeons, so the difference in reward structure was not very tangible. Another factor was that pigeons tend toward maximization (i.e., high exploitation) on RR schedules by showing nearly exclusive responding for the option with the richest experienced payoff structure (Herrnstein & Loveland, 1975). To increase sensitivity to reward and to encourage exploration by temporarily decreasing the reward value of a disk, the reward structure was changed from a random ratio to a variable-interval schedule. The variable intervals used were 3, 6, 12, 24, 48, 96, 192, and 384 s and varied by up to ±50% of the scheduled interval (e.g., for VI 3, the interval varied between 1.5 and 4.5 s). After another 60 sessions, it was clear that the pigeons were still showing strong biases to disks located in particular positions. Shifting the color assignments revealed that the bias was based on location and not color.
The pigeons completed 5 sessions in which only one disk from the display was shown, and the disk had a .61 probability of reward. In this situation, the pigeons did reliably peck to the disk, regardless of its color or position.
The pigeons then completed 40 sessions with all eight disks present, one of which had a .61 probability of reward and seven of which had no reward. The pigeons still showed a strong bias to particular disk locations, even if the locations were not associated with reward in a given session.
A final attempt to equalize the perceived reward value of the disks and encourage exploration involved presenting the pigeons again with all eight disks for 24 sessions. However, the reward schedule was made more extreme, with VIs of 3, 9, 27, 81, 243, 729, 2,187, and 6,561 s (with experienced intervals again varying up to ±50% of the scheduled interval). In addition, at the end of the 24 sessions, the disk that was most pecked was eliminated. For the subsequent 24 sessions, only the remaining seven disks were present, with the longest reward interval was not assigned to a disk. At the end of this set of 24 sessions, the most pecked disk was again eliminated along with the longest reward interval still being used. This procedure progressed until the pigeons were given 24 sessions with only the three least-pecked (by location) disks remaining. To keep the pigeons at 85% of free-feeding weight, a session was terminated once 300 rewards had been received during the session.
For the test sessions, the pigeons were presented with all eight disks for 24 sessions with VIs of 3, 9, 27, 81, 243, 729, 2,187, and 6,561 s. Assignment of VI schedule to the disks varied daily. Only the data from this final set of testing sessions were analyzed.
To analyze the data, we used two approaches. First, we will describe the frequency with which each disk was chosen as a function of its programmed payoff. This approach will provide a general assessment of the degree of control established by the reward structure. Second, we will provide an analytical assessment of the pigeons’ exploratory behavior using Luce’s decision rule (Luce, 1963).
From a reinforcement-learning perspective, low θ values indicate that a chooser either has not learned the differential payoffs or has maintained high exploration despite the differential payoffs. However, a sudden decrease in θ (when responding is not a function of previous disk value) indicates that a chooser has recognized that the payoffs have changed, thus prompting an increase in exploratory behavior.
The complicating factor in our analysis is that the programmed contingencies may not have been experienced equally by every organism. A pigeon may have undersampled a particular choice and thus obtained a biased estimate of its payoff. Pigeons frequently showed disk biases and failed to fully explore each of the options. Thus, in our second set of choice analyses for pigeons, we used disk location as an independent predictor of the best-fitting θ values and predicted lower θs (i.e., poor response differentiation as a function of payoff value) for less-preferred disks.
To fit Luce’s decision rule to behavior, we used nonlinear mixed-effects modeling and identified the maximum likelihood best-fitting parameter values (Cudeck & Harring, 2007; Davidian & Giltinan, 2003). Mixed-effects modeling is used to simultaneously generate estimates of parameter estimates for each subject and as a function of the independent variables (e.g., Laird & Ware, 1982; Pinheiro & Bates, 2004). This approach is superior to the two-stage approach, in which parameter estimates are derived independently for each subject and the estimates are used in a subsequent analysis, because the results of the first stage do not include information about uncertainty in the parameter estimates that are used in the second stage (Shkedy, Straetemans, & Molenberghs 2005).
Choice differentiation as a function of programmed payoffs
As a baseline of comparison, we initially ignored these disk biases and identified the best-fitting θ for Eq. 1 [using log (1/VI) as a proxy for payoff rate] as a function of 5-min trial block (1–12). The analysis revealed that the degree of response differentiation, θ, varied as a function of block, F(11, 8975) = 5.71, p < .0001, BIC = −4711, R2 = .40. The maximum likelihood value of θ was .10 in Block 1, reached .32 by Block 3, peaked at .34 in Block 6, and steadily decreased toward .22 in Block 12. Thus, the pigeons tended to quickly differentiate the better disks among the choice alternatives, but as the session progressed, their behavior became increasingly undifferentiated. Interestingly, this behavior was highly correlated with the number of pecks produced throughout the session: Pecking was highest during Blocks 2–4 and then gradually fell throughout the session. By Block 12, responding averaged 28% of the peak rate of responding. It appears that as the pigeons’ level of satiety increased, the motivation to differentiate among the payoff disks decreased, or the motivation to exploit abated.
The predicted disk choices for each pigeon are shown superimposed on Figure 3. Luce’s decision rule predicts that responding is a monotonic function of disk value, and thus the rule cannot account for the unusual data patterns observed in Cosmo when disk value was solely a function of programmed (not experienced) payoff. However, the other birds’ behavior was well approximated by Eq. 2.
Finally, we examined the degree to which disk value on a previous session lingered into the next session. In the first 5-min part of a session (Block 1), response likelihood was as much a function of a disk’s value on the previous session [t(6) = 4.33, p < .01] as of its value for the current session [t(6) = 4.07, p < .01]. Over the next five blocks, the effect of a disk’s previous value steadily decreased (ts of 2.52, 1.66, 1.28, and 0.51), whereas the effect of a disk’s current value was maintained (ts = 4.54, 3.74, 3.88, and 4.55).
Choice differentiation as a function of programmed payoffs and disk location
Because some pigeons were not showing sufficient exploration of all eight response disks, using the programmed payoff in fitting Luce’s decision rule is problematic. To incorporate the effect of disk location for individual birds, we assessed θ as a function of both trial block and disk location. The analysis revealed that the degree of response differentiation, θ, varied as a function of both block, F(11, 8968) = 3.74, p < .0001, and disk location, F(7, 8968) = 7.94, p < .0001, BIC = −5,048, R2 = .46. A model that included an interaction produced a poorer fit, BIC = −4,510, indicating that it was overparameterized, and thus the interaction was not included in our analysis.
In an eight-armed bandit task, pigeons’ disk choice was largely a function of the VI schedule associated with each disk. For 4 of the pigeons, their behavior was broadly consistent with that predicted by Luce’s decision rule as applied to the programmed reinforcement rate [log (1/VI)], thus suggesting that the derived θ values are good estimates of the degree of exploitation exhibited by the pigeons. Pigeons did not demonstrate high degrees of exploration at the beginning of a session that was cued by session onset, but rather their low θ values were a result of behavior being heavily influenced by carryover from the prior session’s disk values. Within 10 min, however, their responding was largely driven by the new reinforcement contingencies. Thus, increases in exploration were likely produced by adversity—only when preferred disks were no longer paying off at a high rate did the pigeons begin to explore other choices (see Gallistel, Mark, King, & Latham 2001, for an alternative interpretation of matching in nonstationary environments).
Our pigeons, which were working for primary reinforcers, showed less exploitation as a session progressed. This change could have been due to an anticipated change in disk payoffs, but the evidence suggests that exploitation decreased due to an increase in satiety. Regardless of this pattern, we did not see high degrees of exploitation at any point in a session. Averaged across every session and trial block, no pigeon chose its preferred disk more than 45% of the time (see Fig. 3). When these results were averaged across sessions but broken down by trial blocks, no pigeon chose its preferred disk more than 55% of the time (not shown). The pigeons were not adopting greedy strategies in our nonstationary environment. Despite our attempts to eliminate disk biases, the birds continued to show location preferences that were independent of a disk’s programmed reinforcement schedule. We attempted to incorporate these biases into our analysis as an independent factor that allowed less behavioral differentiation (lower θ values) for certain disk locations, but the fit was only marginally better. An alternative formulation that would retain Luce’s decision rule would be to incorporate disk location into our estimates of value, thus making a disk’s value a function of both its scheduled payoff and its location. Unfortunately, this approach would require a post hoc assessment of disk preferences for each bird.
In our second experiment, we used a similar design to examine exploration versus exploitation in humans. We anticipated rapid changes in θ and fewer location preferences that were independent of payoffs. The literature on risky choice and risk perception suggests that people might be well adapted to identifying and responding to changes in payoffs for decisions under uncertainty (for a discussion of various examples, see Rakow & Miler, 2009).
A total of 20 undergraduates (16 female, 4 male) at the University of California, Los Angeles (UCLA), received course credit for participating in the experiment.
Testing was conducted on a notebook computer with a 38-cm (diagonal) color monitor set at 1,152 × 864 pixels. Participants used a mouse to guide a cursor around a screen, and a response was recorded every time the left mouse button was clicked. A built-in speaker was used to give auditory feedback when a reward was given.
Following the sample trial, participants completed six sessions. Each session was 6 min long. We used the same reward schedule that had been used with the pigeons: VIs of 3, 9, 27, 81, 243, 729, 2,187, and 6,561 s, with ±50% variation. The assignment of variable intervals to disks was constant within a session but was rearranged from session to session. The same rearrangement from session to session was used for each participant. Counters were provided at the top of the screen giving an indication of how many points had been collected in each session, and the appropriate counter was updated every time a point was collected.
At the conclusion of each session, the participants needed to click on a button (not shown in Fig. 8) to start the next session. At the end of the fifth session, they were asked to type into the computer answers to the questions “What do you think was happening during the task?” “What strategy did you use to earn points?” “Within (not between) a given session, how did the colored discs differ from each other?” and “Was there a difference from one session to another? If so, what was the difference?” Following this, they were asked to do the final, sixth session.
As a consequence of the use of a VI schedule, most people showed a temporary decrease in the likelihood of choosing a disk after it was rewarded. Figure 6 (right column) shows the individual smoothed likelihood splines for each participant for the three richest schedules, and the vast majority of participants developed an aversion to returning to a disk that was just rewarded; the likelihood of returning to it was a function of its VI schedule. Thus, due to the temporary decrease in the efficacy of a recently rewarded response, participants were being encouraged to explore by sampling other disks.
Finally, we examined the degree to which disk value on a previous session lingered into the next session. In the first 30 s of a session (Block 1), response likelihood was largely a function of a disk’s value for the current session [t(19) = 7.75, p < .01], but there was a small, nonsignificant effect of the disk’s value from the previous session [t(19) = 1.68, p = .11]. Over the next four blocks, the effect of a disk’s previous value remained small (ts of 2.63, 0.78, 1.23, and 1.13) and was only significant in Block 2, whereas the effect of a disk’s current value increased and leveled off (ts = 10.47, 10.87, 11.95, and 11.89). By the final block, performance was entirely a function of a disk’s value for the current session [t(19) = 14.01, p < .01], with little effect of the disk’s value for the previous session [t(19) = 0.99, p = .32].
The strategy reports were largely uninformative. Six of the participants reported that points earned was somehow a function of time or delay (the correct controlling variable), 2 reported that points were a function of the number of times chosen, 1 reported a complex geometrical relationship, and the remaining participants’ reports were either vague or equivalent to reporting that they did not know. Sex, self-reported GPA, and self-reported strategy did not significantly predict the best-fitting value of θ, but our sample size was too small to identify all but the largest individual-difference effects (a prior study had found a weak negative correlation, r = −.09, between intelligence and exploratory behavior; Steyvers et al., 2009).
In our eight-armed bandit task, human disk choice was largely a function of the VI schedule associated with each disk. Behavior was generally consistent with that predicted by Luce’s decision rule as applied to the programmed reinforcement rate [log (1/VI)]. Exploration was high early in a session and was only weakly a function of a disk’s previous value. This lack of carryover, accompanied by a high degree of exploration in the first block of a session (see Fig. 10), likely occurred because the transition from session to session was clearly demarcated for the participants (Fig. 8 shows the highlighting of the current session at the top of the screen). Thus, our human participants showed an adaptive increase in exploration in the presence of a signal that indicated a change in disk payoffs, unlike the pigeons in Experiment 1. Finally, like the pigeons, our human participants did not demonstrate a greedy strategy (see Fig. 9). Instead, they continued to explore other alternatives late in a session.
Both pigeons and people produced response patterns that were often well modeled by Luce’s (1963) decision rule. Although there were some exceptions (most notably the pigeon Cosmo in Exp. 1), these deviations may have been driven by differences in the programmed and experienced disk payoffs or by idiosyncratic strategies that we have not assessed. Additionally, neither species demonstrated greedy strategies in the nonstationary environments used in the present study. Whereas exclusive choice of the highest value disk would seem adaptive once a chooser has learned that disks only change their value across sessions, the use of a VI schedule likely contributed to higher exploration by producing a temporary decrease in the value of a disk (see Fig. 6). Given the clocked nature of a VI, a disk with a leaner schedule is more likely to be rewarded than a disk with a richer schedule if the lean disk has not been chosen in a long time. For example, consider the choice between a VI 3-s and a VI 9-s disk. If the VI 3-s disk was chosen 8 s into a session, it would have an average delay of 3 s until its next reward was available (i.e., 11 s into the session). By contrast, the VI 9-s disk would have an average delay of 1 s until its next reward was available (i.e., 9 s into the session). Thus, the adoption of an optimal fully informed strategy would cause a chooser to occasionally sample the leaner schedules as a function of the elapsed time since their last reinforcement. Both the pigeons’ and people’s behavior often demonstrated a temporary decrease in the likelihood of choosing a disk that was recently rewarded, along with a rapid increase soon after (Fig. 6). After a peak in likelihood, responding gradually fell, which is largely a result of responses to a disk eventually being rewarded, thus truncating the distribution.
The greatest species differences involved (a) strong disk biases in the pigeons but not in people and (b) the weak carryover of disk value across sessions for people but the strong carryover for pigeons. The strong disk biases were quite intransigent in our pigeons. Even after extensive attempts to train out these biases, the pigeons still underexplored certain responses (see Fig. 4). We believe that there are two significant contributors to these biases. First, the upper disks may have required substantial effort to reach, thus reducing their value due to a high response cost (cf. Jensen et al., 2006). Second, the pigeons may have been content to satisfice, such that there was insufficient motivation to maximize their reward rate. Given that the response rate gradually abated later in the session, satiation may have reduced the incentive to identify the disk with the highest value.
The second large species difference involved the fact that the pigeons’ behavior early in a session was heavily influenced by the disk values from the previous session, whereas people showed little session-to-session carryover of value. This result is even more remarkable given the extensive experience that the pigeons had with daily changes during training (309 sessions) and testing (24 sessions), an ample opportunity to learn that disk value did not (except in rare instances) carry over across sessions. In contrast, our human participants received only 6 min of training before disk payoffs changed and yet showed little value carryover. Thus, the pigeons increased exploration largely in response to an experienced change in payoff rates, whereas people increased exploration when a discriminative cue dictated.
The control over performance exerted by disk values from the prior session is striking when one considers that nonstationary procedures reveal strong constraints on the duration of working memory in the pigeon. Pigeon working memory has been found to last from tens of seconds, in delayed matching-to-sample procedures (e.g., Grant, 1976; White, Ruske, & Colombo 1996), to no more than 1 or 2 h, on open-field spatial search tasks (Spetch, 1990; Spetch & Honig, 1988). This stands in stark contrast to retention of correct responses in stationary procedures, which have been shown to last for months or years (e.g., Cook, Levison, Gillett, & Blaisdell 2005; Vaughan & Greene, 1984). Above-chance retention of disk values over a 24-h interval after only a single session of exposure has previously been reported in two-choice situations (e.g., Grace & McLean, 2006; Kyonka & Grace, 2008; Schofield & Davison, 1997). These studies involving between-session changes in reinforcement schedules reveal some lasting influence of the prior session’s reinforcement contingencies at the beginning of the next session. To our knowledge, however, ours are the first results showing similar carryover effects on schedules involving more than two choice options. This suggests that pigeons acquired some memory for the distribution of values across multiple choice options from a single session, the influence of which persisted in the following session. We can only speculate that our task contained features that better tap into processes of long-term memory than have previous working memory procedures.
Although our human participants showed adaptive increases in exploratory behavior at the beginning of a session, session onset was clearly signaled. It is not known how quickly people would increase their exploratory behavior if change was not signaled. Without a signaled change in schedule, any increase in exploration would likely be a function of the magnitude of the change in disk value and of which disks (e.g., those of previously high or low value) changed their value. If a low-value, and thus undersampled, disk suddenly became the richest option, a high exploiter would be slow to discover this change. In contrast, if a high-value, and thus heavily sampled, disk suddenly decreased in value (which was typically the case in the present experiments), this change would be apparent to both high and low exploiters.
People’s sudden increase in exploratory behavior at the onset of each session suggests a level of operant control that goes beyond merely responding to changes in the payoffs of the operanda. One possibility is that this result provides further evidence of behavioral variability as an operant (Neuringer, 2002; Page & Neuringer, 1985), but the rapidity with which our human participants responded suggests insufficient time for variability to have been reinforced during the confines of our experiment. Thus, people previously must have learned the utility of exploration in the face of a rapidly changing environment. Pigeons, on the other hand, may be better adapted to more stable environments that reward perseveration over flexibility.
Although an actor always faces uncertainty about the utility of future actions, the randomness of events underlying this uncertainty extends from that conforming to well-understood linear-based Gaussian distributions to those best described by poorly understood nonlinear power laws (Taleb, 2007). It would be very interesting to understand how actors as diverse as humans and pigeons face action-making decisions in these vastly different types of stochastic contexts that characterize real-world situations. Given the importance of understanding choice and the common desire to optimize choice strategies in stationary and nonstationary environments, we hope that more researchers will consider spending less time exploiting the study of simple choice tasks with stationary payoffs, and instead allocate more effort toward exploring many-choice tasks in nonstationary environments (e.g., Davison & Baum, 2000; Ward & Odum, 2008).
- Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (pp. 322–331). Piscataway, NJ: IEEE Press.Google Scholar
- Burns, N. R., Lee, M. D., & Vickers, D. (2006). Individual differences in problem solving and intelligence. Journal of Problem Solving, 1, 20–32.Google Scholar
- Dimitrakakis, C., & Lagoudakis, M. G. (2008). Rollout sampling approximate policy iteration. Machine Learning, 72, 157–171.Google Scholar
- Grant, D. S. (1976). Effect of sample presentation time on long-delay matching in pigeons. Learning and Motivation, 7, 580–590.Google Scholar
- Jensen, G., Miller, C., & Neuringer, A. (2006). Truly random operant responding: Results and reasons. In E. A. Wasserman & T. R. Zentall (Eds.), Comparative cognition: Experimental explorations of animal intelligence (pp. 459–480). New York: Oxford University Press.Google Scholar
- Koulouriotis, D. E., & Xanthopoulos, A. (2008). Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Applied Mathematics and Computation, 196, 913–922.Google Scholar
- Lin, Y. K., & Batzli, G. O. (2002). The cost of habitat selection in prairie voles: An empirical assessment using isodar analysis. Evolutionary Ecology, 16, 387–397.Google Scholar
- Luce, R. D. (1963). Detection and recognition. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. 1, pp. 103–189). New York: Wiley.Google Scholar
- Mettke-Hofmann, C., Wink, M., Winkler, H., & Leisler, B. (2004). Exploration of environmental changes relates to lifestyle. Behavioral Ecology, 10, 2004. Google Scholar
- Pinheiro, J. C., & Bates, D. M. (2004). Mixed-effects models in S and S-PLUS. New York: Springer.Google Scholar
- Sikora, R. T. (2008). Meta-learning optimal parameter values in non-stationary environments. Knowledge Based Systems, 2(8), 800–806.Google Scholar
- Spetch, M. L. (1990). Further studies of pigeons’ spatial working memory in the open-field task. Animal Learning & Behavior, 18, 332–340.Google Scholar
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar
- Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
- Valsecchi, I. (2003). Job assignment and bandit problems. International Journal of Manpower, 24(7), 844–866.Google Scholar
- Zach, R. (1979). Shell dropping: Decision-making and optimal foraging in northwestern crows. Behaviour, 68, 106–117.Google Scholar