Abstract
The outcome of an action often occurs after a delay. One solution for learning appropriate actions from delayed outcomes is to rely on a chain of state transitions. Another solution, which does not rest on state transitions, is to use an eligibility trace (ET) that directly bridges a current outcome and multiple past actions via transient memories. Previous studies revealed that humans (Homo sapiens) learned appropriate actions in a behavioral task in which solutions based on the ET were effective but transition-based solutions were ineffective. This suggests that ET may be used in human learning systems. However, no studies have examined nonhuman animals with an equivalent behavioral task. We designed a task for nonhuman animals following a previous human study. In each trial, participants chose one of two stimuli that were randomly selected from three stimulus types: a stimulus associated with a food reward delivered immediately, a stimulus associated with a reward delivered after a few trials, and a stimulus associated with no reward. The presented stimuli did not vary according to the participants’ choices. To maximize the total reward, participants had to learn the value of the stimulus associated with a delayed reward. Five chimpanzees (Pan troglodytes) performed the task using a touchscreen. Two chimpanzees were able to learn successfully, indicating that learning mechanisms that do not depend on state transitions were involved in the learning processes. The current study extends previous ET research by proposing a behavioral task and providing empirical data from chimpanzees.
Similar content being viewed by others
Data availability
Data and materials are available upon reasonable request.
Code availability
Codes are available upon reasonable request.
References
Akam, T., Rodrigues-Vaz, I., Marcelo, I., Zhang, X., Pereira, M., Oliveira, R. F., Dayan, P., & Costa, R. M. (2021). The anterior cingulate cortex predicts future states to mediate model-based action selection. Neuron, 109(1), 149-163.e7. https://doi.org/10.1016/j.neuron.2020.10.013
Amsel, A. (1958). The role of frustrative nonreward in noncontinuous reward situations. Psychological Bulletin, 55(2), 102–119. https://doi.org/10.1037/h0043125
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, SMC-13(5), 834–846. https://doi.org/10.1109/TSMC.1983.6313077
Ben-Artzi, I., Luria, R., & Shahar, N. (2022). Working memory capacity estimates moderate value learning for outcome-irrelevant features. Scientific Reports, 12, 19677. https://doi.org/10.1038/s41598-022-21832-x
Beran, M. J. (2001). Do chimpanzees have expectations about reward presentation following correct performance on computerized cognitive testing? The Psychological Record, 51(2), 173–183. https://doi.org/10.1007/BF03395393
Beran, M. J., Perdue, B. M., Futch, S. E., Smith, J. D., Evans, T. A., & Parrish, A. E. (2015). Go when you know: Chimpanzees’ confidence movements reflect their responses in a computerized memory task. Cognition, 142, 236–246. https://doi.org/10.1016/j.cognition.2015.05.023
Bogacz, R., McClure, S. M., Li, J., Cohen, J. D., & Montague, P. R. (2007). Short-term memory traces for action bias in human reinforcement learning. Brain Research, 1153(1), 111–121. https://doi.org/10.1016/j.brainres.2007.03.057
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6), 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027
Eckstein, M. K., Wilbrecht, L., & Collins, A. G. E. (2021). What do reinforcement learning models measure? Interpreting model parameters in cognition and neuroscience. Current Opinion in Behavioral Sciences, 41, 128–137. https://doi.org/10.1016/j.cobeha.2021.06.004
Fay, M. (2010). Confidence intervals that match Fisher’s exact or Blaker’s exact tests. Biostatistics, 11(2), 373–374. https://doi.org/10.1093/biostatistics/kxp050
Gabry, J., & Češnovar, R. (2021). cmdstanr: R Interface to “CmdStan.” https://mc-stan.org/cmdstanr
Gerstner, W., Lehmann, M., Liakoni, V., Corneil, D., & Brea, J. (2018). Eligibility traces and plasticity on behavioral time scales: Experimental support of neoHebbian three-factor learning rules. Frontiers in Neural Circuits, 12, 53. https://doi.org/10.3389/fncir.2018.00053
Gureckis, T. M., & Love, B. C. (2009). Short-term gains, long-term pains: How cues about state aid learning in dynamic environments. Cognition, 113(3), 293–313. https://doi.org/10.1016/j.cognition.2009.03.013
Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006). A Lego system for conditional inference. The American Statistician, 60(3), 257–263. https://doi.org/10.1198/000313006X118430
Itakura, S. (1993). Emotional behavior during the learning of a contingency task in a chimpanzee. Perceptual and Motor Skills, 76(2), 563–566. https://doi.org/10.2466/pms.1993.76.2.563
Jocham, G., Brodersen, K. H. H., Constantinescu, A. O. O., Kahn, M. C. C., Ianni, A. M., Walton, M. E. E., Rushworth, M. F. F. S., & Behrens, T. E. E. J. (2016). Reward-guided learning with and without causal attribution. Neuron, 90(1), 177–190. https://doi.org/10.1016/j.neuron.2016.02.018
Katahira, K. (2018). Kodo deta no keisanron moderingu—Kyoka gakusyu moderu wo rei toshite— [Computational Modeling of Behavioral Data]. Ohmsha.
Katahira, K., Yu, B., & Nakao, T. (2017). Pseudo-learning effects in reinforcement learning model-based analysis: A problem of misspecification of initial preference. PsyArXiv. https://doi.org/10.31234/osf.io/a6hzq
Lehmann, M. P., Xu, H. A., Liakoni, V., Herzog, M. H., Gerstner, W., & Preuschoff, K. (2019). One-shot learning and behavioral eligibility traces in sequential decision making. eLife, 8, e47463. https://doi.org/10.7554/eLife.47463
Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1), 8–30. https://doi.org/10.1109/JRPROC.1961.287775
Nassar, M. R., & Frank, M. J. (2016). Taming the beast: Extracting generalizable knowledge from computational models of cognition. Current Opinion in Behavioral Sciences, 11, 49–54. https://doi.org/10.1016/j.cobeha.2016.04.003
Nonomura, S., Nishizawa, K., Sakai, Y., Kawaguchi, Y., Kato, S., Uchigashima, M., …, Kimura, M. (2018). Monitoring and updating of action selection for goal-directed behavior through the striatal direct and indirect pathways. Neuron, 99(6), 1302–1314.e5. https://doi.org/10.1016/j.neuron.2018.08.002
Nosarzewska, A., Peng, D. N., & Zentall, T. R. (2021). Pigeons acquire the 1-back task: Implications for implicit versus explicit learning? Learning & Behavior. https://doi.org/10.3758/s13420-021-00468-3
Palminteri, S., Wyart, V., & Koechlin, E. (2017). The importance of falsification in computational cognitive modeling. Trends in Cognitive Sciences, 21(6), 425–433. https://doi.org/10.1016/j.tics.2017.03.011
Papini, M. R., Guarino, S., Hagen, C., & Torres, C. (2022). Incentive disengagement and the adaptive significance of frustrative nonreward. Learning & Behavior, 50(3), 372–388. https://doi.org/10.3758/s13420-022-00519-3
Pike, A. C., Lowther, M., & Robinson, O. J. (2021). The importance of common currency tasks in translational psychiatry. Current Behavioral Neuroscience Reports, 8(1), 1–10. https://doi.org/10.1007/s40473-021-00225-w
R Core Team. (2021). R: A laguage and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
Redish, A. D., Kepecs, A., Anderson, L. M., Calvin, O. L., Grissom, N. M., Haynos, A. F., …, Zilverstand, A. (2022). Computational validity: Using computation to translate behaviours across species. Philosophical Transactions of the Royal Society B, 377(1844), 20200525. https://doi.org/10.1098/rstb.2020.0525
Rosati, A. G., & Hare, B. (2013). Chimpanzees and bonobos exhibit emotional responses to decision outcomes. PLoS ONE, 8(5), e63058. https://doi.org/10.1371/journal.pone.0063058
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. CUED/F-INFENG/TR 166. Cambridge University Engineering Department.
Sakai, Y., Sakai, Y., Abe, Y., Narumoto, J., & Tanaka, S. C. (2022). Memory trace imbalance in reinforcement and punishment systems can reinforce implicit choices leading to obsessive-compulsive behavior. Cell Reports, 40(9), 111275. https://doi.org/10.1016/j.celrep.2022.111275
Sato, Y., Sakai, Y., & Hirata, S. (2020). Computerized intertemporal choice task in chimpanzees (Pan troglodytes) with/without postreward delay. Journal of Comparative Psychology, 135(2), 185–195. https://doi.org/10.1037/com0000254
Scholl, J., & Klein-Flügge, M. (2018). Understanding psychiatric disorder by capturing ecologically relevant features of learning and decision-making. Behavioural Brain Research, 355, 56–75. https://doi.org/10.1016/j.bbr.2017.09.050
Schultz, W. (1997). Dopamine neurons and their role in reward mechanisms. Current Opinion in Neurobiology, 7(2), 191–197. https://doi.org/10.1016/S0959-4388(97)80007-4
Seo, H., & Lee, D. (2010). Orbitofrontal cortex assigns credit wisely. Neuron, 65(6), 736–738. https://doi.org/10.1016/j.neuron.2010.03.016
Shen, W., Flajolet, M., Greengard, P., & Surmeier, D. J. (2008). Dichotomous dopaminergic control of striatal synaptic plasticity. Science, 321(5890), 848–851. https://doi.org/10.1126/science.1160575
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3), 123–158. https://doi.org/10.1007/BF00114726
Smith, J. D., Jackson, B. N., & Church, B. A. (2020). Monkeys (Macaca mulatta) learn two-choice discriminations under displaced reinforcement. Journal of Comparative Psychology, 134(4), 423–434. https://doi.org/10.1037/com0000227
Stan Development Team. (2019). Stan User’s Guide 2.21.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1023/A:1022633531479
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT press. https://lccn.loc.gov/2018023826
Tanaka, S. C., Shishida, K., Schweighofer, N., Okamoto, Y., Yamawaki, S., & Doya, K. (2009). Serotonin affects association of aversive outcomes to past actions. Journal of Neuroscience, 29(50), 15669–15674. https://doi.org/10.1523/JNEUROSCI.2799-09.2009
Tartaglia, E. M., Clarke, A. M., & Herzog, M. H. (2017). What to choose next? A paradigm for testing human sequential decision making. Frontiers in Psychology, 8, 312. https://doi.org/10.3389/fpsyg.2017.00312
Tomonaga, M., Kurosawa, Y., Kawaguchi, Y., & Takiyama, H. (2023). Don’t look back on failure: Spontaneous uncertainty monitoring in chimpanzees. Learning & Behavior. https://doi.org/10.3758/s13420-023-00581-5
Torchiano, M. (2020). effsize: Efficient effect size computation. https://doi.org/10.5281/zenodo.1480624, R package version 0.8.1 (https://CRAN.R-project.org/package=effsize).
Walsh, M. M., & Anderson, J. R. (2011). Learning from delayed feedback: Neural responses in temporal credit assignment. Cognitive, Affective and Behavioral Neuroscience, 11(2), 131–143. https://doi.org/10.3758/s13415-011-0027-0
Walsh, M. M., & Anderson, J. R. (2014). Navigating complex decision spaces: Problems and paradigms in sequential choice. Psychological Bulletin, 140(2), 466–486. https://doi.org/10.1037/a0033455
Watanabe, M., Cromwell, H. C., Tremblay, L., Hollerman, J. R., Hikosaka, K., & Schultz, W. (2001). Behavioral reactions reflecting differential reward expectations in monkeys. Experimental Brain Research, 140, 511–518. https://doi.org/10.1007/s002210100856
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. https://doi.org/10.7554/eLife.49547
Yagishita, S., Hayashi-Takagi, A., Ellis-Davies, G. C. R., Urakubo, H., Ishii, S., & Kasai, H. (2014). A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science, 345(6204), 1616–1620. https://doi.org/10.1126/science.1255514
Yoo, A. H., & Collins, A. G. E. (2022). How working memory and reinforcement learning are intertwined: A cognitive, neural, and computational perspective. Journal of Cognitive Neuroscience, 34(4), 551–568. https://doi.org/10.1162/jocn_a_01808
Zentall, T. R., Peng, D. N., & Mueller, P. M. (2022). 1-Back reinforcement matching and mismatching by pigeons: Implicit or explicit learning? Behavioural Processes, 195, 104562. https://doi.org/10.1016/j.beproc.2021.104562
Acknowledgements
We thank the staff and researchers at Kumamoto Sanctuary for their help with the study, particularly Dr. N. Morimura and Dr. F. Kano. We thank Benjamin Knight, MSc., from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript.
Funding
This study was supported financially by the Ministry of Education, Culture, Sports, Science, Japan Society for the Promotion of Science to YSato (grant number 19J22889), to SH (grant numbers 26245069, 18H05524, 23H00494), and to TM (grant number 16H06283); a Program for Leading Graduate Schools to TM (U04); and the Great Ape Information Network.
Author information
Authors and Affiliations
Contributions
Conceptualization: Yutaro Sato, Yutaka Sakai, Satoshi Hirata. Methodology: Yutaro Sato, Yutaka Sakai. Formal analysis, Investigation, Data Curation and Visualization: Yutaro Sato. Writing–Original Draft, Yutaro Sato, Yutaka Sakai. Resources: Yutaro Sato, Satoshi Hirata. Writing–Review and Editing: Satoshi Hirata. Supervision: Satoshi Hirata. Project administration: Satoshi Hirata. Funding acquisition: Yutaro Sato, Satoshi Hirata.
Corresponding author
Ethics declarations
Ethics approval
Animal husbandry and research protocols complied with the Guide for Animal Research Ethics provided by the Wildlife Research Center, Kyoto University (No. WRC-2020-KS006A). For human participants (Online Supplementary Materials (OSM)), the research protocol was approved by the Ethics Committee of the Unit for Advanced Study of Mind at Kyoto University (2-P-16).
Consent to participate
Informed consent was obtained from all individual human participants included in the study (OSM).
Consent for publication
Human participants (OSM) signed informed consent that included publishing their data.
Conflict of interest
The authors have no known conflicts of interest to disclose.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open practices statements
None of the data or materials for the experiments reported here have been deposited online, and none of the experiments was preregistered.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sato, Y., Sakai, Y. & Hirata, S. State-transition-free reinforcement learning in chimpanzees (Pan troglodytes). Learn Behav 51, 413–427 (2023). https://doi.org/10.3758/s13420-023-00591-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13420-023-00591-3