Can Reinforcement Learning Learn Itself? A Reply to ‘Reward is Enough’

Alexander, Samuel Allen

doi:10.1007/978-3-031-12429-7_9

Samuel Allen Alexander ORCID: orcid.org/0000-0002-7930-110X¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13230))

Included in the following conference series:

International Conference on Software Engineering and Formal Methods

527 Accesses

Abstract

In their paper ‘Reward is enough’, Silver et al. conjecture that the creation of sufficiently good reinforcement learning (RL) agents is a path to artificial general intelligence (AGI). We consider one aspect of intelligence Silver et al. did not consider in their paper, namely, that aspect of intelligence involved in designing RL agents. If that is within human reach, then it should also be within AGI’s reach. This raises the question: is there an RL environment which incentivises RL agents to design RL agents?

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
To be clear, when an agent updates its own future behavior based on training data, we do not consider this to be an instance of the agent designing a new agent, even though in some sense the agent post-training is different than the agent pre-training. In the same way, when one reads a book, one becomes, in a sense, a different human being, yet we do not say that by doing so, one has designed a human being. When we speak of an RL agent designing an RL agent, we mean it in the same sense as, e.g., when we speak of an RL agent writing a poem. An RL agent would write a poem by writing down words. In the same way, an RL agent would design an RL agent by writing down pieces of computer code.
2.
Practitioners often abuse language and refer to agent-classes as agents. For example, a Python programmer might write “from stable_baselines3 import DQN” and refer to the resulting DQN class as the deep Q learning “agent” when, in reality, that object does not itself act. Rather, it must be instantiated (with hyperparameters), and the instance then acts. Language is further abused: underlying an agent, there is typically a model or policy (e.g., a neural network and its weights); once trained using Reinforcement Learning, the model is often published alone, in which capacity it merely acts in response to observations, and no longer has any mechanism for learning from rewards or even accepting rewards as input. Practitioners sometimes abuse language and refer to such pretrained models as “RL” agents. Thus, one might say, “this camera is controlled by an RL agent”, when in reality the camera is controlled by a model obtained by training an RL agent (an expensive one-time training investment done on a supercomputer so that the resulting model can be used on consumer-grade computers to control many cameras thereafter). The model itself is not the RL agent—the weaker computer running the model does not give the model rewards or punishments. These nuances cause no confusion in practice.
3.
To quote Aristotle: “For if ... one were to stretch a covering or membrane over the skin, a sensation would still arise immediately on making contact; yet it is obvious that the sense-organ was not in this membrane” [6].
4.
One might object that there could be environments which reward some other behavior, which behavior requires RL-agent-design as an intermediate step, rather than rewarding RL-agent-design on its own. But how could we know this other behavior requires RL-agent-design as intermediate step? Maybe a smart enough RL agent would figure out a way to avoid the intermediate step—just as RL agents can learn to exploit video-game bugs, or invent unanticipated new Go strategies, or just as image classifiers can learn to associate rulers with malignant tumors [17]. Thus, to be confident that an RL environment can incentivise RL-agent-design, it seems necessary that there be an environment that directly rewards RL-agent-design as primary objective, not merely rewarding some other behavior that requires RL-agent-design as intermediate step.
5.
Foreshadowed by [15].
6.
We assume some fixed background proof system such as ZFC or Peano Arithmetic.
7.
Here we use the word “know” in the sense of “act as if it knows”. This is similar to how knowledge is treated in [5].
8.
This environment has similarities to Yampolskiy’s impossible “Disobey!” [26].

References

Aldini, A., Fano, V., Graziani, P.: Do the self-knowing machines dream of knowing their factivity? In: AIC, pp. 125–132 (2015)
Google Scholar
Aldini, A., Fano, V., Graziani, P.: Theory of knowing machines: revisiting Gödel and the mechanistic thesis. In: Gadducci, F., Tavosanis, M. (eds.) HaPoC 2015. IAICT, vol. 487, pp. 57–70. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47286-7_4
Chapter Google Scholar
Alexander, S.A.: A machine that knows its own code. Stud. Log. 102(3), 567–576 (2014)
Article MathSciNet Google Scholar
Alexander, S.A.: AGI and the Knight-darwin law: why idealized AGI reproduction requires collaboration. In: Goertzel, B., Panov, A.I., Potapov, A., Yampolskiy, R. (eds.) AGI 2020. LNCS (LNAI), vol. 12177, pp. 1–11. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52152-3_1
Chapter Google Scholar
Alexander, S.A.: Short-circuiting the definition of mathematical knowledge for an artificial general intelligence. In: Cleophas, L., Massink, M. (eds.) SEFM 2020. LNCS, vol. 12524, pp. 201–213. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67220-1_16
Chapter Google Scholar
Aristotle: on the soul. In: Barnes, J., et al. (eds.) The Complete Works of Aristotle. Princeton University Press (1984)
Google Scholar
Brockman, G., et al.: OpenAI gym. Preprint (2016)
Google Scholar
Davis, M.: Hilbert’s tenth problem is unsolvable. Am. Math. Mon. 80(3), 233–269 (1973)
Article MathSciNet Google Scholar
Hernández-Orallo, J., Dowe, D.L.: Measuring universal intelligence: towards an anytime intelligence test. Artif. Intell. 174(18), 1508–1539 (2010)
Article MathSciNet Google Scholar
Hernández-Orallo, J., Dowe, D.L., España-Cubillo, S., Hernández-Lloreda, M.V., Insa-Cabrera, J.: On more realistic environment distributions for defining, evaluating and developing intelligence. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS (LNAI), vol. 6830, pp. 82–91. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22887-2_9
Chapter Google Scholar
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer, Heidelberg (2004)
MATH Google Scholar
Kaliszyk, C., Urban, J., Michalewski, H., Olšák, M.: Reinforcement learning of theorem proving. In: NeurIPS (2018)
Google Scholar
Legg, S., Hutter, M.: Universal intelligence: a definition of machine intelligence. Mind. Mach. 17(4), 391–444 (2007)
Article Google Scholar
Legg, S., Veness, J.: An approximation of the universal intelligence measure. In: Dowe, D.L. (ed.) Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence. LNCS, vol. 7070, pp. 236–249. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-44958-1_18
Chapter Google Scholar
Maguire, P., Moser, P., Maguire, R.: Are people smarter than machines? Croatian J. Philos. 20(1), 103–123 (2020)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Narla, A., Kuprel, B., Sarin, K., Novoa, R., Ko, J.: Automated classification of skin lesions: from pixels to practice. J. Investig. Dermatol. 138(10), 2108–2110 (2018)
Article Google Scholar
Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3
Russell, S.J., Subramanian, D.: Provably bounded-optimal agents. J. Artif. Intell. Res. 2, 575–609 (1994)
Article Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. Preprint (2017)
Google Scholar
Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Silver, D., Singh, S., Precup, D., Sutton, R.: Reward is enough. Artif. Intell. 299, 103535 (2021)
Article MathSciNet Google Scholar
Singh, S., Lewis, R.L., Barto, A.G., Sorg, J.: Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans. Auton. Ment. Dev. 2(2), 70–82 (2010)
Article Google Scholar
Watkins, C.: Learning from delayed rewards. Ph.D. thesis, Cambridge (1989)
Google Scholar
Yampolskiy, R.: On controllability of artificial intelligence. Technical report (2020)
Google Scholar

Download references

Acknowledgments

We gratefully acknowledge José Hernández-Orallo, Phil Maguire, and the reviewers for generous comments and feedback.

Author information

Authors and Affiliations

The U.S. Securities and Exchange Commission, New York City, N.Y., USA
Samuel Allen Alexander

Authors

Samuel Allen Alexander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samuel Allen Alexander .

Editor information

Editors and Affiliations

Nazarbayev University, Nur-Sultan, Kazakhstan
Antonio Cerone
University of L'Aquila, L'Aquila, Italy
Marco Autili
Mälardalen University, Västerås, Sweden
Alessio Bucaioni
Aarhus University, Aarhus, Denmark
Cláudio Gomes
University of Urbino, Urbino, Italy
Pierluigi Graziani
University of Pisa, Pisa, Italy
Maurizio Palmieri
Sapienza University of Rome, Rome, Italy
Marco Temperini
Tokyo University of Agriculture and Technology, Tokyo, Japan
Gentiane Venture

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alexander, S.A. (2022). Can Reinforcement Learning Learn Itself? A Reply to ‘Reward is Enough’. In: Cerone, A., et al. Software Engineering and Formal Methods. SEFM 2021 Collocated Workshops. SEFM 2021. Lecture Notes in Computer Science, vol 13230. Springer, Cham. https://doi.org/10.1007/978-3-031-12429-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-12429-7_9
Published: 25 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12428-0
Online ISBN: 978-3-031-12429-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics