Skip to main content

Can Reinforcement Learning Learn Itself? A Reply to ‘Reward is Enough’

  • Conference paper
  • First Online:
Software Engineering and Formal Methods. SEFM 2021 Collocated Workshops (SEFM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13230))

Included in the following conference series:

  • 527 Accesses

Abstract

In their paper ‘Reward is enough’, Silver et al. conjecture that the creation of sufficiently good reinforcement learning (RL) agents is a path to artificial general intelligence (AGI). We consider one aspect of intelligence Silver et al. did not consider in their paper, namely, that aspect of intelligence involved in designing RL agents. If that is within human reach, then it should also be within AGI’s reach. This raises the question: is there an RL environment which incentivises RL agents to design RL agents?

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    To be clear, when an agent updates its own future behavior based on training data, we do not consider this to be an instance of the agent designing a new agent, even though in some sense the agent post-training is different than the agent pre-training. In the same way, when one reads a book, one becomes, in a sense, a different human being, yet we do not say that by doing so, one has designed a human being. When we speak of an RL agent designing an RL agent, we mean it in the same sense as, e.g., when we speak of an RL agent writing a poem. An RL agent would write a poem by writing down words. In the same way, an RL agent would design an RL agent by writing down pieces of computer code.

  2. 2.

    Practitioners often abuse language and refer to agent-classes as agents. For example, a Python programmer might write “from stable_baselines3 import DQN” and refer to the resulting DQN class as the deep Q learning “agent” when, in reality, that object does not itself act. Rather, it must be instantiated (with hyperparameters), and the instance then acts. Language is further abused: underlying an agent, there is typically a model or policy (e.g., a neural network and its weights); once trained using Reinforcement Learning, the model is often published alone, in which capacity it merely acts in response to observations, and no longer has any mechanism for learning from rewards or even accepting rewards as input. Practitioners sometimes abuse language and refer to such pretrained models as “RL” agents. Thus, one might say, “this camera is controlled by an RL agent”, when in reality the camera is controlled by a model obtained by training an RL agent (an expensive one-time training investment done on a supercomputer so that the resulting model can be used on consumer-grade computers to control many cameras thereafter). The model itself is not the RL agent—the weaker computer running the model does not give the model rewards or punishments. These nuances cause no confusion in practice.

  3. 3.

    To quote Aristotle: “For if ... one were to stretch a covering or membrane over the skin, a sensation would still arise immediately on making contact; yet it is obvious that the sense-organ was not in this membrane” [6].

  4. 4.

    One might object that there could be environments which reward some other behavior, which behavior requires RL-agent-design as an intermediate step, rather than rewarding RL-agent-design on its own. But how could we know this other behavior requires RL-agent-design as intermediate step? Maybe a smart enough RL agent would figure out a way to avoid the intermediate step—just as RL agents can learn to exploit video-game bugs, or invent unanticipated new Go strategies, or just as image classifiers can learn to associate rulers with malignant tumors [17]. Thus, to be confident that an RL environment can incentivise RL-agent-design, it seems necessary that there be an environment that directly rewards RL-agent-design as primary objective, not merely rewarding some other behavior that requires RL-agent-design as intermediate step.

  5. 5.

    Foreshadowed by [15].

  6. 6.

    We assume some fixed background proof system such as ZFC or Peano Arithmetic.

  7. 7.

    Here we use the word “know” in the sense of “act as if it knows”. This is similar to how knowledge is treated in [5].

  8. 8.

    This environment has similarities to Yampolskiy’s impossible “Disobey!” [26].

References

  1. Aldini, A., Fano, V., Graziani, P.: Do the self-knowing machines dream of knowing their factivity? In: AIC, pp. 125–132 (2015)

    Google Scholar 

  2. Aldini, A., Fano, V., Graziani, P.: Theory of knowing machines: revisiting Gödel and the mechanistic thesis. In: Gadducci, F., Tavosanis, M. (eds.) HaPoC 2015. IAICT, vol. 487, pp. 57–70. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47286-7_4

    Chapter  Google Scholar 

  3. Alexander, S.A.: A machine that knows its own code. Stud. Log. 102(3), 567–576 (2014)

    Article  MathSciNet  Google Scholar 

  4. Alexander, S.A.: AGI and the Knight-darwin law: why idealized AGI reproduction requires collaboration. In: Goertzel, B., Panov, A.I., Potapov, A., Yampolskiy, R. (eds.) AGI 2020. LNCS (LNAI), vol. 12177, pp. 1–11. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52152-3_1

    Chapter  Google Scholar 

  5. Alexander, S.A.: Short-circuiting the definition of mathematical knowledge for an artificial general intelligence. In: Cleophas, L., Massink, M. (eds.) SEFM 2020. LNCS, vol. 12524, pp. 201–213. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67220-1_16

    Chapter  Google Scholar 

  6. Aristotle: on the soul. In: Barnes, J., et al. (eds.) The Complete Works of Aristotle. Princeton University Press (1984)

    Google Scholar 

  7. Brockman, G., et al.: OpenAI gym. Preprint (2016)

    Google Scholar 

  8. Davis, M.: Hilbert’s tenth problem is unsolvable. Am. Math. Mon. 80(3), 233–269 (1973)

    Article  MathSciNet  Google Scholar 

  9. Hernández-Orallo, J., Dowe, D.L.: Measuring universal intelligence: towards an anytime intelligence test. Artif. Intell. 174(18), 1508–1539 (2010)

    Article  MathSciNet  Google Scholar 

  10. Hernández-Orallo, J., Dowe, D.L., España-Cubillo, S., Hernández-Lloreda, M.V., Insa-Cabrera, J.: On more realistic environment distributions for defining, evaluating and developing intelligence. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS (LNAI), vol. 6830, pp. 82–91. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22887-2_9

    Chapter  Google Scholar 

  11. Hutter, M.: Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer, Heidelberg (2004)

    MATH  Google Scholar 

  12. Kaliszyk, C., Urban, J., Michalewski, H., Olšák, M.: Reinforcement learning of theorem proving. In: NeurIPS (2018)

    Google Scholar 

  13. Legg, S., Hutter, M.: Universal intelligence: a definition of machine intelligence. Mind. Mach. 17(4), 391–444 (2007)

    Article  Google Scholar 

  14. Legg, S., Veness, J.: An approximation of the universal intelligence measure. In: Dowe, D.L. (ed.) Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence. LNCS, vol. 7070, pp. 236–249. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-44958-1_18

    Chapter  Google Scholar 

  15. Maguire, P., Moser, P., Maguire, R.: Are people smarter than machines? Croatian J. Philos. 20(1), 103–123 (2020)

    Google Scholar 

  16. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  17. Narla, A., Kuprel, B., Sarin, K., Novoa, R., Ko, J.: Automated classification of skin lesions: from pixels to practice. J. Investig. Dermatol. 138(10), 2108–2110 (2018)

    Article  Google Scholar 

  18. Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3

  19. Russell, S.J., Subramanian, D.: Provably bounded-optimal agents. J. Artif. Intell. Res. 2, 575–609 (1994)

    Article  Google Scholar 

  20. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. Preprint (2017)

    Google Scholar 

  21. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  22. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017)

    Article  Google Scholar 

  23. Silver, D., Singh, S., Precup, D., Sutton, R.: Reward is enough. Artif. Intell. 299, 103535 (2021)

    Article  MathSciNet  Google Scholar 

  24. Singh, S., Lewis, R.L., Barto, A.G., Sorg, J.: Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans. Auton. Ment. Dev. 2(2), 70–82 (2010)

    Article  Google Scholar 

  25. Watkins, C.: Learning from delayed rewards. Ph.D. thesis, Cambridge (1989)

    Google Scholar 

  26. Yampolskiy, R.: On controllability of artificial intelligence. Technical report (2020)

    Google Scholar 

Download references

Acknowledgments

We gratefully acknowledge José Hernández-Orallo, Phil Maguire, and the reviewers for generous comments and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuel Allen Alexander .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alexander, S.A. (2022). Can Reinforcement Learning Learn Itself? A Reply to ‘Reward is Enough’. In: Cerone, A., et al. Software Engineering and Formal Methods. SEFM 2021 Collocated Workshops. SEFM 2021. Lecture Notes in Computer Science, vol 13230. Springer, Cham. https://doi.org/10.1007/978-3-031-12429-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12429-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12428-0

  • Online ISBN: 978-3-031-12429-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics