Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda

Soares, Nate; Fallenstein, Benya

doi:10.1007/978-3-662-54033-6_5

Nate Soares¹² &
Benya Fallenstein¹²

Part of the book series: The Frontiers Collection ((FRONTCOLL))

1206 Accesses
14 Citations

Abstract

In this chapter, we discuss a host of technical problems that we think AI scientists could work on to ensure that the creation of smarter-than-human machine intelligence has a positive impact. Although such systems may be decades away, it is prudent to begin research early: the technical challenges involved in safety and reliability work appear formidable, and uniquely consequential. Our technical agenda discusses three broad categories of research where we think foundational research today could make it easier in the future to develop superintelligent systems that are reliably aligned with human interests:

1.
Highly reliable agent designs: how to ensure that we built the right system.
2.
Error tolerance: how to ensure that the inevitable flaws are manageable and correctable.
3.
Value specification: how to ensure that the system is pursuing the right sorts of objectives.

Since little is known about the design or implementation details of such systems, the research described in this chapter focuses on formal agent foundations for AI alignment research—that is, on developing the basic conceptual tools and theory that are most likely to be useful for engineering robustly beneficial systems in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A more careful wording might be “aligned with the interests of sentient beings.” We would not want to benefit humans at the expense of sentient non-human animals—or (if we build them) at the expense of sentient machines.
2.
Since the Dartmouth Proposal (McCarthy et al. 1955), it has been a standard idea in AI that a sufficiently smart machine intelligence could be intelligent enough to improve itself. In 1965, I.J. Good observed that this might create a positive feedback loop leading to an “intelligence explosion” (Good 1965). Sotala and Yampolskiy (2015, Sect. 2.3, this volume) and Bostrom (2014, Chap. 14) has observed that an intelligence explosion is especially likely if the agent has the ability to acquire more hardware, improve its software, or design new hardware.
3.
Legg and Hutter (2007) provide a preliminary answer to this question, by defining a “universal measure of intelligence” which scores how well an agent can learn the features of an external environment and maximize a reward function. This is the type of formalization we are looking for: a scoring metric which describes how well an agent would achieve some set of goals. However, while the Legg-Hutter metric is insightful, it makes a number of simplifying assumptions, and many difficult open questions remain (Soares 2015).
4.
As this is a multi-agent scenario, the problem of counterfactuals can also be thought of as game-theoretic. The goal is to define a procedure which reliably identifies the best available action; the label of “decision theory” is secondary. This goal subsumes both game theory and decision theory: the desired procedure must identify the best action in all settings, even when there is no clear demarcation between “agent” and “environment.” Game theory informs, but does not define, this area of research.
5.
Of course, if an agent reasons perfectly under logical uncertainty, it would also reason well about the construction of successor agents. However, given the fallibility of human reasoning and the fact that this path is critically important, it seems prudent to verify the agent’s reasoning methods in this scenario specifically.
6.
Or of all humans, or of all sapient creatures, etc. There are many philosophical concerns surrounding what sort of goals are ethical when aligning a superintelligent system, but a solution to the value learning problem will be a practical necessity regardless of which philosophical view is the correct one.

References

Armstrong S (2015) AI motivated value selection, accepted to the 1st International Workshop on AI and Ethics, held within the 29th AAAI Conference on Artificial Intelligence (AAAI-2015), Austin, TX
Google Scholar
Armstrong S, Sandberg A, Bostrom N (2012) Thinking inside the box: Controlling and using an oracle AI. Minds and Machines 22(4):299–324
Article Google Scholar
Bárász M, Christiano P, Fallenstein B, Herreshoff M, LaVictoire P, Yudkowsky E (2014) Robust cooperation in the Prisoner’s Dilemma: Program equilibrium via provability logic, unpublished manuscript. Available via arXiv. http://arxiv.org/abs/1401.5577
Ben-Porath E (1997) Rationality, Nash equilibrium, and backwards induction in perfect-information games. Review of Economic Studies 64(1):23–46
Article Google Scholar
Bensinger R (2013) Building phenomenological bridges. Less Wrong Blog http://lesswrong.com/lw/jd9/building_phenomenological_bridges/
Bird J, Layzell P (2002) The evolved radio and its implications for modelling the evolution of novel sensors. In: Proceedings of the 2002 Congress on Evolutionary Computation. Vol. 2, IEEE, Honolulu, HI, pp 1836–1841
Google Scholar
Bostrom N (2014) Superintelligence: Paths, Dangers, Strategies. Oxford University Press, New York
Google Scholar
Christiano P (2014a) Non-omniscience, probabilistic inference, and metamathematics. Tech. Rep. 2014–3, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/Non-Omniscience.pdf
Christiano P (2014b) Specifying “enlightened judgment” precisely (reprise). Ordinary Ideas Blog http://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/
de Blanc P (2011) Ontological crises in artificial agents’ value systems. Tech. rep., The Singularity Institute, San Francisco, CA, http://arxiv.org/abs/1105.3821
Demski A (2012) Logical prior probability. In: Bach J, Goertzel B, Iklé M (eds) Artificial General Intelligence, Springer, New York, 7716, pp 50–59, 5th International Conference, AGI 2012, Oxford, UK, December 8–11, 2012. Proceedings
Google Scholar
Fallenstein B (2014) Procrastination in probabilistic logic. Working paper, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProbabilisticLogicProcrastinates.pdf
Fallenstein B, Soares N (2014) Problems of self-reference in self-improving space-time embedded intelligence. In: Goertzel B, Orseau L, Snaider J (eds) Artificial General Intelligence, Springer, New York, 8598, pp 21–32, 7th International Conference, AGI 2014, Quebec City, QC, Canada, August 1–4, 2014. Proceedings
Google Scholar
Fallenstein B, Soares N (2015) Vingean reflection: Reliable reasoning for self-improving agents. Tech. Rep. 2015–2, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/VingeanReflection.pdf
Gaifman H (1964) Concerning measures in first order calculi. Israel Journal of Mathematics 2(1):1–18
Google Scholar
Gaifman H (2004) Reasoning with limited resources and assigning probabilities to arithmetical statements. Synthese 140(1–2):97–119
Article Google Scholar
Gödel K, Kleene SC, Rosser JB (1934) On Undecidable Propositions of Formal Mathematical Systems. Institute for Advanced Study, Princeton, NJ
Google Scholar
Good IJ (1965) Speculations concerning the first ultraintelligent machine. In: Alt FL, Rubinoff M (eds) Advances in Computers, vol 6, Academic Press, New York, pp 31–88
Google Scholar
Halpern JY (2003) Reasoning about Uncertainty. MIT Press, Cambridge, MA
Google Scholar
Hintze D (2014) Problem class dominance in predictive dilemmas. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProblemClassDominance.pdf
Hutter M (2000) A theory of universal artificial intelligence based on algorithmic complexity, unpublished manuscript. Available via arXiv. http://arxiv.org/abs/cs/0004001
Hutter M, Lloyd JW, Ng KS, Uther WTB (2013) Probabilities on sentences in an expressive logic. Journal of Applied Logic 11(4):386–420
Article Google Scholar
Jeffrey RC (1983) The Logic of Decision, 2nd edn. Chicago University Press, Chicago, IL
Google Scholar
Joyce JM (1999) The Foundations of Causal Decision Theory. Cambridge Studies in Probability, Induction and Decision Theory, Cambridge University Press, New York, NY
Book Google Scholar
Legg S, Hutter M (2007) Universal intelligence: A definition of machine intelligence. Minds and Machines 17(4):391–444
Article Google Scholar
Lehmann EL (1950) Some principles of the theory of testing hypotheses. Annals of Mathematical Statistics 21(1):1–26
Article Google Scholar
Lewis D (1979) Prisoners’ dilemma is a Newcomb problem. Philosophy & Public Affairs 8(3):235–240, http://www.jstor.org/stable/2265034
Lewis D (1981) Causal decision theory. Australasian Journal of Philosophy 59(1):5–30
Article Google Scholar
Łoś J (1955) On the axiomatic treatment of probability. Colloquium Mathematicae 3(2):125–137, http://eudml.org/doc/209996
MacAskill W (2014) Normative uncertainty. PhD thesis, St Anne’s College, University of Oxford, http://ora.ox.ac.uk/objects/uuid:8a8b60af-47cd-4abc-9d29-400136c89c0f
McCarthy J, Minsky M, Rochester N, Shannon C (1955) A proposal for the Dartmouth summer research project on artificial intelligence. Proposal, Formal Reasoning Group, Stanford University, Stanford, CA
Google Scholar
Muehlhauser L, Salamon A (2012) Intelligence explosion: Evidence and import. In: Eden A, Søraker J, Moor JH, Steinhart E (eds) Singularity Hypotheses: A Scientific and Philosophical Assessment, Springer, Berlin, the Frontiers Collection
Google Scholar
Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: Langley P (ed) Proceedings of the Seventeenth International Conference on Machine Learning (ICML-’00), Morgan Kaufmann, San Francisco, pp 663–670
Google Scholar
Omohundro SM (2008) The basic AI drives. In: Wang P, Goertzel B, Franklin S (eds) Artificial General Intelligence 2008, IOS, Amsterdam, no. 171 in Frontiers in Artificial Intelligence and Applications, pp 483–492, proceedings of the First AGI Conference
Google Scholar
Pearl J (2000) Causality: Models, Reasoning, and Inference, 1st edn. Cambridge University Press, New York, NY
Google Scholar
Poe EA (1836) Maelzel’s chess-player. Southern Literary Messenger 2(5):318–326
Google Scholar
Rapoport A, Chammah AM (1965) Prisoner’s Dilemma: A Study in Conflict and Cooperation, Ann Arbor Paperbacks, vol 165. University of Michigan Press, Ann Arbor, MI
Book Google Scholar
Russell S (2014) Unifying logic and probability: A new dawn for AI? In: Information Processing and Management of Uncertainty in Knowledge-Based Systems: 15th International Conference, IPMU 2014, Montpellier, France, July 15–19, 2014, Proceedings, Part I, Springer, no. 442 in Communications in Computer and Information Science, pp 10–14
Google Scholar
Sawin W, Demski A (2013) Computable probability distributions which converge on \(\pi _1\) will disbelieve true \(\pi _2\) sentences. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/Pi1Pi2Problem.pdf
Shannon CE (1950) XXII. Programming a computer for playing chess. Philosophical Magazine 41(314):256–275
Article Google Scholar
Soares N (2014) Tiling agents in causal graphs. Tech. Rep. 2014–5, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/TilingAgentsCausalGraphs.pdf
Soares N (2015) Formalizing two problems of realistic world-models. Tech. Rep. 2015–3, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/RealisticWorldModels.pdf
Soares N (2016) The value learning problem. In: Ethics for Artificial Intelligence Workshop at the 25th International Joint Conference on Artificial Intelligence (IJCAI-16). New York, NY, July 9th-15th
Google Scholar
Soares N, Fallenstein B (2014) Toward idealized decision theory. Tech. Rep. 2014–7, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/TowardIdealizedDecisionTheory.pdf
Soares N, Fallenstein B (2015) Questions of reasoning under logical uncertainty. Tech. Rep. 2015–1, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/QuestionsLogicalUncertainty.pdf
Solomonoff RJ (1964) A formal theory of inductive inference. Part I. Information and Control 7(1):1–22
Article Google Scholar
United Kingdom Ministry of Defense (1991) Requirements for the procurement of safety critical software in defence equipment. Interim Defence Standard 00-55, United Kingdom Ministry of Defense
Google Scholar
United States Department of Defense (1985) Department of Defense trusted computer system evaluation criteria. Department of Defense Standard DOD 5200.28-STD, United States Department of Defense, http://csrc.nist.gov/publications/history/dod85.pdf
Vinge V (1993) The coming technological singularity: How to survive in the post-human era. In: Vision-21: Interdisciplinary Science and Engineering in the Era of Cyberspace, NASA Lewis Research Center, no. 10129 in NASA Conference Publication, pp 11–22, http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940022856.pdf
Wald A (1939) Contributions to the theory of statistical estimation and testing hypotheses. Annals of Mathematical Statistics 10(4):299–326
Article Google Scholar
Weld D, Etzioni O (1994) The first law of robotics (a call to arms). In: Hayes-Roth B, Korf RE (eds) Proceedings of the Twelfth National Conference on Artificial Intelligence, AAAI Press, Menlo Park, CA, pp 1042–1047, http://www.aaai.org/Papers/AAAI/1994/AAAI94-160.pdf
Yudkowsky E (2008) Artificial intelligence as a positive and negative factor in global risk. In: Bostrom N, Ćirković MM (eds) Global Catastrophic Risks, Oxford University Press, New York, pp 308–345
Google Scholar
Yudkowsky E (2011) Complex value systems in Friendly AI. In: Schmidhuber J, Thórisson KR, Looks M (eds) Artificial General Intelligence, Springer, Berlin, no. 6830 in Lecture Notes in Computer Science, pp 388–393, 4th International Conference, AGI 2011, Mountain View, CA, USA, August 3–6, 2011. Proceedings
Google Scholar
Yudkowsky E (2013) The procrastination paradox. Brief technical note, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProcrastinationParadox.pdf
Yudkowsky E (2014) Distributions allowing tiling of staged subjective EU maximizers. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/DistributionsAllowingTiling.pdf
Yudkowsky E, Herreshoff M (2013) Tiling agents for self-modifying AI, and the Löbian obstacle. Early draft, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/TilingAgents.pdf

Download references

Author information

Authors and Affiliations

Machine Intelligence Research Institute, Berkeley, USA
Nate Soares & Benya Fallenstein

Authors

Nate Soares
View author publications
You can also search for this author in PubMed Google Scholar
Benya Fallenstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nate Soares .

Editor information

Editors and Affiliations

School of Computer and Electrical Engineering, University of Essex, Essex, United Kingdom
Victor Callaghan
Economics Faculty, Smith College, Northampton, Massachusetts, USA
James Miller
Univ of Louisville Dept. of Comp Engrng & Comp Sci, Louisville, Kentucky, USA
Roman Yampolskiy
Faculty of Philosophy Littlegate House, University of Oxford, Oxford, United Kingdom
Stuart Armstrong

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Soares, N., Fallenstein, B. (2017). Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda. In: Callaghan, V., Miller, J., Yampolskiy, R., Armstrong, S. (eds) The Technological Singularity. The Frontiers Collection. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54033-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-54033-6_5
Published: 23 May 2017
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-54031-2
Online ISBN: 978-3-662-54033-6
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics