Skip to main content

Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda

  • Chapter
  • First Online:
The Technological Singularity

Part of the book series: The Frontiers Collection ((FRONTCOLL))

Abstract

In this chapter, we discuss a host of technical problems that we think AI scientists could work on to ensure that the creation of smarter-than-human machine intelligence has a positive impact. Although such systems may be decades away, it is prudent to begin research early: the technical challenges involved in safety and reliability work appear formidable, and uniquely consequential. Our technical agenda discusses three broad categories of research where we think foundational research today could make it easier in the future to develop superintelligent systems that are reliably aligned with human interests:

  1. 1.

    Highly reliable agent designs: how to ensure that we built the right system.

  2. 2.

    Error tolerance: how to ensure that the inevitable flaws are manageable and correctable.

  3. 3.

    Value specification: how to ensure that the system is pursuing the right sorts of objectives.

Since little is known about the design or implementation details of such systems, the research described in this chapter focuses on formal agent foundations for AI alignment research—that is, on developing the basic conceptual tools and theory that are most likely to be useful for engineering robustly beneficial systems in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A more careful wording might be “aligned with the interests of sentient beings.” We would not want to benefit humans at the expense of sentient non-human animals—or (if we build them) at the expense of sentient machines.

  2. 2.

    Since the Dartmouth Proposal (McCarthy et al. 1955), it has been a standard idea in AI that a sufficiently smart machine intelligence could be intelligent enough to improve itself. In 1965, I.J. Good observed that this might create a positive feedback loop leading to an “intelligence explosion” (Good 1965). Sotala and Yampolskiy (2015, Sect. 2.3, this volume) and Bostrom (2014, Chap. 14) has observed that an intelligence explosion is especially likely if the agent has the ability to acquire more hardware, improve its software, or design new hardware.

  3. 3.

    Legg and Hutter (2007) provide a preliminary answer to this question, by defining a “universal measure of intelligence” which scores how well an agent can learn the features of an external environment and maximize a reward function. This is the type of formalization we are looking for: a scoring metric which describes how well an agent would achieve some set of goals. However, while the Legg-Hutter metric is insightful, it makes a number of simplifying assumptions, and many difficult open questions remain (Soares 2015).

  4. 4.

    As this is a multi-agent scenario, the problem of counterfactuals can also be thought of as game-theoretic. The goal is to define a procedure which reliably identifies the best available action; the label of “decision theory” is secondary. This goal subsumes both game theory and decision theory: the desired procedure must identify the best action in all settings, even when there is no clear demarcation between “agent” and “environment.” Game theory informs, but does not define, this area of research.

  5. 5.

    Of course, if an agent reasons perfectly under logical uncertainty, it would also reason well about the construction of successor agents. However, given the fallibility of human reasoning and the fact that this path is critically important, it seems prudent to verify the agent’s reasoning methods in this scenario specifically.

  6. 6.

    Or of all humans, or of all sapient creatures, etc. There are many philosophical concerns surrounding what sort of goals are ethical when aligning a superintelligent system, but a solution to the value learning problem will be a practical necessity regardless of which philosophical view is the correct one.

References

  • Armstrong S (2015) AI motivated value selection, accepted to the 1st International Workshop on AI and Ethics, held within the 29th AAAI Conference on Artificial Intelligence (AAAI-2015), Austin, TX

    Google Scholar 

  • Armstrong S, Sandberg A, Bostrom N (2012) Thinking inside the box: Controlling and using an oracle AI. Minds and Machines 22(4):299–324

    Article  Google Scholar 

  • Bárász M, Christiano P, Fallenstein B, Herreshoff M, LaVictoire P, Yudkowsky E (2014) Robust cooperation in the Prisoner’s Dilemma: Program equilibrium via provability logic, unpublished manuscript. Available via arXiv. http://arxiv.org/abs/1401.5577

  • Ben-Porath E (1997) Rationality, Nash equilibrium, and backwards induction in perfect-information games. Review of Economic Studies 64(1):23–46

    Article  Google Scholar 

  • Bensinger R (2013) Building phenomenological bridges. Less Wrong Blog http://lesswrong.com/lw/jd9/building_phenomenological_bridges/

  • Bird J, Layzell P (2002) The evolved radio and its implications for modelling the evolution of novel sensors. In: Proceedings of the 2002 Congress on Evolutionary Computation. Vol. 2, IEEE, Honolulu, HI, pp 1836–1841

    Google Scholar 

  • Bostrom N (2014) Superintelligence: Paths, Dangers, Strategies. Oxford University Press, New York

    Google Scholar 

  • Christiano P (2014a) Non-omniscience, probabilistic inference, and metamathematics. Tech. Rep. 2014–3, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/Non-Omniscience.pdf

  • Christiano P (2014b) Specifying “enlightened judgment” precisely (reprise). Ordinary Ideas Blog http://ordinaryideas.wordpress.com/2014/08/27/specifying-enlightened-judgment-precisely-reprise/

  • de Blanc P (2011) Ontological crises in artificial agents’ value systems. Tech. rep., The Singularity Institute, San Francisco, CA, http://arxiv.org/abs/1105.3821

  • Demski A (2012) Logical prior probability. In: Bach J, Goertzel B, Iklé M (eds) Artificial General Intelligence, Springer, New York, 7716, pp 50–59, 5th International Conference, AGI 2012, Oxford, UK, December 8–11, 2012. Proceedings

    Google Scholar 

  • Fallenstein B (2014) Procrastination in probabilistic logic. Working paper, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProbabilisticLogicProcrastinates.pdf

  • Fallenstein B, Soares N (2014) Problems of self-reference in self-improving space-time embedded intelligence. In: Goertzel B, Orseau L, Snaider J (eds) Artificial General Intelligence, Springer, New York, 8598, pp 21–32, 7th International Conference, AGI 2014, Quebec City, QC, Canada, August 1–4, 2014. Proceedings

    Google Scholar 

  • Fallenstein B, Soares N (2015) Vingean reflection: Reliable reasoning for self-improving agents. Tech. Rep. 2015–2, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/VingeanReflection.pdf

  • Gaifman H (1964) Concerning measures in first order calculi. Israel Journal of Mathematics 2(1):1–18

    Google Scholar 

  • Gaifman H (2004) Reasoning with limited resources and assigning probabilities to arithmetical statements. Synthese 140(1–2):97–119

    Article  Google Scholar 

  • Gödel K, Kleene SC, Rosser JB (1934) On Undecidable Propositions of Formal Mathematical Systems. Institute for Advanced Study, Princeton, NJ

    Google Scholar 

  • Good IJ (1965) Speculations concerning the first ultraintelligent machine. In: Alt FL, Rubinoff M (eds) Advances in Computers, vol 6, Academic Press, New York, pp 31–88

    Google Scholar 

  • Halpern JY (2003) Reasoning about Uncertainty. MIT Press, Cambridge, MA

    Google Scholar 

  • Hintze D (2014) Problem class dominance in predictive dilemmas. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProblemClassDominance.pdf

  • Hutter M (2000) A theory of universal artificial intelligence based on algorithmic complexity, unpublished manuscript. Available via arXiv. http://arxiv.org/abs/cs/0004001

  • Hutter M, Lloyd JW, Ng KS, Uther WTB (2013) Probabilities on sentences in an expressive logic. Journal of Applied Logic 11(4):386–420

    Article  Google Scholar 

  • Jeffrey RC (1983) The Logic of Decision, 2nd edn. Chicago University Press, Chicago, IL

    Google Scholar 

  • Joyce JM (1999) The Foundations of Causal Decision Theory. Cambridge Studies in Probability, Induction and Decision Theory, Cambridge University Press, New York, NY

    Book  Google Scholar 

  • Legg S, Hutter M (2007) Universal intelligence: A definition of machine intelligence. Minds and Machines 17(4):391–444

    Article  Google Scholar 

  • Lehmann EL (1950) Some principles of the theory of testing hypotheses. Annals of Mathematical Statistics 21(1):1–26

    Article  Google Scholar 

  • Lewis D (1979) Prisoners’ dilemma is a Newcomb problem. Philosophy & Public Affairs 8(3):235–240, http://www.jstor.org/stable/2265034

  • Lewis D (1981) Causal decision theory. Australasian Journal of Philosophy 59(1):5–30

    Article  Google Scholar 

  • Łoś J (1955) On the axiomatic treatment of probability. Colloquium Mathematicae 3(2):125–137, http://eudml.org/doc/209996

  • MacAskill W (2014) Normative uncertainty. PhD thesis, St Anne’s College, University of Oxford, http://ora.ox.ac.uk/objects/uuid:8a8b60af-47cd-4abc-9d29-400136c89c0f

  • McCarthy J, Minsky M, Rochester N, Shannon C (1955) A proposal for the Dartmouth summer research project on artificial intelligence. Proposal, Formal Reasoning Group, Stanford University, Stanford, CA

    Google Scholar 

  • Muehlhauser L, Salamon A (2012) Intelligence explosion: Evidence and import. In: Eden A, Søraker J, Moor JH, Steinhart E (eds) Singularity Hypotheses: A Scientific and Philosophical Assessment, Springer, Berlin, the Frontiers Collection

    Google Scholar 

  • Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: Langley P (ed) Proceedings of the Seventeenth International Conference on Machine Learning (ICML-’00), Morgan Kaufmann, San Francisco, pp 663–670

    Google Scholar 

  • Omohundro SM (2008) The basic AI drives. In: Wang P, Goertzel B, Franklin S (eds) Artificial General Intelligence 2008, IOS, Amsterdam, no. 171 in Frontiers in Artificial Intelligence and Applications, pp 483–492, proceedings of the First AGI Conference

    Google Scholar 

  • Pearl J (2000) Causality: Models, Reasoning, and Inference, 1st edn. Cambridge University Press, New York, NY

    Google Scholar 

  • Poe EA (1836) Maelzel’s chess-player. Southern Literary Messenger 2(5):318–326

    Google Scholar 

  • Rapoport A, Chammah AM (1965) Prisoner’s Dilemma: A Study in Conflict and Cooperation, Ann Arbor Paperbacks, vol 165. University of Michigan Press, Ann Arbor, MI

    Book  Google Scholar 

  • Russell S (2014) Unifying logic and probability: A new dawn for AI? In: Information Processing and Management of Uncertainty in Knowledge-Based Systems: 15th International Conference, IPMU 2014, Montpellier, France, July 15–19, 2014, Proceedings, Part I, Springer, no. 442 in Communications in Computer and Information Science, pp 10–14

    Google Scholar 

  • Sawin W, Demski A (2013) Computable probability distributions which converge on \(\pi _1\) will disbelieve true \(\pi _2\) sentences. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/Pi1Pi2Problem.pdf

  • Shannon CE (1950) XXII. Programming a computer for playing chess. Philosophical Magazine 41(314):256–275

    Article  Google Scholar 

  • Soares N (2014) Tiling agents in causal graphs. Tech. Rep. 2014–5, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/TilingAgentsCausalGraphs.pdf

  • Soares N (2015) Formalizing two problems of realistic world-models. Tech. Rep. 2015–3, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/RealisticWorldModels.pdf

  • Soares N (2016) The value learning problem. In: Ethics for Artificial Intelligence Workshop at the 25th International Joint Conference on Artificial Intelligence (IJCAI-16). New York, NY, July 9th-15th

    Google Scholar 

  • Soares N, Fallenstein B (2014) Toward idealized decision theory. Tech. Rep. 2014–7, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/TowardIdealizedDecisionTheory.pdf

  • Soares N, Fallenstein B (2015) Questions of reasoning under logical uncertainty. Tech. Rep. 2015–1, Machine Intelligence Research Institute, Berkeley, CA, https://intelligence.org/files/QuestionsLogicalUncertainty.pdf

  • Solomonoff RJ (1964) A formal theory of inductive inference. Part I. Information and Control 7(1):1–22

    Article  Google Scholar 

  • United Kingdom Ministry of Defense (1991) Requirements for the procurement of safety critical software in defence equipment. Interim Defence Standard 00-55, United Kingdom Ministry of Defense

    Google Scholar 

  • United States Department of Defense (1985) Department of Defense trusted computer system evaluation criteria. Department of Defense Standard DOD 5200.28-STD, United States Department of Defense, http://csrc.nist.gov/publications/history/dod85.pdf

  • Vinge V (1993) The coming technological singularity: How to survive in the post-human era. In: Vision-21: Interdisciplinary Science and Engineering in the Era of Cyberspace, NASA Lewis Research Center, no. 10129 in NASA Conference Publication, pp 11–22, http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940022856.pdf

  • Wald A (1939) Contributions to the theory of statistical estimation and testing hypotheses. Annals of Mathematical Statistics 10(4):299–326

    Article  Google Scholar 

  • Weld D, Etzioni O (1994) The first law of robotics (a call to arms). In: Hayes-Roth B, Korf RE (eds) Proceedings of the Twelfth National Conference on Artificial Intelligence, AAAI Press, Menlo Park, CA, pp 1042–1047, http://www.aaai.org/Papers/AAAI/1994/AAAI94-160.pdf

  • Yudkowsky E (2008) Artificial intelligence as a positive and negative factor in global risk. In: Bostrom N, Ćirković MM (eds) Global Catastrophic Risks, Oxford University Press, New York, pp 308–345

    Google Scholar 

  • Yudkowsky E (2011) Complex value systems in Friendly AI. In: Schmidhuber J, Thórisson KR, Looks M (eds) Artificial General Intelligence, Springer, Berlin, no. 6830 in Lecture Notes in Computer Science, pp 388–393, 4th International Conference, AGI 2011, Mountain View, CA, USA, August 3–6, 2011. Proceedings

    Google Scholar 

  • Yudkowsky E (2013) The procrastination paradox. Brief technical note, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/ProcrastinationParadox.pdf

  • Yudkowsky E (2014) Distributions allowing tiling of staged subjective EU maximizers. Tech. rep., Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/DistributionsAllowingTiling.pdf

  • Yudkowsky E, Herreshoff M (2013) Tiling agents for self-modifying AI, and the Löbian obstacle. Early draft, Machine Intelligence Research Institute, Berkeley, CA, http://intelligence.org/files/TilingAgents.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nate Soares .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer-Verlag GmbH Germany

About this chapter

Cite this chapter

Soares, N., Fallenstein, B. (2017). Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda. In: Callaghan, V., Miller, J., Yampolskiy, R., Armstrong, S. (eds) The Technological Singularity. The Frontiers Collection. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54033-6_5

Download citation

Publish with us

Policies and ethics