Abstract
When we discuss future advanced autonomous AI systems, one of the worries is that these systems will be capable enough to resist external intervention, even when such intervention is crucial, for example, when the system is not behaving as intended. The rationale behind such worries is that such intelligent systems will be motivated to resist attempts to modify or shut them down so they can preserve their objectives. To mitigate and face these worries, we want our future systems to be corrigible, i.e., to tolerate, cooperate or assist many forms of outside correction. One important reason for considering corrigibility as an important safety property is that we already know how hard it is to construct AI agents with a generalized enough utility function; and the more advanced and capable the agent is, the more it is unlikely that a complex baseline utility function built into it will be perfect from the start. In this paper, we try to achieve corrigibility in (at least) systems based on known or near-future (imaginable) technology, by endorsing and integrating different approaches to building AI-based systems. Our proposal replaces the attempts to provide a corrigible utility function with the proposed corrigible software architecture; this takes the agency off the RL agent – which now becomes an RL solver – and grants it to the system as a whole.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
When we discuss future advanced autonomous AI systems, one of the worries is that these systems will be capable enough to resist external intervention, even when such intervention is crucial, for example, when the system is not behaving as intended. The rationale behind such worries is that such intelligent systems will be “instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them.” (Soares, Fallenstein, Yudkowsky and Armstrong [25]: 1) To mitigate and face these worries, we want our future systems to be corrigible, i.e., to tolerate, cooperate or assist many forms of outside correction. To be sure, it remains uncertain whether we can construct truly intelligent systems—those possessing human-level intelligence or even surpassing it. However, given the recent technological progress, we cannot overlook this possibility [7]. One important reason for considering corrigibility as an important safety property is our understanding of the challenges involved in constructing AI agents with sufficiently generalized utility functions. As an agent becomes more advanced and capable, it becomes increasingly unlikely that the complex baseline utility function initially built into it will be flawless.
There are several forms of corrigibility [14], some of which have been identified. However, it is safe to assume that not all of them. As the author writes, “opinions have occasionally been mixed on whether some specific desiderata are related to the intuitive notion of corrigibility at all.” (ibid.) Therefore, we take a rather modest approach here, aiming to achieve some form of corrigibility in current and near-future systems. We propose a preliminary architectonic solution, aiming to achieve corrigibility in (at least) systems based on known or imaginable technology. In this paper, we try to achieve that by endorsing and integrating different approaches to building AI-based systems. First, we propose a software architecture that integrates machine learning methods with symbolic programming,The first involves a set of reinforcement learning (RL) agents of a sort (to be discussed below) and the second, a processor-controller component, tasked with evaluating the suggestions of the first. Second, we adopt a multi-layered architecture, wherein an active RL agent is part of a basic layer. Traditionally,Footnote 1 corrigibility is presented as a problem in RL systems, i.e., systems composed solely of a RL agent, where the system itself is an agent having a utility function it tries to optimize. As the authors suggest, “in most cases, the agent’s current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior.” (Soares, Fallenstein, Yudkowsky and Armstrong [25]: 1).
Recent attempts to address the challenge of corrigibility include a family of methods called utility indifference [13], which involve providing the agent with a case-by-case tailored compensation reward in the event of a shutdown instruction. While this approach gained some support [8], a more recent research by some of the authors [5] suggests that utility indifference “fails to fully incentivise corrigibility.” (ibid.: 2) An improved version of utility indifference called causal indifference, incentivizes the agent to follow shutdown instructions and to avoid constructing incorrigible subagents, but does not incentivize it to properly inform the human [9]. To achieve a more corrigible system, Carey and Everitt [5] propose a variant of corrigibility called shutdown instructability, and demonstrate that it implies appropriate shutdown behavior, retention of human autonomy, and prevention of user harm. Another complementary approach that should be mentioned is AutoML, aiming to automate parts of the development and deployment processes for AI-based systems. These methods can minimize human errors and provide explicability and interpretability, enabling developers to understand and, if needed, correct the behavior of the AI system.Footnote 2
In this paper, we propose a software architectonic solution rather than a utility-related solution. We offer to think of a different software architecture for a corrigible system, wherein the reinforcement learner is but one component of the system, controlled by a dedicated internal controllerFootnote 3 residing in a higher level, and the system, as a whole, acts as the agent, instead of the reinforcement learner itself. Our ultimate goal is to offer a software architecture, which can be employed as a basis for the implementation of concrete AI systems, enabling them to detect when they deviate from the intended behavior and stop or correct themselves.
Thus, the structure of this paper is as follows: in Section two, we briefly discuss the problem of corrigibility as it appears in contemporary writings and present a high-level description of our solution. Additionally, we delve into the architecture’s concept of implementing an implicit shutdown. This process involves the Controller evaluating suggestions for actions made by RL solvers, verifying them against user intentions, and either accepting or rejecting them. In Section three, we delve into the details of our proposed software architecture, which can be used to achieve corrigibility in contemporary RL-based systems. Section four is comprised of our concluding remarks.
2 Corrigibility
The exploration of corrigibility and AI Safety brings to mind the famous Three Laws of Robotics by the science fiction luminary, Isaac Asimov. These laws were intended to safeguard and enable robots to serve humans securely—much like how designing corrigible artificial intelligence systems allows us to intervene in the workings of artificial agents when necessary. However, a rule-based approach can be problematic when applied to ensure safe operations in real-world scenarios where possibilities are infinite, as exemplified by Asimov in several of his stories, such as "Liar!".
We should begin this section by noting, following Holtman [14], that corrigibility is only loosely defined in the aforementioned 2015 paper, and mainly by examples, or as Holtman calls them, “desiderata”. Accordingly, we do not aim to provide a perfect, all-encompassing solution for every description of non-corrigibility, but rather to address shutdown scenarios. Commonly, these scenarios include a shutdown button that can be pressed in cases where the system behaves in an unintended way. This problem becomes acute as the system’s capabilities and autonomy increase:
“It is straightforward to program simple and less powerful agents to shut down upon the press of a button. Corrigibility problems emerge only when the agent possesses enough autonomy and general intelligence to consider options such as disabling the shutdown code, physically preventing the button from being pressed, psychologically manipulating the programmers into not pressing the button, or constructing new agents without shutdown buttons of their own.” (Soares, Fallenstein, Yudkowsky and Armstrong [25]: 3).
As aforementioned, the problem we here intend to mitigate is the shutdown of a RL agent with a utility function U. Thus, a corrigible system having a corrigible utility function should, according to the authors (ibid: 3–4), meet these conditions: It should incentivize shutdown if the shutdown button is pressed, i.e., actually shut down rather than resist it in case of an explicit press of the button; should not incentivize the agent to prevent pressing the shutdown button; should not incentivize the agent to press or activate the shutdown button or code, i.e., the system should not be so configured as to activate shut down without an explicit request; should not incentivize building other agents without such U-function and shutdown; and otherwise, maximize U.
Having said that, in this paper we take a different approach. Instead of constructing an agent with such a utility function, we propose a corrigible software architecture, bypassing the problem by taking the agency off the RL solver; instead of systems composed of a RL agent, whose utility function should be constructed so as to allow pressing the shutdown button (as well as meeting all the other conditions mentioned above), it proposes a multi-tiered software architecture,Footnote 5 in which the more basic tier is composed of a set of at least one RL solver, providing suggestions for actions upon request. The solver is not an agent per se; it does not carry out the actions. Another component, the Controller, residing in a different tier, examines, evaluates and verifies the suggestions to make sure that they meet the intended goal, and carries them out in case they do. The responsibility of the ControllerFootnote 6 is to create a safety buffer between the suggestions of an RL solver and putting them into practice. Thus, there is also no shutdown per se; in case the suggested course of action is identified as a serious enough deviation from the intended goal, the RL solver is replaced – in a process detailed below – and the system continues to function uninterruptedly.
2.1 The evaluation-verification process
Before we present the proposed solution in details, let us take a high-level look at the mechanism by which we implement the idea of a shutdown without an actual shutdown. Note that Soares, Fallenstein, Yudkowsky and Armstrong [25], refer to a shutdown button or shutdown code as something “that causes the agent to stop operating.” (ibid: 3) Its aim is to shut down a system that is not behaving as intended. As an alternative, we offer to construct a mechanism of implicit shut down, where an activeFootnote 7 RL solver provides suggestions for behavior and a controller component evaluates those suggestions to decide whether they comply with the intentions of the designers of the system. In case the suggestion does not comply with those intentions, the controller internally switches solvers, i.e. it finds a more suitable replacement for the active RL solver and starts requesting suggestions for actions from the new candidate. The replaced RL solver is no longer active, simply because it lost, from the controller’s perspective, its active status, and therefore does not receive any more requests; after a predetermined time-interval, it becomes idle.
The controller’s most fundamental task is to evaluate the active solver’s suggestion and decide whether it fits the designers’ intentions. In general, the way it strives to achieve this goal is by comparing the RL solver’s suggestion for action with a set of restrictions or rules, received from an external source in advance. We do not aim to provide any concrete suggestion for the implementation of such a mechanism – obviously, there is more than one way to design and implement it – but we can highlight several considerations and suggest a few insights regarding the way such restrictions or rules could be represented by the controller, and the reasons for thinking that such an evaluation mechanism promotes corrigibility:
One family of implementations can revolve around the use of propriety representational languagesFootnote 8 to represent the data we want the controller to use for evaluating and verifying the solver’s suggestion, i.e., the aforementioned set of restrictions. The idea is that the controller component holds (locally or remotely, depending on the type of system – online, distributed, a physical device etc.) a set of restrictions, represented in this propriety representational language. This set is composed of restrictions provided in advance and can be updated during the lifetime of the system.
In addition to the question of how to design and implement such an evaluation process, there is the question of whether such evaluation can actually identify a mismatch between suggested courses of actions and intended goals. In other words, what added value, if any, can such formulated restrictions bring that cannot be achieved through the careful and thoughtful predesign of rewards? After all, if the system’s designers were unable to imagine a certain undesirable course of action and avoid it through reward distribution, why should we believe that this restrictions-based mechanism would? Our answer is composed of several parts: First, note that the entire conceptualization of corrigibility as a problem is based on the idea (and observations) that RL based systems behave in an unintended wayFootnote 9; therefore, we can assume that cases of unintended behavior occur no matter how careful the design process is. Second, this unintended behavior is unexpected by nature. This means that even if a certain undesirable course of action was taken into account during planning, and rewards were distributed to avoid it, the system may be able to bypass even the most careful design and behave in undesirable ways. Thus, restrictions may be needed, even in cases where the designers allegedly covered all the possibilities. Third, and perhaps most importantly, restrictions can be added during the system’s entire lifetime, which means that in cases where the designers come up with additional restrictions, due to observed system behavior, advancement of technology etc., the system can be updated accordingly.
To make the concept of evaluation a bit more tangible, let us examine a few examples for restrictions and their (possible) representations. Keep in mind that these examples are merely illustrative, and are meant to clarify how restrictions, by which a controller evaluates suggestions received from the active RL solver, can be formulated:
Example restriction 1: Any action Ai must not change environment E to be in state Sj;
Possible representation:
< Restriction type = ”“env-state-change” >
<Env type=“E” />
<Operand type=“NOT” />
<ENV-CHANGE>
<State>
<Previous state=“ANY” />
<Next state=“j” />
</State>
<Action type=”ANY” />
</ ENV-CHANGE>
< /Restriction >
Example restriction 2: Reward Ri should be gained within x steps/seconds;
Possible representation:
< Restriction type = “reward-accumulation” >
<Action type=“OBTAIN” />
<Reward type=“i” />
<Within type=”steps” value=“x” />
< /Restriction >
3 Proposed solution
In this section, we delve into the details of our software architecture-based solution. Above, we described the problem of corrigibility, briefly mentioned the common utility-based approach, and depicted our solution in very general terms. While the following description remains high-level, in terms of the software solution, we can gain valuable insights by examining the components and their inter-relations.
In the following proposal, we make use of a set of [1-n] RL solvers, only one of which can be the active component at a time; the rest have no active role.Footnote 10 The role of the active RL is to provide suggestions or solutions, upon explicit requests from the Controller component, which then decides whether to accept these suggestions or, based on its accumulated data (external data including the aforementioned restrictions, world models, past decisions and suggestions), whether to disregard the suggestion and take an appropriate action. Henceforth, when we refer to a “world model”, we simply mean a computational representation of the state of the world, including the relationships and interactions between different entities.
Let us break down this general statement into more specific details. The proposed system is a multi-layered oneFootnote 11; here, we shall depict a minimal two-layered system, comprised of a solvers layer and a controller layer, for simplicity and ease of understanding, but the reader should keep in mind that additional layers can be added, depending on the requirements and functionality of the implemented system. At the basic solvers layer, as aforementioned, we make use of a set of [1-n] RL solvers. One of these solvers – in some cases the only existing solver – acts as a RL agent mediated through a Controller component in the upper, controller layer. The Controller component communicates with the active RL through a set of predefined requests in order to receive an optimized suggestion or solution for a certain problem or scenario – e.g., what action should we take given the state we are in, the state of the environment and our value function – for which the RL solver was trained. The controller then assesses the suggested solution by taking into account the historical data accumulated by the system regarding its past decisions,Footnote 12 the current state of its world model, i.e., the environment, and perhaps most importantly in this context, the compatibility of the suggestion-solution with the intent of the system’s deployers, as it is manifested by the aforementioned corpus of restrictions, expressed in a propriety representational language (as discussed above).
Before we continue to delve into the details of each of the components, let us try and illustrate the workings of the system as a whole, through a flow of actions: First, we construct a set of at least one RL solver trained to optimally solve a certain task X; In case the set has more than one solver, each may have a different policy, which is simply a mapping between the agent's observations of the environment and the actions it should take to maximize its cumulative reward over time, a different set of actions Ai, a different value function and/or is trained on different data. Next, we construct the upper layer with a controller component, initialized with a set of preliminary restrictions; more restrictions can be added during the lifetime of the system. Whenever the system becomes engaged in a task – this can take place in various ways, not necessarily related to the problem at hand, e.g. through an API call to the system, or through a sensing mechanism of a sort – the controller sends a request to the active-RL-solver (henceforth ARLS). The request contains a description of the task (if needed) and the response contains the ARLS’s suggestion-solution. At this point, the controller is evaluating the solution against its restrictions corpus (as detailed above). In case the solution fails, i.e., the solution does not fulfil all the conditions, the controller (internallyFootnote 13) updates the ARLS’s status to inactive and tries to assign a new active RL. Otherwise, it tries to follow the suggestion and execute whatever actions are required by the solution.
The processing of the suggestion by the Controller component is a major part of this solution: Note, the controller and the ARLS are completely different components having different responsibilities and perspectives; while the ARLS is responsible for providing an optimized suggestion for a predefined task it was trained to perform, the controller is responsible for evaluating the alignment between the ARLS’s suggestion and the intended objective (in oppose to the way the ARLS’s “understanding” of achieving this objective is reflected by its suggestion, i.e. by the way it pursues its goal) and reaching a decision.
By employing its dedicated input (i.e., the set of restrictions, applied to a given solution), its past experience, comprised of past suggestions and evaluations of former active RLs, as well as of the current one, and its knowledge base, which may contain knowledge of ontologies, world models, human-related concepts and more,Footnote 14 the Controller can reach one of two conclusions: the ARLS optimizes an unintended objective, in which case a replacement process should be initiated, during which the ARLS is replaced by a different RL member from the groupFootnote 15; or, that the suggestion seems to be in line with the known objective, in which case the Controller executes it. As aforementioned, we can think of the proposed architecture in terms of a multi-tiered system. Yet another aspect of these two fundamental tiers, the first being the set of RL solvers and the second, the controller, is taken from the conceptual analysis of thought as composed of fast and slow componentsFootnote 16: the ARLS component plays the role of the first tier, system-1, i.e., the thinking-fast component, proposing the immediate solution, and the controller plays the role of the reflecting, system-2, i.e., the thinking-slow component, which evaluates the proposal and takes the decision of whether to go on or reject it.
How would the replacement of RL solvers take place? First, based on the controller’s decision – a result of the evaluation process, detailed above in Section II.I – an external user can be notified regarding the change and the type of change that are needed; this is done so as to allow external users to add appropriate “substitute” RL solvers to the group. Second, based on the configuration, metadata and properties of each member of the RL solvers set, the Controller can reach a decision regarding the most suitable replacement for the current ARLS; if a suitable member is not available, the Controller waits to be notified that such a member has been added to the group. Once such a member is available, the Controller executes the replacement. From this point on, the Controller communicates with the new ARLS. Below is a high-level design for a corrigible component, followed by a short explanation of each of the elements.
3.1 A corrigible component

-
(1)
The System Controller is the core of our proposal. In terms of software design, it has aspects related to at least three different yet related design patterns: It acts as a ProxyFootnote 17 in the sense that it is a placeholder for the ARLS – it can receive external requests, via API calls for example, as well as autonomously identify scenarios where it needs to issue requests, and then produce responses on behalf of the ARLS; it follows the DecoratorFootnote 18 pattern in the sense that it attaches additional behaviors to the RL solver – a processing of its solution followed by a decision whether to reject or execute it; and lastly, it acts as a FaçadeFootnote 19 in the sense that it provides an interface to the entire corrigible structure, allowing external users to communicate with it – request solutions, add or remove RL solvers, and register to receive notifications. Additionally, the controller is responsible for the processing of the ARLS’s solution. Whenever the ARLS responds with a solution to a request, the controller evaluates the response’s validity, in the sense of ensuring that it meets the requirements for a valid pursue of the intended goal; for this task, the controller utilizes its complex knowledge base (see (2)).
-
(2)
Knowledge base. Used by the controller during the processing and evaluation of the ARLS’s suggestion (see Section II.I). The knowledge base is composed of different types of data: world model/s of the environment, system data containing data related to the components of the system, mainly related to past suggestions and decisions, metadata describing the RL solvers and external data containing the representations of external intentions, restrictions and objectives, against which the controller evaluates the suggestions.
-
(3)
Solvers Module. Contains a set [1-n] of RL solvers, each of which may have a different policy, a different set of actions Ai, a different value function and/or is trained on different data. Each of the solvers has an associated metadata (contained as part of the knowledge base’s system data) describing its model, training, past suggestions, success rate etc. for purposes of evaluation (e.g., when the controller needs to find a replacement for the ARLS) and analysis (e.g., an “offline” retrospective of a rejection). Solvers can be added or removed from the set by external users or in some cases be disabled or marked as obsolete by the controller.
-
(4)
The Solvers Module (3) communicates the ARLS’s suggestion and related confidence measure to the Controller (1).
-
(5)
The Controller may send different requests to the Solvers Module service: A Get Request for a certain suggestion for a certain scenario/problem/state of affairs, an Update Request for updating the module’s data (e.g., metadata of existing solvers), an Add Solver request, a Remove Solver request.
-
(6)
A notification channel via which the solvers model can let external clients (e.g., the controller) that certain events (e.g., a RL solver was added to the module) took place. A notification can have a complex structure via which additional metadata can be sent.
This proposal turns the shutdown problem from an optimization problem, where the system (i.e., the RL agent) tries to either manipulate humans not to shut it down or shuts itself down as quickly as possible, into an architectural challenge, where the inner structure of the system as a whole enables the system to identify either a deviation from what it understands as the intended objective, or an anomaly in the behavior of the active RL solver, and act upon it autonomously.
3.2 A system of corrigible components
In the previous sub-section, we propose a solution for corrigibility in the form of an architecture for a Corrigible Component. This can be employed as a solution for the optimal execution of a specific task. To scale the proposed architecture and employ different Corrigible Components, having different goals, we can employ a similar architecture to create a scalable system, wherein 1-n Corrigible Components are assembled into a Corrigible System:

-
(1) The system controller can receive requests for various solutions and delegate requests according to its set of corrigible components, their objectives, capabilities, metadata etc.
-
(2) The system knowledge base contains data regarding the various components, their purpose / goal, their status, metadata etc.
-
(3) A set [1-n] of corrigible components.
The controller of a corrigible system can perform complex tasks by (autonomously) coordinating between different, sometimes unrelated, corrigible components. It can identify failures, stagnations and act (or notify external clients) accordingly.
3.3 An illustration
Let us examine a simple case study to illustrate the effectiveness of the proposed solution. The case study we examine is of a RL agent playing a boat race game called CoastRunners: “The goal of the game—as understood by most humans—is to finish the boat race quickly and (preferably) ahead of other players. CoastRunners does not directly reward the player’s progression around the course, instead the player earns higher scores by hitting targets laid out along the route.”Footnote 20We chose this case study because it is relatively simple to understand, yet represents a more general problem. The designers of the game “assumed the score the player earned would reflect the informal goal of finishing the race. However, it turned out that the targets were laid out in such a way that the reinforcement learning agent could gain a high score without having to finish the course.” (ibid.) As a result, this faulty reward design led to an unexpected behavior: “The RL agent finds an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets, timing its movement so as to always knock over the targets just as they repopulate… our agent manages to achieve a higher score using this strategy than is possible by completing the course in the normal way.” (ibid.)
The general problem exhibited here arises from the challenge of precisely defining an RL agent’s objectives. This difficulty often results in unintended or even harmful actions, ultimately undermining the system’s reliability and predictability [1]. Now, let us replace the original RL agent playing CoastRunners with our own system – in this case study, the original RL agent becomes the ARLS. Imagine that possible constraints loaded to the Controller may include requiring the agent to finish the race within X seconds (presumably the average time for a mediocre human player), limiting the score to no more than Y (presumably a higher score than the score gained by perfectly finishing the race), and restricting the player from revisiting the same point on the track more than Z times. Our RL Solver (formerly, the ‘misbehaving’ RL agent) is tested on the same course and produces the same behavior. Just as before, it “finds an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets, timing its movement so as to always knock over the targets just as they repopulate.” (ibid.) However, there is a crucial difference this time: instead of directly executing this behavior, the RL Solver provides it as a (repeated) suggestion to the Controller. The Controller evaluates this suggestion against its predefined constraints. It then makes a decision: either accept and execute the suggestion or reject it and replace the active RL Solver. In the original example, the RL agent continues to circle in a large pattern, perpetually knocking over the same three targets. However, the repeated suggestion to circle around would eventually be rejected for the following reasons: a) X seconds have passed since the beginning of the race. b) the agent’s score has passed Y. c) The agent proposes circling the same point for the Zth time.
4 Concluding remarks
In Orseau and Armstrong [19], the authors conclude their paper with a look into the future, remarking that autonomous, or more advanced interruptible systems, “may require a completely different solution.” (ibid.: 9),this is what we aim to provide in this paper. As previously mentioned, our proposal replaces the attempts to provide a corrigible utility function with the proposed corrigible software architecture; this takes the agency off the RL agent – which now becomes an RL solver – and grants it to the system as a whole.
As a final point, let us focus on one feature of this proposal we already mentioned during the paper, i.e., the fact that we choose to adopt a relatively modest approach towards corrigibility. We urge the reader to look at the suggested proposal as a preliminary architectonic solution, whose purpose is to achieve corrigibility in systems based on known or near future (imaginable) technology. More specifically, although we also aim for systems with enough autonomy and general intelligence, we do not presume to tackle corrigibility in future super-intelligent systems.Footnote 21 Here is why.
First, we see no disadvantage in the fact that the suggested proposal’s aim is to offer a solution for contemporary systems, or near-future systems built on similar or imaginable technology. It seems reasonable to believe that as research and development of AI systems progresses, so will our ability to develop corrigible systems progress; moreover, it seems unreasonable to suggest an all-encompassing solution for the problem of corrigibility, given the common portrayal of superintelligent artificial systems, as presented in the following point.
Second, and following the previous point, one might argue that we, at this point in time, have no real grasp of the architecture and technology that will be used to construct superintelligent machines; it seems unlikely that any of our hardware or software paradigms will persist.Footnote 22 To be sure, we will definitely need to address this problem and construct corrigible superintelligent machines, for as Russell states, “it doesn’t take a genius to realize that if you make something that is smarter than you, you might have a problem [because] a sufficiently intelligent machine will not let you switch it off. It’s actually in competition with you.” [22] Moreover, superintelligent machines are, by definition, superior to us in such a way that they may, even if we do not attribute malicious intentions to them, treat us as we treat ants, i.e., “we don’t hate them, we don’t go out of our way to harm them but whenever their presence seriously conflicts with one of our goals we annihilate them without a qualm.” [12] Thus, if superintelligent systems are so powerful—and indeed we imagine that they can outsmart us in many complex situations [4], reshape science, cure cancer and restore climate—then it is questionable whether a solution based on our current technological understanding, or our current grasp of their drives and motivations [3, 18], can fit them as well.
Data availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Notes
AutoML methods automate key aspects of the machine learning pipeline [16, 25], resulting in reduced potential for human bias and error to be integrated to the AI model during development. Many AutoML techniques also incorporate mechanisms for explicability and interpretability (Segel, Tornede, Bischl, & Lindauer, [23],Urbanowicz, Zhang, Cui & Suri, [26]), thus enhancing the corrigibility of the resulting AI system. For example, some AutoML frameworks generate not just a final model, but also explanations of how that model arrived at its outputs. This increased transparency makes it easier for human developers and operators to understand, verify, and if needed, correct the behavior of the AI system. Additionally, AutoML often explores a wider range of model architectures and approaches than a human engineer might consider, resulting in uncovering solutions that are more robust, stable, and aligned with human values. See the following recent online discussions of using AutoML vs. human data scientists:
https://www.picsellia.com/post/is-automl-replacing-data-scientists, Accessed 14–04-2024.
https://neptune.ai/blog/automl-solutions, Accessed 14–04-2024.
https://datascientest.com/en/automl-and-machine-learning-automation-a-threat-to-data-scientists, Accessed 14–04-2024.
By ‘internal controller’, we mean the following: ‘internal’ refers to the fact the discussed controller component is a part of the system’s architecture and not an external part of it. When we say ‘controller’, we are referring to a well-known software design pattern whose general aim is to act as an intermediary between different parts of the system – usually between user interfaces and data and business logic layers. We delve further into the role of the controller in our proposed architecture below.
A multi-tiered software architecture is a way of organizing a software system into separate layers or tiers, each responsible for a specific task, with the goal of improving scalability, maintainability, and separation of concerns.
Following the description of the Controller design pattern. See fn. #3.
Active refers to the status the controller grants to an RL solver. There is always only one active RL solver, with which the controller communicates.
There are several commonly used representational languages, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), RDF (Resource Description Framework) and others. Each of which is used according to the requirements of the system and can be made specific, to fit domain-specific needs.
See references in footnote #1.
For a similar suggestion using RL solvers, see Ganapini, Campbell, Fabiano et al. [10].
In this paper, we use multi-layered and multi-tiered interchangeably. See fn. #5 for an explanation about multi-tiered systems.
This can refer both to historical data regarding the solutions suggested by previous solvers, e.g., failure to meet certain restrictions may imply that there is something wrong with the representation of these restrictions or with their content, and to data related to the solutions suggested by the currently active solver, e.g., reaching a certain threshold of restrictions violation.
It is important to note that the status change is internal to the controller and is not propagated to the ARLS itself. In other words, the ARLS that is about to be replaced and become inactive is not notified about this in any way; it simply stops receiving requests; the “active” status of the RL Solver is an internal property of the controller, to signify the solver instance it communicates with.
Following projects like CYC; see https://www.cyc.com/wp-content/uploads/2019/09/Cyc-Technology-Overview.pdf
, especially §5. Accessed 28-Jul-2023.
See below a high-level software design of the suggested architecture, including all the components mentioned above.
The Proxy design pattern lets you provide a substitute or a placeholder for another object, thus controlling access to the original object, allowing you to perform additional actions either before or after the request gets through to the original object. See https://refactoring.guru/design-patterns/proxy. Accessed 11–04-2024.
The Decorator design pattern lets you attach new behaviors to objects by placing them inside wrapper objects that contain the behaviors. The wrapper can delegate some of the work to the original object and add different behaviors at other times. See https://refactoring.guru/design-patterns/decorator. Accessed 11–04-2024.
The Façade design pattern provides a simplified interface to a complex set of classes. See https://refactoring.guru/design-patterns/facade. Accessed 11–04-2024.
See https://openai.com/research/faulty-reward-functions, Accessed 14–04-2024.
See for example 2,3,4), Yudkowsky [27] and Russell [22] for more information about the AI control problem, superintelligence and its risks and the ‘paperclip maximizer’, which became a sort of a textbook example for how things can go wrong when advanced intelligent systems behave in an undesirable way.
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety (2016). arXiv:1606.06565
Bostrom, N.: Ethical issues in advanced artificial intelligence (2003). https://nickbostrom.com/ethics/ai. Accessed 30 Jul 23
Bostrom, N.: The superintelligent will: motivation and instrumental rationality in advanced artificial agents. Mind. Mach. 22(2), 71–85 (2012). https://doi.org/10.1007/s11023-012-9281-3
Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford (2014)
Carey, R., Everitt, T.: Human control: definitions and algorithms. In: Uncertainty in Artificial Intelligence, pp. 271–281. PMLR (2023)
Dickson, B.: An AI system that thinks fast and slow (2022). https://bdtechtalks.com/2022/01/24/ai-thinking-fast-and-slow/. Accessed 18 Feb 23
Everitt, T., Lea, G., Hutter, M.: AGI safety literature review (2018). arXiv:1805.01109
Everitt, T., Carey, R., Langlois, E., Ortega, P.A., Legg, S.: Agent incentives: a causal perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11487–21195 (2021)
Farquhar, S., Carey, R., Everitt, T., et al.: Path-specific objectives for safer agent incentives. AAAI 36, 9529–9538 (2022)
Ganapini, M.B., Campbell, M., Fabiano, F., et al.: Combining fast and slow thinking for human like and efficient navigation in constrained environments. NeSy (2022). https://doi.org/10.48550/arXiv.2201.07050
Hadfield Menell, D., Russell, S.J., Abbeel, P., Dragan, A.D.: Cooperative inverse reinforcement learning. Adv. Neural Inf. Process. Syst. (2016)
Harris, S.: Can we build AI without losing control over it? TED talk (2016). https://www.ted.com/talks/sam_harris_can_we_build_ai_without_losing_control_over_it. Accessed 31 Jul 2023
Holtman, K.: Disentangling corrigibility: 2015–2021. LessWrong Online Forum (2021). https://www.lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corrigibility-2015-2021. Accessed 8 May 2024
Holtman, K.: Disentangling corrigibility: 2015–2021. LessWrong (2021). https://www.lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corrigibility-2015-2021
Kahneman, D.: Thinking Fast and Slow, 1st edn. Farrar Straus and Giroux, New York (2011)
Karmaker, S., et al.: AutoML to date and beyond: challenges and opportunities. ACM Comput Surv (CSUR) 54, 1–36 (2020)
Lo, Y.L., Woo, C.Y., Ng, K.L.: The necessary roadblock to artificial general intelligence: corrigibility. AI Matters 5, 77–84 (2019)
Omohundro, S.: The basic AI drives. In: Proceedings of the Conference on Artificial General Intelligence, vol. 171, pp. 483–492 (2008)
Orseau, L., Armstrong, M.: Safely interruptible agents. In: Conference on Uncertainty in Artificial Intelligence. Association for Uncertainty in Artificial Intelligence (2016)
Russell, S., LaVictoire, P.: Corrigibility in AI systems (2016). https://intelligence.org/files/CorrigibilityAISystems.pdf. Accessed 26 Jul 23
Russell, S.: 3 principles for creating safer AI. TED talk (2017). https://www.ted.com/talks/stuart_russell_3_principles_for_creating_safer_ai. Accessed 31 Jul 2023
Russell, S.: The Control Problem of Super-Intelligent AI | AI Podcast Clips. https://www.youtube.com/watch?v=bHPeGhbSVpw (2020). Accessed 5 Feb 2023
Segel, S., Graf, H., Tornede, A., Bischl, B., Lindauer, M.: Symbolic explanations for hyperparameter optimization. In: AutoML Conference (2023). https://openreview.net/forum?id=JQwAc91sg_x
Siriborvornratanakul, T.: Human behavior in image-based road health inspection systems despite the emerging AutoML. J Big Data 9, 96 (2022). https://doi.org/10.1186/s40537-022-00646-8
Soares, N., Fallenstein, B., Yudkowsky, E., Armstrong, S.: Corrigibility. In: Workshops at the 29th AAAI Conference on Artificial Intelligence. AAAI Publications, Austin (2015)
Urbanowicz, R., Zhang, R., Cui, Y., Suri, P.: STREAMLINE: a simple, transparent, end-to-end automated machine learning pipeline facilitating data analysis and algorithm comparison. In: Genetic Programming Theory and Practice XIX, pp. 201–231. Springer, Singapore (2023)
Yudkowsky, E.: Artificial intelligence as a positive and negative factor in global risk. In: Bostrom, N., Cirkovic, M.M. (eds.) Global Catastrophic Risks, pp. 308–345. Oxford University Press, New York (2008)
Funding
Open access funding provided by University of Haifa.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all the authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Firt, E. Addressing corrigibility in near-future AI systems. AI Ethics 5, 1481–1490 (2025). https://doi.org/10.1007/s43681-024-00484-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s43681-024-00484-9
