1 Introduction

The fundamental cognitive problem for active organisms is to decide what to do next in changing environments. According to representationalists, neurocognitive systems rely on representations when solving this problem. These internal model-like states allow organisms to plan and control sequences of behavior.Footnote 1 They enable adaptive, flexible and goal-directed action.

According to critics, however, the appeal to representations distracts rather than guides the research of cognitive phenomena (Chemero & Silberstein, 2008; Stepp et al., 2011; van Gelder & Port, 1995). For some, the representationalist framework is a result of a misinterpretation concerning what the experimental research actually entails. And some, such as recent radical enactivists, argue there is no satisfactory account of how contentful representational states drive action (Hutto & Myin, 2020). Instead, enactivists demand, we should explain action in terms of reactivations and re-enactments (Hutto & Myin, 2020).Footnote 2

In their arguments, sensorimotor action control is typically taken as a paradigmatic example of a non-representational phenomenon. For example, Myin and Hutto (2015, p. 62) write “… acts of perceptual, motor, or perceptuomotor cognition—chasing and grasping a swirling leaf—are directed towards worldly objects and states of affairs, or aspects thereof, yet without representing them” (2015, p. 62). Cognitively, for a small and simple creature, such as a fly with little inertia, short temporal delays and few action choices, non-representational “reactivations” and “directedness” may suffice. For a human-sized complex organism with multiple possible actions, they may not.

When a human grasps a swirling object, a cognitive system must integrate information across multiple sensory sources (vision, touch, kinesthesia). It must also control multiple effectors (eyes, limbs, posture) in a goal-directed, temporally organised and purposeful way (Fiehler et al., 2019; Mischiati et al., 2015; Wolpert et al., 2011).Footnote 3 Furthermore, the environments where agents operate are constantly changing. To obtain success in such circumstances, agents can’t rely only on actions that have been effective in the past. Instead, agents nust also explore options they have not tried before. Thus, they anticipate, prepare and plan. This may require surprisingly sophisticated cognitive synchronization, coordination and prediction abilities (Fiehler et al., 2019; Hayhoe, 2017).

In this paper, we analyse how reinforcement learning (RL) is used to study action control. RL is a computational framework in which the focus is on automating goal-directed learning and decision-making.Footnote 4 The heart of RL is an elegant and efficient trial and error algorithm. Its goal is to learn an optimal action policy, which maximizes the expected rewards in a given environment. The RL algorithm learns by exploring the possible actions and by observing the consequences of them.

RL-based action control models are more sophisticated than, say, simple proportional feedback models of the control theory from 1960s. Today, RL has an impressive range of successful applications in AI, robotics and cognitive sciences. It is deployed, for example, in controlling the movements of robotic arms, navigating the routes in autonomous vehicles, predicting stock markets and playing various games. In computational neuro- and cognitive sciences, the framework is used to study various forms action and motor control, memory, decision-making and learning, and it is supported by a growing number of empirical and theoretical evidence.Footnote 5

Crucially, RL-based action control is typically given a representational interpretation. It, however, differs from the portrait of representationalism drawn by recent enactivists. In enactivistic arguments, the paradigmatic case of a cognitive representation is a “percept”, a sensory-like state that is used to “represent how things are with the world” (Hutto & Myin, 2020). The task of representations is to provide information about the environment, to which they are connected via sensory contact (Hutto, 2015).

In RL, the goal of the algorithm is to detect opportunities for successful action, not to construct descriptions of the world. The algorithm’s computational objective is to learn the best possible action policy, in light of reward maximization. For doing so, it utilizes representational states. These states, however, are not “percepts”.Footnote 6 Instead, they are goal-directed representations, which stand in for the estimated outcomes of action. Thus, this framework challenges the enactivist presuppositions on what action control representations are, and what they are used for.

2 Basics of reinforcement learning

Reinforcement learning (RL) is a computational framework, which focuses on automating goal-directed learning and decision-making.Footnote 7 Historically, RL is based on the work of early behaviorists. Thorndike’s (1911) “Law of Effect” described how reinforcing events (i.e., reward and punishment) affect the tendency to select actions. For Pavlov (1927) and Skinner (1937), Thorndike’s law provided a scientifically acceptable description for the mechanisms of conditioning. It offered a way to remove suspicious mentalistic concepts from psychology and neurosciences.

Later, Turing proposed an overall architecture for a “pleasure-pain system” in computers (Sutton & Barto, 2018). The actual formalisms of RL algorithm were formulated in the 1970s and 1980s, when computer scientists found a way to combine Optimal Control Theory, Temporal Difference Learning and Learning Automata.Footnote 8 Over the years, the approach has been developed further and complemented by a number of technical and conceptual additions.Footnote 9 As it stands, it is a family of efficient and sophisticated learning algorithms, which are widely used in computational cognitive and neurosciences, artificial intelligence and robotics.

In the 1990s, when cognitive neuroscientists began to utilize RL to model higher brain function, the initial “behavioristic” interpretation of reinforcement learning changed (Niv, 2009). Namely, one of their key theoretical insights was the idea that RL can be used to describe how neurocognitive systems learn by representing value.Footnote 10 Perhaps ironically, neuroscientists gave, thus, a thoroughly representational reinterpretation for a framework that was originally intended to eliminate such conceptualizations from the brain and behavioral sciences.

This reinterpretation of RL, however, does not fit well with recent enactivist views on representations. In enactivistic arguments, the task of representations is to provide information about the external environment (Hutto, 2015; Hutto & Myin, 2020). Representations are states, which (should) tell how the world is. In RL, however, the goal of the algorithm is not to detect the environment “in a correct way”, but to learn the best possible action policy. That is, in RL, representations are goal-directed states.Footnote 11 Their success is not assessed by “veridicality” but by reward maximization.

To see this more clearly, we must unpack some core concepts of RL and take a look at a concrete example of a RL-based action planner.Footnote 12 Before that, some words of warning are in order. First, in this article we describe the algorithm in a simplified and vernacular way. Second, the concepts—such as an agent, an environment, learning, or rewards—are theoretical terms in computer sciences. They should not be confused with their counterparts in everyday language. For example, an agent does not refer to a whole organism, but a formal entity as specified in the formal description of the algorithm. Likewise, an environment is not a synonymous with the physical, concrete world around us. Instead, in RL “environment” denotes the so-called synthetic and technical environment, the state space of the RL model. It is the formally specified agent’s world in which the algorithm interacts.

2.1 Core concepts of reinforcement learning

Generally, in RL, an agent learnsFootnote 13 what to do by exploring the possible actions and by observing the consequences of its actions. The agent is not told which actions to take. Instead, it must discover which actions yield the most reward in the long run by trying them out (Fig. 1).Footnote 14

Fig. 1
figure 1

The organization of a RL-algorithm

The goal of a RL algorithm is to learn an optimal action policy that maximizes the expected cumulative reward over time in a given environment. An action policy is, roughly, a strategy that an agent uses in pursuit of its goals. The policy dictates the actions that the agent takes as a function of (the agent’s estimate of) the current state.

To learn the optimal action policy, RL algorithms must come to “know” the expected long-term cumulative reward, or value, of each state. Given estimates of each state’s values and how actions affect state transition, the optimal action policy is to take the action that, on average, leads to the highest value in the next state: at = argmaxa P(st+1|st,a,M′)V(st+1).Footnote 15

A reward r is, roughly, the measure by which the immediate success or failure of an agent’s actions can be estimated. A reward is a simple scalar, which can be negative (a punishment) or positive (a reward). Thus, a wide variety of entities can be described in terms of rewards. For instance, a reward signal can refer to different trade-offs, risk-seeking and risk aversing utilities, and many other combinations of objectives.

In RL, rewards are organism-dependent, and they are not properties of the physical world. Technically, a reward function R(s) is “a property of the organism”, but the observed rewards are completely determined by the environment, and are not directly manipulable by the agent.Footnote 16 In other words, the reward function is, in a technical sense, “external” to the agent. The only way the agent can influence the rewards is through its actions.Footnote 17

The concept of value refers to the cumulative expected long-term reward.Footnote 18 The amount of such a reward is specified by the value function: The value V(s) of a state is the expected future sum of a rewards (rt) observed at time t, and future rewards that are typically discounted further in the future. The estimates of value take into account the following states, the reward accrued in those states, and their respective probabilities. For this reason, immediate high rewards do not always lead to maximal value. Or, a state might yield a low reward but still have a high value, if it is followed by other, high reward states.

2.2 Action planning in reinforcement learning

In the mainstream computational work on action, motor control is typically seen as mapping moment-by-moment control of movements in a simple motor task (e.g., hand movements, when grasping a leaf). The control is based on internal inverse and forward models (Miall & Wolpert, 1996; Weinsteiner & Botvinick, 2018; Wolpert et al., 1995). Inverse models allow to determine the motor commands necessary to achieve a desired state (for example, grasping a leaf), while forward models allow the system to predict the expected sensory feedback of a motor command.Footnote 19

When the dynamics of action selection is approached in terms of RL,Footnote 20 the agent is thought to take an action—for example reaching a leaf—according to its action policy. The algorithm queries the model M′ for a state-action pair (st, at), and, in turn, receives the next state (st+1) and reward (rt+1).Footnote 21 The algorithm receives a reward outcome in the form of a signal, and updates its action policy.Footnote 22 If the reward is positive, the algorithm strengthens its action policy (say, a sequence of motor movements to reach the leaf in a certain way). If the reward is negative, it weakens the policy (Fig. 2).

Fig. 2
figure 2

Organization of value-based action selection in RL. At each timestep, the agent selects the action that produces maximal expected value in the next step

In contrast to many other computational approaches, in RL the algorithm does not only estimate the immediate next actions (as motor commands as sequences of physical movements), or focus only on what the expected sensory feedback (of a motor command) might be. Instead, it attempts to find an optimal policy through learning the values of actions at any state by estimating the expected future rewards. That is, it anticipates the success of actions before taking them. At the same time, it is able to estimate the success of the executed actions by receiving feedback in a form of rewards and punishments.

This combination of exploration and exploitation allows the systems to act in a more flexible, adaptive and proactive way. Moreover, these algorithms are able to respond to changes in the environment in a relatively efficient way. If the environment changes, the appropriate action becomes different. For instance, when the winter comes, instead of leaves, snowflakes will swirl in the wind. Cognitively, this impacts on the goals of agent, too. The agent must be able to update its action policy, when necessary.

In RL, the algorithm can use a forward model M′ for updating (Doya, 2008). In RL, forward models mimic the possible development of the environment, or more generally, they allow inferences to be made about how the environment will develop as a response to the agent’s actions (in light of its action policy). Moreover, if such a forward model is available, with the so-called state transition rule P (new state∣state, action), the agent can perform a following inference: If I take an action (a) in my current state (t), what is the next state (t + 1) I will end up in?

Technically, this planning and simulation procedure can be described as the maximization of the value up to a time T: ∑ T t = 0 rt + 1, where t indexes discrete time steps up to some maximum T, and rt is the reward received at each step (Fig. 3).Footnote 23,Footnote 24

Fig. 3
figure 3

Forward models and planning in action planners

This allows that in RL, action planner systems are not limited to actions they’ve found effective in the past. They can also estimate the success of options that they have not tried before. Crucially, they are able evaluate which actions may yield the maximal results in the future. Thus, planner systems are able to prepare for the future, not only to learn from the past.

3 Action planner systems and representations

RL-based planner systems provide examples of how cognitive systems utilize goal-directed representations. First, the estimations of value can be taken to represent the goals in terms of expected outcomes (for example, the estimated amount of a long-term reward, if the agent grasps the leaf). Second, the action planners utilize forward models. They stand in for the future states of the algorithm’s synthetic environment. That is, in RL the forward models do not represent the entities in the real world (say, hands grasping leaves), or the future trajectories of real world entities (say, the possible future trajectories of hands grasping new leaves). Instead, they represent the predicted states of the algorithm’s environment in light of its action policy.

3.1 Forward models, synthetic environments and real world environments

In RL, the agent-environment construction is a part of the algorithm’s specification, and the environment is literally a “synthetic” environment for the algorithm. Its content is specified in terms of the RL formalisms.Footnote 25 This synthetic environment can be seen as a surrogate, an abstract and idealized world model, which substitutes the real external, physical environment for the algorithm.Footnote 26

In real-world tasks, however, the success of performance often requires a sufficient correspondence between the synthetic environment and the external physical environment. To systematically select appropriate and successful policies, the action planner system must take into account a significant number of external factors. For example, if the goal of a robot hand is to pick up a leaf in a real environment, relevant factors include the leaf’s location, size, and distance to the hand, among other things.

The “sufficient correspondence”, however, can be achieved in many ways.Footnote 27 Whether, and to what extent, the synthetic environment corresponds to the real-world environment depends, however, on the technical details of a particular application. Not all of them are representational. For example, in robotic reinforcement learning, the external environment can be designed to serve only as a source of feedback information. In this case, the control system receives information, uses it to update the parameters of the algorithms and to calibrate its actions. The feedback information, however, may play only a causal role. And, as Ramsey (2007) remarks, mere causal relations do not represent.

Or, applications can use “sensory-like” information as inputs. For example, in autonomous vehicles a variety of sensors (such as cameras, radars and lidars) can be utilized to scan the vehicle’s environment.Footnote 28 Then, the sensor data can be combined, for instance, by using a computational technique called “sensor fusion”.Footnote 29 The vehicle’s control systems can take this fused data into account when they make the driving decisions. In this way, the sensor systems can “sense” or carry information about the relevant environmental factors, and the control systems can utilize this information, when necessary.Footnote 30

The goal of a RL-based control unit is, however, to compute the driving decisions, not to track or represent the states of the environment in a correct or veridical way. The computational objective of RL algorithm is to maximize the reward, not to produce veridical representations of the environment.

Of course, one may program a RL application, such an intelligent camera, to mimic, say, perceptual processes and to learn world models in a more “representational” way (Silver et al., 2021).Footnote 31 For example, in a recent RL-based image recognition application the task is to learn a correct categorization for real-time video images (say, of leaves behaving in various ways)Footnote 32 During the learning process, the system receives observations (real-time videos, say, of swirling and non-swirling leaves) as input data.Footnote 33 Then, the system parametrizes the observations, and transforms them into hidden states.Footnote 34 At each step, the model predicts the policy, the value function, and the immediate reward. In RL, the hidden states are, thus, updated iteratively by a recurrent process that receives the previous hidden states and hypothetical next actions in the systems.

As a result, as Schrittwieser et al. (2019) put it, “there is no direct … requirement for the hidden state to capture all information necessary to reconstruct the original observation, nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state… they are free to represent the environment in whatever way is relevant to maximize current and future rewards.” That is, the algorithm’s goal is not to match the representational states of the system with the “true states” of the external environment. Or, its task is not to use (current or past) “observations” (about the external environment) to construct veridical descriptions of the objective world as such. Instead, the algorithm’s job is to decide which actions it should take to maximize the reward. To do so, it is free to represent the states of physical environment in whatever way which is relevant for maximizing the rewards.

Furthermore, the specialty of action planner systems in RL is that they do not only use (current or past) “observations” (about the external environment) to predict future rewards. Instead, they also estimate the impact of future actions to predict future rewards. In RL, the forward models tell what the possible development of algorithm’s synthetic environment may be. These future-oriented states allow the algorithm to estimate how its environment will develop as a response to agent’s actions (in light of its action policy). Thus, these estimates have no existing target systems in the real world. Instead, they stand in for anticipated possible future states.Footnote 35 Hence, in RL, forward models represent “possible worlds”, not real environments.

3.2 Value representations

Value estimations stand in for the total amount of reward an agent can expect to accumulate over the future. Their content is specified in terms of the value function and other formalisms of the algorithm. While a value is the farsighted judgement of what is good in the long run, a reward refers to the measure by which the immediate success of an action can be estimated. It indicates the intrinsic short-term desirability of environmental states for the agent.Footnote 36

Rewards, thus, are not signals that indicate the presence of states or entities per se. A reinforcing stimulus (physical feature) only presents an organism with reward or value depending on its reward function. The aim of the reward function is not to represent whether the stimulus is “really” valuable or rewarding.

In animal experiments, of course, rewards are often operationalized as food or drink. For example, a capuchin monkey can be taught to do various tasks by rewarding it with a treat, such as a grapefruit (Brosnan & de Waal, 2003). Still, despite this operationalization, the reward itself is not defined as the grapefruit (or its glucose). Instead, the reward is like the “pleasure” related for getting such a thing (Sutton & Barto, 2018).

Moreover, rewards are not reducible to an agent-independent physical property or object. They are organism-dependent features, not organism-independent causes of neural responses. Factors, such as the hunger-state of the organism, impact the quantity and the quality of rewards. Sugar, for example, is not equally rewarding for every organism (with different reward functions). For a capuchin monkey, it usually is. For a cat, it isn’t. How rewarding sugar is depends on the current glucose metabolism, or generally the physiological state of the organism. Sugar is more rewarding when one is hungry, and less rewarding when one is not.

4 Fly detectors and goal-directed representations

The value estimations and forward models are examples of goal-directed representations. Their role is to detect opportunities for successful action, not to construct veridical models of the objective world as such. Thus, they do not fit well with the recent enactivistic arguments, where the paradigmatic case of a cognitive representation is a “percept”, a mental state used to “represent how things are with the world” (Hutto & Myin, 2020). These representations, as Hutto (2015) puts it, are connected with the world via “sensory contact”.

From a cognitive science point of view, Hutto and Myin’s arguments continue the legacy of “fly detectors”.Footnote 37 This legacy dates back to receptive field studies from the late 1950s (Hubel & Wiesel, 1959; Lettvin et al., 1959). In these early studies on sensory mechanisms, the focus was on signal transformation properties of frog ganglion cells, later known as “fly detectors” (Lettvin et al., 1959). The ganglion cells were found to respond to small, black, fly-like dots moving against a stationary background in the frog’s visual field. A few years later Hubel and Wiesel (1962) proposed a way in which “pooling mechanisms” might explain how the cells of mammalian visual cortex are able to detect more complex features by combining simple responses.

Under the influence of fly detectors, the experimental research of sensory processes focused on a bottom-up feature detection for decades. This framework impacted also on philosophers, and fly detectors dominated the discussion on representations for a long time. For example, in the 1980s, the analysis of representations focused almost completely on the questions of (1) whether a representation of a fly is really about flies, (2) how to make the leap from the signal transformation properties of, say, ganglion cells into semantic properties of receptors, or (3) of how to specify the content determination of these representations in a satisfactory naturalistic way (Dretske, 1981; Fodor, 1992; Millikan, 1989).

From a cognitive point of view, fly detectors are stimulus-based representations. That is, the activation of a detector representation is taken to require a causal association with preceding stimuli, a “neural” signal that triggers the representation (or causes an indicator to fire). For example, in the crude causal theory a stimulus—say, a black dot in the visual field of a frog—is a proximal cause for the activation or the triggering of the tokening of a representation “fly”. The stimulus is caused by a signal, and the external source of a signal—an entity, say, a fly or a mosquito—is assumed to exist in the physical environment. In teleosemantics or indicator semantics, the stimulus is typically taken to be responsible, for example, for the firing of indicator mechanisms, or to play a part in the causal specification of an indicator mechanism.Footnote 38

In stimulus-based representations, the source of a signal (causing the stimulus or the source for a cause that is responsible the firing of an indicator) is taken to exist, somehow, as a distal object in the physical environment (including, possibly, parts of the organism’s body). Obviously, RL-based action planning representations cannot be specified in such terms. They are not caused or triggered by stimuli. For example, in value representations the content of a reward is specified by the reward function. Stimulus is, however, not a part of the reward function. Rewards, simply, aren’t rewarding because they are triggered by entities that happen to function as reinforcers for the agent. Instead, in RL, reinforcers reinforce because they elicit reward.

In value or reward representations, thus, the fly-detector based formulation on signals and stimuli is turned on its head: in RL, the signal is not “a reward signal”, because it triggers or causes a reward stimulus. Instead, a stimulus is “rewarding” because it causes a reward signal (i.e., signal that acts the way reward acts in RL). Furthermore, the reward signal itself does not indicate the presence of some independent feature of the environment (such as the presence of, say, glucose in a grapefruit as a reinforcing stimulus). Although the reward can be operationalized as a concrete item (such as treats, say grapefruits, in animal experiments), the reward itself is not the grapefruit, but the feeling of “pleasure” related to it. Thus, the question of whether the semantic content of a reward signal is really glucose, grapefruit, fly or whatever, simply does not arise.

To sum up, neither rewards nor value representations (as estimated long-term rewards) can be specified in terms of "fly detectors”, or any other framework, in reference to environmental stimuli.Footnote 39 Obviously, these organism-dependent and goal-oriented representations raise very difficult problems of neural encodings of their formal and abstract properties. As things stand, there is no appropriate philosophical account for them.

As we see it, the algorithmic methods of cognitive and computer sciences may provide some resources for analyzing them. For example, the formal descriptions of “synthetic environment” and “expected rewards” may provide a more fertile way to make an autopsy for these organism-derived, abstract representations than the traditional philosophical analysis in terms of “secondary qualities”, “observer-dependent representations” or “organism-produced representations in absence of environmental targets”. Computational methods provide not only tools for simulating these phenomena in a controlled way but, more importantly, they may also offer concepts for analyzing them in an exact and mathematically tractable way.

5 Diversity of representations

When assessing what is the most plausible empirical story for behaviors like “chasing and grasping a swirling leaf”, one should remember that we live in a complex and changing world. In dynamic environments, even simple action—such as grasping a new swirling leaf—is a complicated challenge. To solve this challenge, cognitive systems do not only react, but also explore new strategies and learn from observing the consequences of their actions. Moreover, to select successful policies, they must take into account a significant number of external factors, and integrate information from several sources in a purposeful way. Thus, successful action requires information from multiple perspectives.

From a neurocognitive point of view, cognitive systems use different representations for different purposes.Footnote 40 Perceptual and sensory systems allow systems to detect the environment and to get feedback from the states of the body. Motor representations as motor commands help the agent to coordinate the behavior of body. They encode explicit instructions for sequences of physical movements (Miall & Wolpert, 1996; Mylopoulos & Pacherie, 2017; Wolpert et al., 1995). And, goal-directed representations are used to detect and estimate opportunities for successful action.

Representational states vary from neural coding in sensory cells to higher-order perceptual processing, and from on-line representations to memory-based representations of previous experiences. And, they vary from simple motor commands to complex metacognitive simulations of possible actions. While some of them, such as fly-detectors or simple encodings of visual or acoustic signals, are stimulus-based and track the external targets via sensory apparatus, other representations such as motor representations or goal-directed representations,Footnote 41 do not. And, while some representations are organism-independent, such as sensory encoding of simple physical quantities like the speed and direction of motion, others, such as estimates of long-term rewards, are organism-dependent.

Clearly, this diversity challenges the enactivist presupposition that to represent is to represent only as fly detectors, or generally, as sensory states do. Furthermore, it pushes us to ask: to what extent can these various representational states be analyzed only in terms of a framework which was originally developed to explore the signal transformation properties of a frog’s ganglion cells. Perhaps we should let go of the assumption that only states that track external environment count as representational, and analyze the representational diversity as it is. Eventually, this should not be taken as a matter of philosophical convictions or presuppositions. Instead, it should be taken as a question of roles that representations play in explaining action control scientifically.

6 Conclusions

In this article, we have illustrated how reinforcement learning (RL) algorithms are used to study the cognitive dynamics of action control. When action control is characterized in this way, it is understood as the control system of a cognitive agent who can learn, anticipate and adapt.

In RL, the cognitive dynamics of action control is seen as forms of computationally specified learning-as-decision making processes (Sutton & Barto, 2018). RL provides an exact and algorithmic way to describe various aspects of these processes. Furthermore, this approach is widely and successfully used in simulating and analyzing them in computational cognitive sciences, artificial intelligence and robotics.

In RL, the goal of an algorithm is to maximize the long-term reward. For doing so, the algorithm utilizes internal representational states. These representational states, however, are not sensory-like “percepts”. Instead, they are goal-directed. Thus, this algorithmic framework provides examples of representations which challenge the enactivist conviction that to represent is to represent as fly detectors do. Moreover, it impugns the enactivist intuition that action-oriented perspective implies the need for a non-representational theory of cognitive systems. Instead, in the light of RL, it implies a need for updating intuitions on what action control representations are, and what they are used for.