Introduction

Integrating efficient adaptivity and learning with behavior control for both, biological and robotic systems, have to support fast responses without sacrificing flexibility. To accomplish this, higher brains apparently have embedded adaptivity and learning within what has been termed a “dual process architecture” that combines fast behavior with slower processes, usually referred to as cognitive computation [1, 2]. With regard to learning, these two types of processes are characterized as model-based in contrast to model-free learning. Cognitive computation as a higher level process is understood as realizing flexible planning by running grounded internal simulations, which suffers from the drawbacks that it requires effort, that processing of different possibilities is done serially, and that therefore processing is presumably quite slow. Adaptive behavior, on the other hand, is considered a very fast processes that allow for quick reactions (but might be shaped on a slower timescale through learning). Importantly, following the notion of embodied cognition [3], it is assumed—and has been shown—that these two types of different modes should not be considered as disjunct or independent systems. Instead, there is evidence that the higher level cognitive system is based on and recruits the underlying adaptive control system [4,5,6]. In this article, we follow this assumption. In particular, our focus is on the organization and learning of the underlying lower level adaptive control system. We argue that the organizational characteristics are instrumental and crucial for the flexibility of the higher level planning system, meaning, the higher (cognitive) planning level has become so powerful due to the flexibility of the employed lower level building blocks.

Learning of goal-directed behavior on this lower level has often been framed as a reinforcement learning (RL) problem [7]. Reinforcement learning paradigms are rooted in behavioral science and have been studied in detail on a mechanistic level in neuroscience [8, 9]. In reinforcement learning, the space of possible actions for a given context should initially be explored. Later on, gained experiences should be exploited when facing similar situations and actions should be recalled that have been rewarded in the past. In this way, a controller is learning from interactions with the environment (Fig. 1a), Fig. 1 provides an conceptual overview of reinforcement learning). On a functional level, in such a form of trial-and-error learning a high number of explorational trajectories are required in order to cover the currently accessible part of the whole state space. Deep reinforcement learning (DRL) has been established as one effective approach in low- to medium-dimensional spaces as deep neural networks allow to interpolate between previous experiences [10, 11]. Initial success came in the area of computer games [12] with discrete action signals. But over the last years, DRL has also been applied for robot control [13, 14]. While DRL has shown initial success in robotic tasks, these approaches still struggle when dealing with noisy or changing environments [8, 15]. In general, a very large number of experiences is needed for learning when no prior knowledge is used. The first reason is (as shown in Fig. 1b) that the sensory input space is high dimensional and learning requires experiences to cover sufficient parts of this spanned sensory space. While in DRL function approximation is used to interpolate between experiences, it is still necessary to generate experiences that are sufficiently diverse over this high-dimensional sensory space. But, usually, only a small fraction of combinations of sensory inputs in the different dimensions can reasonably be acquired. A deeper understanding of the sensory space would require prior knowledge. A second reason that makes exploration in real world settings so difficult is that the action space of the policy is often quite high dimensional (Fig. 1b), right side). This is particularly problematic for RL as during exploration random combinations of actions are produced and the number of possible actions increases exponentially with the number of action dimensions. Therefore, many RL problems are restricted to quite small action spaces in order to be tractable. Finally, a third problem arises along the temporal dimension (shown in Fig. 1c). In many tasks only a sparse and temporally delayed reward can be provided. This leads to a credit assignment problem as it is difficult to infer (without additional knowledge) which action decision in a sequence of many decisions was crucial for a positive outcome. Hence, the gained reward is often divided among several decisions and is thus significantly diluted.

Fig. 1
figure 1

Conceptual view of Reinforcement Learning (upper row) and of current approaches addressing current challenges (lower row). a Standard scheme of reinforcement learning. Based on a current state, a controller selects an action. During RL, this controller is optimized to maximize a returned reward signal. In panel b, a simple (neural network-based) controller as a policy is shown with a high number of input and output dimensions which both pose a problem as they make exploration costly. In panel c, unrolling of a decision sequence over time is illustrated. When there is only a delayed reward (orange curve at the bottom), this reward is distributed over the whole sequence making it hard to learn which decisions were beneficial. Panels d, e, and to f illustrate approaches to alleviate the three discussed problems (numbers in the arrows refer to the respective subsection in “Towards Modularization and Abstraction in Deep Reinforcement Learning.” That provide a detailed discussion). First, when dealing with high-dimensional input spaces d), dimensionality reduction can facilitate faster behavior learning and transfer as uncovered relations can be exploited between tasks. Second, partitioning of the action space e) helps in exploration, as a high number of degrees of freedom could not be sufficiently explored. Last, in hierarchical DRL (shown in f), there are different levels of abstraction on different temporal scales. In this way, RL tries to uncover smaller subsequences as behavior building blocks that can be reused and rearranged by the higher level policies

In this article, we want to address this family of problems (visualization in Fig. 1b and c); and a conceptual view on possible approaches is shown in Fig. 1d to f). On the one hand, a biological perspective, exemplarily taken from the field of motor control systems, points in the direction of modular, decentralized, and hierarchical organization of control systems. On the other hand, we want to discuss the possible impact of these organizational characteristics for the general learning framework of deep reinforcement learning (DRL). This perspective helps to explain fundamental difficulties when scaling DRL to more real world and complex problems and provides a path for alleviating scaling as well as transfer problems.

We want to point out three different characteristics (Fig. 1d to f for visualization) of the organization of the sensorimotor system that all share a form of modularization as a common trait [16] but along different dimensions of abstraction. Firstly, we consider a form of abstraction over the sensory input space (Fig. 1d) which aims at more general control primitives and tries to make high-dimensional input spaces manageable through dimensionality reduction and learning of disentangled representations. Such abstractions integrate invariances along meaningful dimensions into building blocks while ignoring other forms of variation. Secondly, we will turn towards a partitioning of the action/control space as a form of modularization in which coordinated movements of multiple degrees of freedom are integrated into forms of synergies (Fig. 1e). As reinforcement learning amounts to learning from exploration, it scales particularly badly when there are many degrees of freedom that should be handled. Partitioning the action space dramatically decreases the number of possible actions at a given time for exploration and allows for independent exploration in parallel. Thirdly, there is a form of temporal abstraction and re-representation of behavior on different levels of temporal granularity [17], which is required to deal with problems when there is only sparse or delayed reward (shown in Fig. 1c). In such forms of hierarchical RL (Fig. 1f), the search for a behavior is distributed onto different levels operating on different temporal scales. This allows reusable building blocks of short duration that can be chained together by a higher level policy.

In this article, we briefly motivate these characteristics from a biological perspective with a look at the organization of motor control in animals. The main focus is on how these principles can facilitate reinforcement learning and can be applied for robotic motor control systems [18]. It is our goal to show how these insights into the organization of control have already been applied in Reinforcement Learning and where, as well as how, this could be further beneficial for the robotics community to take further inspiration [19]. Therefore, we present selected current approaches from DRL to highlight advances as well as drawbacks from current approaches.

Biological Perspective

Animals are complex systems that face the difficult task of controlling many degrees of freedom in a coordinated manner to generate coherent goal-directed behavior [20] in quite diverse contexts. Coordination of action of these multiple degrees of freedom in animals is realized based on a rich sensory input space and such behavior can aim for goals over quite long time horizons [21]. Particularly well-studied examples of such adaptive behavior are locomotion and the manipulation of objects. Behavior is characterized as adaptive as it shows a robustness for variation of the situations, for example, when walking on uneven terrain or climbing through a tree. Such biological control systems provide examples of adaptive behavior that we want to understand and transfer to artificial control systems. In the following, we will briefly point out biological insights that highlight the three different dimensions introduced above: First, a characterization of sensory inputs and sensory processing. Second, work on sensorimotor loops that operate concurrently, e.g., reflexes on a fast timescale. Third, and already induced by the former two, a form of hierarchy and modularity in motor control systems. As we are aiming for a sensorimotor perspective that connects perceptual processing and control, we turn towards reinforcement learning as a framework motivated from biology.

From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior

While we want to take sensory processing in animals as an inspiration, it is important to point out a crucial difference to technical systems. Compared to robots, animals rely on a large number of sensors. Usually, these sensory inputs are quite noisy or only provide discretized or even binary information. In many cases, redundant sensory systems are present to measure similar signals. For example, joint angles are measured in insects in a distributed fashion. There are groups of mechanoreceptors that measure bending of the surrounding surface at distinct points and provide a binary signal [22]. In some cases, entirely different sensory systems provide this information, for example, through measuring the lengthening of elastic parts of the joint. The large number of noisy and low-resolution sensory signals requires further processing and integration of information. This can be considered constituting a form of an internal model that integrates information from different sensory organs into a coherent image [23, 24]. As a further example, estimating the posture of the hand in humans is assumed to integrate visual information and proprioceptive information from joints, muscles, skin, and further information on balance, for example, from the vestibular system [25]. As a third example, it was demonstrated for neuronal processing in sensory neocortex during learning that discontinuous steps during performance increase are associated with significant reductions of the dimensionality of the neuronal dynamics state space reflecting a newly generated semantic interpretation of stimuli and stimulus features [21, 26]. Overall, dimensionality reduction is one prominent strategy of biological systems to deal with high-dimensional input spaces (see Fig. 1d). Many control approaches have taken inspiration from such biological dimensionality reduction techniques (however, it should be mentioned that there are alternative ways to deal with high dimensionality; for more details see “Dimensionality Reduction: From Sensory Inputs to State Space” or, as an example, [27]).

Reduction of dimensionality is tightly connected with integration of information through different stages of processing which induces a form of hierarchy and abstraction. Further processing of visual input into more abstract and semantic representations is well-studied in the visual system of primates, among others. Objects are analyzed in a series of stages in the visual system leading towards more and more complex features [28]. While early stages focus mostly on local relations, as edge detectors, later stages and more complex features represent larger areas of the visual field and more conceptual knowledge. Importantly, it is distinguished between two pathways. The dorsal visual pathway is assumed to play a prominent role in guiding immediate action with respect to detailed information on location. In contrast, the ventral pathway is concerned with identification of objects through a stage-wise processing of more and more complex as well as mostly visual properties and disentangled representation [29]. Still, these higher level representations are influential in guiding behavior, but on a different temporal scale as they influence planning (see “Hierarchical Organization and Abstraction” on different temporal levels). Building up more complex representations in a series of stages has inspired the most successful area of deep learning for vision-based tasks. In particular, the combination of deeper network structures of stacked local filters—taken as a prior—combined with learning on big datasets lead to the success of deep neural networks [30]. This idea of convolutions as local filters has proven a strong prior that can be learned in a distributed fashion and leads to more high-level invariant or equivariant representations.

In our context, we are particularly interested in the connection with actions [31] which is often understood in the form of affordances, i.e., the idea that raw sensory data is translated into a meaningful representation directly valuable for guiding actions of the animal [32]. While affordances were originally considered in the context of visual processing towards representations that guide action, these can be found in other senses as well and appear as a general concept in which sensory input is transformed into a representation that facilitates certain behavior [33]. A simple example for touch-based affordance can already be found in stick insects: Stick insects are mostly blind but are able to climb through twigs which is difficult as there are only very few footholds. This is possible as information on footholds is shared between antenna and legs, from the front to the hind ones. This information guides movements [34]. It appears that only antennae and front legs perform searching movements while middle and hind legs can exploit the position information that is afforded.

To summarize, in biological systems, sensory processing is required to integrate a multitude of sensory inputs which individually suffer from low resolution and noise. Such representations appear to focus initially on small and locally close sensory inputs, leading in later stages to more abstract representations that summarize larger input spaces and more distant sensory signals. This induces a form of hierarchy and reduces the dimensionality of the input.

Sensorimotor Action Loops in Biology: Concurrent Processing on Different Temporal Scales

Control of adaptive behavior in animals requires coordination of a high number of degrees of freedom in order to robustly achieve a goal. Biological systems are facing a high dimensionality, in this case of actuators. We will discuss two—partially overlapping—types of strategies in biological systems to deal with this high dimensionality. On the one hand, using again a form of dimensionality reduction as discussed for dealing with large numbers of sensory inputs. On the other hand, partitioning of the action space into local and decentralized control structures (see Fig. 1e).

A large number of controllable degrees of freedom is simplified in biology through governing principles that combine multiple of these degrees of freedoms, for example, into muscle groups or synergies [35, 36]. Motor synergies realize the general principle that the different degrees of freedoms are not used individually, but generally can be grouped into such synergies in which activity is highly correlated [37]. Motor control is in this way simplified as, on the one hand, the number of controlled synergies is much smaller. On the other hand, it has been observed that only some specific dimensions are task relevant and tightly controlled while other dimensions do not have to be carefully “considered” and show broad variation [38]. This simplification is afforded through motor synergies as these also realize a transformation into a more goal-directed space (while muscle activations provide a difficult or ill-determined mapping to movement, a synergy integrating multiple muscles can describe a movement in a joint or even Euclidean positional space).

Furthermore, neural pathways suffer from slow transmission. For many control problems, this rules out full control cycles through the nervous system up to a central control unit as this would be too slow [39]. Instead, on the lower levels we see, firstly, contributions of mechanical embodied properties, like the elasticities of muscles, and secondly the reaction of fast preflexes or reflexes [20]. Such local control circuits allow for fast reactions (Fig. 1e) visualizes this partitioning of the action space into local control modules that can act on a fast timescale. They can be triggered by local—and low dimensional—sensory input and can act on a fast time scale. Further processing is realized in parallel on a higher level and this induces a form of control hierarchy which will be covered in the next section. Here, we first want to stress that in biological motor control, we find decentralization into local modules that can concurrently react. This enables adaptive behavior. The system can deal with difficult control tasks that require immediate responses. We want to emphasize that such decentralized control cannot only be found on a low-level producing reflex movements of single joints, but is a crucial part of motor control. It appears that decentralization into local modules is not only a necessity to overcome transmission delays, but it fundamentally contributes to coordinated behavior. The prime example is the control of locomotion and the contribution of—sensory driven—oscillations. Detailed work in invertebrates, as for example in insects, allows for a closer look on the underlying neuronal structures that are driving this behavior. On these lower levels, there is naturally a decentralized organization in concurrent, small control circuits [40,41,42,43,44]. The basic idea is that there are many, partially independent decentralized and concurrently operating controllers. These are assumed to work on different levels of abstraction, for example, one controller per leg or even further distinguishing control onto the single joint level [45]. Overall, behavior emerges as a result of the interaction of all control circuits: coupled through local connections, some top-down modulations, and the loop through the environment (in the case of insects, gaits emerge freely in response from what is required by the environment [40, 41]). Importantly, this form of modularization into decentralized local control clusters is tightly connected to the organization of the body: Together, the actuated individual leg and the decentralized control circuit interact with the environment. But in return, local sensory inputs not only provide local information but furthermore sense interactions of the whole body with the environment. This not only offers the advantage of fast, local processing. But it has shown that this decentralization is advantageous for the adaptivity of the overall system.

To summarize, biological control systems operate in a decentralized fashion. Local control modules allow for fast (re-)actions. The distribution into local control modules partitions the otherwise high-dimensional action space. Instead, detailed movement control is handled on the local level (Fig. 1e). In addition, this induces a form of hierarchy, as will be discussed in the next section, in which top-down signals modulate lower level synergies or control modules (see as well [46]). It appears that these local building blocks not only solve a problem of temporal delays by keeping signal processing local. But that a modularization into functional synergies breaks down the high complexity of the overall control problem into manageable smaller modules.

Hierarchical Organization and Abstraction

Dealing with the complexity of control of many degrees of freedom and depending on a high-dimensional sensed state is addressed in many animals through decomposition of behavior into a hierarchical organization (Fig. 1f) visualizes such a form of abstraction with a focus on the temporal dimension and different levels operating on different temporal scales. This hierarchy is as well reflected in sensory inputs and action spaces in which we observe forms of abstraction and reduction of dimensionality.). This leads to a modular control system [47,48,49]. In such a hierarchical structure (Fig. 2a) for a visualization as assumed in humans), complexity is distributed over different modules, leading to a form of abstraction [47, 50, 51] as we already mentioned in the preceding sections. From a motor control perspective, higher levels activate modules on the lower level. Locomotion has been widely used to study the interactions of such different levels. In humans, higher centers project through the sensorimotor cortex onto basal ganglia and brain stem [52]. From there, descending commands are assumed to activate executive circuits in the central nervous system that provide detailed control for actuators and movement. Such an organization is known for vertebrate locomotion [53] and is assumed to be shared widely across animals [20, 54]. Importantly, behavior is not only controlled in a top-down fashion. Control is complemented by low-level reflex loops as discussed in the previous section. Overall, such hierarchies of control are tightly interconnected with subserving representations working as well in a bottom-up fashion. While on the lower level inputs describe local and sensory signals, higher levels are connected to more and more abstract representations forming larger receptive fields or summarizing inputs into more abstract concepts as described in “From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior.” Two important characteristics of such abstractions are, on the one hand, a form of spatial abstraction in which more and more context is integrated into higher levels of representation and more and more invariances are covered on these higher levels. On the other hand—and intimately connected—is a form of temporal abstraction, as higher level representation not only take the immediate sensory input into account, but also represent dynamic relations [55] that are extended over time. This further connects to “Sensorimotor Action Loops in Biology: Concurrent Processing on Different Temporal Scales.” in which we already discussed a partitioning of the action space and the formation of abstractions over the action space which induces a hierarchy that operates on different temporal scales [46].

Fig. 2
figure 2

a Illustration of motor control hierarchies as assumed in the human brain (following [52]). The highest level deals with selection of goal-directed behavior. On an intermediate level, actions are selected which leads to an internal competition between different actions depending on context. The lower level realizes motor control primitives [56] that are modulated by the higher levels through descending commands (lower level shown in green with the higher levels’ projections shown as descending commands). At the lower level, motor primitives allow for fast, sensory-guided adaptation towards disturbances. In panel b, the standard view of interaction with the environment in reinforcement learning is shown [7] and extended to a hierarchical perspective [57] as advocated, for example, in [58]. For higher level control (shown in blue), this is in agreement with what we know on the structure of motor control in mammals (see a) and, e.g., [52]) about descending pathways and modulation of lower level control centers (shown in green) in the spinal cord. Such structures are shared not only in mammals, but also in invertebrates and insects [20]

In the preceding three sections, we briefly reviewed individually findings in biology related to our introduced distinction considering the structuring of the input space, action space, and an induction of hierarchies. This proposed distinction highlights different dimensions to focus on. We already noted in the different sections that these different areas are tightly interconnected and should be considered jointly. In such a view, control is understood as sensorimotor couplings in which sensing and behaving are tightly interconnected. In connection to learning, Reinforcement Learning offers one framework that captures the coupling of sensory inputs and action spaces, reinforcing the ones that lead to higher rewards. This is well understood in behavioral and neuroscience [8] identifying, for example, the striatum as a region in which the prediction error of an assumed value appears to be represented which can drive reinforcement learning. Furthermore, this has been extended to hierarchical forms of reinforcement learning (see Fig. 2b) for an extension of the reinforcement learning loop to a hierarchical perspective) in neuroscience [58]. We argue that reinforcement learning—and, in particular, the currently successful application in deep reinforcement learning (DRL)—can further benefit from biological organizational principles as introduced above.

Towards Modularization and Abstraction in Deep Reinforcement Learning

In this section, we will review current directions in DRL and highlight shortcomings of such approaches, in particular, related to scaling to complex control problems and dealing with unpredictable environments. We will follow the distinction into the three dimensions introduced in Fig. 1 and discuss how large input spaces, large action spaces, and a lack of abstraction hinder scaling of DRL towards more complex systems as well as transfer between tasks. This will also offer routes to improvement for current approaches (see bottom part of Fig. 1).

Dimensionality Reduction: From Sensory Inputs to State Spaces

We have pointed out above that in biological systems both input (feature) spaces and output (action) spaces—and in fact, spaces representing activity within subnetworks that are serially connected—are typical high dimensional. There is plenty of evidence that high-dimensionality is not only a reflection of the large number of sensor and actuator components in biological systems but that it serves computational purposes. For instance, in early visual processing high-dimensionality can actually be a feature that helps efficient sensory processing through sparseness, optimizing various aspects such as reducing metabolism or maximizing statistical independence and therefore ease learning and reduce wiring [59, 60] (but see [61] for a critical discussion of this account). Furthermore, projecting decision-making problems into high-dimensional spaces can in fact facilitate the formation of a decision-boundary hyper-space by learning processes as it enables a trade-off towards simplifications of the required geometry of decision surfaces. Also, high-dimensionality allows for the existence of sufficiently rich dynamics representing computational processes in a given subnetwork without influencing a serially connected subsequent network (thereby allowing effective gating without the need for global gain modulation). Such, so-called null spaces [27, 62] allow for example computational processes subserving movement preparation before an actual motor program is triggered in a subsequent network. However, many approaches in decision-making use low-dimensional and pre-defined feature spaces as inputs on which decisions should be based (as one different approach, for example, [63] follows the idea of transforming the sensory space into even higher dimensional spaces with the goal of simplifying the manifold structure that is used as an input space in reinforcement learning.). Reinforcement learning can be easily applied and has a long tradition working in these settings [7]. This requires a preprocessing in which suitable feature spaces are generated. In contrast, when dealing with the original, high-dimensional sensory input problems to be solved become more difficult. In particular, as reinforcement learning aims for exploration of the space in a trial-and-error fashion, with high-dimensional input spaces, the to be explored search space increases dramatically. As a consequence, reinforcement learning has been mostly restricted to cases with a state space that can be—to a certain degree—explored exhaustively and therefore should not be too high dimensional [64].

Ideally, a mapping from high-dimensional sensory input to a meaningful low-dimensional state space should be learned and developed in an unsupervised fashion [65] beforehand. Different approaches have been used to accomplish such a mapping. On the one hand, generic approaches—as for example, principal component analysis [66]—aim for dimensionality as well as redundancy reduction in sensory data streams. In the area of neural networks, such representation learning methods are often based on autoencoder-like architectures that learn a low-dimensional (bottleneck) latent space that allows reproducing the original input as well as possible [67]. More recent approaches add constraints on the latent space that, for example, allow these models to be used as generative models or drive the latent space towards a disentangled representation [68]. Many of such approaches are not taking the temporal development of inputs into account. In contrast, there are approaches that explicitly try to exploit temporal information. Slow feature analysis (SFA), for example, assumes temporal coherence in input data [69]. It finds a mapping that extracts from high-dimensional input those features that are changing comparatively slowly. SFA further imposes a constraint that the produced features are uncorrelated and provide meaningful information about a coherent state. In this sense, SFA is not aiming for the most faithful representation that allows for reconstruction of the original data, but it tries to uncover temporally stable information on the sensed state. This induces a form of abstraction and SFA is well suited for hierarchical representations [64, 70].

SFA provides a first step, taking the temporal dimension into account. Features are not simply observed as snapshots in time, but how they develop over time. Carrying this approach one step further, different algorithms have been proposed that extract features that are maximally predictable, as that is what actually is required for planning into the future [71,72,73]. Such predictive features effectively capture the dynamics of the underlying data. However, it seems that predictability does not provide more useful features than slowness, since slowness seems to be already a robust proxy for predictability [74].

As one further crucial step, following the notion of embodied cognition—as explained in the introduction and in “From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior”—, sensory features should be considered in the context of actions. In this sense, sensory features should not only be considered over time, but should be behaviorally relevant and should relate to possible actions of agents [75]. Therefore, learning of mappings from sensory input to state spaces and learning of decision-making are tightly connected and interwoven. This has led to combined and staged approaches in which a sensory mapping is first learnt inside a typical behavioral context and then utilized for learning of behavior. For example, in [75], a mapping from sensory features to a state space was learnt for an agent that was moving around in a two dimensional environment. Again, the mapping maximized the predictability of the uncovered sensory state transitions depending on applied motor actions. In a second step, it showed that this learnt state space facilitated learning of navigation tasks and it further produced a well-organized spatial representation similar to place fields [76].

In contrast to such a staged approach, the recent trend in DRL has been driven largely by the success of end-to-end learning approaches in which action selection based on high-dimensional inputs is learned directly. These successes are mostly from areas with visual input in which other strong priors (such as local convolutional filters, cf. “From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior”) are leveraged making such a reduction into a reasonably sized state space possible. A prominent example is the seminal work on Atari games in [12]. But it has been shown that these approaches have difficulties in transferring to changes in tasks [77] or they tend to overfit [78]. For instance, in the game breakout moving the controlled paddle only a little bit further upwards leads to a complete breakdown of gameplay [79]. Or, as a second example, when inspecting what was learnt in such computer games, we found that in the well-known game of Pong the agent most of the time did not focus on the moving ball [80]. This is reasonable for a deterministic setting when the agent can infer where to move and hit the ball next just from an early glimpse. But in such cases, learned decisions were based on too much detail of the environment. The agent learned it required only early information for successful decisions in the original setting which makes a transfer to slightly changed Pong environments impossible. These difficulties are a form of overfitting that has become a crucial problem in DRL in general, e.g., in the transfer to robot control tasks [81].

Today, DRL has turned towards better generalizing solutions. One direction taken is through broadening the behavioral context, e.g., in curriculum learning [82, 83]. Another example is turning towards open-ended learning. In [84], an iCub robot was trained to learn multiple skills initially through DRL in a curiosity-driven fashion exploring the space of possible actions. At the same time, an additional world model was used that augments the input state space of the agent. This model was learned incrementally using an extension of slow feature analysis in order to further integrate new experiences. In the end, it was demonstrated that this model provides useful abstractions of the input space and helps in learning novel tasks on the iCub robot when tested in a pick and place task. But such ongoing learning is affected by another problem in DRL: During continuous learning, the learner often deals with an only partially observable environment or the environment is non-stationary. In such a case, the learner is trying to estimate a probability distribution that is shifting and changing the whole time. This has an impact on continuous learning as it is aiming for a “moving target.” In [85], this has been approached through parallel learning of feature representation and action policies during curiosity-driven exploration, but in their approach these were learned in a modular fashion. Learning of such modules induces a form of hierarchy (see “Hierarchical Organization and Abstraction” and “Hierarchical Organization: Overview on Abstraction Along the Temporal Dimension”) and requires to explicitly address how to transition between different modules. In [85], this was realized through a simple gating mechanism and has been demonstrated on a simulated iCub.

A second direction of research aims at explicitly learning a general representation that is informative for different behavioral contexts. Such general representations often take inspiration from biology, e.g., considering the organization of sensory systems. For example, we have demonstrated how the hierarchical organization of an artificial network supports the transfer of learned behaviors from one behavioral context into another, modeling neurobiological data from rodents and humans in a reversal learning task [86]. As another example, that we already mentioned above, a structured spatial representation can emerge in behaving agents [75] and can be found in biology in the form of place cells. It appears as a fundamental form of representation as there are distinct areas in the humans’—and other animals’—brain (in the hippocampus) which encode information on location or heading direction at different resolutions [87]. Leveraging a hierarchical spatial representation has shown as advantageous for decision-making [88]. In [89], a hierarchical spatial representation was used in learning of navigation (see as well [90]). The representation consisted of place cells of different resolutions, organized at different scales and on different layers of the spatial hierarchy. In learning navigation through reinforcement learning, Llofriu et al. [89] could demonstrate that it was advantageous to learn action decisions on the different scales of the spatial representation and integrate the decision afterwards.

To summarize, high-dimensional input spaces pose a problem for exploration driven reinforcement learning. While applications of simple dimensionality reduction techniques have been used successfully in DRL approaches, it has shown that often the learned representation—even though some of these are only implicit in end-to-end DRL approaches—do not generalize well to novel tasks or even slight changes of tasks. This form of overfitting hinders transfer learning and is currently one of the major obstacles for DRL. Therefore, representation learning of more general representations, but with behavioral relevance in mind, has currently drawn more attention in the area of DRL—often, taking inspiration from biology, e.g., from the organizational structure of the visual system or more recently from hierarchical spatial representation.

Partitioning of the Action Space: Decentralized Control

Reinforcement learning is exploring the space of possible selection of actions over the state space. As a consequence, the number of possible combinations of actions grows exponentially with the dimensionality of the action space. If a single central controller is tasked with choosing high-dimensional actions, the space to explore can become infeasibly large (Fig. 1b) shows a centralized control network which leads to an increasing number of exploration possibilities as during exploration the controller has to try all permutations of possible actions along the action dimensionality). As an additional problem, it is becoming more difficult to interpret the obtained reward as the learner faces a credit assignment problem (Fig. 1c): Which part of the high-dimensional decision helped to gain the reward and should be reinforced? This becomes even more difficult when there is only a sparse, delayed reward as the learner has to distribute reward over long time windows and multiple action dimensions. In “Sensorimotor Action Loops in Biology: Concurrent Processing on Different Temporal Scales,” we reviewed work from biology that points out the advantages of modular approaches. In modular approaches, small number of action dimensions are considered jointly (Fig. 1e) shows two modular controller working concurrently), constituting higher level motor primitives and inducing a form of motor hierarchy (see next section). In addition, often these circuits are operating concurrently which allows for quick reactions and fast responses that mostly rely on local sensory input and only control a low number of motor outputs as actions.

From our point of view, this form of modularity and partitioning of the action space—as is present in biological control in animals throughout the nervous system—offers an advantageous prior that has mostly been neglected in DRL research. But how can such a modular and decentralized organization be introduced into DRL and help guide exploration with the goal of bringing down the number of required samples dramatically? As one example, in [91], we distributed the control problem for a four-legged simulated robotic walker into multiple individual control circuits, one for each leg (as indicated in Fig. 1e); for a different approach that aims for learning a general single joint controller through weight sharing see [92]). Three different characteristics for modularity were considered: First, concurrent processing which means that each of the four legs was controlled by a single concurrently running and learning unit. In this way, the action and output space were kept local. Second, the reward function driving RL was furthermore localized, i.e., we restricted costs to the local costs of a controller and only shared global rewards. Third, input information was modularized and restricted to local information (which relates to “Dimensionality Reduction: From Sensory Inputs to State Spaces”). It was differentiated between information only from the controlled leg and information that included neighboring legs as well. As an overall result, we found that such decentralized approaches were able to learn robust walking behavior (Fig. 3). Secondly, learning did not suffer from the scaling issues found in a standard, centralized approach. In the centralized approach, learning often broke down for long periods of time (Fig. 3, blue curves, for around 500 epochs as a mean value) and the centralized approach struggled with the high-dimensional input and output spaces (even though the action space is still considerably small with only eight dimensions) as we already discussed above. Exploration often got stuck in a local optimum which simply equated to not moving at all and avoiding any costs. Escaping from this local optimum was much more difficult for the centralized approach as it required some coincidental success from one of the random exploratory movements. But these small successes could easily get lost in the complex reward that was available to the learner or in what happened in all the other degrees of freedom. In contrast, a modularized local approach did not show this scaling issue at all. In all tested cases, we found that it learned continuously and did not get stuck in the described local optimum (Fig. 3, orange curves). Learning completed much faster (converging already after around 250 epochs in contrast to the centralized approach which required four to five times as long) and on an optimal level. One surprising finding was that the exact number of input dimensions did not appear as much as a crucial factor for learning (up to a certain degree). There was not a huge difference between very small input spaces (kept only to that particular leg) and small input spaces (that included neighboring information). When turning towards more demanding tasks, more information appeared beneficial (for more details, see [91]). Overall, the neural networks appear to interpolate quite well over the large sensory input spaces and learning important relations in the input space is not hindered so much by adding more input dimensions.

Fig. 3
figure 3

Comparison of learning in deep neural network–based controller in reinforcement learning applied in a locomotion task for a four-legged simulated walker (OpenAI’s ant environment). Shown are learning curves for a centralized controller (blue) and a decentralized controller (orange, each leg is controlled by an independent controller). Given are ten individual traces (dashed lines) and the average trace (solid line)

It was also demonstrated that modular control architectures generalized better and were more robust when tackling novel tasks, in our case tested on quite uneven terrain. It appears that the quite detailed information in centralized approaches leads to a form of overfitting in which relations between sensory inputs were learned that actually did not help for transfer to a slightly changed task [81]. Our simple prior—taking the body structure as a blueprint for the control architecture and how to share information only to neighboring controllers—showed better performance than allowing a fully connected architecture try to figure out meaningful relations on its own. Furthermore, the decentralized and modular approaches all appeared robust to hyperparameter selection for the employed neural networks. In contrast, centralized approaches required heavy hyperparameter tuning and showed breakdowns due to overfitting, for example, when the capacity of the neural networks was chosen too big.

To summarize, modularization of controllers—as found in biological systems—along input and output dimensions appears as a promising approach to increase robustness and flexibility of agents implementing these modules [93, 94]. It addresses and circumvents the problem of scaling to high-dimensional spaces and the exponential growth which would be required for exploration. Furthermore, decentralized organization—following the structure of the body organization—directly offers a blueprint for how to distribute costs and rewards in a meaningful way, which is advantageous for guiding learning. The concurrently operating controllers are still able to solve complex tasks and solutions emerge from the interaction of these multiple controllers and the interaction with an environment. In this way, it is circumvented to rely on overly detailed models of the surroundings or becoming dependent on sophisticated predictive models of the environment. Importantly, this has currently only been demonstrated for locomotion as a well-studied task in biology bringing together different insights from neuroscience, behavioral analysis, as well as modeling studies. From our point of view, the next important step is to scale this up to a hierarchical structure ranging over different temporal scales as discussed in the next section.

Hierarchical Organization: Overview on Abstraction along the Temporal Dimension

A major focus in DRL is on how to deal with exploration in large search spaces. In many real world tasks, there is only a sparse reward and reinforcement learning faces a credit assignment problem as reward for an action is delayed [95]. Therefore, it is difficult to assign which choice of action was causing a high reward and should be reinforced for future exploitation. This problem is amplified in reinforcement learning in general, as, on a theoretic level, the overall return for a Markov decision process is taken as an optimization criterion which is collected over a long temporal horizon [57]. In this formulation, a gained return is distributed over temporal trajectories, but increasingly dampened by a factor. Therefore, in the case of delayed rewards, an important early decision is only rewarded with a very weak fraction of the reinforcement signal as a feedback.

Neuroscientific insights into hierarchical forms of reinforcement learning [58] have inspired control architectures (Fig. 4) by introducing hierarchically nested policies working on different timescales. These have been proposed and successfully applied to address the problem of sparse rewards. Initially, again in the area of game playing [57] for solving adventure-style like games as the famous Montezuma’s revenge. In this game, only sparse rewards are given to an agent after finishing long sequences of actions, e.g., finding a key and opening a door. In a hierarchical approach (Fig. 4), the problem is distributed onto different levels of a hierarchy. This hierarchy is abstracting along the temporal dimension. On the lower level, individual behaviors should be learned that are conditioned on a specific goal and rewarded when reaching this goal. A higher level is tasked with sequencing these lower level building blocks by selecting a sequence of goals for the lower level. The goals function as stepping stones [96] toward the overall goal of the system, and these (sub-)goals in particular help when there is only a late reward on the higher level, as it only has to be distributed among the short series of sub-goals during search for a solution.

Fig. 4
figure 4

Conceptual view of hierarchical reinforcement learning: In panel a, the standard view of interaction with the environment in reinforcement learning is shown extended to a hierarchical perspective [57] as advocated, e.g., in [58]. For higher level control (shown in light green), this is in agreement with what we know on the structure of motor control in mammals (e.g., [52]) about descending pathways and modulation of lower level control centers (shown in green, see as well Fig. 2). In panel b, temporal abstraction is illustrated as a process over time operating on two different levels of a hierarchy. The higher level (light green) only operates at a slow timescale being only operated every couple of time steps. The higher level aims for a sparse environmental reward. The lower level is operating on a detailed timescale each control step and providing detailed actions. It is conditioned on the higher level output which provides a form of goal context and, during learning, can set intrinsic rewards for learning a diverse set for lower level action. Bottom row shows the lower level control network in a hierarchical DRL approach. Only an additional dimensions for the sub-goals has been introduced on the lower level (bottom row, c, left). Selecting specific sub-goals is realized as a form of conditioning on the goal inputs to the network d. In the neural network, this is realized as a selection of a specific part of the overall manifold

In general, we are seeing more and more of such approaches turning to more complex tasks and dealing with more diverse scenarios. Hierarchical reinforcement learning appears as a promising and biologically inspired avenue [58] for learning in robotic systems [97, 98]. Most of these hierarchical approaches already deal with the problem of large sensory input spaces and tailor the state spaces for the different layers [99]. Lower levels are restricted, for example, to rely only on proprioceptive inputs in robotic tasks. This form of restricting the input space has shown to help exploration. Resulting spanned sensory spaces are of lower dimensionality and can be sufficiently covered in reasonable time during exploration.

But with respect to dealing with larger dimensional action spaces, we still deal with a mapping of inputs towards all outputs on the lower level in most of these hierarchical approaches. As this action space is usually quite high dimensional, exploration, and uncovering meaningful relations during reinforcement learning would still take a long time and becomes very expensive. In these hierarchical approaches, the higher level selects a goal for the lower level and the lower level control system is conditioned on this goal. This is realized in hierarchical deep reinforcement learning (HDRL) as enforcing the goal in a top-down manner onto the lower level as an input. The controller is therefore restricted to only that part of the input–output mapping with a matching sub-goal (Fig. 4c and d). Conditioning on sub-goals is restricting the lower level controller—realized as a neural network that tries to approximate an optimal input–output mapping—onto a part of the overall spanned space between inputs and outputs. This induces a form of modularization, but only with respect to sub-goals—enforced by the higher level. This modularization alleviates the problem of catastrophic forgetting [100], but it appears not efficient for learning in general as still all possible input–output relations are considered for each sub-goal without taken meaningful prior knowledge into account and not focusing on local concurrent modules as proposed in the preceding section (“Partitioning of the Action Space: Decentralized Control”).

Furthermore, there are currently a couple of drawbacks or quite specific requirements for the application of hierarchical DRL. First, an inherent problem of such approaches is the selection of suitable sub-goals and how to come up with these during learning [101]? There are areas which offer a quite natural decomposition into sub-goals that can be used as rewards for lower levels. For example, in computer games reaching and collecting different shown objects can be used as sub-goals [57]. Or in locomotion tasks, a spatial layout (this relates to representation as described in “From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior” and “Dimensionality Reduction: From Sensory Inputs to State Spaces”) is used for defining sub-goals and this has shown to help in navigation tasks and enables fast transfer to novel environments, while the lower level motion primitives could be maintained [99].

As a second related problem, most of these approaches employ only a quite limited number of different levels—mostly restricted to two levels. Coming up with how to differentiate between different levels is still an unsolved problem that hinders dealing with deeper hierarchies. This holds for the selection of sub-goals that should be discovered by higher levels. While there are approaches that aim for curiosity-driven exploration of sub-goal spaces [101], this appears difficult to scale to more and self-emerging levels.

A third problem is given by the fixed temporal stepping used to realize temporal abstraction which appears as too rigid to account for the flexibility that we find in cognitive processing: Using such fixed timing between different levels works well when dealing with problems that fit to these defined timing pattern, like navigation between adjacent squares in a spatial map. But it fails when considering more flexible temporal relations and dealing with sub-goals that differ with respect to the timescales they operate on.

To summarize, when learning in the real world and in high-dimensional settings, reinforcement learning must overcome severe curses of dimensionality as it is based on exploration. This can be alleviated through forms of abstraction. In the last section, we discussed hierarchical forms of DRL that introduce a form of temporal abstraction and in this way help assigning late or delayed rewards. In “Dimensionality Reduction: From Sensory Inputs to State Spaces” (and “From Sensory Inputs Towards Disentangled Representation that Provide Meaningful Information for Behavior”), the focus was on abstraction of the input spaces introducing forms of more and more abstract representation that should provide integrated/summarized and behavior relevant information for the learner. Furthermore, in “Partitioning of the Action Space: Decentralized Control” (motivated in “Sensorimotor Action Loops in Biology: Concurrent Processing on Different Temporal Scales”), we focused on partitioning of the action spaces. From our point of view, this appears as crucial as the action space in RL can be considered the search space for exploration in which the difficulty of finding solutions increases exponentially with the dimensionality of this space. Overall, we already mentioned how these different parts are tightly interwoven and together induce forms of abstraction as they are constructing invariant representations. In the next section, we want to briefly summarize the different insights into recommendations for future work and how this could be realized successfully in a modular DRL architecture.

Flexibility in Modular Learning

Classical motor control theories already put an emphasis on modularity and have taken inspiration from biology. But their emphasis lead to a different starting point and from such a modularity-driven perspective two questions are usually posed: selecting which action to perform and how to perform that action. This is addressed in modular approaches. As one example, in the influential MOSAIC approach [102, 103], different modules are used for different behaviors and on a higher level it is selected—or in that case actually mixed—which behavior to perform. In such cases, we see a hierarchy and there is a clear-cut modularization as there are distinct and limited modules which are individually responsible for well identifiable behaviors. This leads to a form of local representation for each behavior which still facilitates a form of interpolation between behaviors as a mixture of the outputs.

In the preceding sections, we have discussed for the three different characteristics—or dimensions—how modularity could be introduced into DRL. In the temporal dimension, we already see hierarchical approaches that deal with a form of temporal abstraction, i.e., a behavior is considered as a sequence of smaller building blocks. It is important to understand (see Fig. 4d) that in this case the underlying controller is using a distributed form of representation and selecting a specific goal on the higher level is not selecting a module, but this is more focusing on a specific area of the mapping provided by the policy function. It is important to understand that this fundamentally differs from the notion of modularity as used in the MOSAIC approach. In MOSAIC, localist modules with well-defined behaviors are used. In the worst case of a completely novel situation, such an approach requires mixing behaviors which most probably will lead to non-optimal behavior (e.g., playing tennis for the first time as a squash or table tennis player). In contrast, in the distributed representation of a HDRL approach, a novel context induced from the top layer as a novel goal, forces us into unexplored terrain and relies on extrapolation in the high-dimensional space (Fig. 4d). So, even though the novel goal might be specified from known sub-goals as conditions, e.g., squash or table tennis, this conditioning forces us away from trained behaviors. The underlying DNNs are required to extrapolate—which neural networks are known to do badly. From our point of view, this might explain why transfer in hierarchical approaches does not work as well as hoped for. This might seem counterintuitive when coming from a control perspective that uses clear-cut modules in a hierarchy that allow for some degree of interpolation. But the distributed nature of the HDRL approach is working in an entirely different way.

Nonetheless, it is generally agreed that hierarchical DRL is an important research direction that will lead to approaches that can better deal with multiple tasks and should help for transfer learning [58]. While we agree that such a form of temporal abstraction is indispensable, we want to point out that current approaches are often limited here as they simply assume fixed control steps or frequencies for each level. While this might work for particular tasks, it appears not flexible enough and too restricted to specific durations and timing patterns. From our point of view, flexibility is required along this temporal dimension and we require more approaches that deal with learning frequencies and timing on the different levels, for two examples see [104, 105].

But in order to scale DRL to more complex tasks and control problems, we advocate to further exploit modularity as an architectural principle. In the previous section, we have summarized findings on how the restriction of input spaces and the use of concurrent controllers, which only address some output dimensions, can dramatically improve learning and lead to smooth convergence. The main criteria used in selecting input and output dimensions was locality and restricting control to information close by. This was motivated by biological structures, for example, the operation of reflexes or lower level control circuits as used in locomotion. This should be combined in the future with hierarchical approaches dealing with different time scales. Furthermore, again we propose that such modular approaches will require more flexibility, e.g., to select meaningful input dimensions to the different concurrently operating controllers. Using local information as a prior appears a good start. But fine-tuning or enriching the shared information between controllers over time is a promising direction for future research. Selecting suitable input channels faces the dilemma of when does additional input help and improve a specific behavior and when does it confuse selection or execution of a behavior. This process of narrowing done input spaces can benefit from employing current continuously learning representation learning approaches or geometric deep learning approaches that deal with the task of learning how to optimally share information across a given graph structure and learn an underlying representation of which information to share (for an example see [92]).

A future goal will be to combine the different characteristics of architecture as discussed here and allow for more flexibility in the temporal dimension as well as in the selection of inputs and outputs. This should lead, on the one hand, to transformed representations of sensory inputs that can be exploited by different downstream tasks and provide meaningful abstractions. On the other hand, a more fine-grained temporal distinction should emerge in multiple levels of a hierarchy. Currently, in hierarchical DRL, there is a lack of approaches spanning more than two (or at most three) levels of hierarchy. For one exception see [106]—who argue in the same direction that for better transfer and application to multiple tasks multiple timescales are required for decomposition and that this is currently lacking in hierarchical DRL approaches. Operating on multiple levels requires that such approaches are able to discover autonomously meaningful outputs on the different layers. Currently, these outputs are usually termed goals or sub-goals, which develop as abstractions over tasks.

In such a hierarchy of nested structures, a schematic structure of behavior should emerge and would complement temporal abstraction. Higher levels would not only be differentiated as operating on longer timescales. But as these are more and more shared between different tasks, the policies on the higher levels would also become broader and an abstraction of a wider variety of tasks (consider for example grasping different type of objects using different types of grasps). Central will be approaches that introduce generalizations on more abstract levels, but for which general temporal properties still hold. For example, an abstract grasp representation should be independently mapped to many possible variations of the exact timing of how a grasp would be performed (approaching, pre-shaping, and grasping). We require a form of equivariance along this temporal abstraction dimension, i.e., it should not matter if we consider the outcome of an action or behavior on the lower level or on a more abstract level.

Conclusion: Internal Simulation on Different Levels of Abstraction

We have argued that future architectures for DRL should be organized in a modular fashion allowing different levels of abstraction to emerge (this includes selecting appropriate and relevant inputs and outputs for these levels and introducing meaningful intermediate representations that can be leveraged in many tasks). This has also important implications for recruiting the dynamic representations in higher level function. In particular, this provides the opportunity for a DRL agent to create and “run” different versions of internal models as “hypotheses” as a key mechanism for “planning ahead.” Crucially, the emerging representations should provide temporal models that are equivariant with respect to the level of abstraction. On a mechanistic level, this means that it does not matter if we consider—or mentally simulate—a sequence of detailed movements on the lowest level or on an abstract, schematic level. Independent of the considered level of abstraction, the simulations should lead to the same results. It does not matter if we simulate a sequence in all fine details along the temporal dimension and only check if our end-state fits to a given goal state on the abstract level. This should be the same as when simply employing a simulation on a more abstract and schematic level: first, transferring the current state to the relevant abstraction on a schematic level (e.g., not considering positions in space or joint orientations, but in general reachable distinct objects) and then planning on this schematic level using the more abstract language of (sub-)goals as outcomes of a schema. This introduces a great flexibility into planning as we could move through these internal simulations and unfold them or backtrack when necessary. But, importantly, this does not change the overall trajectory of where the simulation and a constructed plan is leading us. Unfolding only adds detail. This requires a hierarchy of learned policies and representations as described in the previous sections that can serve cognitive tasks: Internal simulation could be realized on different levels of complexity and could be seamlessly unfolded when needed. Of note, the here proposed functional architecture for future DRL systems, inspired by recent insights into the functional organization of various biological systems, also provides a generic framework that simultaneously implements both model-based and model-free types of processes, classically dissociated in traditional dual process accounts of adaptive systems.