1 Introduction

Reinforcement learning (RL) has attracted a great deal of research attention owing to its learning procedure that allows agents to directly interact with the environment. As a result, an RL agent can imitate human learning process to achieve a designated goal, i.e., an agent can carry out trial-and-error learning (exploration) and draw on “experience” (exploitation) to improve its behaviours [1, 2]. Therefore, RL is used in a variety of domains, such as IT resources management [3], cyber-security [4], robotics [5,6,7,8], control systems [9,10,11,12], recommendation systems [13], stock trading strategies [14], bidding and advertising campaigns [15], and video games [16,17,18]. However, traditional RL methods and dynamic programming [19], which use a bootstrapping mechanism to approximate the objective function, cease to work in high-dimensional environments due to limitation in memory and computational load requirements. This “curse of dimensionality” issue creates a major challenge in RL principle.

Figure 1 depicts an RL problem by using a Unified Modeling Language (UML) [20] sequential diagram. Specifically, the problem includes two entities: a decision maker and an environment. The environment can be an artificial simulator or a wrapper of a real-world environment. While the environment is a passive entity, the decision maker is an active entity that periodically interacts with the environment. In the RL context, a decision maker and an agent are interchangeable, though they can be two identified objects from a software design perspective.

Fig. 1
figure 1

A UML sequential diagram to describe an RL problem

At first, the decision maker perceives a state st from the environment. Then it uses its internal model to select the corresponding action at. The environment interacts with the chosen action at by sending a numerical reward rt+ 1 to the decision maker. The environment also brings the decision maker to a new state st+ 1. Finally, the decision maker uses the current transition 𝜗 = {at,st,rt+ 1,st+ 1} to update its decision model. This process is iterated until t equals T, where sT denotes the terminal state of an episode. There are different methods to develop a decision model, such as fuzzy logic [21], genetic algorithms [22], or dynamic programming [23]. In this paper, however, we consider a deep neural network as the decision model.

The diagram in Fig. 1 infers that RL is capable of online learning because the model is updated continuously with incoming data. However, RL can be performed offline via a batch learning [24] technique. In particular, the current transition 𝜗 can be stored in an experience replay [25] and retrieved later to train the decision model. The goal of an RL problem is to maximize the expected sum of discounted reward Rt, i.e.,

$$ R_{t} = r_{t+1} + \gamma r_{t+2} + \gamma^{2} r_{t+3} + ... + \gamma^{T-t-1} r_{T}, $$

where γ denotes the discounted factor and 0 < γ ≤ 1.

In 2015, Google DeepMind [26] announced a breakthrough in RL by combining it with deep learning to create an intelligent agent that can beat a professional human player in a series of 49 Atari games. The idea was to use a deep neural network with convolutional layers [27] to directly process raw images (states) of the game screen to estimate subsequent actions. The study was highly valued because it opened a new era of RL with deep learning, leading to partially solving the curse of dimensionality. In other words, deep learning offers a great complement for RL in a wide range of complicated applications. For instance, Google DeepMind created a program, AlphaGo, which beat the Go grandmaster, Lee Sedol, in the best-of-five tournament in 2016 [28]. AlphaGo is a full-of-tech AI based on Monte Carlo Tree Search [29], a hybrid network (policy network and value network), and a self-taught training strategy [30]. Other applications of deep RL can be found in self-driving cars [31, 32], helicopters [33], or even NP-hard problems such as Vehicle Routing Problem [34] and combinatorial graph optimization [35].

As stated above, deep RL is crucial owing to its appealing learning mechanism and widespread applications in the real world. In this study, we further delve into practical aspects of deep RL by analyzing challenges and solutions while designing a deep RL-based system. Furthermore, we consider a real-world scenario where multiple agents, multiple objectives, and human-machine interactions are involved. Firstly, if we can take advantage of using multiple agents to accomplish a designated task, we can shorten the wall time, i.e., the computational time to execute the assigned task. Depending on the task, the agents can be cooperative or competitive. In the cooperative mode, agents work in parallel or in pipeline to achieve the task [36]. In the case of competition, agents are scrambled, which basically raises the resource hunting problem [37]. However, in contrast to our imagination, competitive learning can be fruitful. Specifically, the agent is trained continually to place the opponent into a disadvantaged position and the agent is improved over time. As the opponent is also improved over time, this phenomenon eventually results in a Nash equilibrium [38]. Moreover, competitive learning promotes the self-taught strategy (e.g. AlphaGo) and a series of techniques such as the Actor-Critic architecture [39], opponent modeling [40], and Generative Adversarial Networks [41]. In this respect, we notice the problem of moving target [42] in multi-agent systems, which describes a scenario when the decision of an agent depends on other agents, thus the optimal policy becomes non-stationary [43].

Secondly, a real-world objective is often complicated as it normally comprises multiple sub-goals. It is straightforward if sub-goals are non-conflicting because they can be seen as a single composite objective. A more challenging case is when there are conflicting objectives. One solution is to convert a multi-objective problem into a single objective counterpart by applying scalarization through an application of a linear weighted sum for individual objective [44] or non-linear methods [45]. These approaches are categorized as single-policy methods. In contrast, multi-policy methods [46] seek multiple optimal policies at the same time. While the number of multi-policy methods is restricted, they are able to offer powerful solutions. For instance, the Convex Hull Value Iteration algorithm [47] computes a set of objective combinations to retrieve all deterministic optimal policies. To benchmark a multi-objective method, we can find or approximate a boundary surface, namely Pareto dominance, which presents the maximum performance of different weights (if scalarization is used) [48]. Recent studies have integrated multi-objective mechanisms into deep RL models [49, 50].

On the other hand, human-machine interaction is another key factor in designing a usable and useful deep RL-based system. A self-driving car, for example, should accept human intervention in emergency situations [31, 32]. Therefore, it is critical to ensure a certain level of safety while designing a hybrid system in which humans and machines can work together. Due to its importance, Google DeepMind and OpenAI have presented novel ways to encourage various innovations in recent years [51]. For instance, Christiano et al. [52] proposed a novel scheme that accepts human feedback during the training process. However, the method requires an operator to constantly observe the agent’s behavior, which is an onerous and error-prone task. Recent work [53] provided a practical approach by introducing a behavioral control system. The system is used to control multiple agents in real time via human dictation. Table 1 summarizes key terminologies that are widely used in RL contexts (Table 1).

Table 1 Key RL terminologies

In summary, our study contributes to the following aspects:

  • We present an overall picture of contemporary deep RL (Table 2). We consider state-of-the-art deep RL methods in three key aspects of pertaining to real-world applications such as multi-agent learning, multi-objective problems, and human-machine interaction. Thereafter, the paper offers a checklist for software managers, a guideline for software designers, and a technical reference for software programmers.

  • We analyse challenges and difficulties in designing a deep RL-based system, contributing towards minimising possible mistakes during its development process. In other words, software designers can inherit the proposed design, foresee difficulties, and eventually expedite the entire development procedure of an RL-based system, especially in agile software development.

  • We provide the source codes of our proposed framework in [54] (Table 3). Based on the template, RL beginners can design an RL method and implement it in real-world applications in a short time span. As a result, our work contributes toward promoting the use of deep RL to a wider community. A direct usecase of our study is to employ an educational framework that is used to demonstrate basic RL algorithms.

Table 2 Key deep RL methods in literature

The paper has the following sections. Section 2 presents a survey on state-of-the-art deep RL methods in different research directions. Section 3 describes the proposed system architecture, which supports multiple agents, multiple objectives, and human-machine interactions. Concluding remarks are given in Section 4.

2 Literature review

Table 3 Demonstration codes of different use cases [54]

2.1 Single-agent methods

The first advent of deep RL, namely a Deep Q-Network (DQN) [26], basically used a deep neural network to estimate values of state-action pairs via a Q-value function (a.k.a., action-value function or Q(s,a)). Thereafter, a number of variants based on DQN were introduced to improve the original algorithm. Typical extensions are Double DQN [64], Dueling Network [65], Prioritized Experience Replay [66], Recurrent DQN [67], Attention Recurrent DQN [59], and an ensemble method named Rainbow [68]. These methods use an experience replay to store historical transitions and retrieve them in batches to train the resulting network. Moreover, a separate target network can be used to mitigate the correlation of sequential data and prevent training network from overfitting.

Instead of estimating the action-value function, we can directly approximate the agent’s policy π(s). This approach is known as policy gradient or policy-based methods. Asynchronous Advantage Actor-Critic (A3C) [69] is one of the first policy-based deep RL methods in the literature. A3C comprises two networks: an actor network to estimate the agent policy π(s) and a critic network to estimate the state-value function V (s). Additionally, to stabilize the learning process, A3C uses an advantage function, i.e., A(s,a) = Q(s,a) − V (s). There is a synchronous version of A3C, namely A2C [69], which has the advantage of being simpler but with comparable or better performance. A2C mitigates the risk of multiple learners from overlapping when updating the weights of the global networks.

There have been a great number of policy gradient methods since the development of A3C. For instance, UNsupervised REinforcement and Auxiliary Learning (UNREAL) [70] used multiple unsupervised pseudo-reward signals simultaneously to improve the learning efficiency in complicated environments. Rather than estimating a stochastic policy, Deterministic Policy Gradient [71] (DPG) seeks for a deterministic policy, which significantly reduce data sampling. Moreover, Deep Deterministic Policy Gradient [72] (DDPG) combined DPG with DQN to enable learning of a deterministic policy in a continuous action space using an actor-critic architecture. To further stabilize the training process, a Trust Region Policy Optimization (TRPO) method [73] integrated Kullback–Leibler divergence [74] into the training procedure, leading to a complicated method. In 2017, Wu et al. [75] proposed Actor-Critic using Kronecker-Factored Trust Region (ACKTR), which applied Kronecker-factored approximation curvature into gradient update steps. Additionally, Actor-Critic with Experience Replay (ACER) [76] was introduced to offer an efficient off-policy sampling method based on A3C and an experience replay. To simplify the implementation of TRPO, ACKTR, and ACER, Proximal Policy Optimization (PPO) [77] is proposed to exploit a clipped “surrogate” objective function together with stochastic gradient ascent. Soft Actor-Critic (SAC) [62, 63] uses the maximum entropy reinforcement learning framework that aims to learn a stochastic policy that maximizes both the reward and the entropy. In other words, SAC learns an actor that succeeds at the task but is as random as possible. Concurrently, Twin-Delayed DDPG (TD3) [61] focuses on fixing the issue of value overestimation of DDPG by introducing three tricks: clipped double-Q learning, policy update delays, and target policy smoothing. Both SAC and TD3 are quite comparable, and they are currently considered state-of-the-art methods on continuous control domains. Some studies combined policy-based and value-based methods such as [78,79,80] or on-policy and off-policy methods such as [81, 82]. Table 2 presents a summary of key deep RL methods and their implementation. Based on specific application domains, software managers can select a suitable deep RL method to act as a baseline for the target system.

Recently, many studies have focused on efficient training and state-of-the-art performance in various tasks and settings in the field of RL. For instance, D4PG (Distributed Distributional Deep Deterministic Policy Gradients) [83] was proposed to improve the performance of RL in a wide array of control tasks, such as robotics control with a finite number of discrete actions. To efficiently train an agent in a scalable RL system, Espeholt et al. introduced a fast and low-cost RL algorithm, namely SEED RL (Scalable and Efficient Deep RL), which could train numerous frames per second [84]. Moreover, a self-predictive representation learning algorithm [85] was proposed for RL to exploit data augmentation and future prediction objective. Using the algorithm, agents could learn future latent state representations with limited interactions.

2.2 Multi-agent methods

In multi-agent learning, there are two widely used schemes in the literature: individual and mutual. In the individual scheme category, each agent in the system can be considered as an independent decision maker and other agents as part of the environment. In this way, any deep RL methods in the previous subsection can be used in multi-agent learning. For instance, Tampuu et al. [86] used DQN to create an independent policy for each agent. Behavioural convergence of the involved agents was analysed with respect to cooperation and competition. Similarly, Leibo et al. [37] introduced a sequential social dilemma, which basically used DQN to analyze the agent’s strategy in Markov games such as Prisoner’s Dilemma, Fruit Gathering, and Wolfpack. However, the approach limits the number of agents due to computational complexity with the number of policies. To overcome this obstacle, Nguyen et al. developed a behavioural control system [53] for homogeneous agents to share the same policy. As a result, the method is robust and scalable. Another problem in multi-agent learning is the use of an experience replay, which amplifies the non-stationary problem due to asynchronous data sampling of different agents [43]. A lenient approach [42] can subdue the problem by mapping transitions into decaying temperature values, which basically controls the magnitude of updating different policies.

In the mutual scheme category, agents can “speak” with each other via a settled communication channel. While agents are often trained in a centralized manner, they eventually operate in a decentralized fashion [87]. In other words, a multi-agent RL problem can be divided into two sub-problems: a goal-directed problem and a communication problem. For instance, Multi-agent DDPG (MADDPG) [88] was proposed to employ DDPG in a multi-agent environment. Specifically, Foerster et al. [89] introduced two communication schemes based on the centralized-decentralized rationale: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). While RIAL reinforces agent learning by sharing parameters, DIAL allows inter-communication between agents via a shared medium. Both methods, however, operate with a discrete number of communication actions. As opposed to RIAL and DIAL, Communication Neural Net (CommNet) [90] enabled communication by using a continuous vector. As a result, agents are trained to communicate by backpropagation. However, CommNet limits the number of agents due to computational complexity. To make it scalable, Gupta et al. [91] introduced a parameter sharing method to handle a large number of agents. However, the method only worked with homogeneous systems. Nguyen et al. [53] extended the study in [91] to heterogeneous systems by designing a behavioral control system. For further reading, comprehensive reviews on multi-agent RL can be found in [92, 93].

In recent years, many algorithms have been introduced for multi-agent RL (MARL). Shu and Tian proposed M3RL (Mind-aware Multi-agent Management RL) to implement an optimal collaboration within agents by training a ”super” agent to manage member agents [94]. The super-agent was trained in both policy learning and agent modeling, i.e., by combining imitation learning and RL. As a result, the super-agent could assign members to perform suitable tasks and maximise the overall productivity while minimizing payments for rewarding them with bonuses. Yang et al. presented CM3 (Cooperative Multi-goal Multi-stage MARL) using a novel multi-stage curriculum to learn both individual goal attainment and collaboration in MARL systems [95]. In CM3, an augmentation function was used to bridge value function and policy function across the multi-stage curriculum. Another approach, Evolutionary Population Curriculum (EPC) was proposed by Long et al. [96], which learned well-balanced policies in large-scale MARL systems. Additionally, in many real-world MARL systems, communication between agents is required to make sequential decisions in fully collaborative multi-agent tasks. Kim et al. presented a SchedNet for MARL systems to schedule inter-agent communication when the communication bandwidth is limited and the medium is shared among agents [97]. In cooperative MARL systems, common knowledge between the agents is critical to coordinate agent behaviours. Therefore, Schroeder de Witt et al. proposed MACKRL (Multi-Agent Common Knowledge RL) to learn a hierarchical policy tree [98]. MACKRL helped multiple agents to learn a decentralized policy by exploring and exploiting commonly available knowledge. However, exploration in MARL is a challenging problem. Christianos et al. proposed SEAC (Shared Experience Actor-Critic) for MARL to combine the gradient information of agents to share experience among agents in an actor-critic architecture [99]. Recently, to factorize the joint value function and overcome the limitation of value-based MARL in terms of scalability, a duplex dueling multi-agent Q-learning, namely QPLEX [100] was proposed to fully support centralized training and decentralized execution in MARL systems.

In summary, it is critical to address the following factors in multi-agent learning as they significantly impact on the target software architecture:

  • It is preferable to employ a centralized-decentralized rationale in a multi-agent RL-based system because the training process is time-consuming and computationally expensive. A working system requires hundreds to thousands of training sessions by searching through the hyper-parameter space to find an optimal solution.

  • communication between agents can be realistic or imaginary. In realistic communication, agents “speak” with each other using an established communication protocol. However, there is no actual channel in imaginary communication. Agents are trained to collaborate using a specialized network architecture. For instance, OpenAI [88] proposes an actor-critic architecture where the critic is augmented with other agents’ policy information. As a result, both methods can differentiate how to design an RL-based system.

  • A partially observable environment has a great impact on designing a multi-agent system because each agent has its own unique perspective of the environment. Therefore, it is important to first carefully examine the environment and application type to avoid any malfunction in the design.

2.3 Meta-RL methods

Meta reinforcement learning or Meta-RL employs the principle of meta-learning in RL. The major difference from traditional RL is that the last reward rt− 1 and the last action at− 1 are incorporated into the policy observation. The key components of Meta-RL consist of a model with memory (e.g., RNN (Recurrent Neural Networks), LSTM (Long short-term memory)), meta-learning algorithm, and distribution of MDPs. Meta-RL aims to aid agents to adapt to new tasks by using a small amount of experience.

Wang et al. [101] presented a novel approach named Deep Meta-RL based on meta-learning with an RNN. In Deep Meta-RL, agents learn new tasks rapidly by acquiring the knowledge from previous experience. Additionally, learning from spare and under specified rewards is a challenging problem in RL, e.g., when an agent is required to observe a complex state and provide sequential actions simultaneously. To overcome this problem, a meta reward learning algorithm leverage meta-learning and Bayesian strategy, which eventually optimises an auxiliary reward function [102]. Although deep Meta-RL algorithms aid agents to learn new tasks rapidly by a small amount of experience, the lack of a mechanism to explain task uncertainty in sparse-reward problems remains unsolved. Off-policy Meta-RL algorithm using probabilistic context variables, named PEARL (Probabilistic Embeddings for Actor-critic Reinforcement Learning) [103] was designed to improve meta-training efficiency. Another challenge in Meta-RL is the chicken-and-egg problem, which occurs when agents are required to explore and exploit relevant information in an end-to-end training. Liu et al. proposed a decoupling exploration and exploitation in Meta-RL named DREAM to overcome this canonical problem and avoid local optima [104]. However, existing Meta-RL algorithms are compromised when the rewards are sparse. To solve this problem, a meta-exploration namely Hyper-sate Exploration (HyperX) [105] was introduced to approximate exploration strategies based on a Bayesian RL model.

In MARL, agents interact with each other by cooperating or competing to maximize their return and information gain from all agents. A framework named IBRL (Interactive Bayesian RL) was used to find adaptive policies under uncertain scenarios with prior belief. However, the framework was not intractable in most settings and restricted to light-weight tasks or simple agent models. Therefore, Zintgraf et al. proposed a meta-learning deep IBRL for MARL to overcome this limitation [106]. Recently, many exploratory deep RL algorithms have been proposed based on task-agnostic objectives. However, it is necessary to learn effective exploration from prior experience. Gupta et al. presented MAESN (Model Agnostic Exploration with Structured Noise) using gradient-based meta-learning and a learned latent exploration space [107]. Another limitation in Meta-RL is that most existing Meta-RL methods can be sensitive with respect to distribution shift of a testing task, leading to performance degradation. In this respect, a model-based adversarial Meta-RL method [108] was proposed to overcome this issue.

2.4 RL challenges

In this subsection, we discuss the major challenges in designing a deep RL-based system and the corresponding solutions. To remain concise, the proposed framework is not hampered by the existing limitations, as it is straightforward to extend our proposed architecture to support these rectifications.

Catastrophic forgetting is a problem in continual learning and multi-task learning. Consider the scenario where a network is trained to learn the first task. In this case, the neural network gradually forgets the knowledge of the first task to adopt the new one. One solution is to use regularization [109, 110] or a dense neural network [111, 112]. However, these approaches are only feasible with a limited number of tasks. Recent studies introduce more scalable approaches such as Elastic Weight Consolidation (EWC) [113] or PathNet [114]. While EWC finds a network configuration to yield the best performance in learning different tasks, PathNet uses a “super” neural network to learn the knowledge of different tasks in different paths.

Policy distillation [115] or transfer learning [116, 117] can be used to train an agent to learn individual tasks and collectively transfer the knowledge to a single network. Transfer learning is often used when the actual experiment is expensive and intractable. In this case, the network is trained with simulations and is deployed later in the target experiment. However, a negative transfer may occur when the performance of the learner is lower than the trainer. In this respect, the concept of Hierarchical Prioritized Experience Replay [116] was introduced to use high-level features of a task and selects important data from the experience replay to mitigate negative transfer. One recent study [118] proposed the principle of mutual learning to achieve a comparable performance between actual experiment and simulations.

Another obstacle in RL is dealing with a long-horizon environment with sparse rewards. In such tasks, the agent hardly receives any reward, and it can easily be trapped in local minimum solutions. One solution is to use reward shaping [119] that continuously instructs the agent to achieve the objective. The problem can also be divided into a hierarchical tree of sub-problems where the parent problem has a higher abstraction than that of the child problem (Hierarchical RL) [120]. To encourage self-exploration, intrinsic reward signals can be introduced to reinforce the agent to make a generic decision [121]. State-of-the-art methods of intrinsic motivation can be found in [122,123,124]. Andrychowicz et al. [125] proposed Hindsight Experience Replay to implicitly simulate curriculum learning [126] by creating imagined trajectories in experience replay with positive rewards. In this way, an agent can learn from failures and can automatically generalize a solution in successful cases.

A variety of RL-related methods have been proposed to make RL feasible in large-scale applications. One approach is to augment the neural network with a “memory” to enhance sample efficiency in complicated environments [127, 128]. Additionally, to enforce scalability, many distributed methods can be employed, such as Distributed Experience Replay [129], deep RL acceleration [130], and distributed deep RL [131]. In addition, imitation learning can be used together with inverse RL to accelerate training by directly learning from expert demonstration and extracting the expert’s cost function [132].

2.5 Deep RL framework

In this subsection, we discuss the latest deep RL frameworks in the literature. We select the libraries based on different factors including Python-based implementation, clear documentation, reliability, and active community. Based on our analysis, software managers can select a suitable framework depending on the project requirements.

  • Chainer – Chainer [58] is a powerful and flexible framework of neural networks. The framework is currently supported by IBM, Intel, Microsoft, and Nvidia. It provides an easy way to manipulate neural networks such as creating a customized network, visualizing a computational graph, and supporting a debug mode. It also implements a variety of deep RL methods. However, the Chainer architecture is complicated, which requires a great effort to develop a new deep RL method. The number of integrated environments is also limited, e.g., Atari [133], OpenAI Gym [134], and Mujoco [135].

  • Keras-RL – Keras-RL [136] is a friendly deep RL library, which is recommended for deep RL beginners. However, the library provides a limited number of deep RL methods and environments.

  • TensorForce – TensorForce [137] is an ambitious project that targets both industrial applications and academic research. The library has the best modular architecture we have reviewed so far. Therefore, it is convenient to use the framework to integrate customized environments, modify network configurations, and manipulate deep RL algorithms. However, the framework has a deep software stack (“pyramid” model) that includes many abstraction layers, as shown in Fig. 2. This hinders novice users in prototyping a new deep RL method.

  • OpenAI Baselines – OpenAI Baselines [57] is a high-quality framework of contemporary deep RL. In contrast to TensorForce, the library is suitable for researchers who want to reproduce original results. However, OpenAI Baselines is unstructured and incohesive. Moreover, the codebase is no longer maintained by OpenAI.

  • Stable-Baselines and Stable-Baselines3 - Stable-Baselines [138] is an attempt to port Tensorflow-implemented RL algorithms in OpenAI Baselines in Pytorch. It gradually grows into a reliable source for baseline RL algorithms. The project is now actively maintained under a different repository - Stable-Baselines3 [139]. This new codebase’s advantage is the ease of modifying existing algorithms with modularized codes. Moreover, it is actively maintained; therefore, the authors are very responsive in answering questions.

  • RLLib – RLLib [56] is a well-designed deep RL framework that allows deploying deep RL in distributed systems. The usage of RLLib is not friendly for RL beginners.

  • RLLab – RLLab [140] provides diverse deep RL models including TRPO, DDPG, Cross-Entropy Method, and Evolutionary Strategy. While the library is friendly to use, it is not straightforward in terms of modifications.

  • PettingZoo – PettingZoo [141] is a Python library, which supports a wide range of MARL environments and can be accessible for both university and non-expert researchers. However, it requires other languages to support games with more than 10,000 agents. Moreover, it does not support competitive games where different agents compete with each other.

  • MAgent – MAgent [142] is a MARL platform that supports researchers to develop artificial collective intelligence at both individual agent and society levels. MAgent has limited algorithms and does not support continuous environments.

  • Acme – A framework for distributed RL namely Acme is proposed by DeepMind [143]. Acme is a modular, lightweight tool that helps researchers to re-implement RL algorithms in both research and industrial environments. It also aids training of RL algorithms in both single-actor and distributed paradigms.

  • Megaverse – Megaverse [144] is the first immersive 3D simulation framework for embodied agents and RL that supports multiple agents in immersive environments with more than 1,000,000 actions per second on a single 8-GPU node. The framework requires extensive computational resources to democratize deep RL research.

  • Tianshou - Tianshou [145] is a modularized Pytorch codebase with friendly APIs. Besides supporting standard algorithms, this codebase supports memory-based agents needed to tackle partially observable MDPs (POMDPs).

In summary, most frameworks focus on the performance of deep RL methods. As a result, those frameworks limit code legibility, restricting RL users in terms of readability and modifications. In this paper, we propose a comprehensive framework that has the following properties:

  • Allow new users, including novice developers, to prototype a deep RL method in a short period of time by following a modular design. As opposed to TensorForce, we limit the number of abstraction layers and avoid the pyramid structure.

  • The framework is friendly with a simplified user interface. We provide an API based on three key concepts: policy network, network configuration, and learner.

  • Scalability and generalization are realised in our framework, while supporting multiple agents, multiple objectives, and human-machine interactions.

  • A concept of unification and transparency is introduced by creating plugins. Plugins are gateways that extract learners from other libraries and plug them into our proposed framework. In this way, users can interact with different frameworks using the same interface.

Fig. 2
figure 2

A “pyramid” software architecture

3 A prospective RL software architecture

In this section, we examine core components towards designing a comprehensive deep RL framework, which basically employs generality, flexibility, and interoperability. We aim to support a broad range of RL-related applications that involve multiple agents, multiple objectives, and human-agent interaction. We use the following pseudocode to describe a function signature:

figure a

where → denotes a return operation, A is a scalar value, [...] denotes an array, and {...} denotes a list of possible values of a single variable.

3.1 Environment

First, we create a unique interface for the environment to establish a communication channel between the framework and agents. However, to reduce complexity, we put any human-related communication into the environment. As a result, human interaction assumes as a part of the environment and is hidden from the framework, i.e., the environment provides two interfaces: one for the framework and one for human, as shown in Fig. 3. While the framework interface is often in programming level (functions), the human interface has a higher abstraction mostly in human understanding forms such as voice dictation, gesture recognition, or control system.

Fig. 3
figure 3

A conceptual model of the environment with a human interface

The environment framework interface provides the following functions:

  • clone(): the environment can duplicate itself. The function is useful when an RL algorithm requires multiple learners simultaneously (e.g. A3C).

  • reset(): reset the environment to its initial state. The function must be called after or before an episode.

  • step([a1,a2,...,aN]) → [r1,r2,...,rM]: executes N specified actions of N agents in the environment. The function returns M rewards, each of which represents an objective function.

  • get_state() → [s1,s2,...,sN]: retrieves the current states of the environment. If the environment is a partially observable MDP, the function returns N states, each presents the current state of an agent. However, if the environment is a fully observable MDP, we have s1 = s2 = ... = sN = s.

  • is_terminal() → {True, False}: checks whether an episode is terminated.

  • get_number_of_objectives() → M: is a helper function that indicates the number of objectives in the environment.

  • get_number_of_agents() → N: is a helper function that indicates the number of agents in the environment.

In addition, it is important to consider the following questions during the design of an environment component, which has a significant impact on subsequent design stages:

  • Is it a simulator or a wrapper? In the case of a wrapper, the environment is already developed and configured. Our task is then to develop a wrapper interface that can compatibly interact with the framework. In contrast to a wrapper, developing a simulator is complicated and requires expert knowledge. In real-time applications, we first develop a simulator in C/C++ (for better performance) and then create a Python wrapper interface (for ease of integration). In this case, we need to develop both simulator and wrapper.

  • Is it stochastic or deterministic? A stochastic environment is more challenging to implement than a deterministic one. There are potential factors that contribute to randomness of the environment. Consider an example where a company intends to run a bicycle rental service, in which N bicycles are equally distributed into M potential locations. However, at a specific time, location A has many bicycles due to limited customers. As a result, bicycles in location A are delivered to other locations with higher demand. The company seeks development of an algorithm that can balance the number of bicycles in each place over time. This is a stochastic environment example. We can start building a simple stochastic model based on Poisson distribution to represent the bicycle demand in each place. We end up with a complicated model based on a set of observable factors such as rush hour, weekend, festival, etc. Depending on the stochasticity of the model, we can decide to use a model-based or model-free RL method.

  • Is it complete or incomplete? A complete environment provides sufficient information at any time to construct a series of possible moves in the future (e.g. Chess or Go). Completeness in information helps the determination of an effective RL method, e.g., a complete environment can be solved with a careful planning rather than a trial-and-error approach.

  • Is it fully observable or partially observable? Observability of an environment is essential when designing a deep neural network. A partially observable environment requires recurrent layers or an attention mechanism to enhance the network capacity during training. As an example, a self-driving scenario is partially observable while a board game is fully observable.

  • Is it continuous or discrete? As described in Table 1, this factor is important to determine the type of methods used, such as policy-based or value-based methods, as well as network configurations, such as actor-critic architectures.

  • How many objectives? Real-world applications often have multiple objectives. If the importance weights between objectives can be identified initially, it is reasonable to use single-policy RL methods. Alternatively, a multi-policy RL method can prioritize the importance of an objective in real time.

  • How many agents? A multi-agent RL-based system is more complicated than a single-agent counterpart. Therefore, it is essential to analyze the following factors of a multi-agent system before delving into the design: the number of agents, the type of agents, communication capabilities, cooperation strategies, and competitive potential.

3.2 Network

The neural network is a key module of our proposed framework, which includes a network configuration and a policy network, as illustrated in Fig. 4. A network configuration defines the deep neural network architecture (e.g. CNN (Convolutional Neural Networks) or LSTM), loss functions (e.g. Mean Square Error or Cross Entropy Loss), and optimization methods (e.g. Adam or SGD). Depending on the project’s requirements, a configuration can be divided to different abstraction layers, where the lower abstraction layer is used as a mapping layer for the higher abstraction layer. In the lowest abstraction level (programming language), a configuration is implemented by a deep learning library, such as Pytorch [146] (with dynamic graph) or TensorFlow (with static graph). The next layer is to use a scripting language, such as xml or json, to describe the network configuration. This level is useful because it provides a faster and easier way to configure a network setting. For uses with limited knowledge in implementation details such as system analysts, an intuitive and easy-to-use graphical user interface is useful. However, there is a trade-off here: the higher abstraction layer achieves better usability and productivity but has a longer development cycle.

Fig. 4
figure 4

A neural network module includes a network configuration and a policy network

A policy network is a composite component with a number of network configurations. The dependency between a policy network and a configuration can be weak, i.e., an aggregation relationship. The policy network objective is twofold. It provides a high-level interface that maintains connectivity with other modules in the framework, and it initializes the network, saves the network parameters into checkpoints, and restores the network parameters from checkpoints. The neural network interface provides the following functions:

  • create_network() → [𝜃1,𝜃2,...,𝜃K]: instantiates a deep neural network by using a set of network configurations. The function returns the network parameters (references) 𝜃1,𝜃2,...,𝜃K.

  • save_model(): saves the current network parameters into a checkpoint file.

  • load_model([chk]): restores the current network parameters from a specified checkpoint file chk.

  • predict([s1,s2,...,sN]) → [a1,a2,...,aN]: given the current states of N agents s1,s2,...,sN, the function uses the network to predict the next N actions a1,a2,...,aN.

  • train_network([data_dict]): trains the network by using the given data dictionary. The data dictionary often includes the current states, current actions, next states, terminal flags, and miscellaneous information (global time step or objective weights) of N agents.

3.3 Learner

The last key module of our proposed framework is a learner, as shown in Fig. 5. While the environment module and the network module create the application shell, the learner acts as an engine that allows the system to operate properly. These three modules jointly create the backbone of the system. In particular, the learner uses the environment module to generate episodes. It manages the experience replay memory and defines the RL implementation details, such as multi-step learning, multi-threading, or reward shaping. The learner is often created together with a monitor. The monitor is used to manage multiple learners (if multi-threading is used) and collect any data from the learners during training, such as performance information for debugging purposes and post-evaluation reports. The learner collects necessary data, packs them into a dictionary before sending them to the network module for training.

Fig. 5
figure 5

A high-level design of a learner module

Additionally, a factory pattern [147] can be used to hide the operation details between the monitor and the learner. As a result, the factory component promotes higher abstraction and usability through a simplified user API, as follows:

  • create_learner([monitor_dict, learner_dict]) → obj: The factory creates a learner by using the monitor’s data dictionary (such as batch size, the number of epochs, and the report frequency) and the learner’s data dictionary (the number of threads, epsilon values, reward clipping thresholds, etc.).

  • train(): trains the generated learner.

  • evaluate(): evaluates the generated learner.

3.4 Plugin

Many RL methods are available in the literature, and it is impractical to implement all of them. However, we can reuse the implementation from existing libraries such as TensorForce, OpenAI Baselines, or RLLab. To ensure flexibility and interoperability, we introduce a concept of unification by using plugins. A plugin is a piece of program that extracts learners or network configurations from third party libraries and plugs them into our framework. As a result, the integrated framework provides a unique user API to support a variety of RL methods. In this way, users do not need to learn different libraries. The concept of unification is described in Fig. 6.

Fig. 6
figure 6

A unification of different RL libraries by using plugins

A plugin can also act as a conversion program that converts the environment interface of the library into the environment interface of another library. As a result, the proposed framework can work with any environment in third party libraries and vice versa. Therefore, a plugin should have the following functions:

  • convert_environment([source_env]) → target_env: converts the environment interface from the source library to the environment interface defined in the target library.

  • extract_learner([param_dict]) → learner: extracts the learner from the target library.

  • extract_configuration([param_dict]) → config: extracts the network configuration from the target library.

3.5 Overall structure

Assembling everything together, we have a sequential diagram of the training process, as described in Fig. 7. The workflow divides the training process into smaller procedures. Firstly, the factory instantiates a specified learner (or a plugin) and sends its reference to the monitor. The monitor clones the learner into multiple learner threads. Each learner thread is executed until the number of epochs exceeds a predefined threshold, K. The second loop within the learner thread is used to generate episodes. In each episode, a learner thread perceives the current states of the environment and predicts the next actions using the policy network and configuration network. The next actions are applied to the environment. The environment returns the next states and a terminal flag. The policy network is trained for every L-step. There are minor changes in the evaluation process, as shown in Fig. 8. Firstly, the policy network’s parameters are restored from a specified checkpoint file when initializing the learner. Secondly, all training procedure calls are discarded when generating the episodes.

Fig. 7
figure 7

A UML sequential diagram of the training process

Fig. 8
figure 8

A UML sequential diagram of the evaluation process

To enhance usability and reduce redundancy, it is advisable to implement the framework in Object-Oriented Programming (OOP). In this way, a new learner (configuration) can be easily developed by inheriting existing learners (configurations) in the framework, as shown in Fig. 9.

Fig. 9
figure 9

An inheritance relationship between learners and configurations

4 Conclusions

In this paper, we have presented a review on recent advances in the RL literature with respect to multi-agent learning, multi-objective learning, and human-machine interaction. We have also examined different deep RL libraries and analysed their limitations. Importantly, we have proposed a novel deep RL framework that offers usability, flexibility, and interoperability for developing RL-based systems. We have highlighted the key concerns so that software managers can avoid possible mistakes in designing an RL-based application.

The proposed framework served as a generic template to design and implement real-world RL-based applications. Because the framework is developed in OOP, it is beneficial to utilize OOP principles, such as inheritance, polymorphism, and encapsulation to expedite the development process. We have created a flexible software layer stack, where the number of modules is minimal while maintaining a certain level of cohesion. As a result, the learning curve is not steep. By providing a simplified API, the framework is suitable for novice developers who are new to designing deep RL models, especially software engineers. The proposed framework acts as a bridge to connect different RL communities. Future developments of the framework include employing an educational RL platform in universities and a pilot program that uses the framework to demonstrate basic RL algorithms. A visual module is also developed to serve those purposes.