Multi-agent reinforcement learning for character control

Simultaneous control of multiple characters has been a research topic that has been extensively pursued for applications in computer games and computer animations, for applications such as crowd simulation, controlling two characters carrying objects or fighting with one another and controlling a team of characters playing collective sports. With the advance in deep learning and reinforcement learning, there is a growing interest in applying multi-agent reinforcement learning for intelligently controlling the characters to produce realistic movements. In this paper we will survey the state-of-the-art MARL techniques that are applicable for character control. We will then survey papers that make use of MARL for multi-character control and then discuss about the possible future directions of research.


Introduction
Controlling multiple agents under adversarial and cooperative scenarios has been a research topic that has been intensively pursued by researchers in artificial intelligence, computer graphics and robotics in the past few decades. There are a wide range of applications of which include autonomous driving [13], swarm robotics control in warehouses [10], NPC control in computer games [37,43] and background character control in films [35,48].
Especially in computer animation, starting from the landmark research of the boids model by Reynolds [32] for animating flocks of birds and schools of fish, extensive research has been done for crowd animation [7] and close character interactions [12]. Before the deep learning era, researchers have focused on designing hand-tuned controllers or optimization-based approaches [21,37]. Although such methods are forming the foundation of multi-character controllers, the intelligence of the characters is limited due Recently, there is growing interest in building more intelligent character control by making use of algorithms such as reinforcement learning [15,16,25,36,43,47]. Especially with the introduction of deep reinforcement learning (DRL), the scale of the learnable systems in terms of generalization and data size has massively grown [33]. The idea of DRL has been applied to terrain running in 2D [26] and 3D [27], tracking human motion capture data [24], video motion tracking [28] and synthesizing an optimal series of motions for achieving a task [29].
Such algorithms are also being applied in the field of simultaneous multi-agent control, which is categorized in the area called multi-agent reinforcement learning (MARL). Amazing results where the agents are intelligently controlled to play football [18], hide-and-seek [1] and autonomous driving [13] have been achieved.
In this paper, we review the basics of MARL, especially in a scenario for controlling characters in a real-time scenario. Starting from the basics of RL, we then extend this to cover MARL in a centralized training and decentralized execution. We then review recent papers that apply MARL for multicharacter control and related topics. We finally discuss some future direction of the research.

Reinforcement learning background
The key idea that makes reinforcement learning different from supervised learning and unsupervised learning is that it learns from interactions with an environment and approximates some return based on these environment interactions to help 'reinforce' its decisions in that environment. The entire system can be modelled as a Markov decision process with a discrete-time setup (t = 0, 1, 2, 3, ...), where an agent observes the environment state s t at time step t and takes an action a t according to the observation. In the next time step t + 1, when the agent finishes the action, it receives a numerical reward r t+1 from the environment, and the environment state transits from s t to s t+1 . Therefore, a complete interaction can be represented in a sequence, which is called a trajectory: Mathematically, the Markov decision process is written as a tuple S, A, R, P , where S is the set of all possible states, A is the set of all possible actions, R : is the transition function of the environment P(s |s, a) and describes the probability of transition to state s when action a is taken at state s. Reinforcement learning's main target is to maximize the discounted sum of rewards it receives after having taken an action. This is defined as the return: (2) where γ ∈ [0, 1] is the discount factor. An agent's choices on actions are decided by its policy π , which is a function that outputs the probability for taking action a at state s, normally written as π(a|s). In addition to the policy, a value function is used to estimate the return that will be received given a state or state and action. The value function for state s under policy π is: The value function for taking action a at state s under policy π is called a Q function that is defined as: Q-learning [44] is a classic algorithm in reinforcement learning which learns the Q function by approximating with a target Q function that always selects the optimal action: where α is the learning rate. Another kind of algorithm, known as policy gradient or policy optimization, learns a policy function directly with an gradient ascent on an empirical estimate of the true policy gradient. We define J (θ ) as the expected return under the policy parameters θ : where τ is a trajectory (see Eq. (1)), π θ (τ ) is the probability that τ is created by policy τ θ parameterized by θ and G(τ ) is the return (see Eq. (2)) obtained by trajectory τ . So, the goal is to learn the policy parameters θ that maximize J (θ ): where η is the learning rate and the derivative term can be derived in a from that can be estimated with sample trajectories: Combining the two key components, value functions and policy functions, and learning them together is the main idea behind the most commonly used kind of algorithm nowadays, known as Actor-Critic, with a value function acting as the critic and a policy function acting as the actor. The critic learns a function that evaluates the trajectory, which can be the value function V (s), the Q function Q(s, a), the advantage function V (s) − Q(s, a) or other variations, and the actor learns a policy function in the direction indicated by the critic. With the help of deep learning, reinforcement learning algorithms can solve problems in applications on much larger scales. Specifically, deep reinforcement learning algorithms learn value functions and policy functions using neural networks. For example, deep Q-learning [23], which is the deep learning version of Q-learning, calculates a loss to update the Q function instead of updating it in a direct manner shown in Eq. (5). Some other deep reinforcement learning algorithms, such as proximal policy optimization (PPO) [8,33], deep deterministic policy gradient (DDPG) [17], and asynchronous advantage actor-critic (A3C) [22] are good examples of applying deep learning in actor-critic methods.

Multi-agent reinforcement learning
When multiple reinforcement learning agents interact with the same environment, the environment becomes a multiagent system and multi-agent reinforcement learning algorithms are applied. It can be modelled as a

MARL setups for character control
We now discuss some factors and setups that we need to consider when applying MARL to character control.

Non-stationarity
The environment for a single reinforcement learning agent is assumed to be stationary in default, i.e. the state transition function and reward function do not change over time [11]. However, things are different for agents in multiagent environments. As all agents are learning concurrently, their policies change as time changes, and the environment becomes non-stationary from the agents' perspectives. An agent would possibly perceive a different state s after taking the same action a at the same state s with other agents having some different policies: since the transition of the true state involves actions from all agents. This non-stationarity would cause the critic function to be non-stationary as well and, hence, leads to a poorly learnt policy, for any single agent.

Centralized training and decentralized execution
Centralized training and decentralized execution (CTDE) setup is an effective solution in dealing with the nonstationary problem, especially in multi-agent cooperative environments. The decentralized execution part stays the same as the normal setup, where the agents' policies output actions based on their partial observations respectively. The centralized training part "cheats" a little bit, where the critics have access to extra information, for example, the true state of the environment and actions of all agents, which can make the environment stationary, since: Extra information is usually accessible in computer graphics applications, which is an advantage compared to other applications, such as autonomous driving, where simulating the real world is a very difficult task. Hence, we will introduce some multi-agent reinforcement learning algorithms that use the CTDE setup.

Self-play
Self-play is a method frequently used in multi-agent competition environments, such as tennis and chess. Instead of training against an agent with some other policies, the agent learns by competing with itself, i.e. its current best policy or a randomly chosen previous policy. The main idea behind it is to mimic how humans structure competitions. For example, it is more reasonable and efficient for a beginner tennis player to practise against other beginners than practising a world champion or a small child struggling to pick hold the racket. Multi-agent self-play environments can use single-agent RL methods, and some interesting works will be discussed later.

Algorithms for MARL
We now describe four MARL algorithms that are considered state-of-the-art and suitable for applications in character control.

Value-decomposition networks
VDN [40] is an MARL algorithm that can make agents learn their individual Q value functions when there is only a group reward available. VDN assumes that the joint action-value function for all agents can be decomposed into individual action-value functions for each agent: where the individual action-value functions depend on the local observations of each agent. As for the centralized training, Q tot is trained with the deep Q-learning method, using the joint reward to back-propagates gradients into the networks, so the individual Q i does not need to learn from specifically assigned rewards. And as for the decentralized execution, each agent acts greedily by taking the maximum Q i value given its local observations. In this scenario, it is equivalent to selecting the joint action greedily, which ensures that the centralized policy and decentralized policy are consistent.

QMIX
QMIX [31] removes the constraint (Eq. (12)) in VDN by introducing a new Q tot function. Instead of just taking the sum of all individual Q i s as Q tot , QMIX uses a mixing network that takes all Q i s and the global state as input to approximate the value of Q tot , which both increase the variety of relation representation between Q tot and individual Q i s, and make better use of the centralized training by considering the global state. More specifically, as shown in Fig. 1, the hyper-networks inside take the global state information as input to generate the weights and biases for the mixing network that computes Q tot using the individual Q i s as inputs.
To have the same greedy policy as VDN, QMIX states that the actions which would maximize the Q tot value is just the combination of the actions which would maximize individual Q i values: and this can be ensured by setting up a constraint: i.e. Q tot is monotonically increasing with respect to each Q i . This is achieved by using absolute activation functions for the hyper-networks so that they can output nonnegative weights for the mixing network.

Counterfactual multi-agent policy gradients
When a group of cooperative agents receives a group reward, it is challenging for the individual agents to know their exact contribution. This is known as the multi-agent credit assignment problem and COMA [6] is an Actor-Critic method Fig. 1 Network structure for QMIX, graph adapted from [31] that focuses on it by using a counterfactual baseline. It was inspired by the idea of difference reward [46]: where a −i is the joint action of all agents except for the i th agent, and c i is the default action for the i th agent. This is just the difference of reward when the action taken by the i th agent is replaced by its default action. An action that can make the value of D i larger would also make the value of r (s, a) larger. However, it is hard to handpick the default action c i in most scenarios, COMA designed an advantage function to avoid the problem: where a i is an action different from a i for the i th agent, and the second term on the right-hand side is the expected Qvalue when the agent's current action a i is marginalized out. This advantage function indicates how good an agent's action is to the whole team compared to its other actions.

Multi-agent deep deterministic policy gradient
MADDPG [20] is the multi-agent version for the algorithm DDPG. It proposes an exceptional CTDE structure, which is widely used nowadays. Unlike the three algorithms mentioned above, where there is only a single value function Q tot or a central critic that considers the joint actions and joint observations or states, MADDPG allocates critics that consider extra information, such as observations and actions  [20] from other agents, to every single agent, as shown in Fig. 2. Also, each agent has its own policy that only takes its local observations as inputs and generate actions in a deterministic way, which make it capable of dealing with continuous action spaces. Moreover, since each agent learns by maximizing individual return, the environment can be cooperative, competitive or both for MADDPG, which makes it an algorithm with wider range of application.

MARL in computer graphics
In this section we review some important research applications for multi-agent reinforcement learning in graphics. Despite being a popular theoretical field for reinforcement learning researchers, we find that MARL is still in its infancy in the graphics community, perhaps due to a lack of clear applications of the framework. The applications that MARL can be considered most applicable to are crowd simulation, strategy games such as Starcraft II, two-player competitions such as martial arts or fighting, and collective sports such as football. We briefly review how characters have been controlled in such domains and some work where RL/MARL have been applied for solving these problems.

Crowd simulation
Crowd simulation has been applied to problems such as simulating fire escape scenarios for safety testing [9], large fight simulations for wars in movies [45], simulating the movements of citizens within a city [30] and finding bottlenecks for flow in architecture design [34]. All of these approaches have made use of some interesting heuristics-based naviga-tion that combines local information of an agent's nearby interactions with some global goal information, for example, an agent making it through a target building's exit while avoiding colliding with nearby agents. The results are impressive at scale [45] when a large number of interacting agents result in interesting, emergent behaviours. But the scale of these approaches can hide the fact that these agents are not acting naturally, nor 'intelligently', and any increase in the resolution of the simulation usually shows individual agent movement artefacts.
Chen et al. [4] train a joint collision avoidance model using DRL. The model is trained in a decentralized manner using trajectories produced by ORCA [41]. The model runs much faster than ORCA and runs without any communication between the agents. Their approach assumes the agents have full information about the nearby agents, which is not realistic in real-world applications. To cope with this problem, Fan et al. [5] directly use the sensor information as the state to control mobile robots without colliding with other agents in various complex environments. The system is trained with policy-gradient-based deep reinforcement learning using training data with many agents. Haworth et al. [7] propose a hierarchical controller where the higher level controller plans the footstep patterns towards the goal while the lower level controller computes the PD targets for the full-body character to follow the planned footsteps as accurately as possible. The lower level controller is a task agnostic controller that is trained in a centralized fashion. The innovation of this research is that the characters are controlled by physical simulation based on DeepLoco [27] at the lowest level. Further development can potentially produce interesting effects, such as fighting at the frontlines at war scenes, or physical interactions at bottlenecks such as corridors.

Self-play in games
Competitive games such as Go, chess, shogi or computer games such as Starcraft have been test-beds for AI systems. One of the keys in such research nowadays is self-play (see Sect. 3.1.3), a scheme to let agents play the game between themselves to improve their skills through exploration. Alp-haZero [38] use deep neural networks with Monte Carlo search trees to build an agent and trained it by self-play. Though it consumed tremendous computing resources, the performance was extraordinary on Go, chess and shogi, and it defeated the previous state-of-the-art Stockfish, in chess, and Elmo, in shogi. The success shows great potential in applications of deep reinforcement learning.
Starcraft II, a science fiction real-time strategy video game, is the new milestone for deep reinforcement learning in game playing, due to its large, partially-observable observation space and large action space. Both Sun et al. [39] and Lee et al. [14] demonstrate results in beating the built-in AI in some maps from Starcraft II. Sun et al. use self-play with DDQN and PPO to train the agent policies, while Lee et al. use A3C. They both reduce the action space by using macro actions, which are sequences of unit actions. Also, to decrease the learning complexity for the agent, they both divide the policy into different modules, each handling a category of actions, but with different designs. Vinyals et al. [42] managed to go further, their AlphaStar reached a Grandmaster level, which can beat 99.8% of the human players, including some professional players. They use prioritized fictitious self-play (PFSP) with an RL algorithm similar to A3C, where PFSP picks opponents with probabilities proportional to the win rate against the agent, allowing the agent to compete with the problematic opponents more frequently. Results show that computers have a significant advantage in micromanagement, but they are still inferior to the best professional players in strategies. This gap indicates that there is still a tremendous space for improvement in complicated applications using reinforcement learning.

Competitive games
For competitive games between two agents, techniques based on game tree expansion [35,37] and reinforcement learning [43] have been proposed. This stream is taken over by DRL-based controllers.
Researchers in OpenAI simulate competitions such as sumo [2] and hide-and-seek [1]; they extend the idea of competitive self-play (see Sect. 4.2) to the domain of continuous control in multi-character games. Bansal et al. [2] design four 1v1 competitive environments with agents using continuous control and use PPO as the reinforcement learning algorithm. When the reward is sparse in a complicated environment, it is hard for the agent to receive the reward; for example as in sumo, if the agent gets a reward only when its opponent is knocked to the ground or pushed out of the ring, the system is hard to train as the agents need to go through a long competition before finding out some actions are useful. To solve this problem, they use an exploration curriculum. In the beginning, the agents would be rewarded for completing relatively simple tasks, such as standing or moving, then gradually reduce the reward for these tasks and focus on the target reward as training goes on. Since self-play is used, it is usual to have two agents have different skill levels during training. To mitigate the effect this would bring, they use opponent sampling. When an agent is training, instead of taking the latest policy as its opponent, it randomly chooses older versions. Experiment results show the importance of the curriculum setup and the choice of sampling strategy, especially for more complex environments.
Baker et al. [1] set up a hide-and-seek environment with two hiders and two seekers, together with some boxes and ramps as tools. They use self-play to train agents with oppo-nents that are at an appropriate level and the CTDE setup with PPO to optimize the policies. Experiment results show that, after millions of steps of training, the agents formed an auto-curricula: the agents move randomly→ the seekers catch the hiders→ the hiders hide using boxes→ the seekers use ramps to climb up→ the hiders lock up the ramps→ the seekers climb up the boxes and surf them. Agents have possibilities in learning reasonable actions beyond common knowledge.

Competitive sports
Competitive sports are a good representation of the complexity of human cooperative and competitive multi-agent coordination. A deep hierarchy of cognition is required to play in a sport, such as low-level subconscious muscle responses all the way up to high-level team planning and coordination. Video game AI in sports requires a large network of state machines [3] that cover the breadth of possible plays, yet the limitation of the system is constrained to how much knowledge of the game the designer puts into the AI. Alternatively, self-learning AI systems for sports could transform the scripted or unnatural sports AI we have in games now into intelligent, surprising and adaptive agents that are as creative to play against as human players. So far, the existing research in the application of MARL to competitive sports games is small but the results are impressive.
Liu et al. [18] demonstrate a self-playing multi-agent system in the game of soccer. They utilize standard actor-critic policy gradients methods and pit agents in 2v2 matches. The selection of agents is from a pool of 32 agents in a population, and agents are randomly sampled from this pool and play a game of soccer to collect experience. They extend beyond the usually population-based sampling for competitive multiagent environments and also adapt the hyper-parameters of the individual agents, such as the reward coefficients, the discounted return scale and the learning rates, via evolutionary perturbation. They demonstrate some interesting results showing the dynamics of how important various reward components are to the agent over time. Because the agents do not have information about the internal state of other agents in the game, they utilize an recurrent architecture for both actor and critic so that behaviours of opponents and teammates can be incorporated into the decision making. This is one of the first works that presents the idea that MARL algorithms can be used for cooperative and competitive AI for games and simulations.
This work is then extended to a much more difficult scenario where the agents were involved in a physically-based soccer simulation and had to actuate in the environment as a humanoid in order to play the game [19]. The players are first trained to imitate motion capture skills of running and turning. Then, they are trained by RL through mid-level drill training to be able to learn skills such as dribbling, kicking and shooting. Finally, the teams of 2x2 are trained to coordinate by MARL. Results are very impressive as the agents have clear sign of learning simple real-world tactics, both in attack and defence. This work can be extended to a large scale in the future, for example, making the competitions eleven-a-side, or having more modern football rules, such as offside.

Discussion
Below, we now provide some discussion on MARL applications that haven't had the attention in the graphics community we believe they should have.

Variable number of agents
One interesting scenario in multi-agent environments that has not been explored too deeply yet is the variable number of agents. Agents joining or leaving the environment will make other agents' rewards and transition states different after taking the same actions at the same previous state and therefore consequently affect their policies. This can be easily found in training non-playable characters (NPCs) with multiagent reinforcement learning. For example, in first-person shooting games, NPCs can be killed by the players and respawned by the system.

Heterogeneous versus homogeneous agents
Another interesting topic is heterogeneous agents in multiagent competitive environments. Homogeneous agents are basically identical, while heterogeneous agents may have different observation ranges, action spaces and learning algorithms. For cooperative environments, the CTDE setup can be used, since it does not require the agents to be homogeneous in theory. However, for competitive environments, self-play would not work since the agents cannot train against themselves. A possible way is to build up a metric system that could evaluate current levels for different agents, to help the agents train against opponents at appropriate levels.

Future application to computer graphics
Other than all the fascinating applications introduced in the previous section, we suggest another possible MARL application that covers all the elements we discussed: a large-scale war scene. Multiple kinds of soldiers are available for each troop, and they can be immobilized during the war. Agents need to learn to cooperate in order to win the battles. Also, difficulty can be increased by having multiple troops representing different tribes, and the relationship between these tribes can be allies, neutral or rivals. In this scenario, agents need to learn strategies to maximize the interest of their own tribe. Finally, collective sports with a lot of teammates competing with the other team involves a lot of tactics and allocation of players to different locations. Learning such tactics considering the speciality of the players is a very high dimensional and complex problem; handling such problems will not only benefit the game industry but also the sports industries for strategy making.

Conclusion
In this paper, we discussed some key ideas in RL and MARL and reviewed some significant algorithms in MARL with CTDE architecture, which are suitable for graphics applications. We also reviewed several exciting MARL applications in the field of computer graphics. MARL is a difficult problem and there is still much research needed to tackle general, real-world problems. This also applies to computer graphics problems as such applications usually have higher complexity. In the foreseeable future, we would like to see more breakthroughs in computer graphics applications with MARL.