1 Introduction

Since DeepMind’s AlphaGo defeated the world champion in the ancient GO game in 2015, with the rise of deep learning, more and more researchers have been dedicated to studying reinforcement learning (RL). One of the favorite RL study field is playing video games. Since “Street Fighter 2” was published on the arcade platform in 1991, fighting games have taken the world by storm, and the fighting genre has become an essential category in video games. Many researchers have studied applying RL techniques to fighting games such as the 2D game “Super Smash Bros” [1], 2.5D game “Little Fighter 2” [3], and 3D game “Blade & Soul” [5]. In [1], researchers used information such as player’s location, velocity, and action state read from game memory as observations instead of raw pixels. They used two main classes of model-free RL algorithms: Q-learning and policy gradients, and both approaches were successful even though the two algorithms found qualitatively different policies from each other. After training the agent to fight against in-game AI, a self-play curriculum was utilized to boost the agent’s performance, making the agent competitive against human professionals. In [3], since in 2.5D games, 3D objects in a scene are orthographically projected to the 2D screen, the main challenge is the ambiguity between distance and height. To tackle this problem, in addition to an array of 4-stacked, gray-scaled successive screenshots, the location information of both the agent and the opponent read from the environment was used as an observation. A CNN was used to extract visual features from the screenshots, and other information features were concatenated with the CNN’s output. They also utilized LSTM in the network in order to enable the agent to learn proper game related and action-order related features. In [5], a novel self-play curriculum was utilized. By reward shaping, three different styles (Aggressive, Balanced, and Defensive) of agents were created and trained by competing against each other for robust performance. By introducing a range of different battle styles, diversity in the agent’s strategies were enforced, and the agents were capable of handling a variety of opponents.

Compared with classical fighting games on the arched platform, such as the “Street Fighter” series and “The Kind of Fighters” series, the command system of games mentioned above is relatively simple: special moves are triggered either by combinations of direction keys and action keys or by dedicated keys. While in the game of KOF ’97, the movement system is more complex. Except for basic moves such as a punch or a kick, special and desperation moves’ commands consist of a sequence of joystick movements and button presses. A desperation moves’ command is longer than special moves’ and using a desperation move consumes a power stock indicated by a green flashing orb at the bottom of the screen. The movement and combo systems require from the RL agent to master the ability to complete relatively long sequences of accurate inputs.

To the best of our knowledge, till now, there is no RL research dedicated to arcade fighting games like KOF ’97, which typically feature special moves and desperation moves that are triggered using rapid sequences of carefully timed button presses and joystick movements. The aim of the study described in this paper was to develop a RL agent which can not only win the game KOF ’97 but also masters a complex game movement system. In order to achieve this goal, special attention was paid to the effective representation of observations in the RL environment.

2 Environment Setup for KOF ’97

2.1 Interaction with Arched Emulator

An arched game emulator called MAME is used to set up the python game environment. It is possible to drive MAME via Lua scripts externally, and the event hook allows interaction with the game at every step. An open source Python library called MAMEToolkit [4] encapsulates Lua scripts operation into Python classes. The KOF ’97 game environment, which implements the Gym env interface, is built based on this python lib. Figure 1 shows a simplified communication process between the game environment and MAME emulator. Whole source code, including game environment, training, and evaluating, is available on GitHubFootnote 1

Fig. 1.
figure 1

Simplified communication process

2.2 Action Space

There are eight basic input signals: UP, DOWN, LEFT, RIGHT, A, B, C, and D. The action space has 49 discrete actions, including all meaningful combinations of the basic input signals. One of the problem is that direction inputs in the commands are based on the character’s orientation, i.e., if the character changes its orientation, all direction inputs in the commands should be flipped horizontally. Since the agent operates 1P’s (Player 1) character, most of the time, the character is facing to the right side. It makes the training samples unbalanced. Moreover, the experience learned from the default orientation could not be applied to the other side. A trick of action flipping is used to tackle these problems. In the action space, FORWARD and BACKWARD commands are used instead of LEFT and RIGHT, and the environment will transfer FORWARD and BACKWARD to the LEFT or RIGHT according to the character’s orientation. With the help of this trick, the agent’s win rate increased to over 90% from 70% after 30 M timesteps’ training.

2.3 Observation

The observation vector from a single step consists of several parts. The first part contains basic information (such as characters’ HP, location, POW state, and combo count). All scalar features are normalized. The second part is the characters’ ACT code, which represents the character’s action. ACT is a concept in the game implementation mechanism. The character’s ACT code and corresponding action can be checked in debug mode of the game. A character’s move consists of one or more ACTs, and each ACT consists of several animation frames. As Figs. 2 shows, the character’s “100 Shiki Oni Yaki” Move consists of 3 ACTs (Codes are 129, 131, and 133, respectively) which denote the start, middle, and end of the move. Each character can have a maximum of 512 ACTs. Although some of the ACT codes are not used, for compatibility, a 512 bits one-hot encoding vector is used to represent the ACT code feature. The basic information and characters’ ACT code can be read from the game memory. The last part is the agent’s input, which is also a one-hot encoding vector.

Fig. 2.
figure 2

“100 Shiki Oni Yaki” Move’s ACTs

2.4 Graphical Representation of the Input and ACT Sequences

For readers can have a better understanding of the approach proposed below, it is worth to mention about the command judgment system of the game. As mentioned, to make a special move or desperation move, the player needs to finish a sequence of input in a short period. The game system would determine which move the character is going to do according to the input history. As human input varies in rhythm and precision, the judgment of input sequence has some flexibility. Take the move “hopping back” for example. The command is “\(\leftarrow \) \(\leftarrow \)”. As Fig. 3 shows, as long as the player can finish such an input pattern in 8 frames, the character can make the move successfully. In case 1 and case 2, although activated keys last for a different time, the input pattern “\(\leftarrow \) \(\leftarrow \)” is finished in 8 frames, so the character can make the hopping back move. In case 3, such an input pattern is not finished in the time window, so the character failed to make the move.

Fig. 3.
figure 3

Hop back command’s input judgment

The observation features from a single step are not enough to capture all important information since, on the one hand, the character’s move highly depends on the recent input history; on the other hand, according to the sequence of actions the characters have already made and for how long these actions have lasted, the player can make more accurate predictions and make an optimal response. In order to preserve the Markov’s property, it is necessary to maintain the character’s input sequence and ACT sequence within a specific time window. But simply concatenating features from several steps is not a good idea. Firstly, the feature vector for each step is long (over 1000 per step), and at least 30 frames’ feature is needed since desperation move’s commands are pretty long. It means more neurons for the network are needed, and training such a network would be slow and difficult. Besides, because of the command judgment system’s flexibility, the input sequence can vary for the same move, and it is hard for an MPL network to find such a pattern. Our idea is to stack the 1P’s input sequence, 1P’s ACT sequence, and 2P’s ACT sequence as three binary images. Figure 4 shows our approach to the representation of the feature sequences in the form of binary images, and these images encode the character’s input and behavioural pattern, which are then retrieved and recognized by the CNN network. Although input sequences can differ for the same move, the input sequences will form a similar pattern on the graphic representation, and CNN is good at image pattern recognition.

Fig. 4.
figure 4

Graphical representation of the input sequence and ACT sequences

2.5 Reward System

Based on the goal, the following reward system inspired by the reward system described in [1] is proposed:

$$\begin{aligned} R^{total}&= R^{1P\_damage}- P^{2P\_damage}-P^{distance}-P^{time}-P^{POW} \end{aligned}$$


$$\begin{aligned} R^{1P\_damage}&= d^{1P} * c^{1P} * \gamma ^{1P} \\ P^{2P\_damage}&= d^{2P} * c^{2P} * \gamma ^{2P} \\ P^{distance}&= Max(0,g -\beta )*\gamma ^{distance}\\ P^{time}&= 1*\gamma ^{time} \\ P^{POW}&= {\left\{ \begin{array}{ll} 5 &{} \text {if }\text { a POW orb consumed}\\ 0 &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

The first two terms are a damage reward and a damage penalty (\(R^{1P\_damage}, P^{2P\_damage}\)). The priority is to win the game, so it is quite natural that if the character deals damage (\(d^{1P}\)), the agent gets a reward. In contrast, if the opponent deals damage (\(d^{2P}\)), the agent gets a penalty. In order to encourage the agent to perform more combo, the damage reward and penalty will be amplified linearly with the combo numbers(\(c^{1P}, c^{2P}\)). On top of that, two factors: \(\gamma ^{1P}\), and \(\gamma ^{2P}\), are introduced to balance the reward and penalty caused by damage. To prevent the agent from behaving too aggressively, \(c^{2P}\) is set to 1.3, and \(c^{1P}\) is set to 1.

Except for the damage reward and penalty, some other terms are also introduced to adjust the agent’s behavior. The distance penalty term \(P^{distance}\) is used to make the agent keep a reasonable distance between the two characters. If the distance exceeds a certain threshold \(\beta \), the agent will get a small penalty controlled by the factor \(\gamma ^{distance}\). The term \(P^{time}\) gives the agent a small penalty at every step. It is used to encourage the agent to deal damage and win the game as soon as possible. The last term in the formula is \(P^{POW}\). It gives a moderate penalty (specifically 5) when the POW orb is consumed, and if the desperation move can hit the opponent later, the reward will counter-weigh this penalty. Otherwise, the agent will be punished for using the POW orb for nothing.

2.6 Proposed Network Structure

Figure 5 shows the overall structure of our proposed network. The network’s input consists of four parts: basic features as a vector, stacked input sequence (last 40 frames), stacked 1P characters’ ACTs (last 36 steps), and stacked 2P characters’ ACTs (last 36 steps). There are three CNNs to extract features from these images, and the results of these CNN extractors are concatenated with the basic features as the input of our Action Network and Value Network. Figure 5b shows the structure of CNN network. The proposed network is referred to as Multi-CNN in the following content.

Fig. 5.
figure 5

Proposed network structure

3 Experiment

3.1 Training Process

The game’s difficulty level (Built-in AI level) was set to 8, which was the hardest setting. Built-in AI refers to a programmed action strategy that is used to fight against players in this game. Even for intermediate-level gamers, winning a level 8 built-in AI is quite challenging. On top of that, to simplify the training process, the game model was changed from team-play to single-play. The chosen RL Algorithm was Proximal Policy Optimization [6]. The model’s training was carried out on a personal computer with a 12-cores CUP and an RTX3060 GPU with 12 environments running in parallel.

3.2 Results

Multi-CNN Model and Transfer Learning. Each model was evaluated for 50 episodes. The first model’s agent operated the character Iori to fight against Kyo Kusanagi. As Table 1 shows, after 54 h of training, the first Multi-CNN model achieved a 100% win rate. A random agent was made as a negative baseline, and the win rate of the random agent is only 18%. Based on the first Multi-CNN model, using transfer learning, three more models were trained to fight with another three arbitrary chosen characters: Goro Daiman, Mai Shiranui, and Billy. After 30 M timesteps’ training, as Table 1 shows, the transfer learning models also achieved similar performance. As Fig. 6 shows, the transform learning models’ performance increases more rapidly than the one learned from scratch, meaning that the experience learned from the previous model can be transferred to new models. The agents can defeat their counterparts in a short time and have lots of HP remaining. Besides, our agents are able to perform more combos and desperation moves than the random agent. Overall the Multi-CNN models are more than capable of winning the game by a considerable margin. Sample videos of the Multi-CNN model agents are available on YoutubeFootnote 2.

Table 1. Test Result
Fig. 6.
figure 6

Mean rewards of Multi-CNN models and LSTM model

LSTM Network for Comparison. The LSTM network is well-suited to tackle tasks based on time series data since the cells can remember values over arbitrary time intervals. In order to illustrate our proposed network’s validity, an LSTM network was constructed for comparison. As Fig. 7 shows, the structure is similar to the Multi-CNN’s. The only difference is that the multiple CNN extractors are replaced with an LSTM module. As Table 1 shows, given the same budget (50 M timesteps), the LSTM model can only achieve a 46% win rate. As Fig. 6 shows (the purple line), at the first 5 M steps of training, the performance increases steadily, then it begins to fluctuate without apparent improvement. Whereas in the Multi-CNN model, the reward gradually increases as the step increases. Figure 8 shows that more LSTM models with different numbers of hidden layers and hidden units were trained for 10 M timesteps, but none of these models was promising. Another observation is that the training process dramatically slowed as the number of hidden layers and units increased.

Fig. 7.
figure 7

LSTM network structure

Fig. 8.
figure 8

Mean reward over timesteps of more LSTM models

3.3 Discussion

Even though our agent can perform some simple combos and defeat the built-in AI by a large margin, the agent failed to learn advanced combo skills. It is because performing advanced combos needs a long sequence of accurate inputs at the right time. Such opportunities are rare unless the player intentionally creates them. Furthermore, at each step, our agent has many actions to choose from, so purely by random exploration, it is almost impossible for our agents to get such an experience. Some studies, such as [2, 7], combined supervised learning and reinforcement learning to make the training process more effective. In the first stage, they used supervised learning to make the model imitate human operation, and in the second stage, they used RL to enhance their model. It would be a promising direction for further improvement.

Atari games are usually used as a testbed for RL algorithms, and the operation skill of such games is relatively low. Whereas in reality, manipulating things is often more complex and involves a sequence of operations with some patterns. The Recurrent Neural Network (RNN) is commonly used to process time series data. However, in such models, more recent data usually impact the results more, so it is not quite efficient for detecting behavioral patterns with some flexibility. The proposed graphical representation of sequences and network structure provides a new idea for detecting and simulating complex action patterns that are required by more practical tasks.

4 Conclusions

In this paper, a lot of valuable experience from the previous research is applied to our task: playing the arched fighting game KOF ’97. Based on the features of this game and our goal, a novel graphical representation of the input sequence and characters’ action sequence is proposed. The input pattern and characters’ behavior pattern are well extracted by the proposed Multi-CNN network. Besides, a contribution to the RL community has been made by adding a new game environment of KOF ’97, which is one of the most iconic arcade fighting game.

The experiments show that the agent can not only win the game by a large margin but also learns to perform some combos and desperation moves according to the situation. They also show that the experience learned from fighting against one character can be transferred to fighting against other characters via transfer learning.