Keywords

1 Introduction

As a famous international robot event, RoboCup appeals to numerous robot enthusiasts and researchers around the world. The small size league (SSL) is one of the oldest leagues in RoboCup and consists of 28 teams this year. A SSL game takes place between two teams of six robots each. Each robot must conform to the specified dimensions: the robot must fit within a 180 mm diameter circle and must be no higher than 15 cm. The robots play soccer with an orange golf ball on a green carpeted field that is 9 m long by 6 m wide. All objects on the field are tracked by a standardized vision system that processes the data provided by four cameras that are attached to a camera bar located 4 m above the playing surface. Off-field computers for each team are used for the processing required for coordination and control of the robots. Communication is wireless and uses dedicated commercial radio transmitter/receiver.

We introduce the hardware overview and software framework in this paper. The software framework has a plugin system which brings extensibility. For the high level strategy, our energy is focused on the free-kick because we want to find a more intelligent and controllable one. Controllable means that we hope our team can switch strategy in case that the opponent change their strategy in next game. The intelligent and the controllable are not contradictory. Many research also indicate the importance of free-kick [1, 3].

In recent years, many applications about reinforcement learning have sprung up, for instance, the AI used in the StarCraft and DotA. These applications require the cooperation between agents and the RoboCup is a perfect testbed for the research of reinforcement learning for its simplified multi-agents environment and explicit goal. In this context comes our free-kick strategy and the empirical result in the RoboCup 2017 indicates that our strategy has out-standing performances.

The remainder of this paper is organized as follows. Section 2 describes the overview of the robot’s hardware. Section 3 presents the details of robotics framework we used. Section 4 introduces the markov decision process (MDP) and the MAXQ method in the Sect. 4.1, then illustrates the application in our free-kick strategy. The Sect. 5 shows the result. Finally, Sect. 6 concludes the paper and points out some future work.

2 Hardware

In this part, we describe the overview of the robot mechanic design. The controller board is shown in Fig. 1 and the mechanical structure is in Fig. 2.

Fig. 1.
figure 1

Controller board overview

Fig. 2.
figure 2

Mechanical structure

Our CPU is STM32F407VET6. The main components are:

  1. (1)

    Colored LED interface

  2. (2)

    Motor Controller interface

  3. (3)

    Encoder interface

  4. (4)

    Infrared interface

  5. (5)

    Motor interface

  6. (6)

    Speaker interface

  7. (7)

    LED screen interface

  8. (8)

    Mode setting switcher

  9. (9)

    Bluetooth indicator

  10. (10)

    Debug interface

  11. (11)

    Joystick indicator

  12. (12)

    Booster switcher

  13. (1)

    LED screen

  14. (2)

    Charge status indicator

  15. (3)

    Kicker mechanism

  16. (4)

    Bluetooth Speaker

  17. (5)

    Battery

  18. (6)

    Universal wheel

  19. (7)

    Power button

  20. (8)

    Energy-storage capacitor

3 Software Framework

RoboKit is a robotics framework developed by us, as shown in Fig. 3. It contains plugin system, communication mechanism, event system, service system, parameter server, network protocol, logging system and Lua Script bindings etc. We develop it with C++ and Lua, so it is a cross platform framework (working on windows, Linux, MacOS etc.). For SSL, we developed some plugins based on this framework, such as vision-plugin, skill-plugin, strategy-plugin etc. Vision-plugin contains multi-camera fusion, speed filter and trajectory prediction. Skill-plugin contains all of the basic action such as kick, intercept, chase, chip etc. And strategy-plugin contains defense and attack system.

Fig. 3.
figure 3

RoboKit structure

4 Reinforcement Learning

Reinforcement learning has become an important method in RoboCup. Stone, Veloso [3, 4, 12], Bai [2], Riedmiller [13] et al. have done a lot of work on the online learning, SMDP Sarsa (λ) and MAXQ-OP for robots planning.

Free-kick plays a significant role in the offense, while the opponents’ formation of defense are relatively not so changeable. Our free-kick strategy is inspired from that a free-kick can also be treated as a MDP and the robot can learn to select the best free-kick tactics from a certain number of pre-defined scripts. For the learning process, we also implement the MAXQ method to handle the large state space.

In this chapter we will first briefly introduce the MDP and MAXQ, further details can be found here [9]. Then, we will show how to implement this method in our free-kick strategy, involving the MDP modeling and the sub-task structure construction.

4.1 MAXQ Decomposition

The MAXQ technique decomposes a markov decision process M into several sub-processes hierarchically, denoted by \( \left\{ {M_{i} , i = 0, 1, \ldots , n} \right\} \). Each sub-process \( M_{i} \) is also a MDP and defined as \( \left\langle {S_{i} ,T_{i} ,A_{i} ,R_{i} } \right\rangle \), where \( S_{i} \) and \( T_{i} \) are the active state and termination set of \( M_{i} \) respectively. When the active state transit to a state among \( T_{i} \), the \( M_{i} \) is solved. \( A_{i} \) is a set of actions which can be performed by M or the subtask \( M_{i} \). \( R_{i} (s^{\prime } |s, a) \) is the pseudo-reward function for transitions from active states to termination sets, indicating the upper sub-task’s preference for action a during the transition from the state \( s^{\prime} \) to the state \( s \). If the termination state is not the expected one, a negative reward would be given to avoid \( M_{i} \) generating this termination state [9]

$$ Q_{i}^{ *} \left( {s,a} \right) = V^{ *} \left( {a,s} \right) + C_{i}^{ *} \left( {s,a} \right) $$
(1)

Where \( Q_{i}^{*} \left( {s,a} \right) \) is the expected value by firstly performing action \( M_{i} \) at state \( s \), and then following policy \( \varvec{\pi} \) until the \( M_{i} \) terminates. \( V^{\pi } \left( {a,s} \right) \) is a projected value function of hierarchical policy \( \varvec{\pi} \) for sub-task in state \( s \), defined as the expected value after executing policy \( \varvec{\pi} \) at state \( s \), until \( M_{i} \) terminates.

$$ V^{ *} \left( {i,s} \right) = \left\{ {\begin{array}{*{20}l} {R\left( {s,i} \right) } \hfill & {if\,\,M_{i} \,\,is\,\,primitive} \hfill \\ {max_{{a \in A_{i} }} Q_{i}^{ *} \left( {s,a} \right) } \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(2)

\( C_{i}^{*} \left( {s,a} \right) \) is the completion function for policy \( \varvec{\pi} \) that estimates the cumulative reward after executing the action \( M_{a} \), defined as:

$$ C_{i}^{ *} \left( {s,a} \right) = \sum\nolimits_{{s^{\prime},N}} {\gamma^{N} } P\left( {s^{\prime},N|s,a} \right)V^{ *} \left( {i,s^{\prime}} \right) $$
(3)

The online planning solution is explained in [2], and here we list the main algorithms.

figure a

Here we set an initial action update before the system start updating. The initial action enable us to modify the strategy according to the opponent’s defense formation.

figure b

Algorithm 2 summarizes the major procedures of evaluating a subtask. The procedure uses an AND-OR tree and a depth-first search method. The recursion will end when:

  1. (1)

    the subtask is a primitive action;

  2. (2)

    the state is a goal state or a state outside the scope of this subtask;

  3. (3)

    a certain depth is reached.

figure c

Algorithm 3 shows a recursive procedure to estimate the completion function, where \( \widetilde{\text{G}}_{\text{a}} \) is a set of sampled states drawn from prior distribution \( D_{a} \) using importance sampling techniques.

4.2 Application in Free-Kick

Now we utilize the techniques we mentioned in our free-kick strategy. First we should model the free-kick as a MDP, specifying the state, action, transition and reward functions.

State.

As usual, the teammates and opponents are treated as the observations of environment. The state vector’s length is fixed, containing 5 teammates and 6 opponents.

Action. For the free-kick, the actions includes kick, turn and dash. They are in the continuous action space.

Transition.

We predefined 60 scripts which tell agent the behavior of team-mates. These scripts are chosen randomly. For the opponents, we simply assume them moving or kicking (if kickable) randomly. The basic atomic actions is modeled from the dynamics.

Reward Function.

The reward function considers not only the ball scored, which may cause the forward search process terminates without rewards for a long period. Considering a free-kick, a satisfying serve should never be intercepted by the opponents, so if the ball pass through the opponents, we give a positive reward. Similarly, we design several rewards function for different sub-tasks.

Next, we implement MAXQ to decompose the state space. Our free-kick MAXQ hierarchy is constructed as follows:

Primitive Actions.

We define three low-level primitive actions for the free-kick process: the kick, turn and dash. Each primitive action has a reward of −1 so that the policy reach the goal fast.

Subtasks.

The kickTo aims to kick the ball to a direction with a proper velocity, while the moveTo is designed to move the robot to some locations. To a higher level, there are Lob, Pass, Dribble, Shoot, Position and Formation behaviors where:

  1. (1)

    Lob is to kick the ball in the air to lands behind the opponents;

  2. (2)

    Pass is to give the ball to a teammate.

  3. (3)

    Dribble is to carry the ball for some distance.

  4. (4)

    Shoot is to kick the ball to score.

  5. (5)

    Position is to maintain the formation in the free-kick.

Free-Kick.

The root of the process will evaluate which sub-task should the place kicker should take.

Our hierarchy structure is shown in Fig. 4. Note that some sub-tasks need parameters and they are represented by a parenthesis.

Fig. 4.
figure 4

Hierarchical structure of free-kick

5 Performance Evaluation

To evaluate the strategy’s performance, we filter out the defense frames from the log files of teams in RoboCup 2016. Then we summarize each team’s defense strategy and write a simulator to defend against our team.

For each team, we run 200 free-kick attacks. Tables 1 and 2 shows the test results. Compare to the primitive free-kick strategy, our new strategy has a higher rate to score from a free-kick and that’s what we expected.

Table 1. Training result against log files of RoboCup 2016 above: Free-kick with primitive routine below: Free-kick using RL
Table 2. RoboCup 2017 round robin result

In the RoboCup 2017, the strategy is tested. Note that the mechanism is not idealize, some teamwork fails frequently. Still, it can be seen from the Tables 2 and 3 that our team outperforms other teams.

Table 3. RoboCup 2017 result of Elimination

Before the Final, we got the log file of other teams and modified the strategy by specifying the initial action of place kicker (i.e. the kicker would directly pass the ball to teammates for the Parsian’s defense robot is not so close) after analyzing the defense routine of opponents. The test result is in Table 4.

Table 4. Test result before final

During the final, our robots’ shoot speed broke the restriction frequently and one robot was sent off. Luckily, our team wins with a narrow margin (Table 5).

Table 5. Final result

6 Conclusion

This paper presents our robot’s hardware and software framework. We implement the reinforcement learning in our free-kick tactic. Based on the related work, we divide the free-kick into some sub-tasks and write some hand-made routines for the learning process. The results of the competition prove the efficiency of our strategy. In the meantime, we find that some generated policies are missions impossible, which never been followed by robots fully. Therefore, we need to consider more constraints and the mechanical needs to be more flexible. Our contribution lies in the realization of reinforcement learning in the SSL, which is a first step from simulation to reality. In the future, we plan to imply more artificial intelligent technologies in SSL and make efforts to the competition between human and robots in RoboCup 2050.