Ball Dribbling for Humanoid Biped Robots: A Reinforcement Learning and Fuzzy Control Approach
Abstract
In the context of the humanoid robotics soccer, ball dribbling is a complex and challenging behavior that requires a proper interaction of the robot with the ball and the floor. We propose a methodology for modeling this behavior by splitting it in two sub problems: alignment and ball pushing. Alignment is achieved using a fuzzy controller in conjunction with an automatic foot selector. Ballpushing is achieved using a reinforcementlearning based controller, which learns how to keep the robot near the ball, while controlling its speed when approaching and pushing the ball. Four different models for the reinforcement learning of the ballpushing behavior are proposed and compared. The entire dribbling engine is tested using a 3D simulator and real NAO robots. Performance indices for evaluating the dribbling speed and ballcontrol are defined and measured. The obtained results validate the usefulness of the proposed methodology, showing asymptotic convergence in around fifty training episodes, and similar performance between simulated and real robots.
Keywords
Reinforcement learning TSK fuzzy controller Soccer robotics Biped robot NAO Behavior Dribbling1 Introduction
In the context of soccer robotics, ball dribbling is a complex behavior where a robot player attempts to maneuver the ball in a very controlled way, while moving towards a desired target. In case of humanoid biped robots, the complexity of this task is very high, because it must take into account the physical interaction between the ball, the robot’s feet, and the ground, which is highly dynamic, nonlinear, and influenced by several sources of uncertainty.
Very few works have addressed the dribbling behavior with biped humanoid robots; [1] presents an approach to incorporate the ball dribbling as part of a closed loop gait, combining a footstep and foot trajectory planners for integrating kicks in the walking engine. Since this work is more focused to the theoretical models and controllers of the gait, there is not included a dribbling engine final performance evaluation. On the other hand, [2] presents an approach that uses imitative reinforcement learning for dribbling the ball from different positions into the empty goal, meanwhile [3] proposes an approach that uses corrective human demonstration for augmenting a handcoded ball dribbling task performed against stationary defender robots. Since these two works are not addressing explicitly the dribbling behavior, not many details about the specific dribbling modeling, or performance evaluations for the ballcontrol or accuracy to the desired target are mentioned. Some teams that compete in humanoid soccer leagues, such as [4, 5], have implemented successful dribbling behaviors, but to the best of our knowledge, no publications directly related about their dribbling’s methods have been reported for comparison. It is not clear whether in these cases a handcoded or a learningbased approach has been used.
The dribbling problem has been addressed more extensively for the wheeled robots case, approaches based on the use of automatic control and Machine Learning (ML) has been proposed; [6, 7] apply Reinforcement Learning (RL), [8, 9] use neural networks and evolutionary computation, [10] applies a PD control with linearized kinematic models, [11] uses nonlinear predictive control, and [12, 13, 14] apply heuristic methods. However, these approaches are not directly applicable to the biped humanoid case, due to its much higher complexity.
Although several strategies can be used to tackle the dribbling problem, we classify these in three main groups: (i) based on human experience and/or handcode [2, 3], (ii) based on identification of the system dynamics and/or kinematics and mathematical models [1, 10, 11], and (iii) based on the online learning of the system dynamics [6, 7, 8, 9]. In order to develop the dribbling behavior, each of these alternatives has advantages and disadvantages: (i) is initially faster to implement but vulnerable to errors and difficult to debug and retune when parameters change or while the system complexity increases; (ii) could be solved completely offline by analytical or heuristic methods since robot and ball kinematics are known, but to identify the interaction between the robot’s foot while it is walking, with a dynamic ball and the floor, could be anfractuous; in this way those strategies from (iii) which are capable to learn about that robotballfloor interaction, while find an optimal policy for the ballpushing behavior, as RL, is a promise and attractive approach.
The main goal of this paper is to propose a methodology to learn the balldribbling behavior in biped humanoid robots, reducing as many as possible the online training time. In this way, the aforementioned alternatives (ii) are considered for reducing the complexity of behaviors learned with (iii). The proposed methodology models the balldribbling problem by splitting it in two sub problems, alignment and ballpushing. The alignment problem consists of controlling the pose of the robot in order to obtain a proper alignment with the final ball’s target. The ballpushing problem consist of controlling the robot’s speed in order to obtain, at the same time, a high speed of the ball but a low relative distance between the ball and the robot, that means controllability and efficiency. These ideas are implemented by three modules: (i) a fuzzy logic controller (FLC) for aligning the robot when approaching the ball (offline designed), (ii) a footselector, and (iii) a reinforcementlearning (RL) based controller for controlling the robot’s speed when approaching and pushing the ball (online learned).
Performance indices for evaluating the dribbling’s speed and ballcontrol are measured. In the experiments the training is performed using a 3D simulator, but the validation is done using real NAO robots. The article is organized as follows: Sect. 2 describes the proposed methodology. Section 3 presents the experimental setup and obtained results. Finally, conclusions and future work are drawn in Sect. 4.
2 A Methodology for Learning the BallDribbling Behavior
2.1 Proposed Modeling
As mentioned in the former section, the proposed methodology splits the dribbling problem in two different behaviors: alignment and ballpushing. Under this modeling, ballpushing is treated as a one dimensional (1D) problem due to the ball must be pushed over the balltarget line when the robot is aligned; the alignment behavior is responsible to enforce this assumption, correcting every time the robot desired direction of movement.
 i.
Alignment: in order to maintain the 1D assumption, it is proposed to implement a FLC which keeps the robot aligned to the balltarget line (\( \varphi = 0,\; \gamma = 0 \) ) while approaching the ball. The control actions of this subsystem are applied all the time over \( v_{\theta } \) and \( v_{y} \), and partially applied over \( v_{x} \), only when the constraints of the 1D assumption are not fulfilled. Also, this behavior uses the foot selector for setting the foot that pushes the ball, in order to improve the ball’s direction. Due to the nature of this subbehavior, kinematics for the robot and ball can be modeled individually. Thus, we propose the offline design and tuning of this task.
 ii.
Ballpushing: following the 1D assumption, the objective is that the robot walks as fast as possible and hits the ball in order to change its speed, but without losing the ball possession. That means that the ball must be kept near the robot. The modeling of the robot’s feet–ball–floor dynamics is complex and inaccurate because kicking the ball could generate several unexpected transitions, due to uncertainty on the foot shape and speed when it kicks the ball (note that the foot’s speed is different to the robot’s speed \( v_{x} \)). Therefore it is proposed to model this behavior as a Markov Decision Process (MDP), in order to solve and to online learn it using a RL scheme. The behavior is applied only when the constraints of 1D assumption are fulfilled, i.e. when the robot’s alignment is achieved. Figure 1(b) shows the variables used in this behavior.
2.2 Alignment Behavior
The alignment behavior is compound of two modules: (a) the FLC which sets the robot speeds for aligning it to the balltarget line; and (b) the foot selector which depending on the ball position and robot pose decides which foot must kick the ball.
In order to perform better control actions for different operation points, constant gains of the three linear controllers can be replaced by adaptive gains. Thus, three TakagiSugenoKang Fuzzy Logic Controllers (TSKFLCs) are proposed, which maintain the same linear controller structure for their polynomial consequents.
The \( \upsilon_{x} \) rule base.
\( \gamma \backslash \varphi \)  −H  −L  +L  +H 

−H  K _{Lxρ}  K _{Lxρ}  K _{Lxρ}  K _{Hxρ} 
−L  K _{Hxρ}  K _{Hxρ}  K _{Hxρ}  K _{Hxρ} 
+L  K _{Lxρ}  K _{Hxρ}  K _{Hxρ}  K _{Hxρ} 
+H  K _{Lxρ}  K _{Lxρ}  K _{Lxρ}  K _{Lxρ} 
(a) The v _{ y } rule base. (b) The v _{ θ } rule base.
\( \upsilon_{y} \) Controller rules  \( \upsilon_{\theta } \) Controller rules 

If \( \varphi \) is Low & \( \rho \) is Low, then \( k_{y\varphi } \) is Low  If \( \rho \) is Low, then \( k_{\theta \gamma } \) is High, \( k_{\theta \varphi } \) is Low 
If \( \varphi \) is Low & \( \rho \) is High, then \( k_{y\varphi } \) is Zero  If \( \rho \) is Med., then \( k_{\theta \gamma } \) is Low, \( k_{\theta \varphi } \) is High 
If \( \varphi \) is High & \( \rho \) is High, then \( k_{y\varphi } \) is Zero  If \( \rho \) is High, then \( k_{y\varphi } \) is High, is Low 
If \( \varphi \) is High & \( \rho \) is Low, then \( k_{y\varphi } \) is High 
The \( v_{\uptheta} \) FLC’s rule base is described in Table 2(b). The control action is proportional to \( (\gamma  \varphi ) \), the FLC adapts the gain \( k_{\theta \varphi } \), setting it close to zero when \( \rho \) is low or high, in that cases the robot is aligned to the ball minimizing \( \gamma \); but when \( \rho \) is medium, the robot tries to approach the ball aligned to the target minimizing \( \varphi \).
where S is the total number of trained scenes with different initial robot and ball positions, whereas ρ _{i} is the i – th initial RobotBall distance. Please refers to [15].
The position of the virtual ball regarding the robot reference system, is given as V = B + S _{ sideward }.
2.3 Reinforcement Learning for BallPushing Behavior
Since no footstep planners or specific kicks are performed, the ball is propelled by the robot’s feet while it is walking, and the distance travelled by the ball depends on the robot’s feet speed just before to hit it. Moreover, since our variables to be controlled are the robot speed relative to its center of mass, and not directly the speed of the feet, the robot’s feet–ball dynamics turn complex and inaccurate. In this way, the RL of the ballpushing behavior is proposed.
States and Actions description for M1
States space: \( s = \left[ \rho \right] \), a total of 11 states  
Min  Max  Discretization  
Feature  \( \rho \)  0 mm  500 mm  50 mm 
Actions space: \( a = \left[ {\upsilon_{x} } \right] \), a total of 5 actions.  
Min  Max  Discretization  
Action  \( \upsilon_{x} \)  20 mm/s  100 mm/s  20 mm/s 
There are 55 stateaction pairs 
States and Actions description for M2
States space: \( \left[ {\rho \;dv_{br} \;\upsilon_{x} } \right] \), total of 110 states.  
Min  Max  Discretization  
Feature  \( \rho \)  0 mm  500 mm  50 mm  
Feature  \( dv_{br} \)  −100  100  Negative or positive  
Feature  \( \upsilon_{x} \)  20 mm/s  100 mm/s  20 mm/s  
Actions space: \( a = \left[ {acc_{x} } \right] \), a total of 3 actions.  
Min  Max  Discretization  
Action  \( acc_{x} \)  −20 mm/s^{ 2 }  20 mm/s^{2}  Negative, zero, and positive  
There are 330 stateaction pairs. 
The SARSA( \( \varvec{\lambda} \) ) Algorithm. The implemented algorithm for the ballpushing behavior is the tabular SARSA(λ) with the replacing traces modification [17]. Based on previous work and after several trials, the SARSA(λ) parameters have been chosen prioritizing fastest convergences. In this way, the following parameters are selected: learning rate α = 0.1, discount factor γ = 0.99, eligibility traces decay λ = 0.9, and epsilon greedy ε = 0.2 with exponential decay along the trained episodes.
3 Results
3.1 The BallPushing Behavior

the episode time t _{ f }, i.e. how long the agent takes to push the ball up to the target,

the % cumulated time of faults: the cumulated time t _{ faults } when the robot loses the ball possession, that means ρ > ρ _{ max }, then:
$$ \%~cumulated~ time~of~faults = {\text{t}}_{faults}/{\text{t}}_{\text{f}} $$ 
a global fitness function expressed as:
$$ F = \frac{{BTD + \psi_{tf} }}{{\psi_{0} }}\int_{t = 0}^{tf} {\rho (t)/V_{rx} (t) \cdot dt,} $$where BTD is the total distance traveled by the ball and ψ _{ tf } is the balltarget distance when the episode is finished. In the ideal case BTD = ψ _{0} and ψ _{ tf } = 0.
Figure 4(b) shows that M1 carries out the dribbling about 15–20 % slower than M2. However, M1R1 cares more for the ball possession (Fig. 4(c)), which implies walking with lower speed and off course more time is taken. Notes that M1R1 gets about a half of the time of faults compared with the other three modeling as the Fig. 4(c) depicts.
Modeling M1 is simpler, less stateaction pairs, so, it learns faster. Modeling M2 has three features and more stateaction pairs, so, it learns slower than M1 but improves its performance. On the other hand reward R1 is simpler, for both modeling obtains better performance, but, since it is less explicit for a detailed task, it convergence time is slower.
After to carry out several training episodes, testing different types of rewards and learning parameters for the proposed ballpushing problem, we have concluded that the use of parameterized and interval rewards as R2 is a very sensitive problem to small parameter changes. For example, a right selection of the magnitude of each interval reward, ρ _{ max }, and V _{ th } for handling the tradeoff between speed and ballcontrol, in addition to other learning parameters such as α, γ, λ and the exploration type, could dramatically modifies the learning performance.
3.2 Validation: The Full Dribbling Behavior
For the final validation of the dribbling behavior, the policy learning with modelling M1 and reward function R1 (M1R1) has been selected because its best tradeoff between performance and convergence speed. Since the resulting policy is expressed as a Qtable with 55 stateaction pairs, a linear interpolation is implemented in order to make a continuous inputoutput function. The FLC (alignment) and the RL based controller (ballpushing) switches for handling v _{ x } if the robot is or not into the ballpushing zone. Finally, both controllers are transferred to the physical NAO robot, a hand parameter adjusts in the foot selector and FLC is carried out in order to compensate the socalled reality gap with the simulator.

the average of t _{f,} the time that robot takes for finishing the dribbling scene,

the average of % time increased, if t _{walk} is the time that robot takes to finish the path of the dribbling scene without dribbling the ball and walking at maximum speed, then: \( {\text{\% }}time\;increased = (t_{\text{f}}  t_{\text{walk}} )/t_{\text{f}} \)

the standard deviation of the % time increased.
Validation results of the dribbling engine with three different scenes.
Physical NAO  Simulated NAO  

Dribbling time (s)  Time increased (%)  Time increased St.Dev.  Time increased (%)  Time increased (%)  
Scene 1  53.71  38.56  11.48  31.91  31.91 
Scene 2  49.57  49.57  11.79  37.88  37.88 
Scene 3  44.38  32.39  10.06  9.98  9.98 
4 Conclusions and Future Work
This paper has presented a methodology for modeling the balldribbling problem in the context of humanoid soccer robotics, reducing as many as possible the online training time in order to make achievable futures implementations for learning the balldribbling behavior with physical robots.
The proposed approach is splitted in two sub problems: the alignment behavior, which has been carried out by using a TSKFLC; and, the ballpushing behavior, which has been learned by using a tabular SARSA(λ) scheme, a wellknown, widely used and computationally inexpensive TDRL method.
The ballpushing learning results have shown asymptotic convergence in 50 to 150 training episodes depending on the stateaction model used, which clarifies the feasibility of future implementations with physical robots. Unfortunately, according to the best of our knowledge, no previous similar dribbling engine implementations have been reported, in order to compare our final performance.
From the video [18], it can be noticed some inaccuracies with the alignment after pushing the ball. This could be related to the exclusion of the ball and target angles from the state space and reward function because the 1D assumption. Thus, as future work it is proposed to extend the methodology in order to learn the whole dribbling behavior avoiding switching between the RL and FLC. In this way, we plan to transfer the FLC policy of the predesigned alignment behavior and refine it using RL in order to learn the ballpushing. For that porpouses, transfer learning for RL is a promising approach. In addition, since the current state space is continuous and it will increase with the proposed improvements, RL methods with function approximation and actor critic will be considered.
Footnotes
Notes
Acknowledgments
This work was partially funded by FONDECYT under Project Number 1130153 and the Doctoral program in Electrical Engineering at the Universidad de Chile. D.L. Leottau was funded by CONICYT, under grant: CONICYTPCHA/Doctorado Nacional/201363130183.
References
 1.Alcaraz, J., Herrero, D., Mart, H.: A closedloop dribbling gait for the standard platform league. In: Workshop on Humanoid Soccer Robots of the IEEERAS International Conference on Humanoid Robots (Humanoids), Bled, Slovenia (2011)Google Scholar
 2.Latzke, T., Behnke, S., Bennewitz, M.: Imitative reinforcement learning for soccer playing robots. In: Lakemeyer, G., Sklar, E., Sorrenti, D.G., Takahashi, T. (eds.) RoboCup 2006: Robot Soccer World Cup X. LNCS (LNAI), vol. 4434, pp. 47–58. Springer, Heidelberg (2007)CrossRefGoogle Scholar
 3.Meriçli, Ç., Veloso, M., Akin, H.: Task refinement for autonomous robots using complementary corrective human feedback. Int. J. Adv. Robot. Syst. 8(2), 68–79 (2011)Google Scholar
 4.Röfer, T., Laue, T., Müller, J., Bartsch, M., Batram, M.J., Böckmann, A., Lehmann, N., Maa, F., Münder, T., Steinbeck, M., Stolpmann, A., Taddiken, S., Wieschendorf, R., Zitzmann, D.: Bhuman team report and code release 2012. http://www.bhuman.de/wpcontent/ uploads/2012/11/CodeRelease2012.pdf (2012)
 5.HTWKNAOTeam: Team Description Paper 2013. In: RoboCup 2013: Robot Soccer World Cup XVII Preproceedings. Eindhoven, RoboCup Federation, The Netherlands (2013)Google Scholar
 6.Carvalho, A., Oliveira, R.. Reinforcement learning for the soccer dribbling task. In: 2011 IEEE Conference on Computational Intelligence and Games (CIG), Seoul, Korea (2011)Google Scholar
 7.Riedmiller, M., Hafner, R., Lange, S., Lauer, M.: Learning to dribble on a real robot by success and failure. In: 2008 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Pasadena, California (2008)Google Scholar
 8.Ciesielski, V., Lai, S.Y.S.Y.: Developing a dribbleandscore behaviour for robot soccer using neuro evolution. Work. Intell. Evol. Syst. 2001, 70–78 (2013)Google Scholar
 9.Nakashima, T., Ishibuchi, H.: Mimicking dribble trajectories by neural networks for RoboCup soccer simulation. In: IEEE 22nd International Symposium on Intelligent Control, ISIC 2007 (2007)Google Scholar
 10.Li, X., Wang, M., Zell, A.: Dribbling control of omnidirectional soccer robots. In: Proceedings 2007 IEEE International Conference on Robotics and Automation (2007)Google Scholar
 11.Zell, A.: Nonlinear predictive control of an omnidirectional robot dribbling a rolling ball. In: 2008 IEEE International Conference on Robot Automation (2008)Google Scholar
 12.Emery, R., Balch, T.: Behaviorbased control of a nonholonomic robot in pushing tasks. In: Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164), vol. 3 (2001)Google Scholar
 13.Damas, B.D., Lima, P.U., Custódio, L.M.: A modified potential fields method for robot navigation applied to dribbling in robotic soccer. In: Kaminka, G.A., Lima, P.U., Rojas, R. (eds.) RoboCup 2002: Robot Soccer World Cup VI. LNCS, vol. 2752, pp. 65–77. Springer, Heidelberg (2003)CrossRefGoogle Scholar
 14.Tang, L., Liu, Y., Qiu, Y., Gu, G., Feng, X.: The strategy of dribbling based on artificial potential field. In: 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), vol. 2 (2010)Google Scholar
 15.Celemin, C., Leottau, L.: Learning to dribble the ball in humanoid robotics soccer (2014). https://drive.google.com/folderview?id=0B9cesO4NvjiqdUpWaWFyLVQ3anM&usp=sharing
 16.Storn, R., Price, K.: Differential Evolution  A simple and efficient adaptive scheme for global optimization over continuous spaces (1995)Google Scholar
 17.Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
 18.Leottau, L., Celemin, C.: UCHDribblingVideos. https://www.youtube.com/watch?v=HP8pRh4ic8w. Accessed 28 April 2014