Learning to Run Challenge Solutions: Adapting Reinforcement Learning Methods for Neuromusculoskeletal Environments

Conference paper
Part of the The Springer Series on Challenges in Machine Learning book series (SSCML)


In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.


Deep Deterministic Policy Gradient (DDPG) Reward Shaping Blending Policy Frame Skipping Trust Region Policy Optimization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). URL Software available from
  2. Anonymous: Distributional policy gradients. International Conference on Learning Representations (2018). URL
  3. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)Google Scholar
  4. Bratko, I., Urbancic, T., Sammut, C.: Behavioural cloning: Phenomena, results and problems. IFAC Proceedings Volumes 28(21), 143–149 (1995). DOI URL CrossRefGoogle Scholar
  5. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)Google Scholar
  6. Dhariwal, P., Hesse, C., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y.: OpenAI Baselines. (2017)
  7. Dietterich, T.G., et al.: Ensemble methods in machine learning. Multiple classifier systems 1857, 1–15 (2000)CrossRefGoogle Scholar
  8. Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineering. MIT Press, Cambridge, MA, USA (1997)Google Scholar
  9. Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., et al.: Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286 (2017)Google Scholar
  10. Heess, N., Wayne, G., Tassa, Y., Lillicrap, T.P., Riedmiller, M.A., Silver, D.: Learning and transfer of modulated locomotor controllers. CoRR abs/1610.05182 (2016). URL
  11. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep Reinforcement Learning that Matters. ArXiv e-prints (2017)Google Scholar
  12. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., Silver, D.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)Google Scholar
  13. Ijspeert, A., Nakanishi, J., Pastor, P., Hoffmann, H., Schaal, S.: Dynamical movement primitives: Learning attractor models for motor behaviors. Neural Computation 25, 328–373 (2013). URL Clmc
  14. Jaśkowski, W., Lykkebø, O.R., Toklu, N.E., Trifterer, F., Buk, Z., Koutník, J., Gomez, F.: Reinforcement Learning to Run…Fast. In: S. Escalera, M. Weimer (eds.) NIPS 2017 Competition Book. Springer, Springer (2018)Google Scholar
  15. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35(1), 221–231 (2013)CrossRefGoogle Scholar
  16. Kidziński, Ł., Sharada, M.P., Ong, C., Hicks, J., Francis, S., Levine, S., Salathé, M., Delp, S.: Learning to run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning. In: S. Escalera, M. Weimer (eds.) NIPS 2017 Competition Book. Springer, Springer (2018)Google Scholar
  17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014). URL
  18. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. arXiv preprint arXiv:1706.02515 (2017)Google Scholar
  19. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)Google Scholar
  20. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.A.: Playing atari with deep reinforcement learning. CoRR abs/1312.5602 (2013). URL
  21. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  22. Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via bootstrapped dqn. In: Advances in Neural Information Processing Systems, pp. 4026–4034 (2016)Google Scholar
  23. Pavlov, M., Kolesnikov, S., Plis, S.M.: Run, skeleton, run: skeletal model in a physics-based simulation. ArXiv e-prints (2017)Google Scholar
  24. Plappert, M.: keras-rl. (2016)
  25. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R.Y., Chen, X., Asfour, T., Abbeel, P., Andrychowicz, M.: Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 (2) (2017)Google Scholar
  26. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chap. Learning Internal Representations by Error Propagation, pp. 318–362. MIT Press, Cambridge, MA, USA (1986). URL
  27. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive neural networks. CoRR abs/1606.04671 (2016). URL
  28. Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv e-prints (2017)Google Scholar
  29. Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Starter code for evolution strategies. (2017)
  30. Schaal, S.: Dynamic movement primitives -a framework for motor control in humans and humanoid robotics. In: H. Kimura, K. Tsuchiya, A. Ishiguro, H. Witte (eds.) Adaptive Motion of Animals and Machines, pp. 261–280. Springer Tokyo, Tokyo (2006). DOI 10.1007/4-431-31381-8_23. URL CrossRefGoogle Scholar
  31. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)Google Scholar
  32. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017). URL
  33. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms. ArXiv e-prints (2017)Google Scholar
  34. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 387–395 (2014)Google Scholar
  35. Stelmaszczyk, A., Jarosik, P.: Our NIPS 2017: Learning to Run source code. (2017)
  36. Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1999)MathSciNetCrossRefGoogle Scholar
  37. Uhlenbeck, G.E., Ornstein, L.S.: On the theory of the brownian motion. Physical review 36(5), 823 (1930)CrossRefGoogle Scholar
  38. Wiering, M., Schmidhuber, J.: HQ-learning. Adaptive Behaviour 6 (1997)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of BioengineeringStanford UniversityStanfordUSA
  2. 2.Department of Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyUSA
  3. 3.Ecole Polytechnique Federale de LausanneLausanneSwitzerland
  4. 4.Beijing UniversityBeijingChina
  5. 5.reason8.aiSan FranciscoUSA
  6. 6.Immersive Media Computing LabUniversity of Science and Technology of ChinaHefeiChina
  7. 7.University of WarsawWarsawPoland
  8. 8.Tunghai UniversityTaichung CityTaiwan
  9. 9.YandexMoscowRussia
  10. 10.Institute of Fundamental Technological Research, Polish Academy of SciencesWarsawPoland
  11. 11.CITEC, Bielefeld UniversityBielefeldGermany

Personalised recommendations