Journal of Heuristics

, Volume 14, Issue 2, pp 135–168 | Cite as

Accelerating autonomous learning by using heuristic selection of actions

  • Reinaldo A. C. Bianchi
  • Carlos H. C. Ribeiro
  • Anna H. R. Costa


This paper investigates how to make improved action selection for online policy learning in robotic scenarios using reinforcement learning (RL) algorithms. Since finding control policies using any RL algorithm can be very time consuming, we propose to combine RL algorithms with heuristic functions for selecting promising actions during the learning process. With this aim, we investigate the use of heuristics for increasing the rate of convergence of RL algorithms and contribute with a new learning algorithm, Heuristically Accelerated Q-learning (HAQL), which incorporates heuristics for action selection to the Q-Learning algorithm. Experimental results on robot navigation show that the use of even very simple heuristic functions results in significant performance enhancement of the learning rate.


Reinforcement learning Heuristic function Robot navigation Action selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Albus, J.S.: Data storage in the cerebellar model articulation controller (CMAC). J. Dyn. Syst. Meas. Control 97, 228–233 (1975) zbMATHGoogle Scholar
  2. Bertsekas, D.P.: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Upper Saddle River (1987) zbMATHGoogle Scholar
  3. Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific, Belmont (1995) zbMATHGoogle Scholar
  4. Bianchi, R.A.C.: Using heuristics to accelerate reinforcement learning algorithms (in Portuguese). Ph.D. thesis, University of São Paulo (2004) Google Scholar
  5. Bonabeau, E., Dorigo, M., Theraulaz, G.: Inspiration for optimization from social insect behaviour. Nature 406 (2000) Google Scholar
  6. Butz, M.V.: State value learning with an anticipatory learning classifier system in a Markov decision process. Technical report 2002018 at the Illinois Genetic Algorithms Laboratory (2002) Google Scholar
  7. Drummond, C.: Accelerating reinforcement learning by composing solutions of automatically identified subtasks. J. Artif. Intell. Res. 16, 59–104 (2002) zbMATHGoogle Scholar
  8. Elfes, A.: Using occupancy grids for mobile robot perception and navigation. Computer 22, 46–57 (1989) CrossRefGoogle Scholar
  9. Foster, D., Dayan, P.: Structure in the space of value functions. Mach. Learn. 49, 325–346 (2002) zbMATHCrossRefGoogle Scholar
  10. Fox, D., Burgard, W., Thrun, S.: Markov localization for mobile robots in dynamic environments. J. Artif. Intell. Res. 11, 391–427 (1999) zbMATHGoogle Scholar
  11. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968) CrossRefGoogle Scholar
  12. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) Google Scholar
  13. Konolige, K., Myers, K.: The Saphira architecture for autonomous mobile robots. In: AI-based Mobile Robots: Case Studies of Successful Robot Systems. MIT, Cambridge (1996) Google Scholar
  14. Millan, J.R., Posenato, D., Dedieu, E.: Continuous-action Q-learning. Mach. Learn. 49, 247–266 (2002) zbMATHCrossRefGoogle Scholar
  15. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997) zbMATHGoogle Scholar
  16. Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn. 13, 103–130 (1993) Google Scholar
  17. Munos, R., Moore, A.W.: Variable resolution discretization in optimal control. Mach. Learn. 49, 291–323 (2002) zbMATHCrossRefGoogle Scholar
  18. Peng, J., Williams, R.J.: Efficient learning and planning within the dyna framework. Adapt. Behav. 1, 437–454 (1993) CrossRefGoogle Scholar
  19. Puterman, M.L.: Markovian Decision Problems. Wiley, New York (1994) Google Scholar
  20. Rummery, G., Niranjan, M.: On-line Q-learning using connectionist systems. Technical report CUED/F-INFENG/TR 166, Cambridge University Engineering Department (1994) Google Scholar
  21. Russell, S., Norvig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) Google Scholar
  22. Spiegel, M.R.: Probability and Statistics. McGraw-Hill, New York (1975) Google Scholar
  23. Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988) Google Scholar
  24. Sutton, R.S.: Integrated architectures for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of the 7th International Conference on Machine Learning. Morgan Kaufmann, Austin (1990) Google Scholar
  25. Sutton, R.S.: Generalization in reinforcement learning: successful examples using sparse coarse coding. Adv. Neural. Inf. Process. Syst. 8, 1038–1044 (1996) Google Scholar
  26. Szepesvári, C.: Static and dynamic aspects of optimal sequential decision making. Ph.D. thesis, Jozsef Attila University, Szeged, Hungary (1997) Google Scholar
  27. Szepesvári, C., Littman, M.: Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. CS-96-11, Brown University, Department of Computer Science, Providence, RI (1996) Google Scholar
  28. Thrun, S., Fox, W., Burgard, D., Dellaert, F.: Robust Monte Carlo localization for mobile robots. Artif. Intell. 128, 99–141 (2001) zbMATHCrossRefGoogle Scholar
  29. Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge (1989) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Reinaldo A. C. Bianchi
    • 1
  • Carlos H. C. Ribeiro
    • 2
  • Anna H. R. Costa
    • 3
  1. 1.Centro Universitário da FEISão Bernardo do CampoBrazil
  2. 2.Instituto Tecnológico de AeronáuticaSão José dos CamposBrazil
  3. 3.Escola Politécnica da Universidade de São PauloSão PauloBrazil

Personalised recommendations