Autonomous Robots

, Volume 27, Issue 2, pp 93–103 | Cite as

A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot

  • Ruben Martinez-Cantin
  • Nando de Freitas
  • Eric Brochu
  • José Castellanos
  • Arnaud Doucet


We address the problem of online path planning for optimal sensing with a mobile robot. The objective of the robot is to learn the most about its pose and the environment given time constraints. We use a POMDP with a utility function that depends on the belief state to model the finite horizon planning problem. We replan as the robot progresses throughout the environment. The POMDP is high-dimensional, continuous, non-differentiable, nonlinear, non-Gaussian and must be solved in real-time. Most existing techniques for stochastic planning and reinforcement learning are therefore inapplicable. To solve this extremely complex problem, we propose a Bayesian optimization method that dynamically trades off exploration (minimizing uncertainty in unknown parts of the policy space) and exploitation (capitalizing on the current best solution). We demonstrate our approach with a visually-guide mobile robot. The solution proposed here is also applicable to other closely-related domains, including active vision, sequential experimental design, dynamic sensing and calibration with mobile sensors.


Bayesian optimization Online path planning Sequential experimental design Attention and gaze planning Active vision Dynamic sensor networks Active learning Policy search Active SLAM Model predictive control Reinforcement learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bailey, T., Nieto, J., Guivant, J., Stevens, M., & Nebot, E. (2006). Consistency of the EKF-SLAM algorithm. In Proc. of the IEEE/RSJ int. conf. on intelligent robots and systems, 2006. Google Scholar
  2. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15(4), 319–350. zbMATHCrossRefMathSciNetGoogle Scholar
  3. Bergman, N. (1999). Recursive Bayesian estimation: navigation and tracking applications. PhD thesis, Linköping University. Google Scholar
  4. Bertsekas, D. (1995). Dynamic programming and optimal control. Nashua: Athena Scientific. zbMATHGoogle Scholar
  5. Brochu, E., de Freitas, N., & Ghosh, A. (2007). Active preference learning with discrete choice data. In Advances in neural information processing systems, 2007. Google Scholar
  6. Bryson, M., & Sukkarieh, S. (2008). Observability analysis and active control for airborne SLAM. IEEE Transaction on Aerospace Electronic Systems, 44(1), 261–280. CrossRefGoogle Scholar
  7. Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: a review. Journal of Statistical Science, 10, 273–304. zbMATHCrossRefMathSciNetGoogle Scholar
  8. Durrant-Whyte, H., & Bailey, T. (2006). Simultaneous localisation and mapping (SLAM): part I the essential algorithms. Robotics and Automation Magazine, 13, 99–110. CrossRefGoogle Scholar
  9. Finkel, D. (2003). DIRECT optimization algorithm user guide. Center for Research in Scientific Computation, North Carolina State University. Google Scholar
  10. Gablonsky, J. (2001). Modification of the DIRECT algorithm. PhD thesis, Department of Mathematics, North Carolina State University, Raleigh, North Carolina. Google Scholar
  11. Hernandez, M. (2004). Optimal sensor trajectories in bearings-only tracking. In P. Svensson & J. Schubert (Eds.), Proc. of the seventh int. conf. on information fusion, international society of information fusion, Mountain View, CA (Vol. II, pp. 893–900). Google Scholar
  12. Hernandez, M., Kirubarajan, T., & Bar-Shalom, Y. (2004). Multisensor resource deployment using posterior Cramèr-Rao bounds. IEEE Transactions on Aerospace Electronic Systems, 40(2), 399–416. CrossRefGoogle Scholar
  13. Howard, M., Klanke, S., Gienger, M., Goerick, C., & Vijayakumar, S. (2009). A novel method for learning policies from variable constraint data. Autonomous Robots, 27 (Special issue on Robot Learning, Part B) (this issue). Google Scholar
  14. Jones, D. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 345–383. zbMATHCrossRefGoogle Scholar
  15. Jones, D., Perttunen, C., & Stuckman, B. (1993). Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1), 157–181. zbMATHCrossRefMathSciNetGoogle Scholar
  16. Jones, D., Schonlau, M., & Welch, W. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4), 455–492. zbMATHCrossRefMathSciNetGoogle Scholar
  17. Kato, H., & Billinghurst, M. (1999). Marker tracking and hmd calibration for a video-based augmentedreality conferencing system. In Proc. of the 2nd IEEE and ACM int. work. on augmented reality (pp. 85–94) 1999. Google Scholar
  18. Kollar, T., & Roy, N. (2008). Trajectory optimization using reinforcement learning for map exploration. International Journal of Robotics Research, 27(2), 175–197. CrossRefGoogle Scholar
  19. Konda, V., & Tsitsiklis, J. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166. zbMATHCrossRefMathSciNetGoogle Scholar
  20. Kueck, H., de Freitas, N., & Doucet, A. (2006). SMC samplers for Bayesian optimal nonlinear design. In Nonlinear statistical signal processing workshop (NSSPW), 2006. Google Scholar
  21. Kushner, H. (1964). A new method of locating the maximum of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86, 97–106. Google Scholar
  22. Leung, C., Huang, S., Dissanayake, G., & Forukawa, T. (2005). Trajectory planning for multiple robots in bearing-only target localisation. In Proc. of the IEEE/RSJ int. conf. on intelligent robots and systems, 2005. Google Scholar
  23. Lizotte, D. (2008). Practical Bayesian optimization. PhD thesis, Dept. of Computer Science, University of Alberta. Google Scholar
  24. Lizotte, D., Wang, T., Bowling, M., & Schuurmans, D. (2007). Automatic gait optimization with Gaussian process regression. In International joint conference on artificial intelligence, 2007. Google Scholar
  25. Locatelli, M. (1997). Bayesian algorithms for one-dimensional global optimization. Journal of Global Optimization, 10, 57–76. zbMATHCrossRefMathSciNetGoogle Scholar
  26. Maciejowski, J. (2002). Predictive control: with constraints. New York: Prentice-Hall. Google Scholar
  27. Martinez-Cantin, R. (2008). Active map learning for robots: insights into statistical consistency. PhD thesis, University of Zaragoza. Google Scholar
  28. Martinez-Cantin, R., de Freitas, N., & Castellanos, J. (2006). Analysis of particle methods for simultaneous robot localization and mapping and a new algorithm: Marginal-SLAM. In Proc. of the IEEE int. conf. on robotics & automation, 2006. Google Scholar
  29. Martinez-Cantin, R., de Freitas, N., & Castellanos, J. (2007a). Active policy learning for robot planning and exploration under uncertainty. In Proc. of robotics: science and systems, 2007. Google Scholar
  30. Martinez-Cantin, R., de Freitas, N., Doucet, A., & Castellanos, J. (2007b). Active policy learning for robot planning and exploration under uncertainty. In Robotics: science and systems (RSS), 2007. Google Scholar
  31. Meger, D., Marinakis, D., Rekleitis, I., & Dudek, G. (2009). Inferring a probability distribution function for the pose of a sensor network using a mobile robot. In: ICRA, 2009. Google Scholar
  32. Metta, G., Fitzpatrick, P., & Natale, L. (2006). Yarp: yet another robot platform. International Journal on Advanced Robotics Systems, 3(1), 140–151. Google Scholar
  33. Mockus, J., Tiesis, V., & Zilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. In L. Dixon & G. Szego (Eds.), Towards global optimisation (Vol. 2, pp. 117–129). Amsterdam: Elsevier. Google Scholar
  34. Ng, A., & Jordan, M. (2000). PEGASUS: a policy search method for large MDPs and POMDPs. In Proc. of the sixteenth conf. on uncertainty in artificial intelligence, 2000. Google Scholar
  35. Paris, S., & Le Cadre, J. (2002). Planification for terrain-aided navigation. In Fusion 2002, Annapolis, Maryland (pp. 1007–1014). Google Scholar
  36. Peters, J., & Schaal, S. (2006). Policy gradient methods for robotics. In Proc. of the IEEE/RSJ int. conf. on intelligent robots and systems, 2006. Google Scholar
  37. Peters, J., & Schaal, S. (2008a). Natural actor critic. Neurocomputing, 71(7–9), 1180–1190. CrossRefGoogle Scholar
  38. Peters, J., & Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–697. CrossRefGoogle Scholar
  39. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: The MIT Press. zbMATHGoogle Scholar
  40. Riedmiller, M., Gabel, T., Hafner, R., & Lange, S. (2009). Reinforcement learning for robot soccer. Autonomous Robots, 27(1), 55–73 (Special issue on Robot Learning, Part A). CrossRefGoogle Scholar
  41. Sasena, M. (2002). Flexibility and efficiency enhancement for constrained global design optimization with Kriging approximations. PhD thesis, University of Michigan. Google Scholar
  42. Schonlau, M., Welch, W., & Jones, D. (1998). Global versus local search in constrained optimization of computer models. In N. Flournoy, W. Rosenberger, W. Wong (Eds.) New developments and applications in experimental design (Vol. 34, pp. 11–25). Institute of Mathematical Statistics. Google Scholar
  43. Sim, R., & Roy, N. (2005). Global A-optimal robot exploration in SLAM. In Proc. of the IEEE int. conf. on robotics & automation, 2005. Google Scholar
  44. Singh, A., Krause, A., Guestrin, C., Kaiser, W., & Batalin, M. (2007). Efficient planning of informative paths for multiple robots. In Proc. of the int. joint conf. on artificial intelligence, 2007. Google Scholar
  45. Singh, A., Krause, A., Guestrin, C., & Kaiser, W. (2009). Efficient informative sensing using multiple robots. Journal of Artificial Intelligence Research (JAIR), 34, 707–755. Google Scholar
  46. Singh, S., Kantas, N., Doucet, A., Vo, B., & Evans, R. (2005). Simulation-based optimal sensor scheduling with application to observer trajectory planning. In Proc. of the IEEE conf. on decision and control and eur. control conference (pp. 7296–7301) 2005. Google Scholar
  47. Smallwood, R., & Sondik, E. (1973). The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21, 1071–1088. zbMATHCrossRefGoogle Scholar
  48. Stachniss, C., Grisetti, G., & Burgard, W. (2005). Information gain-based exploration using Rao-Blackwellized particle filters. In Proc. of robotics: science and systems, Cambridge, USA, 2005. Google Scholar
  49. Stolle, M., & Atkeson, C. (2009). Finding and transferring policies using stored behaviors. Autonomous Robots, 27 (Special issue on Robot Learning, Part B) (this issue). Google Scholar
  50. Tremois, O., & Le Cadre, J. (1999). Optimal observer trajectory in bearings-only tracking for manoeuvering sources. IEE Proceeding Radar, Sonar Navigation, 146(1), 31–39. CrossRefGoogle Scholar
  51. Vazquez, E., & Bect, J. (2008). On the convergence of the expected improvement algorithm. arXivorg arXiv:0712.3744v2 [stat.CO],
  52. Vidal-Calleja, T., Davison, A., Andrade-Cetto, J., & Murray, D. (2006). Active control for single camera SLAM. In Proc. of the IEEE int. conf. on robotics & automation (pp. 1930–1936) 2006. Google Scholar
  53. Vlassis, N., Toussaint, G. K. M., & Piperidis, S. (2009). Learning model-free robot control using a Monte Carlo em algorithm. Autonomous Robots, 27 (Special issue on Robot Learning, Part B) (this issue). Google Scholar
  54. Williams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. zbMATHGoogle Scholar
  55. Zilinskas, A., & Zilinskas, J. (2002). Global optimization based on a statistical model and simplicial partitioning. Computers and Mathematics with Applications, 44, 957–967. zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Ruben Martinez-Cantin
    • 1
  • Nando de Freitas
    • 2
  • Eric Brochu
    • 2
  • José Castellanos
    • 3
  • Arnaud Doucet
    • 2
  1. 1.Institute for Systems and RoboticsInstituto Superior TécnicoLisboaPortugal
  2. 2.Department of Computer ScienceUniversity of British ColumbiaVancouverCanada
  3. 3.Department of Computer Science and System EngineeringUniversity of ZaragozaZaragozaSpain

Personalised recommendations