Model-Selection for Non-parametric Function Approximation in Continuous Control Problems: A Case Study in a Smart Energy System

  • Daniel Urieli
  • Peter Stone
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8188)


This paper investigates the application of value-function-based reinforcement learning to a smart energy control system, specifically the task of controlling an HVAC system to minimize energy while satisfying residents’ comfort requirements. In theory, value-function-based reinforcement learning methods can solve control problems such as this one optimally. However, since choosing an appropriate parametric representation of the value function turns out to be difficult, we develop an alternative method, which results in a practical algorithm for value function approximation in continuous state-spaces. To avoid the need to carefully design a parametric representation for the value function, we use a smooth non-parametric function approximator, specifically Locally Weighted Linear Regression (LWR). LWR is used within Fitted Value Iteration (FVI), which has met with several practical successes. However, for efficiency reasons, LWR is used with a limited sample-size, which leads to poor performance without careful tuning of LWR’s parameters. We therefore develop an efficient meta-learning procedure that performs online model-selection and tunes LWR’s parameters based on the Bellman error. Our algorithm is fully implemented and tested in a realistic simulation of the HVAC control domain, and results in significant energy savings.


Optimal Policy Markov Decision Process HVAC System Approximate Dynamic Program Markov Decision Process Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)CrossRefGoogle Scholar
  2. 2.
    Atkeson, C.G., Moore, A.W., Schaal, S.: Locally weighted learning (1997)Google Scholar
  3. 3.
    Deisenroth, M.P., Rasmussen, C.E.: PILCO: A Model-Based and Data-Efficient Approach to Policy Search. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA (June 2011)Google Scholar
  4. 4.
    Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes. In: Proc. of the 22nd International Conference on Machine Learning, pp. 201–208. ACM Press (2005)Google Scholar
  5. 5.
    Farahmand, A.M., Szepesvári, C.: Model selection in reinforcement learning. Mach. Learn. 85(3), 299–332 (2011)zbMATHCrossRefGoogle Scholar
  6. 6.
    Gordon, G.J.: Stable function approximation in dynamic programming. In: Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann (1995)Google Scholar
  7. 7.
    Hansen, N.: The CMA Evolution Strategy: A Tutorial (January 2009)Google Scholar
  8. 8.
    Keller, P.W., Mannor, S., Precup, D.: Automatic basis function construction for approximate dynamic programming and reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 449–456. ACM, New York (2006)Google Scholar
  9. 9.
    Kohl, N., Stone, P.: Machine learning for fast quadrupedal locomotion. In: The Nineteenth National Conference on Artificial Intelligence, pp. 611–616 (July 2004)Google Scholar
  10. 10.
    Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)MathSciNetGoogle Scholar
  11. 11.
    Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 215–238 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Munos, R., Szepesvári, C.: Finite time bounds for sampling based fitted value iteration. In: ICML, pp. 881–886 (2005)Google Scholar
  13. 13.
    Ng, A.Y., Kim, H.J., Jordan, M.I., Sastry, S.: Autonomous helicopter flight via reinforcement learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)Google Scholar
  14. 14.
    Parr, R., Painter-Wakefield, C., Li, L., Littman, M.: Analyzing feature generation for value-function approximation. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 737–744. ACM, New York (2007)Google Scholar
  15. 15.
    Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(79), 1180–1190 (2008)Google Scholar
  16. 16.
    Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley (2011)Google Scholar
  17. 17.
    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes, 3rd edn. The Art of Scientific Computing. Cambridge University Press, New York (2007)zbMATHGoogle Scholar
  18. 18.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. John Wiley & Sons, Inc., New York (1994)zbMATHCrossRefGoogle Scholar
  19. 19.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  20. 20.
    Urieli, D., Stone, P.: A learning agent for heat-pump thermostat control. In: Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (May 2013)Google Scholar
  21. 21.
    Williams, R.J., Baird III, L.C.: Tight performance bounds on greedy policies based on imperfect value functions. In: Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems (1994),

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Daniel Urieli
    • 1
  • Peter Stone
    • 1
  1. 1.Dept. of Computer ScienceThe University of Texas at AustinAustinUSA

Personalised recommendations