Skip to main content

Reinforcement learning algorithm for non-stationary environments

Abstract

Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://cran.r-project.org/web/packages/MDPtoolbox

  2. 2.

    http://vision-traffic.ptvgroup.com/

References

  1. 1.

    Abdallah S, Kaisers M (2016) Addressing environment non-stationarity by repeating q-learning updates. J Mach Learn Res 17(46):1–31

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Abounadi J, Bertsekas D, Borkar V (2001) Learning algorithms for markov decision processes with average cost. SIAM J Control Optim 40(3):681–698. https://doi.org/10.1137/S0363012999361974

    MathSciNet  Article  MATH  Google Scholar 

  3. 3.

    Andrychowicz M et al (2019). Learning dexterous in-hand manipulation. Int J Robot Res https://doi.org/10.1177/0278364919887447

  4. 4.

    Banerjee T, Miao Liu, How JP (2017) Quickest change detection approach to optimal control in markov decision processes with model changes. In: 2017 American Control Conference (ACC). https://doi.org/10.23919/ACC.2017.7962986, pp 399–405

  5. 5.

    Bertsekas D (2013) Dynamic programming and optimal control vol 2, 4th edn. Athena Scientific, Belmont

    MATH  Google Scholar 

  6. 6.

    Cano A, Krawczyk B (2019) Evolving rule-based classifiers with genetic programming on gpus for drifting data streams. Pattern Recogn 87:248–268. https://doi.org/10.1016/j.patcog.2018.10.024

    Article  Google Scholar 

  7. 7.

    Choi SP, Yeung DY, Zhang NL (2000a) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence Learning. Springer, pp 264–287

  8. 8.

    Choi S P M, Yeung D Y, Zhang N L (2000b) An environment model for nonstationary reinforcement learning. In: Solla S A, Leen T K, Müller K (eds) Advances in neural information processing systems, vol 12. MIT Press, pp 987–993

  9. 9.

    Csáji B C, Monostori L (2008) Value function based reinforcement learning in changing markovian environments. J Mach Learn Res 9:1679–1709

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Dick T, György A, Szepesvári C (2014) Online learning in markov decision processes with changing cost sequences. In: Proceedings of the 31st international conference on International Conference on Machine Learning - vol 32, JMLR.org, ICML’14, pp I–512–I–520

  11. 11.

    Ding S, Du W, Zhao X, Wang L, Jia W (2019) A new asynchronous reinforcement learning algorithm based on improved parallel PSO. Appl Intell 49(12):4211–4222. https://doi.org/10.1007/s10489-019-01487-4

    Article  Google Scholar 

  12. 12.

    Everett R (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In: 2018 AAAI spring symposium series

  13. 13.

    Hadoux E, Beynier A, Weng P (2014) Sequential decision-making under non-stationary environments via sequential change-point detection. In: Learning over Multiple Contexts (LMCE), Nancy, France

  14. 14.

    Hallak A, Castro D D, Mannor S (2015) Contextual markov decision processes. In: Proceedings of the 12th European Workshop on Reinforcement Learning (EWRL)

  15. 15.

    Harel M, Mannor S, El-Yaniv R, Crammer K (2014) Concept drift detection through resampling, pp 1009–1017

  16. 16.

    Iwashita A S, Papa J P (2019) An overview on concept drift learning. IEEE Access 7:1532–1547. https://doi.org/10.1109/ACCESS.2018.2886026

    Article  Google Scholar 

  17. 17.

    Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11:1563–1600

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Kaplanis C et al (2019) Policy consolidation for continual reinforcement learning. In: Proceedings of the 36th international conference on machine learning, PMLR, vol 97, pp 3242–3251

  19. 19.

    Kemker R et al (2018) Measuring catastrophic forgetting in neural networks. In: Thirty-second AAAI conference on artificial intelligence

  20. 20.

    Kolomvatsos K, Anagnostopoulos C (2017) Reinforcement learning for predictive analytics in smart cities. In: Informatics, multidisciplinary digital publishing institute, vol 4, p 16

  21. 21.

    Konda V R, Tsitsiklis J N (2003) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–1166

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Krawczyk B, Cano A (2018) Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Appl Soft Comput 68:677–69. https://doi.org/10.1016/j.asoc.2017.12.008

    Google Scholar 

  23. 23.

    Levin, David A, Peres Y, Wilmer EL, Elizabeth L (2006) Markov Chains and Mixing Times. American Mathematical Soc.

  24. 24.

    Liebman E, Zavesky E, Stone P (2018) A stitch in time - autonomous model management via reinforcement learning. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, international foundation for Autonomous Agents and Multiagent Systems, AAMAS ’18, pp 990–998

  25. 25.

    Matteson D S, James N A (2014) A nonparametric approach for multiple change point analysis of multivariate data. J Am Stat Assoc 109(505):334–345

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Minka T (2000) Estimating a Dirichlet distribution

  27. 27.

    Mohammadi M, Al-Fuqaha A (2018) Enabling cognitive smart cities using big data and machine learning: approaches and challenges. IEEE Commun Mag 56(2):94–101. https://doi.org/10.1109/MCOM.2018.1700298

    Article  Google Scholar 

  28. 28.

    Nagabandi A et al (2018) Learning to adapt: meta-learning for model-based control. arXiv:1803.11347

  29. 29.

    Niroui F, Zhang K, Kashino Z, Nejat G (2019) Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robot Autom Lett 4(2):610–617. https://doi.org/10.1109/LRA.2019.2891991

    Article  Google Scholar 

  30. 30.

    Ortner R, Gajane P, Auer P (2019) Variational regret bounds for reinforcement learning. In: Proceedings of the 35th conference on uncertainty in artificial intelligence

  31. 31.

    Page E S (1954) Continuous inspection schemes. Biometrika 41(1/2):100–115

    MathSciNet  MATH  Google Scholar 

  32. 32.

    Prabuchandran K J, Meena S K, Bhatnagar S (2013) Q-learning based energy management policies for a single sensor node with finite buffer. IEEE Wirel Commun Lett 2(1):82–85. https://doi.org/10.1109/WCL.2012.112012.120754

    Article  Google Scholar 

  33. 33.

    Prabuchandran KJ, Singh N, Dayama P, Pandit V (2019). Change Point Detection for Compositional Multivariate Data. arXiv:1901.04935

  34. 34.

    Prashanth LA, Bhatnagar S (2011) Reinforcement learning with average cost for adaptive control of traffic lights at intersections. https://doi.org/10.1109/ITSC.2011.6082823, pp 1640–1645

  35. 35.

    Puterman M L (2005) Markov decision processes: discrete stochastic dynamic programming, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  36. 36.

    Roveri M (2019) Learning discrete-time markov chains under concept drift. IEEE Trans Neural Netw Learn Syst 30(9):2570–2582. https://doi.org/10.1109/TNNLS.2018.2886956

    MathSciNet  Article  Google Scholar 

  37. 37.

    Salkham A, Cahill V (2010) Soilse: a decentralized approach to optimization of fluctuating urban traffic using reinforcement learning. In: 13th international IEEE conference on intelligent transportation systems. https://doi.org/10.1109/ITSC.2010.5625145, pp 531–538

  38. 38.

    Shiryaev A (1963) On Optimum Methods in Quickest Detection Problems. Theory Probab Appl 8(1):22–46

    MathSciNet  MATH  Google Scholar 

  39. 39.

    da Silva BC, Basso EW, Bazzan ALC, Engel PM (2006) Dealing with non-stationary environments using context detection. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, ICML ’06. https://doi.org/10.1145/1143844.1143872, pp 217–224

  40. 40.

    Sutton R S, Barto A G (2018) Reinforcement learning: an introduction, 2nd. MIT Press, Cambridge

    MATH  Google Scholar 

  41. 41.

    Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th international conference on neural information processing systems, pp 1057–1063

  42. 42.

    Tatbul N, Lee TJ, Zdonik S, Alam M, Gottschlich J (2018) Precision and recall for time series. In: Advances in neural information processing systems, pp 1920–1930

  43. 43.

    Tijsma AD, Drugan MM, Wiering MA (2016) Comparing exploration strategies for q-learning in random stochastic mazes. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). https://doi.org/10.1109/SSCI.2016.7849366, pp 1–8

  44. 44.

    Watkins C J, Dayan P (1992) Q-learning. Mach Learn 8(3-4):279–292

    MATH  Google Scholar 

  45. 45.

    Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: 2009 international conference on game theory for networks, pp 314–322, DOI https://doi.org/10.1109/GAMENETS.2009.5137416, (to appear in print)

  46. 46.

    Zhao X et al (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591. https://doi.org/10.1007/s10489-018-1296-x

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sindhu Padakandla.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Padakandla, S., K. J., P. & Bhatnagar, S. Reinforcement learning algorithm for non-stationary environments. Appl Intell 50, 3590–3606 (2020). https://doi.org/10.1007/s10489-020-01758-5

Download citation

Keywords

  • Markov decision processes
  • Reinforcement learning
  • Non-Stationary environments
  • Change detection