Abstract
Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with non-stationary environments, and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem, and a traffic signal control problem.
Similar content being viewed by others
References
Abdallah S, Kaisers M (2016) Addressing environment non-stationarity by repeating q-learning updates. J Mach Learn Res 17(46):1–31
Abounadi J, Bertsekas D, Borkar V (2001) Learning algorithms for markov decision processes with average cost. SIAM J Control Optim 40(3):681–698. https://doi.org/10.1137/S0363012999361974
Andrychowicz M et al (2019). Learning dexterous in-hand manipulation. Int J Robot Res https://doi.org/10.1177/0278364919887447
Banerjee T, Miao Liu, How JP (2017) Quickest change detection approach to optimal control in markov decision processes with model changes. In: 2017 American Control Conference (ACC). https://doi.org/10.23919/ACC.2017.7962986, pp 399–405
Bertsekas D (2013) Dynamic programming and optimal control vol 2, 4th edn. Athena Scientific, Belmont
Cano A, Krawczyk B (2019) Evolving rule-based classifiers with genetic programming on gpus for drifting data streams. Pattern Recogn 87:248–268. https://doi.org/10.1016/j.patcog.2018.10.024
Choi SP, Yeung DY, Zhang NL (2000a) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence Learning. Springer, pp 264–287
Choi S P M, Yeung D Y, Zhang N L (2000b) An environment model for nonstationary reinforcement learning. In: Solla S A, Leen T K, Müller K (eds) Advances in neural information processing systems, vol 12. MIT Press, pp 987–993
Csáji B C, Monostori L (2008) Value function based reinforcement learning in changing markovian environments. J Mach Learn Res 9:1679–1709
Dick T, György A, Szepesvári C (2014) Online learning in markov decision processes with changing cost sequences. In: Proceedings of the 31st international conference on International Conference on Machine Learning - vol 32, JMLR.org, ICML’14, pp I–512–I–520
Ding S, Du W, Zhao X, Wang L, Jia W (2019) A new asynchronous reinforcement learning algorithm based on improved parallel PSO. Appl Intell 49(12):4211–4222. https://doi.org/10.1007/s10489-019-01487-4
Everett R (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In: 2018 AAAI spring symposium series
Hadoux E, Beynier A, Weng P (2014) Sequential decision-making under non-stationary environments via sequential change-point detection. In: Learning over Multiple Contexts (LMCE), Nancy, France
Hallak A, Castro D D, Mannor S (2015) Contextual markov decision processes. In: Proceedings of the 12th European Workshop on Reinforcement Learning (EWRL)
Harel M, Mannor S, El-Yaniv R, Crammer K (2014) Concept drift detection through resampling, pp 1009–1017
Iwashita A S, Papa J P (2019) An overview on concept drift learning. IEEE Access 7:1532–1547. https://doi.org/10.1109/ACCESS.2018.2886026
Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J Mach Learn Res 11:1563–1600
Kaplanis C et al (2019) Policy consolidation for continual reinforcement learning. In: Proceedings of the 36th international conference on machine learning, PMLR, vol 97, pp 3242–3251
Kemker R et al (2018) Measuring catastrophic forgetting in neural networks. In: Thirty-second AAAI conference on artificial intelligence
Kolomvatsos K, Anagnostopoulos C (2017) Reinforcement learning for predictive analytics in smart cities. In: Informatics, multidisciplinary digital publishing institute, vol 4, p 16
Konda V R, Tsitsiklis J N (2003) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–1166
Krawczyk B, Cano A (2018) Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Appl Soft Comput 68:677–69. https://doi.org/10.1016/j.asoc.2017.12.008
Levin, David A, Peres Y, Wilmer EL, Elizabeth L (2006) Markov Chains and Mixing Times. American Mathematical Soc.
Liebman E, Zavesky E, Stone P (2018) A stitch in time - autonomous model management via reinforcement learning. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, international foundation for Autonomous Agents and Multiagent Systems, AAMAS ’18, pp 990–998
Matteson D S, James N A (2014) A nonparametric approach for multiple change point analysis of multivariate data. J Am Stat Assoc 109(505):334–345
Minka T (2000) Estimating a Dirichlet distribution
Mohammadi M, Al-Fuqaha A (2018) Enabling cognitive smart cities using big data and machine learning: approaches and challenges. IEEE Commun Mag 56(2):94–101. https://doi.org/10.1109/MCOM.2018.1700298
Nagabandi A et al (2018) Learning to adapt: meta-learning for model-based control. arXiv:1803.11347
Niroui F, Zhang K, Kashino Z, Nejat G (2019) Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robot Autom Lett 4(2):610–617. https://doi.org/10.1109/LRA.2019.2891991
Ortner R, Gajane P, Auer P (2019) Variational regret bounds for reinforcement learning. In: Proceedings of the 35th conference on uncertainty in artificial intelligence
Page E S (1954) Continuous inspection schemes. Biometrika 41(1/2):100–115
Prabuchandran K J, Meena S K, Bhatnagar S (2013) Q-learning based energy management policies for a single sensor node with finite buffer. IEEE Wirel Commun Lett 2(1):82–85. https://doi.org/10.1109/WCL.2012.112012.120754
Prabuchandran KJ, Singh N, Dayama P, Pandit V (2019). Change Point Detection for Compositional Multivariate Data. arXiv:1901.04935
Prashanth LA, Bhatnagar S (2011) Reinforcement learning with average cost for adaptive control of traffic lights at intersections. https://doi.org/10.1109/ITSC.2011.6082823, pp 1640–1645
Puterman M L (2005) Markov decision processes: discrete stochastic dynamic programming, 2nd edn. Wiley, New York
Roveri M (2019) Learning discrete-time markov chains under concept drift. IEEE Trans Neural Netw Learn Syst 30(9):2570–2582. https://doi.org/10.1109/TNNLS.2018.2886956
Salkham A, Cahill V (2010) Soilse: a decentralized approach to optimization of fluctuating urban traffic using reinforcement learning. In: 13th international IEEE conference on intelligent transportation systems. https://doi.org/10.1109/ITSC.2010.5625145, pp 531–538
Shiryaev A (1963) On Optimum Methods in Quickest Detection Problems. Theory Probab Appl 8(1):22–46
da Silva BC, Basso EW, Bazzan ALC, Engel PM (2006) Dealing with non-stationary environments using context detection. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, ICML ’06. https://doi.org/10.1145/1143844.1143872, pp 217–224
Sutton R S, Barto A G (2018) Reinforcement learning: an introduction, 2nd. MIT Press, Cambridge
Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th international conference on neural information processing systems, pp 1057–1063
Tatbul N, Lee TJ, Zdonik S, Alam M, Gottschlich J (2018) Precision and recall for time series. In: Advances in neural information processing systems, pp 1920–1930
Tijsma AD, Drugan MM, Wiering MA (2016) Comparing exploration strategies for q-learning in random stochastic mazes. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). https://doi.org/10.1109/SSCI.2016.7849366, pp 1–8
Watkins C J, Dayan P (1992) Q-learning. Mach Learn 8(3-4):279–292
Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: 2009 international conference on game theory for networks, pp 314–322, DOI https://doi.org/10.1109/GAMENETS.2009.5137416, (to appear in print)
Zhao X et al (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49(2):581–591. https://doi.org/10.1007/s10489-018-1296-x
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Padakandla, S., K. J., P. & Bhatnagar, S. Reinforcement learning algorithm for non-stationary environments. Appl Intell 50, 3590–3606 (2020). https://doi.org/10.1007/s10489-020-01758-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01758-5