Efficiently detecting switches against non-stationary opponents

Abstract

Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiagent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in these cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This will turn the problem into learning in a non-stationary environment, posing a problem for most learning algorithms. This paper introduces DriftER, an algorithm that (1) learns a model of the opponent, (2) uses that to obtain an optimal policy and then (3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results showing that our approach outperforms state of the art algorithms, in normal form games such as prisoner’s dilemma and then in a more realistic scenario, the Power TAC simulator.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    These can be, for example, previous actions of the agents.

  2. 2.

    Where \(s_{-i}\) denotes the set of all agents except i.

  3. 3.

    One model uses a fixed size window of past interactions while the other uses all historic interactions.

  4. 4.

    In an ergodic set it is possible to go from every state to every state.

  5. 5.

    Other authors have seen a related behavior which is called observationally equivalent models [20].

  6. 6.

    Power TAC takes these prices as negative since it as a buying action.

References

  1. 1.

    Abdallah, S., & Lesser, V. (2008). A multiagent reinforcement learning algorithm with non-linear dynamics. Journal of Artificial Intelligence Research, 33(1), 521–549.

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Adams, R. P., & MacKay, D. (2007). Bayesian online changepoint detection. arXiv:0710.3742v1 [stat.ML]

  3. 3.

    Albrecht, S. V., & Ramamoorthy, S. (2013). A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of 15th international conference on autonomous agents and multiagent systems (pp. 1155–1156).

  4. 4.

    Almeida, A., Ramalho, G., Santana, H., Tedesco, P., Menezes, T., Corruble, V., Chevaleyre, Y. (2004). Recent advances on multi-agent patrolling. In Advances in artificial intelligence—SBIA 2004 (pp. 474–483). IEEE.

  5. 5.

    Axelrod, R., & Hamilton, W. D. (1981). The evolution of cooperation. Science, 211(27), 1390–1396.

    MathSciNet  Article  MATH  Google Scholar 

  6. 6.

    Banerjee, B., & Peng, J. (2005). Efficient learning of multi-step best response. In Proceedings of the 4th international conference on autonomous agents and multiagent systems (pp. 60–66). Utretch: ACM.

  7. 7.

    Barrett, S., & Stone, P. (2014). Cooperating with unknown teammates in complex domains: A robot soccer case study of Ad Hoc teamwork. In Twenty-ninth AAAI conference on artificial intelligence (pp. 2010–2016). Austin, Texas.

  8. 8.

    Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684.

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved. Science, 347, 145–149.

    Article  Google Scholar 

  10. 10.

    Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2), 215–250.

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Brafman, R. I., & Tennenholtz, M. (2003). R-MAX a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3, 213–231.

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1), 374–376.

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 38(2), 156–172.

    Article  Google Scholar 

  14. 14.

    Chakraborty, D., & Stone, P. (2013). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28(2), 182–213.

    Article  Google Scholar 

  15. 15.

    Choi, S. P. M., Yeung, D. Y., Zhang, N. L. (1999). An environment model for nonstationary reinforcement learning. In Advances in neural information processing systems (pp. 987–993). Denver, Colorado.

  16. 16.

    Conitzer, V., & Sandholm, T. (2006). AWESOME: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.

    Google Scholar 

  17. 17.

    Crandall, J. W., & Goodrich, M. A. (2011). Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3), 281–314.

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Da Silva, B. C., Basso, E. W., Bazzan, A. L., Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on machine learnig (pp. 217–224). Pittsburgh, Pennsylvania.

  19. 19.

    de Cote, E. M., Chapman, A. C., Sykulski, A. M., Jennings, N. R. (2010). Automated planning in repeated adversarial games. In Uncertainty in artificial intelligence (pp. 376–383). Catalina Island, California.

  20. 20.

    Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Twenty-first national conference on artificial intelligence (pp. 1131–1136). Boston, MA.

  21. 21.

    Elidrisi, M., Johnson, N., Gini, M., Crandall, J. W. (2014). Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 13th international conference on autonomous agents and multiagent systems (pp. 1141–1148). Paris.

  22. 22.

    Fulda, N., & Ventura, D. (2007). Predicting and preventing coordination problems in cooperative Q-learning systems. In Proceedings of the twentieth international joint conference on artificial intelligence (pp. 780–785). Hyderabad.

  23. 23.

    Gama, J., Medas, P., Castillo, G., Rodrigues, P. (2004). Learning with drift detection. In Advances in artificial intelligence—SBIA (pp. 286–295).

  24. 24.

    Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.

    Article  MATH  Google Scholar 

  25. 25.

    Hernandez-Leal, P., Munoz de Cote, E., Sucar, L. E. (2014). Exploration strategies to detect strategy switches. In Proceedings of the adaptive learning agents workshop (ALA). Paris

  26. 26.

    Hernandez-Leal, P., Rosman, B., Taylor, M. E., Sucar, L. E., Munoz de Cote, E.(2016). A Bayesian approach for learning and tracking switching, non-stationary opponents (extended abstract). In Proceedings of 15th international conference on autonomous agents and multiagent systems (pp. 1315–1316). Singapore.

  27. 27.

    Hernandez-Leal, P., Taylor, M. E., Rosman, B., Sucar, L. E., Munoz de Cote, E. (2016). Identifying and tracking switching, non-stationary opponents: A Bayesian approach. In Multiagent interaction without prior coordination workshop at AAAI. Phoenix, AZ

  28. 28.

    Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Connection Science, 26(2), 103–122.

    Article  Google Scholar 

  29. 29.

    Hido, S., Idé, T., Kashima, H., Kubo, H., Matsuzawa, H. (2008). Unsupervised change analysis using supervised learning. In Advances in knowledge discovery and data mining (pp. 148–159). Berlin: Springer.

  30. 30.

    Ketter, W., Collins, J., & Reddy, P. P. (2013). Power TAC: A competitive economic simulation of the smart grid. Energy Economics, 39, 262–270.

    Article  Google Scholar 

  31. 31.

    Ketter, W., Collins, J., Reddy, P. P., & Weerdt, M. D. (2014). The 2014 power trading agent competition. Rotterdam: Department of Decision and Information Sciencies, Erasmus University.

    Google Scholar 

  32. 32.

    Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning (pp. 157–163). New Brunswick, NJ.

  33. 33.

    Littman, M. L., & Stone, P.(2001). Implicit negotiation in repeated games. In ATAL ’01: Revised papers from the 8th international workshop on intelligent agents VIII.

  34. 34.

    Nash, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.

    MathSciNet  Article  MATH  Google Scholar 

  35. 35.

    Nudelman, E., Wortman, J., Shoham, Y., Leyton-Brown, K. (2004). Run the GAMUT: A comprehensive approach to evaluating game-theoretic algorithms. In Proceedings of the 3rd international conference on autonomous agents and multiagent systems (pp. 880–887). New York, NY.

  36. 36.

    Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.

    MathSciNet  Article  MATH  Google Scholar 

  37. 37.

    Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence (pp. 817–822). Edinburg: Morgan Kaufmann Publishers Inc.

  38. 38.

    Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.

    Google Scholar 

  39. 39.

    Rosman, B., Hawasly, M., Ramamoorthy, S. (2016). Bayesian policy reuse. Machine Learning, 104(1), 99–127.

  40. 40.

    Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10, 1633–1685.

    MathSciNet  MATH  Google Scholar 

  41. 41.

    Tesauro, G., & Bredin, J. L. (2002). Strategic sequential bidding in auctions using dynamic programming. In Proceedings of the 1st international conference on autonomous agents and multiagent systems (p. 591). Bologna: ACM Request Permissions.

  42. 42.

    Urieli, D., & Stone, P. (2014). TacTex 13: A champion adaptive power trading agent. In Proceedings of the twenty-eighth conference on artificial intelligence (pp. 465–471). Quebec.

  43. 43.

    Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.

    MATH  Google Scholar 

  44. 44.

    Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.

    Google Scholar 

  45. 45.

    Wilson, E. B. (1927). Probable inference, the law of succesion, and statistical inference. Journal of the American Statistical Association, 22(158), 209–212.

    Article  Google Scholar 

  46. 46.

    Wurman, P. R., Walsh, W. E., & Wellman, M. (1998). Flexible double auctions for electronic commerce: Theory and implementation. Decision Support Systems, 24(1), 17–27.

    Article  Google Scholar 

  47. 47.

    Yamada, M., Kimura, A., Naya, F., Sawada, H. (2013). Change-point detection with feature selection in high-dimensional time-series data. In Proceedings of the 23rd international joint conference on artificial intelligence (pp. 1827–1833). Bellevue, Washington.

Download references

Acknowledgements

This research was supported partially by project CB-2012-01-183684 and scholarship 335245/234507 granted by Consejo Nacional de Ciencia y Tecnologia (CONACyT) Mexico. This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, Washington State University. IRL research is supported in part by grants NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pablo Hernandez-Leal.

Additional information

Most of this work was performed while the first author was a graduate student at INAOE.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hernandez-Leal, P., Zhan, Y., Taylor, M.E. et al. Efficiently detecting switches against non-stationary opponents. Auton Agent Multi-Agent Syst 31, 767–789 (2017). https://doi.org/10.1007/s10458-016-9352-6

Download citation

Keywords

  • Learning
  • Non-stationary environments
  • Switching strategies
  • Repeated games