A leader–follower partially observed, multiobjective Markov game

Chang, Yanling; Erera, Alan L.; White, Chelsea C.

doi:10.1007/s10479-015-1935-0

A leader–follower partially observed, multiobjective Markov game

Published: 07 July 2015

Volume 235, pages 103–128, (2015)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Yanling Chang¹,
Alan L. Erera¹ &
Chelsea C. White III¹

774 Accesses
9 Citations
Explore all metrics

Abstract

The intent of this research is to generate a set of non-dominated finite-memory policies from which one of two agents (the leader) can select a most preferred policy to control a dynamic system that is also affected by the control decisions of the other agent (the follower). The problem is described by an infinite horizon total discounted reward, partially observed Markov game (POMG). For each candidate finite-memory leader policy, we assume the follower, fully aware of the leader policy, determines a (perfect memory) policy that optimizes the follower’s (scalar) criterion. The leader–follower assumption allows the POMG to be transformed into a specially structured, partially observed Markov decision process that we use to determine the follower’s best response policy for a given leader policy. We then approximate the follower’s policy by a finite-memory policy. Each agent’s policy assumes that the agent knows its current and recent state values, its recent actions, and the current and recent possibly inaccurate observations of the other agent’s state. For each leader/follower policy pair, we determine the values of the leader’s criteria. We use a multi-objective genetic algorithm to create the next generation of leader policies based on the values of the leader criteria for each leader/follower policy pair in the current generation. Based on this information for the final generation of policies, we determine the set of non-dominated leader policies. We present an example that illustrates how these results can be used to support a manager of a liquid egg production process (the leader) in selecting a sequence of actions to maximize expected process productivity while mitigating the risk due to an attacker (the follower) who seeks to contaminate the process with a chemical or biological toxin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Article 09 April 2023

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

References

Aberdeen, D. A. (2003). A (revised) survey of approximate methods for solving partially observable Markov decision processes. Technical report, Research School of Information Science and Engineering, Australia National University.
Bakir, N. O. (2011). A Stackelberg game model for resource allocation in cargo container security. Annals of Operations Research, 187, 5–22.
Article Google Scholar
Bakir, N. O., & Kardes, K. (2009). A stochastic game model on overseas cargo container security. Non-published research reports, CREATE center, Paper 6.
Basilico, N., Gatti, N., & Amigoni, F. (2009). Developing a deterministic patrolling strategy for security agents. In Proceedings of the IEEE/WIC/ACM international conference on intelligent agent technology (IAT) (pp. 565–572).
Bean, J. C. (1994). Genetic algorithms and random keys for sequencing and optimization. ORSA Journal on Computing, 6, 154–160.
Article Google Scholar
Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. V. (2004). Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research (JAIR), 22, 423–455.
Google Scholar
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.
Article Google Scholar
Bernstein, D. S., Hansen, E. A., & Zilberstein, S. (2005). Bounded policy iteration for decentralized POMDPs. In Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI) (pp. 1287–1292), Edinburgh.
Bier, V. M., Oliveros, S., & Samuelson, L. (2007). Choosing what to protect: Strategic defensive allocation against an unknown attacker. Journal of Public Economic Theory, 9(4), 563–587.
Article Google Scholar
Bier, V. M., Haphuriwat, N., Menoyo, J., Zimmerman, R., & Culpen, A. M. (2008). Optimal resource allocation for defense of targets based on differing measures of attractiveness. Risk Analysis, 28(3), 763–770.
Article Google Scholar
Bopardikar, S. D., & Hespanha, J. P. (2011). Randomized solutions to partial information dynamic zero-sum games. In American control conference (ACC), San Francisco, CA.
Bowman, M., Briand, L. C., & Labiche, Y. (2010). Solving the class responsibility assignment problem in object-oriented analysis with multi-objective genetic algorithms. IEEE Transactions on Software Engineering (TSE), 36(6), 817–837.
Article Google Scholar
Cardoso, J. M. P., & Diniz, P. C. (2009). Game theory models of intelligent actors in reliability analysis: An overview of the state of the art. Game Theoretic Risk Analysis of Security Threats, International Series in Operations Research & Management Science, 128, 1–19.
Google Scholar
Canu, A., & Mouaddib, A. I. (2011). Collective decision-theoretic planning for planet exploration. In Proceedings of international conference on tools with artificial intelligence.
Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially observable stochastic domains. In Proceedings twelfth national conference on artificial intelligence (AAAI-94), Seattle, WA (pp. 1023–1028).
Cassandra, A. R. (1994). Optimal policies for partially observable Markov decision processes. Technical report (CS-94-14). Providence, RI: Department of Computer Science, Brown University
Cassandra, A. R., Littman, M. L., & Zhang, N. L. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings thirteenth annual conference on uncertainty in artificial intelligence (UAI-97), Morgan Kaufmann, San Francisco, CA (pp. 54–61).
Cavusoglu, H., & Kwark, Y. (2013). Passenger profiling and screening for aviation security in the presence of strategic attackers. Decision Analysis, 10(1), 63–81.
Article Google Scholar
Chang, Y. (2015). A leader–follower partially observed Markov game. Ph.D. thesis. Atlanta: Georgia Institute of Technology (in preparation).
Cheng, H. T. (1988). Algorithms for partially observable Markov decision processes. Ph.D. thesis. Vancouver, BC: University of British Columbia.
Coello, C. A. C. (2000). An updated survey of GA-based multiobjective optimization techniques. ACM Computing Survey, 32(2), 109–143.
Article Google Scholar
Corne, D. W., Knowles, J. D., & Oates, M. J. (2000). The Pareto envelope-based selection algorithm for multiobjective optimization. In Proceedings of the parallel problem solving from nature VI conference (Vol. 1917, pp. 839–848).
Deb, K. (2001). Nonlinear goal programming using multi-objective genetic algorithms. Journal of the Operational Research Society, 52(3), 291–302.
Article Google Scholar
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.
Article Google Scholar
Delle Fave, F. M., Jiang, A. X., Yin, Z., Zhang, C., Tambe, M., Kraus, S., & Sullivan, J. P. (2014). Game-theoretic security patrolling with dynamic execution uncertainty and a case study on a real transit system. Journal of Artificial Intelligence Research, 50, 321–367.
Doshi, P. (2012). Decision making in complex multiagent contexts: A tale of two frameworks. AI Magazine, 33(4), 82–95.
Google Scholar
Eagle, J. N. (1984). The optimal search for a moving target when the search path is constrained. Operations Research, 32(5), 1107–1115.
Article Google Scholar
Emery-Montemerlo, R., Gordon, G., Schneider, J., & Thrun, S. (2004). Approximate solutions for partially observable stochastic games with common payoffs. In Proceedings of the third international joint conference on autonomous agents and multi-agent systems (AAMAS) (pp. 136–143).
Feng, Z., & Zilberstein, S. (2004). Region-based incremental pruning for POMDPs. In Proceedings of the twentieth conference on uncertainty in artificial intelligence (UAI-04). San Francisco: Morgan Kaufmann (pp. 146–153).
Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Heidelberg: Springer.
Google Scholar
Forrest, S. (1993). Genetic algorithms: Principles of natural selection applied to computation. Science, 261, 872–878.
Article Google Scholar
Ghosh, M. K., McDonald, D., & Sinha, S. (2004). Zero-sum stochastic games with partial information. Journal of Optimization Theory and Applications, 121(1), 99–118.
Article Google Scholar
Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24, 49–79.
Google Scholar
Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley.
Google Scholar
Hansen, E. A. (1998a). An improved policy iteration algorithm for partially observable MDPs. In Advances in neural information processing systems (NIPS-97) (Vol. 10, pp. 1015–1021). Cambridge, MA: MIT Press.
Hansen, E. A. (1998b). Solving POMDPs by searching in policy space. In Proceedings of uncertainty in artificial intelligence (Vol. 10, pp. 211–219).
Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In Proceedings of the nineteenth national conference on artificial intelligence (pp. 709–715), San Jose, CA.
Hausken, K., & Zhuang, J. (2011). Governments’ and terrorists’ defense and attack in a T-period game. Decision Analysis, 8(1), 46–70.
Article Google Scholar
Hauskrecht, M. (1997). Planning and control in stochastic domains with imperfect information. Ph.D. thesis. Massachusetts Institute of Technology.
Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 13, 33–94.
Google Scholar
Hespanha, J. P., & Prandini, M. (2001). Nash equilibria in partial-information games on Markov chains. In IEEE conference on decision and control (pp. 2102–2107), Orlando, FL.
Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. Reprinted in 1992 by MIT Press, Cambridge, MA.
Holloway, H., & White, C. C. (2008). Question selection and resolvability for imprecise multi-attribute alternative selection. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 38(1), 162–169.
Article Google Scholar
Horn, J., Nafpliotis, N., & Goldberg, D. E. (1994). A niched Pareto genetic algorithm for multiobjective optimization. In Proceedings of the 1st IEEE conference on evolutionary computation, IEEE World Congress on Computational Intelligence (Vol. 1, pp. 82–87), Orlando, FL.
Kandori, M., & Obara, I. (2010). Towards a belief-based theory of repeated games with private monitoring: An application of POMDP, manuscript.
Keeney, R. L., & Raiffa, H. (1993). Decisions with multiple objectives: Preferences and value trade-offs. Cambridge: Cambridge University Press.
Book Google Scholar
Konak, A., Coit, D. W., & Smith, A. E. (2006). Multi-objective optimization using genetic algorithms: A tutorial. Reliability Engineering and System Safety, 91, 992–1007.
Article Google Scholar
Kumar, A., & Zilberstein, S. (2009). Dynamic programming approximations for partially observable stochastic games. In Proceedings of the twenty-second international FLAIRS conference (pp. 547–552), Sanibel Island, FL.
Letchford, J., Macdermed, L., Conitzer, V., Parr, R., & Isbell, C. L. (2012). Computing Stackelberg strategies in stochastic games. ACM SIGecom Exchanges, 11(2), 36–40.
Article Google Scholar
Lin, A. Z.-Z., Bean, J., & White, C. C. (1998). Genetic algorithm heuristics for finite horizon partially observed Markov decision problems. Technical report. Ann Arbor: University of Michigan.
Lin, A. Z.-Z., Bean, J., & White, C. C. (2004). A hybrid genetic/optimization algorithm for finite horizon partially observed Markov decision processes. Journal on Computing, 16(1), 27–38.
Google Scholar
Lin, C. M., & Gen, M. (2008). Multi-criteria human resource allocation for solving multistage combinatorial optimization problems using multiobjective hybrid genetic algorithm. Expert Systems with Applications, 34(4), 2480–2490.
Article Google Scholar
Littman, M. L. (1994a). The witness algorithm: Solving partially observable Markov decision processes. Technical report CS-94-40. Department of Computer Science, Brown University.
Littman, M. L. (1994b). Memoryless policies: Theoretical limitations and practical results. In Proceedings of the third international conference on simulation of adaptive behavior: From animals to animats (pp. 238–245).
Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed Markov decision process. Annals of Operations Research, 28(1), 47–65.
Article Google Scholar
Manning, L., Baines, R., & Chadd, S. (2005). Deliberate contamination of the food supply chain. British Food Journal, 107(4), 225–245.
Article Google Scholar
McEneaney, W. M. (2004). Some classes of imperfect information finite state-space stochastic games with finite-dimensional solutions. Applied Mathematics and Optimization, 50(2), 87–118.
Article Google Scholar
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1), 1–16.
Article Google Scholar
Mohtadi, H., & Murshid, A. P. (2009). Risk analysis of chemical, biological, or radionuclear threats: Implications for food security. Risk Analysis, 29, 1317–1335.
Article Google Scholar
Naser-Moghadasi, M. (2012). Evaluating effects of two alternative filters for the incremental pruning algorithm on quality of POMDP exact solutions. International Journal of Intelligence Science, 2(1), 1–8.
Article Google Scholar
Oliehoek, F. A., Spaan, M. T. J., & Vlassis, N. (2005). Best-response play in partially observable card game. In Proceedings of the 14th annual machine learning conference of Belgium and the Netherlands (pp. 45–50).
Oliehoek, F. A., Spaan, M. T. J., & Vlassis, Nikos. (2008). Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32, 289–353.
Google Scholar
Oliehoek, F. A. (2012). Decentralized POMDPs. In M. Wiering & M. V. Otterlo (Eds.), Reinforcement learning: State of the art (Vol. 12, pp. 471–503). Berlin: Springer.
Chapter Google Scholar
Ombuki, B., Ross, B. J., & Hanshar, F. (2006). Multi-objective genetic algorithms for vehicle routing problem with time windows. Applied Intelligence, 24, 17–30.
Article Google Scholar
O’Ryan, M., Djuretic, T., Wall, P., Nichols, G., Hennessy, T., Slutsker, L., et al. (1996). An outbreak of salmonella infection from ice cream. New England Journal of Medicine, 335(11), 824–825.
Article Google Scholar
Paruchuri, P., Tambe, M., Ordonez, F., & Kraus, S. (2004). Towards a formalization of teamwork with resource constraints, In International joint conference on autonomous agents and multiagent systems (pp. 596–603).
Pineau, J., Gordon, G. J., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In International joint conference on artificial intelligence (pp. 1025–1032).
Pita, J., Jain, M., Ordonez, F., Portway, C., Tambe, M., Western, C., et al. (2009). Using game theory for Los Angeles airport security. AI Magazine, 43–57.
Platzman, L. K. (1977). Finite memory estimation and control of finite probabilistic systems. Ph.D. thesis. Cambridge, MA: Massachusetts Institute of Technology.
Platzman, L. K. (1980). Optimal infinite-horizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18, 362–380.
Article Google Scholar
Ponnambalam, S. G., Ramkumar, V., & Jawahar, N. (2001). A multiobjective genetic algorithm for job shop scheduling. Production Planning & Control: The Management of Operations, 12(8), 764–774.
Article Google Scholar
Poupart, P., & Boutilier, C. (2004). Bounded finite state controllers. In Advances in neural information processing systems (NIPS) 16: Proceedings of the 2003 conference. MIT Press.
Poupart, P. (2005). Exploiting structure to efficiently solve large scale partially observable Markov decision processes. Ph.D. thesis. Department of Computer Science, University of Toronto.
Puterman, M. L. (1994). Markov decision processes: Discrete dynamic programming. New York: Wiley.
Book Google Scholar
Rabinovich, Z., Goldman, C. V., & Rosenschein, J. S. (2003). The complexity of multiagent systems: The price of silence. In Proceedings of the second international joint conference on autonomous agents and multi-agent systems (AAMAS) (pp. 1102–1103), Melbourne.
Raghavan, T. E. S., & Filar, J. A. (1991). Algorithms for stochastic games—A survey. Methods and Models of Operations Research, 35, 437–472.
Article Google Scholar
Rothschild, C., McLay, L., & Guikema, S. (2012). Adversarial risk analysis with incomplete information: A level-k approach. Risk Analysis, 32(7), 1219–1231.
Article Google Scholar
Schaffer, J. D. (1985). Multiple objective optimisation with vector evaluated genetic algorithm. In Proceedings of the 1st international conference on genetic algorithms (pp. 93–100). San Mateo: Morgan Kaufmann.
Seuken, S., & Zilberstein, S. (2007). Improved memory-bounded dynamic programming for decentralized POMDPs. In Proceedings of the 23rd conference on uncertainty in artificial intelligence, Vancouver.
Shani, G., Pineau, J., & Kaplow, R. (2013). A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems, 27, 151.
Article Google Scholar
Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences of the U.S.A., 39, 1095–1100.
Article Google Scholar
Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research, 21(5), 1071–1088.
Article Google Scholar
Sobel, J., Khan, A., & Swerdlow, D. (2002). Threat of a biological terrorist attack on the US food supply: The CDC perspective. The Lancet, 359(9309), 874–880.
Article Google Scholar
Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2), 282–304.
Article Google Scholar
Srinivas, M., & Patnaik, L. M. (1994). Genetic algorithms: A survey. IEEE Computer, 27(6), 17–26.
Article Google Scholar
Tsai, J., Rathi, S., Kiekintveld, C., Ordonez, F., & Tambe, M. (2009). IRIS a tool for strategic security allocation in transportation networks. Non-published research report, Paper 71, CREATE Research Archive.
Ummels, M. (2010). Stochastic multiplayer games: Theory and algorithms. Ph.D. thesis. RWTH Aachen University.
Vorobeychik, Y., & Singh, S. (2012). Computing Stackelberg equilibria in discounted stochastic games. In Twenty-sixth national conference on artificial intelligence.
Vorobeychik, Y., An, B., & Tambe, M. (2012). Adversarial patrolling games. In AAAI spring symposium on security, sustainability, and health.
Vorobeychik, Y., An, B., Tambe, M., & Singh, S. (2014). Computing solutions in infinite-horizon discounted adversarial patrolling games. In International conference on automated planning and scheduling.
Wang, C., & Bier, V. M. (2011). Target-hardening decisions based on uncertain multiattribute terrorist utility. Decision Analysis, 8(4), 286–302.
Article Google Scholar
White, C. C. (1991). A survey of solution techniques for the partially observed Markov decision process. Annals of Operations Research, 32(1), 215–230.
Article Google Scholar
White, C. C., & Scherer, W. T. (1989). Solution procedures for partially observed Markov decision processes. Operations Research, 37(5), 791–797.
Article Google Scholar
White, C. C., & Scherer, W. T. (1994). Finite-memory suboptimal design for partially observed Markov decision processes. Operations Research, 42(3), 439–455.
Article Google Scholar
Yildirim, M. B., & Mouzon, G. (2012). Single-machine sustainable production planning to minimize total energy consumption and total completion time using a multiple objective genetic algorithm. IEEE Transactions on Engineering Management, 59(4), 585–597.
Article Google Scholar
Yu, H. (2007). Approximation solution methods for partially observable Markov and semi-Markov decision processes. Ph.D. thesis. Cambridge, MA: Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.
Zhang, H. (2010). Partially observable Markov decision processes: A geometric technique and analysis. Operations Research, 58(1), 214–228.
Article Google Scholar
Zhang, Y. (2013). Contributions in supply chain risk assessment and mitigation. Ph.D. thesis. Georgia Institute of Technology.
Zhuang, J., & Bier, V. M. (2007). Balancing terrorism and natural disasters defensive strategy with endogenous attacker effort. Operations Research, 55(5), 976–991.
Article Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Homeland Security under Grant Award Number 2010-ST-061-FD0001 through a grant awarded by the National Center for Food Protection and Defense at the University of Minnesota. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security or the National Center for Food Protection and Defense.

Author information

Authors and Affiliations

H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Yanling Chang, Alan L. Erera & Chelsea C. White III

Authors

Yanling Chang
View author publications
You can also search for this author in PubMed Google Scholar
Alan L. Erera
View author publications
You can also search for this author in PubMed Google Scholar
Chelsea C. White III
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanling Chang.

Appendix

Proof of Proposition 1

Assume v and $\varGamma $ are such that

$$\begin{aligned} v({\mathscr {I}}^F(t)) = \min \left\{ \sum \gamma ({\mathscr {I}}^L(t, \tau )) P({\mathscr {I}}^L(t, \tau )| {\mathscr {I}}^F(t)){:}\; \gamma \in \varGamma (s^F(t))\right\} , \end{aligned}$$

where the sum is over all $I^L(t, \tau )$. Then the analysis, following the same line of arguments in Smallwood and Sondik (1973), shows that

$$\begin{aligned} h^F({\mathscr {I}}^F(t), a^F(t), v) = \min \left\{ \sum \gamma '({\mathscr {I}}^L(t, \tau )) P({\mathscr {I}}^L(t, \tau )| {\mathscr {I}}^F(t)){:}\; \gamma ' \in \varGamma '(s^F(t), a^F(t))\right\} , \end{aligned}$$

where the sum is over all ${\mathscr {I}}^L(t, \tau )$. If $\gamma ' \in \varGamma '(s^F(t), a^F(t))$ then $\gamma '$ is of the form

$$\begin{aligned} \gamma ({\mathscr {I}}^L(t, \tau ))&= \sum _{a^L(t)}P(a^L(t)| {\mathscr {I}}^L(t, \tau ))\left[ c^F(s(t), a(t))\right. \\&\quad + \beta \sum _{s(t+1)} \sum _{z(t+1)} \gamma ^{i,j}(z^L(t+1), s^L(t+1), a^L(t), \\&\quad \left. {\mathscr {I}}^L(t, \tau - 1)) P(z(t+1), s(t+1)| s(t), a(t))\right] , \end{aligned}$$

where $\gamma ^{i,j}$ can be any element in $\varGamma (s^F(t+1))$ for each $s^F(t+1)=i$ and $z^F(t+1)=j$. And $\{z^L(t+1),s^L(t+1),a^L(t),{\mathscr {I}}^L(t,\tau -1)\}={\mathscr {I}}^L(t+1,\tau )$. Then,

$$\begin{aligned}{}[H^Fv]({\mathscr {I}}^F(t))=\min \left\{ \sum _{{\mathscr {I}}^L(t,\tau )} \gamma ''({\mathscr {I}}^L(t,\tau ))P({\mathscr {I}}^L(t,\tau )|I^F(t)){:}\; \gamma '' \in \varGamma ''(s^F(t))\right\} , \end{aligned}$$

where $\varGamma ''(s^F(t))=\cup _{a^F(t)}\varGamma '(s^F(t),a^F(t)).$

The operator $H^F$ is a contraction operator on the Banach space comprised of all functions mapping ${\mathscr {I}}^F(t)$ into the real line, having as its norm the supremum norm, and as a result, the sequence $\{v^{n}\}$, where $v^{n+1} = H^Fv^{n}$, converges to $v^F$ for any given $v^{0}$. The above result indicates that $H^F$ preserves piecewise linearity and concavity and in the limit preserves concavity. $\square $

Proof of Proposition 2

We remark that since $(\pi ^L, \pi ^F)$ is assumed given, $P(s(t),a(t)|{\mathscr {I}}^F(t,\tau ), {\mathscr {I}}^L(t,\tau ))$ is well defined. Assume there is a function g such that

$$\begin{aligned} v({\mathscr {I}}^L(t)) = \sum g({\mathscr {I}}^L(t,\tau ),{\mathscr {I}}^F(t,\tau )) P({\mathscr {I}}^F(t,\tau )|{\mathscr {I}}^L(t)), \end{aligned}$$

where the sum is over all ${\mathscr {I}}^F(t,\tau )$. Then it is straightforward to show that there is a function $g'$ such that

$$\begin{aligned} h^L_i({\mathscr {I}}^L(t),v) = \sum g'({\mathscr {I}}^L(t,\tau ), {\mathscr {I}}^F(t,\tau ))P({\mathscr {I}}^F(t,\tau )|{\mathscr {I}}^L(t)), \end{aligned}$$

where the sum is over all ${\mathscr {I}}^F(t,\tau )$, and

$$\begin{aligned} g'({\mathscr {I}}^L(t, \tau ), {\mathscr {I}}^F(t, \tau ))&= \sum \nolimits ^1 P(s(t),a(t)|{\mathscr {I}}^F(t, \tau ),{\mathscr {I}}^L(t, \tau ))\left\{ c^L_i(s(t),a(t))\right. \\&\quad +\beta \sum \nolimits ^2 g[({\mathfrak {z}}^L(t+1), {\mathscr {I}}^L(t, \tau -1)),\\&\quad \left. ({\mathfrak {z}}^F(t+1), {\mathscr {I}}^F(t, \tau -1))]P(z(t+1),s(t+1)|s(t),a(t)) \right\} , \end{aligned}$$

and where ${\mathfrak {z}}^k(t)=\{z^k(t),s^k(t),a^k(t-1) \}$, $\sum ^1$ is over all s(t) and a(t), and $\sum ^2$ is over all $z(t+1)$ and $s(t+1)$.The result follows directly from the following facts:

The operator $H^L$, where $[H^Lv]({\mathscr {I}}^L(t)) = h^L_i({\mathscr {I}}^L(t), v)$, is a contraction operator on the Banach space comprised of all functions mapping the set of all ${\mathscr {I}}^L(t)$ into the real line, having as its norm the supremum norm.
As a result, the sequence $\{v^{n}\}$, where $v^{n+1} = H^Lv^{n}$, converges to $v^L$ for any given $v^{0}$.

$\square $

Determine $y^k(t+1)$, given $y^k(t), z^k(t+1), s^k(t+1)$ and $a^k(t)$: Let $\varsigma ^k(t)=\{z^k(t),s^k(t),a^k(t-1) \}$ and $\varsigma (t)=\{\varsigma ^L(t), \varsigma ^F(t) \}$. Without loss of generality, we determine $y^F(t+1)$, given $y^F(t)$ and $\varsigma ^F(t+1)$. Note,

${\mathscr {I}}^L(t+1,\tau )=\{\varsigma ^L(t+1), {\mathscr {I}}^L(t,\tau -1) \}$ and ${\mathscr {I}}^F(t+1)=\{\varsigma ^F(t+1), {\mathscr {I}}^F(t) \}$. Then,

$$\begin{aligned} P(\varsigma ^L(t+1), {\mathscr {I}}^L(t,\tau -1)|\varsigma ^F(t+1),{\mathscr {I}}^F(t)) =\sum _{\varsigma '}P(\varsigma ^L(t+1), {\mathscr {I}}^L(t,\tau )|\varsigma ^F(t+1),{\mathscr {I}}^F(t)), \end{aligned}$$

where $\varsigma '=\varsigma ^L(t-\tau +1).$

Note

$$\begin{aligned} P(\varsigma (t+1),{\mathscr {I}}^L(t,\tau )|{\mathscr {I}}^F(t)) =P(\varsigma ^L(t+1),{\mathscr {I}}^L(t,\tau )|\varsigma ^F(t+1), {\mathscr {I}}^F(t))P(\varsigma ^F(t+1)|{\mathscr {I}}^F(t)) \end{aligned}$$

and that

$$\begin{aligned} P(\varsigma ^F(t+1)|{\mathscr {I}}^F(t))=\sum _{\varsigma ''} \sum _{{\mathscr {I}}}P(\varsigma (t+1),{\mathscr {I}}^L(t,\tau )| {\mathscr {I}}^F(t)), \end{aligned}$$

where $\varsigma ''=\varsigma ^L(t+1), {\mathscr {I}}={\mathscr {I}}^L(t,\tau )$.

Now,

$$\begin{aligned} P(\varsigma (t+1),{\mathscr {I}}^L(t,\tau )|{\mathscr {I}}^F(t)) =P(\varsigma (t+1)|{\mathscr {I}}^L(t,\tau ),{\mathscr {I}}^F(t)) P({\mathscr {I}}^L(t,\tau )|{\mathscr {I}}^F(t)). \end{aligned}$$

Then,

$$\begin{aligned} P(\varsigma (t+1)|{\mathscr {I}}^L(t,\tau ),{\mathscr {I}}^F(t))&=P(z(t+1),s(t+1)|a(t),{\mathscr {I}}^L(t,\tau ),{\mathscr {I}}^F(t))\\&\quad \times P(a(t)|{\mathscr {I}}^L(t,\tau ),{\mathscr {I}}^F(t)). \end{aligned}$$

Thus, we note that $P(\varsigma ^L(t+1),{\mathscr {I}}^L(t,\tau -1)|\varsigma ^F(t+1), {\mathscr {I}}^F(t))$ is a function of $\{P({\mathscr {I}}^L(t,\tau )|{\mathscr {I}}^F(t))\}$, which is the result.

Example 1

Theoretical analysis on the quality of finite-memory policy has been examined by White and Scherer (1994). Below is an example from our numerical experiment. Parameter values for this example are presented in Chang (2015). The resulting $\gamma $ vectors for $s^F=1$ are:

$(s^L, z^L)$	$(s_1,z_1)$	$(s_1,z_2)$	$(s_2, z_1)$	$(s_2, z_2)$
$a^F = a_1$	[4.4784	4.1677	4.4784	4.1677]
$a^F = a_2$	[4.7137	2.5929	4.7137	2.5929]

At the related information pattern (see Chang 2015 for details), the follower will select action $a_2$ for $y^F(t)=P(s^L(t),z^L(t)|{\mathscr {I}}^F(t))$ where $P(s_1,z_1|{\mathscr {I}}^F(t)) + P(s_2,z_1|{\mathscr {I}}^F(t)) \ge 0.85$. Let $\tau = 1$. Drawing 500 samples of $y^F(t-1)$ from a uniform distribution over $S^L\times Z^L$ indicates that the resulting approximate finite memory policy is $P(a^F(t)=a_2|{\mathscr {I}}^F(t,\tau =1)) = 0.938$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, Y., Erera, A.L. & White, C.C. A leader–follower partially observed, multiobjective Markov game. Ann Oper Res 235, 103–128 (2015). https://doi.org/10.1007/s10479-015-1935-0

Download citation

Published: 07 July 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10479-015-1935-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A leader–follower partially observed, multiobjective Markov game

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Proposition 1

Proof of Proposition 2

Example 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

\((s^L, z^L)\)	\((s_1,z_1)\)	\((s_1,z_2)\)	\((s_2, z_1)\)	\((s_2, z_2)\)
\(a^F = a_1\)	[4.4784	4.1677	4.4784	4.1677]
\(a^F = a_2\)	[4.7137	2.5929	4.7137	2.5929]

A leader–follower partially observed, multiobjective Markov game

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Proposition 1

Proof of Proposition 2

Example 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation