Interactive POMDPs with finite-state models of other agents

Panella, Alessandro; Gmytrasiewicz, Piotr

doi:10.1007/s10458-016-9359-z

Interactive POMDPs with finite-state models of other agents

Published: 25 January 2017

Volume 31, pages 861–904, (2017)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

1524 Accesses
17 Citations
1 Altmetric
Explore all metrics

Abstract

We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent’s utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents’ beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents’ PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent’s performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent’s actions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Article 22 April 2021

Notes

$O_j$ also implicitly contains information about agent j’s observation set $\varOmega _j$.
Strictly speaking, the intentional I-POMDP formalization in [23] considers subintentional models side by side with intentional models. However, how to obtain the set of possible subintentional models or how to update them is not explicitly discussed.
In this formulation, we assume that at level 0 the behavior of the other agent is folded into the world state’s transition function as noise; in general, it can be encoded in a more complex subintentional model.
Here and in the remainder of this paper, $\delta $ denotes the Kronecker delta function, that is equal to 1 if its arguments are equal, 0 otherwise.
Here and in the rest of this paper, the notation $x^{1:t}$ indicates the sequence $(x^1,x^2,\ldots ,x^t)$. Sometimes, a condensed notation is used for two or more sequences, i.e. $(x,y)^{1:t}\triangleq (x^{1:t},y^{1:t})$
The acronym stands for Griffiths, Engen, and McCloskey.
In order not to clutter notation, we consider the initial node $q^1$ as being part of $\tau $.
For instance, j’s behavior may be time dependent, or be encoded as a pushdown transducer.
Implemented in MATLAB^® and running on an Intel^® Xeon^® 2.27 GHz processor.

References

Albrecht, S., Crandall, J., & Ramamoorthy, S. (2016). Belief and truth in hypothesised behaviours. Artificial Intelligence, 235, 63–94.
Article MathSciNet MATH Google Scholar
Balle, B., Quattoni, A., & Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.) Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science, vol. 6911, (pp. 156–171). Berlin, Heidelberg: Springer.
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.
Article MathSciNet MATH Google Scholar
Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215–250.
Article MathSciNet MATH Google Scholar
Brown, G. W. (1951). Iterative solutions of games by fictitious play. In Activity analysis of production and allocation, (pp. 374–376). London: Wiley.
Carmel, D., Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the 13th national conference on artificial intelligence, (pp. 62–67).
Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451), 957–970.
Article MathSciNet MATH Google Scholar
Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15–19, 2008, Proceedings, Part I, (pp. 211–226).
Choi, J., & Kim, K. E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
MathSciNet MATH Google Scholar
Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.
Article Google Scholar
Conroy, R., Zeng, Y., Cavazza, M., & Chen, Y. (2015). Learning behaviors in agents systems with interactive dynamic influence diagrams. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015, (pp. 39–45).
Dennett, D. C. (1971). Intentional systems. Journal of Philosophy, 68(February), 87–106.
Article Google Scholar
Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st national conference on artificial intelligence, vol. 2, AAAI’06, (pp. 1131–1136). AAAI Press.
Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34(1), 297–337.
MATH Google Scholar
Doshi, P., & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In D. Fox, & C. P. Gomes (Eds.) AAAI, (pp. 63–68). AAAI Press.
Doshi, P., Zeng, Y., & Chen, Q. (2009). Graphical models for interactive POMDPs: Representations and solutions. Autonomous Agents and Multi-Agent Systems, 18(3), 376–416.
Article Google Scholar
Doshi-Velez, F., Pfau, D., Wood, F., & Roy, N. (2013). Bayesian nonparametric methods for partially-observable reinforcement learning. In IEEE transactions on pattern analysis and machine intelligence 99(PrePrints), 1.
Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan & B. Rozovsky (Eds.), The oxford handbook of nonlinear filtering. Oxford: Oxford University Press.
Google Scholar
Escobar, M. D., & West, M. (1994). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
Article MathSciNet MATH Google Scholar
Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. In MIT Press series on economic learning and social evolution. The MIT Press, Cambridge (Mass.), London.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman and Hall/CRC.
MATH Google Scholar
Gmytrasiewicz, P. J. (1995). On reasoning about other agents. In Intelligent agents II, agent theories, architectures, and languages, IJCAI ’95, workshop (ATAL), Montreal, Canada, August 19–20, 1995, Proceedings, (pp. 143–155).
Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24(1), 49–79.
MATH Google Scholar
Green, P. J., & Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of Statistics, 28(2), 355–375.
Article MathSciNet MATH Google Scholar
Hansen, E. (1998). Solving POMDPs by searching in policy space. In Proceedings of the 14th international conference on uncertainty in artificial intelligence, (pp. 211–219).
Harsanyi, J. (1967). Games with incomplete information played by “Bayesian” players. Management Science, 14(3), 159–182.
Article MathSciNet MATH Google Scholar
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109.
Article MathSciNet MATH Google Scholar
de la Higuera, C. (2010). Grammatical inference: Learning automata and grammars. New York, NY: Cambridge University Press.
Book MATH Google Scholar
Hjort, N. L., Holmes, C., Müller, P., & Walker, S. G. (Eds.). (2010). Bayesian nonparametrics. Cambridge: Cambridge University Press.
MATH Google Scholar
Jain, S., & Neal, R. M. (2004). A split-merge markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 158–182.
Article MathSciNet Google Scholar
Jain, S., & Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Analysis, 2(3), 445–472.
Article MathSciNet MATH Google Scholar
Kadane, J. B., & Larkey, P. D. (1982). Subjective probability and the theory of games. Management Science, 28(2), 113–120.
Article MathSciNet Google Scholar
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.
Article MathSciNet MATH Google Scholar
Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica, 61(5), 1019–1045.
Article MathSciNet MATH Google Scholar
Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European conference on machine learning, ECML’06, (pp. 282–293). Berlin, Heidelberg: Springer.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of 11th international conference on machine learning, (pp. 157–163). Morgan Kaufmann.
Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stick-breaking policy learning in Dec-POMDPs. In Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015, (pp. 2011–2018).
Liu, M., Liao, X., & Carin, L. (2011). The infinite regionalized policy representation. In L. Getoor, T. Scheffer (Eds.) Proceedings of the 28th international conference on machine learning, (pp. 769–776).
Lopes, H., Carvalho, C. M., Johannes, M. S., & Polson, N. G. (2011). Particle learning for sequential Bayesian computation. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Smith, A. F. M., West, M. (Eds.) Bayesian Statistics 9, (pp. 317–360). Oxford: Oxford University Press.
Mccallum, A. K. (1996). Reinforcement learning with selective perception and hidden State. Ph.D. Thesis, The University of Rochester
Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the 15th international conference on uncertainty in artificial intelligence, (pp. 427–436).
Miller, J. M., & Harrison, M. T.: Mixture models with a prior on the number of components. CoRR arXiv:1502.06241v1 [stat.ME] (2015). Preprint
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.
MathSciNet Google Scholar
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th international conference on machine learning, (pp. 663–670). Morgan Kaufmann.
Oncina, J., García, P., & Vidal, E. (1993). Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5), 448–458.
Article Google Scholar
Paisley, J., & Carin, L. (2009). Hidden Markov models with stick-breaking priors. IEEE Transactions on Signal Processing, 57(10), 3905–3917.
Article MathSciNet Google Scholar
Papadimitriou, C., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.
Article MathSciNet MATH Google Scholar
Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems, (pp. 1930–1938).
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: an anytime algorithm for POMDPs. In Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03, (pp. 1025–1030). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Polich, K., & Gmytrasiewicz, P. (2007). Interactive dynamic influence diagrams. In Proceedings of the 6th international joint conference on autonomous agents and multiagent systems, AAMAS ’07, (pp. 341–343). New York, NY: ACM.
Poupart, P., Boutilier, C. (2003). Bounded finite state controllers. In Advances in neural information processing systems 16.
Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence, IJCAI’05, (pp. 817–822). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, (pp. 257–286).
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th international joint conference on artical intelligence, vol. 51, pp. 2586–2591.
Ross, S., Draa, B. C., & Pineau, J. (2007). Bayes-adaptive POMDPs. In Proceedings of the conference on neural information processing systems.
Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
MATH Google Scholar
Shoham, Y., & Leyton-Brown, K. (2008). Multiagent systems: Algorithmic, game-theoretic, and logical foundations. New York, NY: Cambridge University Press.
Book MATH Google Scholar
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 2164–2172). Curran Associates Inc.
Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2), 282–304.
Article MathSciNet MATH Google Scholar
Sonu, E., & Doshi, P. (2012). Generalized and bounded policy iteration for interactive POMDPs. In International symposium on artificial intelligence and mathematics (ISAIM).
Wright, J. R., & Leyton-Brown, K. (2012). Behavioral game theoretic models: A Bayesian framework for parameter analysis. In International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4–8, 2012 (3 Volumes), (pp. 921–930).
Yoshida, W., Dolan, R. J., & Friston, K. J. (2008). Game theory of mind. PLoS Comput Biol, 4(12), e1000,254+.
Article MathSciNet Google Scholar
Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial intelligence Research, 43(1), 211–255.
MathSciNet MATH Google Scholar
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd national conference on artificial intelligence, vol. 3, AAAI’08, (pp. 1433–1438). AAAI Press.

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA
Alessandro Panella & Piotr Gmytrasiewicz

Authors

Alessandro Panella
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Gmytrasiewicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessandro Panella.

Appendices

Appendix 1: Derivation of the induced prior probability on the number of nodes

We want to obtain the probability of K, the number of nodes that gets “instantiated” when drawing from the prior in Eq. 12 as a function of the concentration parameter $\alpha $, the number of actions G and observations H. We can view the process of sampling a PDFC from the prior recursively, starting from one single node and drawing its outgoing transitions according to Eq. 13, some of which may point to new nodes; we then do the same with the second node, if any, and so on. By “instantiated” nodes, we refer to the nodes drawn as a result of this procedure. Since the prior is an exchangeable probability distribution, there is no loss of generality in interpreting a draw from $p(\tau |\alpha )$ sequentially as above.

Let us now derive the probability over the number of nodes K induced by this sequential drawing procedure. We observe that K is the index of the first node whose outgoing transitions $\tau _{K\cdot \cdot }$ all point to already existing nodes (including node K itself.) We will start from $K=1$, $K=2$, $K=3$, and then derive a general rule. Let us denote as $Y=GH$ the number of outgoing transitions from each node. In the following, we index the transitions in the order that they are sampled in our schema, so that transitions $1 \le y\le Y$ are from the first node, transitions $(Y+1)\le y\le 2Y$ are from the second node, and so on. From what we described above, we know that $K=1$ if and only if all of the first node’s outgoing transitions point to itself, i.e., no new node is generated besides the first, which is created with probability one ($\frac{\alpha }{\alpha }$). According to the CRP rule, the probability of this happening is:

$$\begin{aligned} p(K=1|\alpha )=\frac{\alpha }{\alpha }\frac{1}{(1+\alpha )}\frac{2}{(2+\alpha )}\ldots \frac{Y}{(Y+\alpha )} = \frac{\alpha Y!}{\alpha ^{(Y+1)}}, \end{aligned}$$

(31)

where $\alpha ^{(Y+1)}$ is the Pochhammer symbol indicating the rising factorial $\alpha ^{(Y+1)}=\alpha (\alpha +1)(\alpha +2)\ldots (\alpha +Y)$.

For $K=2$, it must be the case that at least one of the first node’s outgoing transitions points to the second node, and the second node’s transitions all point to the first or second node. The transition from the first to the second node with the lowest index, that is, the one that “generated” the second node when sampled, can be any of the first node’s Y outgoing transitions, therefore:

$$\begin{aligned} p(K=2|\alpha ) = \frac{\alpha }{\alpha ^{(2Y)}}\frac{(2Y)!}{Y!}\;\big (\alpha \cdot 2\cdot \ldots \cdot Y \;+\; 1\cdot \alpha \cdot \ldots \cdot Y \;+\; \ldots \;+\; 1\cdot 2\cdot \ldots \cdot \alpha \big ). \end{aligned}$$

(32)

The sum of products between round brackets is the combinatorial quantity whose computation is critical in the general case.

Let us now consider $K=3$: we know that there must be one transition from the first node, having index say $y\le Y$, that points to the second node (and contributes “one $\alpha $”) and one transition indexed $y<y'\le 2Y$ that goes to the third node. This transition may come from either the first or second node. The sum of products resulting from all such possible configurations of new transitions to the second and third node is needed to compute $p(K=3|\alpha )$. For a generic K, we have to consider all the “legal” configurations of the $(K-1)$ “$\alpha $’s” that occur in the nodes previously sampled. We formalize this concept by introducing some definitions.

Definition 1

A configuration for a PDFC with K nodes is a binary vector $w^K=(w^K_1,w^K_2,\ldots ,w^K_{(K-1)Y})$ of length $(K-1)Y$, containing exactly $(K-1)$ zeros. Intuitively, the position of the first zero in this sequence identifies the first transition that was sampled to point to the second node, the second zero indicates the transition that first points to the third node, and so on. We denote as $L_{k}$ the position of the $k^{{\text {th}}}$ zero in a configuration. By convention, $L(0)=0$.

Therefore, $L_{K-1}$ is the first transition that points to node K in a PDFC with K nodes. We know that this transition must be drawn after the first transition to node $(K-1)$ is drawn. This leads to the the definition of “legal” configuration.

Definition 2

A configuration $w^K$ is legal if, for all $0<k<K$, we have that $L_{k-1}< L_k < Y(K-1)$. We denote as $W^K$ the set of all legal configurations for a PDFC with K nodes.

Each legal configuration $w^K$ is associated to a quantity $z(w^K)$, that is the product of the positions of ones in the configuration, i.e. $z(w^K) = \prod _{i=1}^{Y(K-1)} i\cdot w^K_i$. The combinatorial quantity that we need for computing the probability of having K nodes, denoted as $\phi (K)$, is the sum of such quantities for all legal configurations, i.e.

$$\begin{aligned} \phi (K)=\sum _{w^K\in W^K} z(w^K). \end{aligned}$$

(33)

If $\phi (K)$ is known, then the probability of having K nodes is given by

$$\begin{aligned} p(K|\alpha ) = \frac{\alpha ^{K}}{\alpha ^{(KY+1)}} \frac{(KY)!}{((K-1)Y)!} \phi (K), \end{aligned}$$

(34)

where:

$\alpha ^K$ are the numerators of the CRP terms corresponding to transition draws that resulted in the creation of new nodes, including the $\alpha $ in the first vacuous term $\frac{\alpha }{\alpha }$ that “creates” the first node;
$\alpha ^{(KY+1)}=\alpha (\alpha +1)(\alpha +2)\ldots (\alpha +KY)$ is the rising factorial, resulting from the product of the denominators of the CRP conditional distributions;
$ \frac{(KY)!}{((K-1)Y)!} = \big ((K-1)Y\cdot ((K-1)Y+1)\cdot \ldots \cdot KY\big )$ are the numerators of CRP terms for transitions outgoing from the last node K, that did not result in the creation of any new node;
$\phi (K)$ is the sum of products of legal configurations, described above.

1.1 Efficient computation

A brute-force computation of the $\phi $ terms in Eq. 15, according to the formula in Eq. 33, would have exponential complexity. In the following, we instead describe a way to compute $\phi (K)$ more efficiently. Let us introduce the quantity $\phi (K,l)$, that represents the sum of products $z(w^K)$ for legal configurations having the last zero in position l, i.e. $L_{K-1}=l$. Since the last zero in a legal configuration for a PDFC with K nodes can occur between positions $K-1$ and $Y(K-1)$, we have that $\phi (K)=\sum _{l=K-1}^{Y(K-1)}\phi (K,l)$. In order to make its manipulation easier, we decompose $\phi (K,l)$ into the sum of products of the configurations truncated at index l included, denoted as $\bar{\phi }(K,l)$, and the remaining product of the configuration (which does not contain any zero), i.e.:

$$\begin{aligned} \phi (K,l)=\bar{\phi }(K,l)\;(l+1)(l+2)\ldots ((K-1)Y). \end{aligned}$$

(35)

It follows that:

$$\begin{aligned} \phi (K)=\sum _{l=K-1}^{Y(K-1)}\bar{q}(K,l)\frac{((K-1)Y)!}{l!}. \end{aligned}$$

(36)

We can now derive a recursive relation for $\bar{\phi }(K,l)$ from $\bar{\phi }(K,l-1)$. When “moving” the position of the last $\alpha $ from $(l-1)$ to l, we have to multiply the previous $\bar{\phi }$ by $(l-1)$, since in the corresponding configuration the element $w^K_{l-1}$ switched from 0 to 1. Moreover, by shifting the position of the last $\alpha $ to l, we must acknowledge that there are now potentially more configurations that are legal for the first $(K-2)$ $\alpha $’s, that is to say $L_{K-2}$ can now take the value $l-1$. This is only true when $l-1$ is a legal value for $L_{K-2}$, i.e. when $(l-1)<(K-2)Y$. Putting all this together, we have:

$$\begin{aligned} \bar{\phi }(K,l) = (l-1)\;\bar{\phi }(K,l-1)+ {\left\{ \begin{array}{ll} \bar{\phi }(K-1,l-1) &{} {\text {if }} (l-1)<(K-2)Y \\ 0 &{} {\text {otherwise.}} \end{array}\right. } \end{aligned}$$

(37)

Computing the values of $\phi $ in this way has a complexity of $O(K^2)$, much lower than $O(2^K)$ that results from direct computation of Eq. 33. Moreover, these values can be pre-computed and stored, since they are not dependent on $\alpha $, and used when needed.

Appendix 2: Tiger problem specifications

Table 1 Specification of the “standard” multiagent Tiger problem

Full size table

Table 2 Specification of the cooperative multi-agent Tiger Problem with observable actions

Full size table

Table 3 Alternative reward models for the multiagent Tiger Problem

Full size table

Table 4 Specification of the cooperative multi-agent Tiger Problem in the “Follow the Leader” scenario

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Panella, A., Gmytrasiewicz, P. Interactive POMDPs with finite-state models of other agents. Auton Agent Multi-Agent Syst 31, 861–904 (2017). https://doi.org/10.1007/s10458-016-9359-z

Download citation

Published: 25 January 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10458-016-9359-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interactive POMDPs with finite-state models of other agents

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Derivation of the induced prior probability on the number of nodes

Definition 1

Definition 2

1.1 Efficient computation

Appendix 2: Tiger problem specifications

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interactive POMDPs with finite-state models of other agents

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Derivation of the induced prior probability on the number of nodes

Definition 1

Definition 2

1.1 Efficient computation

Appendix 2: Tiger problem specifications

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation