Abstract
We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent’s utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents’ beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents’ PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent’s performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent’s actions.
This is a preview of subscription content, access via your institution.
Notes
 1.
\(O_j\) also implicitly contains information about agent j’s observation set \(\varOmega _j\).
 2.
Strictly speaking, the intentional IPOMDP formalization in [23] considers subintentional models side by side with intentional models. However, how to obtain the set of possible subintentional models or how to update them is not explicitly discussed.
 3.
In this formulation, we assume that at level 0 the behavior of the other agent is folded into the world state’s transition function as noise; in general, it can be encoded in a more complex subintentional model.
 4.
Here and in the remainder of this paper, \(\delta \) denotes the Kronecker delta function, that is equal to 1 if its arguments are equal, 0 otherwise.
 5.
Here and in the rest of this paper, the notation \(x^{1:t}\) indicates the sequence \((x^1,x^2,\ldots ,x^t)\). Sometimes, a condensed notation is used for two or more sequences, i.e. \((x,y)^{1:t}\triangleq (x^{1:t},y^{1:t})\)
 6.
The acronym stands for Griffiths, Engen, and McCloskey.
 7.
In order not to clutter notation, we consider the initial node \(q^1\) as being part of \(\tau \).
 8.
For instance, j’s behavior may be time dependent, or be encoded as a pushdown transducer.
 9.
Implemented in MATLAB^{®} and running on an Intel^{®} Xeon^{®} 2.27 GHz processor.
References
 1.
Albrecht, S., Crandall, J., & Ramamoorthy, S. (2016). Belief and truth in hypothesised behaviours. Artificial Intelligence, 235, 63–94.
 2.
Balle, B., Quattoni, A., & Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In D. Gunopulos, T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.) Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science, vol. 6911, (pp. 156–171). Berlin, Heidelberg: Springer.
 3.
Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.
 4.
Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215–250.
 5.
Brown, G. W. (1951). Iterative solutions of games by fictitious play. In Activity analysis of production and allocation, (pp. 374–376). London: Wiley.
 6.
Carmel, D., Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the 13th national conference on artificial intelligence, (pp. 62–67).
 7.
Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451), 957–970.
 8.
Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15–19, 2008, Proceedings, Part I, (pp. 211–226).
 9.
Choi, J., & Kim, K. E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
 10.
Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.
 11.
Conroy, R., Zeng, Y., Cavazza, M., & Chen, Y. (2015). Learning behaviors in agents systems with interactive dynamic influence diagrams. In Proceedings of the twentyfourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015, (pp. 39–45).
 12.
Dennett, D. C. (1971). Intentional systems. Journal of Philosophy, 68(February), 87–106.
 13.
Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st national conference on artificial intelligence, vol. 2, AAAI’06, (pp. 1131–1136). AAAI Press.
 14.
Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34(1), 297–337.
 15.
Doshi, P., & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In D. Fox, & C. P. Gomes (Eds.) AAAI, (pp. 63–68). AAAI Press.
 16.
Doshi, P., Zeng, Y., & Chen, Q. (2009). Graphical models for interactive POMDPs: Representations and solutions. Autonomous Agents and MultiAgent Systems, 18(3), 376–416.
 17.
DoshiVelez, F., Pfau, D., Wood, F., & Roy, N. (2013). Bayesian nonparametric methods for partiallyobservable reinforcement learning. In IEEE transactions on pattern analysis and machine intelligence 99(PrePrints), 1.
 18.
Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan & B. Rozovsky (Eds.), The oxford handbook of nonlinear filtering. Oxford: Oxford University Press.
 19.
Escobar, M. D., & West, M. (1994). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
 20.
Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. In MIT Press series on economic learning and social evolution. The MIT Press, Cambridge (Mass.), London.
 21.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman and Hall/CRC.
 22.
Gmytrasiewicz, P. J. (1995). On reasoning about other agents. In Intelligent agents II, agent theories, architectures, and languages, IJCAI ’95, workshop (ATAL), Montreal, Canada, August 19–20, 1995, Proceedings, (pp. 143–155).
 23.
Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24(1), 49–79.
 24.
Green, P. J., & Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of Statistics, 28(2), 355–375.
 25.
Hansen, E. (1998). Solving POMDPs by searching in policy space. In Proceedings of the 14th international conference on uncertainty in artificial intelligence, (pp. 211–219).
 26.
Harsanyi, J. (1967). Games with incomplete information played by “Bayesian” players. Management Science, 14(3), 159–182.
 27.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109.
 28.
de la Higuera, C. (2010). Grammatical inference: Learning automata and grammars. New York, NY: Cambridge University Press.
 29.
Hjort, N. L., Holmes, C., Müller, P., & Walker, S. G. (Eds.). (2010). Bayesian nonparametrics. Cambridge: Cambridge University Press.
 30.
Jain, S., & Neal, R. M. (2004). A splitmerge markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1), 158–182.
 31.
Jain, S., & Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Analysis, 2(3), 445–472.
 32.
Kadane, J. B., & Larkey, P. D. (1982). Subjective probability and the theory of games. Management Science, 28(2), 113–120.
 33.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.
 34.
Kalai, E., & Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica, 61(5), 1019–1045.
 35.
Kocsis, L., & Szepesvári, C. (2006). Bandit based MonteCarlo planning. In Proceedings of the 17th European conference on machine learning, ECML’06, (pp. 282–293). Berlin, Heidelberg: Springer.
 36.
Littman, M. L. (1994). Markov games as a framework for multiagent reinforcement learning. In Proceedings of 11th international conference on machine learning, (pp. 157–163). Morgan Kaufmann.
 37.
Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stickbreaking policy learning in DecPOMDPs. In Proceedings of the twentyfourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015, (pp. 2011–2018).
 38.
Liu, M., Liao, X., & Carin, L. (2011). The infinite regionalized policy representation. In L. Getoor, T. Scheffer (Eds.) Proceedings of the 28th international conference on machine learning, (pp. 769–776).
 39.
Lopes, H., Carvalho, C. M., Johannes, M. S., & Polson, N. G. (2011). Particle learning for sequential Bayesian computation. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, Smith, A. F. M., West, M. (Eds.) Bayesian Statistics 9, (pp. 317–360). Oxford: Oxford University Press.
 40.
Mccallum, A. K. (1996). Reinforcement learning with selective perception and hidden State. Ph.D. Thesis, The University of Rochester
 41.
Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finitestate controllers for partially observable environments. In Proceedings of the 15th international conference on uncertainty in artificial intelligence, (pp. 427–436).
 42.
Miller, J. M., & Harrison, M. T.: Mixture models with a prior on the number of components. CoRR arXiv:1502.06241v1 [stat.ME] (2015). Preprint
 43.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.
 44.
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th international conference on machine learning, (pp. 663–670). Morgan Kaufmann.
 45.
Oncina, J., García, P., & Vidal, E. (1993). Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5), 448–458.
 46.
Paisley, J., & Carin, L. (2009). Hidden Markov models with stickbreaking priors. IEEE Transactions on Signal Processing, 57(10), 3905–3917.
 47.
Papadimitriou, C., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.
 48.
Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems, (pp. 1930–1938).
 49.
Pineau, J., Gordon, G., & Thrun, S. (2003). Pointbased value iteration: an anytime algorithm for POMDPs. In Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03, (pp. 1025–1030). San Francisco, CA: Morgan Kaufmann Publishers Inc.
 50.
Polich, K., & Gmytrasiewicz, P. (2007). Interactive dynamic influence diagrams. In Proceedings of the 6th international joint conference on autonomous agents and multiagent systems, AAMAS ’07, (pp. 341–343). New York, NY: ACM.
 51.
Poupart, P., Boutilier, C. (2003). Bounded finite state controllers. In Advances in neural information processing systems 16.
 52.
Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence, IJCAI’05, (pp. 817–822). San Francisco, CA: Morgan Kaufmann Publishers Inc.
 53.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, (pp. 257–286).
 54.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th international joint conference on artical intelligence, vol. 51, pp. 2586–2591.
 55.
Ross, S., Draa, B. C., & Pineau, J. (2007). Bayesadaptive POMDPs. In Proceedings of the conference on neural information processing systems.
 56.
Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
 57.
Shoham, Y., & LeytonBrown, K. (2008). Multiagent systems: Algorithmic, gametheoretic, and logical foundations. New York, NY: Cambridge University Press.
 58.
Silver, D., & Veness, J. (2010). MonteCarlo planning in large POMDPs. In J. Lafferty, C. Williams, J. ShaweTaylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 2164–2172). Curran Associates Inc.
 59.
Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2), 282–304.
 60.
Sonu, E., & Doshi, P. (2012). Generalized and bounded policy iteration for interactive POMDPs. In International symposium on artificial intelligence and mathematics (ISAIM).
 61.
Wright, J. R., & LeytonBrown, K. (2012). Behavioral game theoretic models: A Bayesian framework for parameter analysis. In International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4–8, 2012 (3 Volumes), (pp. 921–930).
 62.
Yoshida, W., Dolan, R. J., & Friston, K. J. (2008). Game theory of mind. PLoS Comput Biol, 4(12), e1000,254+.
 63.
Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial intelligence Research, 43(1), 211–255.
 64.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd national conference on artificial intelligence, vol. 3, AAAI’08, (pp. 1433–1438). AAAI Press.
Author information
Affiliations
Corresponding author
Appendices
Appendix 1: Derivation of the induced prior probability on the number of nodes
We want to obtain the probability of K, the number of nodes that gets “instantiated” when drawing from the prior in Eq. 12 as a function of the concentration parameter \(\alpha \), the number of actions G and observations H. We can view the process of sampling a PDFC from the prior recursively, starting from one single node and drawing its outgoing transitions according to Eq. 13, some of which may point to new nodes; we then do the same with the second node, if any, and so on. By “instantiated” nodes, we refer to the nodes drawn as a result of this procedure. Since the prior is an exchangeable probability distribution, there is no loss of generality in interpreting a draw from \(p(\tau \alpha )\) sequentially as above.
Let us now derive the probability over the number of nodes K induced by this sequential drawing procedure. We observe that K is the index of the first node whose outgoing transitions \(\tau _{K\cdot \cdot }\) all point to already existing nodes (including node K itself.) We will start from \(K=1\), \(K=2\), \(K=3\), and then derive a general rule. Let us denote as \(Y=GH\) the number of outgoing transitions from each node. In the following, we index the transitions in the order that they are sampled in our schema, so that transitions \(1 \le y\le Y\) are from the first node, transitions \((Y+1)\le y\le 2Y\) are from the second node, and so on. From what we described above, we know that \(K=1\) if and only if all of the first node’s outgoing transitions point to itself, i.e., no new node is generated besides the first, which is created with probability one (\(\frac{\alpha }{\alpha }\)). According to the CRP rule, the probability of this happening is:
where \(\alpha ^{(Y+1)}\) is the Pochhammer symbol indicating the rising factorial \(\alpha ^{(Y+1)}=\alpha (\alpha +1)(\alpha +2)\ldots (\alpha +Y)\).
For \(K=2\), it must be the case that at least one of the first node’s outgoing transitions points to the second node, and the second node’s transitions all point to the first or second node. The transition from the first to the second node with the lowest index, that is, the one that “generated” the second node when sampled, can be any of the first node’s Y outgoing transitions, therefore:
The sum of products between round brackets is the combinatorial quantity whose computation is critical in the general case.
Let us now consider \(K=3\): we know that there must be one transition from the first node, having index say \(y\le Y\), that points to the second node (and contributes “one \(\alpha \)”) and one transition indexed \(y<y'\le 2Y\) that goes to the third node. This transition may come from either the first or second node. The sum of products resulting from all such possible configurations of new transitions to the second and third node is needed to compute \(p(K=3\alpha )\). For a generic K, we have to consider all the “legal” configurations of the \((K1)\) “\(\alpha \)’s” that occur in the nodes previously sampled. We formalize this concept by introducing some definitions.
Definition 1
A configuration for a PDFC with K nodes is a binary vector \(w^K=(w^K_1,w^K_2,\ldots ,w^K_{(K1)Y})\) of length \((K1)Y\), containing exactly \((K1)\) zeros. Intuitively, the position of the first zero in this sequence identifies the first transition that was sampled to point to the second node, the second zero indicates the transition that first points to the third node, and so on. We denote as \(L_{k}\) the position of the \(k^{{\text {th}}}\) zero in a configuration. By convention, \(L(0)=0\).
Therefore, \(L_{K1}\) is the first transition that points to node K in a PDFC with K nodes. We know that this transition must be drawn after the first transition to node \((K1)\) is drawn. This leads to the the definition of “legal” configuration.
Definition 2
A configuration \(w^K\) is legal if, for all \(0<k<K\), we have that \(L_{k1}< L_k < Y(K1)\). We denote as \(W^K\) the set of all legal configurations for a PDFC with K nodes.
Each legal configuration \(w^K\) is associated to a quantity \(z(w^K)\), that is the product of the positions of ones in the configuration, i.e. \(z(w^K) = \prod _{i=1}^{Y(K1)} i\cdot w^K_i\). The combinatorial quantity that we need for computing the probability of having K nodes, denoted as \(\phi (K)\), is the sum of such quantities for all legal configurations, i.e.
If \(\phi (K)\) is known, then the probability of having K nodes is given by
where:

\(\alpha ^K\) are the numerators of the CRP terms corresponding to transition draws that resulted in the creation of new nodes, including the \(\alpha \) in the first vacuous term \(\frac{\alpha }{\alpha }\) that “creates” the first node;

\(\alpha ^{(KY+1)}=\alpha (\alpha +1)(\alpha +2)\ldots (\alpha +KY)\) is the rising factorial, resulting from the product of the denominators of the CRP conditional distributions;

\( \frac{(KY)!}{((K1)Y)!} = \big ((K1)Y\cdot ((K1)Y+1)\cdot \ldots \cdot KY\big )\) are the numerators of CRP terms for transitions outgoing from the last node K, that did not result in the creation of any new node;

\(\phi (K)\) is the sum of products of legal configurations, described above.
Efficient computation
A bruteforce computation of the \(\phi \) terms in Eq. 15, according to the formula in Eq. 33, would have exponential complexity. In the following, we instead describe a way to compute \(\phi (K)\) more efficiently. Let us introduce the quantity \(\phi (K,l)\), that represents the sum of products \(z(w^K)\) for legal configurations having the last zero in position l, i.e. \(L_{K1}=l\). Since the last zero in a legal configuration for a PDFC with K nodes can occur between positions \(K1\) and \(Y(K1)\), we have that \(\phi (K)=\sum _{l=K1}^{Y(K1)}\phi (K,l)\). In order to make its manipulation easier, we decompose \(\phi (K,l)\) into the sum of products of the configurations truncated at index l included, denoted as \(\bar{\phi }(K,l)\), and the remaining product of the configuration (which does not contain any zero), i.e.:
It follows that:
We can now derive a recursive relation for \(\bar{\phi }(K,l)\) from \(\bar{\phi }(K,l1)\). When “moving” the position of the last \(\alpha \) from \((l1)\) to l, we have to multiply the previous \(\bar{\phi }\) by \((l1)\), since in the corresponding configuration the element \(w^K_{l1}\) switched from 0 to 1. Moreover, by shifting the position of the last \(\alpha \) to l, we must acknowledge that there are now potentially more configurations that are legal for the first \((K2)\) \(\alpha \)’s, that is to say \(L_{K2}\) can now take the value \(l1\). This is only true when \(l1\) is a legal value for \(L_{K2}\), i.e. when \((l1)<(K2)Y\). Putting all this together, we have:
Computing the values of \(\phi \) in this way has a complexity of \(O(K^2)\), much lower than \(O(2^K)\) that results from direct computation of Eq. 33. Moreover, these values can be precomputed and stored, since they are not dependent on \(\alpha \), and used when needed.
Appendix 2: Tiger problem specifications
Rights and permissions
About this article
Cite this article
Panella, A., Gmytrasiewicz, P. Interactive POMDPs with finitestate models of other agents. Auton Agent MultiAgent Syst 31, 861–904 (2017). https://doi.org/10.1007/s104580169359z
Published:
Issue Date:
Keywords
 Multiagent systems
 Stochastic planning
 Opponent modeling