DeepFP for Finding Nash Equilibrium in Continuous Action Spaces

Kamra, Nitin; Gupta, Umang; Wang, Kai; Fang, Fei; Liu, Yan; Tambe, Milind

doi:10.1007/978-3-030-32430-8_15

Nitin Kamra ORCID: orcid.org/0000-0002-5205-6220¹²,
Umang Gupta¹²,
Kai Wang¹²,
Fei Fang¹³,
Yan Liu¹² &
…
Milind Tambe¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11836))

Included in the following conference series:

International Conference on Decision and Game Theory for Security

1467 Accesses
5 Citations

Abstract

Finding Nash equilibrium in continuous action spaces is a challenging problem and has applications in domains such as protecting geographic areas from potential attackers. We present DeepFP, an approximate extension of fictitious play in continuous action spaces. DeepFP represents players’ approximate best responses via generative neural networks which are highly expressive implicit density approximators. It additionally uses a game-model network which approximates the players’ expected payoffs given their actions, and trains the networks end-to-end in a model-based learning regime. Further, DeepFP allows using domain-specific oracles if available and can hence exploit techniques such as mathematical programming to compute best responses for structured games. We demonstrate stable convergence to Nash equilibrium on several classic games and also apply DeepFP to a large forest security domain with a novel defender best response oracle. We show that DeepFP learns strategies robust to adversarial exploitation and scales well with growing number of players’ resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The full oracle algorithm and the involved approximations are detailed in the appendix to keep the main text concise and continuous.

References

Amin, K., Singh, S., Wellman, M.P.: Gradient methods for stackelberg security games. In: UAI, pp. 2–11 (2016)
Google Scholar
Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning (2018)
Google Scholar
Basilico, N., Celli, A., De Nittis, G., Gatti, N.: Coordinating multiple defensive resources in patrolling games with alarm systems. In: Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pp. 678–686 (2017)
Google Scholar
Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Seddighin, S.: Spatio-temporal games beyond one dimension. In: Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 411–428 (2018)
Google Scholar
Cermák, J., Bošanský, B., Durkota, K., Lisý, V., Kiekintveld, C.: Using correlated strategies for computing stackelberg equilibria in extensive-form games. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 439–445 (2016)
Google Scholar
Fang, F., Jiang, A.X., Tambe, M.: Optimal patrol strategy for protecting moving targets with multiple mobile resources. In: AAMAS, pp. 957–964 (2013)
Article MathSciNet Google Scholar
Ferguson, T.S.: Game Theory, vol. 2 (2014). https://www.math.ucla.edu/~tom/Game_Theory/Contents.html
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017)
Gan, J., An, B., Vorobeychik, Y., Gauch, B.: Security games on a plane. In: AAAI, pp. 530–536 (2017)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)
Haskell, W., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with compass. In: IAAI (2014)
Google Scholar
Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015)
Google Scholar
Johnson, M.P., Fang, F., Tambe, M.: Patrol strategies to maximize pristine forest area. In: AAAI (2012)
Google Scholar
Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: First International Workshop on Artificial Intelligence in Security (2017)
Google Scholar
Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Korzhyk, D., Yin, Z., Kiekintveld, C., Conitzer, V., Tambe, M.: Stackelberg vs. Nash in security games: an extended investigation of interchangeability, equivalence, and uniqueness. JAIR 41, 297–327 (2011)
Article MathSciNet Google Scholar
Krishna, V., Sjöström, T.: On the convergence of fictitious play. Math. Oper. Res. 23(2), 479–511 (1998)
Article MathSciNet Google Scholar
Lanctot, M., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4190–4203 (2017)
Google Scholar
Leslie, D.S., Collins, E.J.: Generalised weakened fictitious play. Games Econ. Behav. 56(2), 285–298 (2006)
Article MathSciNet Google Scholar
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O.P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379–6390 (2017)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Perkins, S., Leslie, D.: Stochastic fictitious play with continuous action sets. J. Econ. Theory 152, 179–213 (2014)
Article MathSciNet Google Scholar
Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 3814–3822 (2017)
Google Scholar
Shamma, J.S., Arslan, G.: Unified convergence proofs of continuous-time fictitious play. IEEE Trans. Autom. Control 49(7), 1137–1141 (2004)
Article MathSciNet Google Scholar
Wang, B., Zhang, Y., Zhong, S.: On repeated stackelberg security game with the cooperative human behavior model for wildlife protection. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, pp. 1751–1753 (2017)
Google Scholar
Yang, R., Ford, B., Tambe, M., Lemieux, A.: Adaptive resource allocation for wildlife protection against illegal poachers. In: AAMAS (2014)
Google Scholar
Yin, Y., An, B., Jain, M.: Game-theoretic resource allocation for protecting large public events. In: AAAI, pp. 826–833 (2014)
Google Scholar
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems, pp. 3394–3404 (2017)
Google Scholar

Download references

Acknowledgments

This research was supported in part by NSF Research Grant IIS-1254206, NSF Research Grant IIS-1850477 and MURI Grant W911NF-11-1-0332.

Author information

Authors and Affiliations

University of Southern California, Los Angeles, CA, 90089, USA
Nitin Kamra, Umang Gupta, Kai Wang & Yan Liu
Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Fei Fang
Harvard University, Cambridge, MA, 02138, USA
Milind Tambe

Authors

Nitin Kamra
View author publications
You can also search for this author in PubMed Google Scholar
Umang Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Kai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Milind Tambe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitin Kamra .

Editor information

Editors and Affiliations

University of Melbourne, Melbourne, VIC, Australia
Tansu Alpcan
Washington University in St. Louis, St. Louis, MO, USA
Yevgeniy Vorobeychik
University of Maryland, College Park, College Park, MD, USA
John S. Baras
KTH Royal Institute of Technology, Stockholm, Sweden
György Dán

A Appendix

1.1 A.1 Approximate Best Response Oracle for Forest Protection Game

Devising a defender’s best response to the adversary’s belief distribution is non-trivial for this game. So we propose a greedy approximation to the best response. (see Algorithm 2). We define a capture-set for a lumberjack location l as the set of all guard locations within a radius $R_g$ from any point on the trajectory of the lumberjack. The algorithm involves creating capture-sets for lumberjack locations l encountered so far in mem and intersecting these capture-sets to find those which cover multiple lumberjacks. Then it greedily allocates guards to the top m such capture-sets one at a time, while updating the remaining capture-sets simultaneously to account for the lumberjacks ambushed by the current guard allocation. Our algorithm involves the following approximations:

1.
Mini-batch approximation: Since it is computationally infeasible to compute the best response to the full set of actions in mem, we best-respond to a small mini-batch of actions sampled randomly from mem to reduce computation (line 1).
2.
Approximate capture-sets: Initial capture-sets can have arbitrary arc-shaped boundaries which can be hard to store and process. Instead, we approximate them using convex polygons for simplicity (line 5). Doing this ensures that all subsequent intersections also result in convex polygons.
3.
Bounded number of intersections: Finding all possible intersections of capture-sets can be reduced to finding all cliques in a graph with capture-sets as vertices and pairwise intersections as edges. Hence it is an NP-hard problem with complexity growing exponentially with the number of polygons. We compute intersections in a pairwise fashion while adding the newly intersected polygons to the list. This way the $k^{th}$ round of intersection produces uptil all $k+1$-polygon intersections and we stop after $k=4$ rounds of intersection to maintain polynomial time complexity (implemented for line 8, but not shown explicitly in Algorithm 2).
4.
Greedy selection: After forming capture-set intersections, we greedily select the top m sets with the highest rewards (line 9).

1.2 A.2 Supplementary Experiments with m, n > 1

Table 3. More results on forests F1 and F4 for m = n = 2.

Full size table

Table 4. Demonstrating getting stuck in locally optimal strategies.

Full size table

Table 3 shows more experiments for DeepFP and OptGradFP with m,n>1. We see that DeepFP is able to cover regions of importance with the players’ resources but OptGradFP suffers from the zero defender gradients issue due to logit-normal strategy assumptions which often lead to sub-optimal results and higher exploitability.

1.3 A.3 Locally Optimal Strategies

To further study the issue of getting stuck in locally optimal strategies we show experiments with another forest F5 in Table 4. F5 has three dense tree patches and very sparse and mostly empty other parts. The optimal defender’s strategy computed by DLP for m = n = 1 is shown in C1. In such a case, due to the tree density being broken into patches, gradients for both players would be zero at many locations and hence both algorithms are expected to get stuck in locally optimal strategies depending upon their initialization. This is confirmed by configurations C2, C3, C4 and C5 which show strategies for OptGradFP and DeepFP with m = n = 1 covering only a single forest patch. Once the defender gets stuck on a forest patch, the probability of coming out of it is small since the tree density surrounding the patches is negligible. However, with more resources for the defender and the adversary m = n = 3, DeepFP is mostly able to break out of the stagnation and both players eventually cover more than a single forest patch (see C7), whereas OptGradFP is only able to cover additional ground due to random initialization of the 3 player resources but otherwise remains stuck around a single forest patch (see C6). DeepFP is partially able to break out because the defender’s best response does not rely on gradients but rather come from a non-differentiable oracle. This shows how DeepFP can break out of local optima even in the absence of gradients if a best response oracle is provided, however OptGradFP relies purely on gradients and cannot overcome such situations.

1.4 A.2 Neural Network Architectures

All our models were trained using TensorFlow v1.5 on a Ubuntu 16.04 machine with 32 CPU cores and a Nvidia Tesla K40c GPU.

Cournot Game and Concave-Convex Game. Best response networks for the Cournot game and the Concave-convex game consist of single fully connected layer with a sigmoid activation, directly mapping the 2-D input noise $z \sim {\mathcal N}([0,0],I_2)$ to a scalar output $q_p$ for player p. Best response networks are trained with Adam optimizer [16] and learning rate of 0.05. To estimate payoffs, we use exact reward models for the game model networks. Maximum games were limited to 30,000 for Cournot game and 50,000 for Concave-convex game.

Forest Protection Game. The action $u_p$ of player p contains the cylindrical coordinates (radii and angles) for all resources of that player. So, the best response network for the Forest protection game maps $Z_A \in {\mathbb R}^{64}$ to the adversary action $u_A \in {\mathbb R}^{n \times 2}$. It has 3 fully connected hidden layers with $\{128,64,64\}$ units and ReLU activations. The final output comes from two parallel fully connected layers with n (number of lumberjacks) units each: (a) first with sigmoid activations outputting n radii $\in [0,1]$, and (b) second with linear activations outputting n angles $\in [-\infty , \infty ]$, which are modulo-ed to be in $[0, 2\pi ]$ everywhere. All layers are L2-regularized with coefficient $10^{-2}$:

$$\begin{aligned} x_A&= relu(FC_{64}(relu(FC_{64}(relu(FC_{128}(Z_A)))))) \\ u_{A,rad}&= \sigma (FC_n(x_A));\quad u_{A,ang} = FC_n(x_A) \end{aligned}$$

The game model takes all players’ actions as inputs (i.e. matrices $u_D, u_A$ of shapes (m, 2) and (n, 2)) respectively) and produces two scalar rewards $r_D$ and $r_A$. It internally converts the angles in the second columns of these inputs to the range $[0, 2\pi ]$. Since the rewards should be invariant to the permutations of the defender’s and adversary’s resources (guards and lumberjacks resp.), we first pass the input matrices through non-linear embeddings to interpret their rows as sets rather than ordered vectors (see Deep Sets [29] for details). These non-linear embeddings are shared between the rows of the input matrix and are themselves deep neural networks with three fully connected hidden layers containing $\{60, 60, 120\}$ units and ReLU activations. They map each row of the matrices into a 120-dimensional vector and then add all these vectors. This effectively projects the action of each player into a 120-dimensional action embedding representation invariant to the ordering of the resources. The players’ embedding networks are trained jointly as a part of the game model network. The players’ action embeddings are further passed through 3 hidden fully connected layers with $\{1024, 512, 128\}$ units and ReLU activations. The final output rewards are produced by a last fully connected layer with 2 hidden units and linear activation. All layers are L2-regularized with coefficient $3 \times 10^{-4}$:

$$\begin{aligned} emb_p&= \sum _{dim=row}(DeepSet_{60,60,120}(u_p))\quad \forall p \in \{D,A\} \\ \hat{r}_D, \hat{r}_A&= FC_{2}(relu(FC_{128}(relu(FC_{512}(relu(FC_{1024}(emb_D, emb_A)))))) \end{aligned}$$

The models are trained with Adam optimizer [16]. Note that the permutation invariant embeddings are not central to the game model network and only help to incorporate an inductive bias for this game. We also tested the game model network without the embedding networks and achieved similar performance with about 2x increase in the number of iterations since the game model would need to infer permutation invariance from data.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kamra, N., Gupta, U., Wang, K., Fang, F., Liu, Y., Tambe, M. (2019). DeepFP for Finding Nash Equilibrium in Continuous Action Spaces. In: Alpcan, T., Vorobeychik, Y., Baras, J., Dán, G. (eds) Decision and Game Theory for Security. GameSec 2019. Lecture Notes in Computer Science(), vol 11836. Springer, Cham. https://doi.org/10.1007/978-3-030-32430-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-32430-8_15
Published: 23 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32429-2
Online ISBN: 978-3-030-32430-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DeepFP for Finding Nash Equilibrium in Continuous Action Spaces

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Approximate Best Response Oracle for Forest Protection Game

1.2 A.2 Supplementary Experiments with m, n > 1

1.3 A.3 Locally Optimal Strategies

1.4 A.2 Neural Network Architectures

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation