Skip to main content

DeepFP for Finding Nash Equilibrium in Continuous Action Spaces

  • Conference paper
  • First Online:
Decision and Game Theory for Security (GameSec 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11836))

Included in the following conference series:

Abstract

Finding Nash equilibrium in continuous action spaces is a challenging problem and has applications in domains such as protecting geographic areas from potential attackers. We present DeepFP, an approximate extension of fictitious play in continuous action spaces. DeepFP represents players’ approximate best responses via generative neural networks which are highly expressive implicit density approximators. It additionally uses a game-model network which approximates the players’ expected payoffs given their actions, and trains the networks end-to-end in a model-based learning regime. Further, DeepFP allows using domain-specific oracles if available and can hence exploit techniques such as mathematical programming to compute best responses for structured games. We demonstrate stable convergence to Nash equilibrium on several classic games and also apply DeepFP to a large forest security domain with a novel defender best response oracle. We show that DeepFP learns strategies robust to adversarial exploitation and scales well with growing number of players’ resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The full oracle algorithm and the involved approximations are detailed in the appendix to keep the main text concise and continuous.

References

  1. Amin, K., Singh, S., Wellman, M.P.: Gradient methods for stackelberg security games. In: UAI, pp. 2–11 (2016)

    Google Scholar 

  2. Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning (2018)

    Google Scholar 

  3. Basilico, N., Celli, A., De Nittis, G., Gatti, N.: Coordinating multiple defensive resources in patrolling games with alarm systems. In: Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pp. 678–686 (2017)

    Google Scholar 

  4. Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Seddighin, S.: Spatio-temporal games beyond one dimension. In: Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 411–428 (2018)

    Google Scholar 

  5. Cermák, J., Bošanský, B., Durkota, K., Lisý, V., Kiekintveld, C.: Using correlated strategies for computing stackelberg equilibria in extensive-form games. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 439–445 (2016)

    Google Scholar 

  6. Fang, F., Jiang, A.X., Tambe, M.: Optimal patrol strategy for protecting moving targets with multiple mobile resources. In: AAMAS, pp. 957–964 (2013)

    Article  MathSciNet  Google Scholar 

  7. Ferguson, T.S.: Game Theory, vol. 2 (2014). https://www.math.ucla.edu/~tom/Game_Theory/Contents.html

  8. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017)

  9. Gan, J., An, B., Vorobeychik, Y., Gauch, B.: Security games on a plane. In: AAAI, pp. 530–536 (2017)

    Google Scholar 

  10. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018)

  11. Haskell, W., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with compass. In: IAAI (2014)

    Google Scholar 

  12. Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015)

    Google Scholar 

  13. Johnson, M.P., Fang, F., Tambe, M.: Patrol strategies to maximize pristine forest area. In: AAAI (2012)

    Google Scholar 

  14. Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: First International Workshop on Artificial Intelligence in Security (2017)

    Google Scholar 

  15. Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI (2018)

    Google Scholar 

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Korzhyk, D., Yin, Z., Kiekintveld, C., Conitzer, V., Tambe, M.: Stackelberg vs. Nash in security games: an extended investigation of interchangeability, equivalence, and uniqueness. JAIR 41, 297–327 (2011)

    Article  MathSciNet  Google Scholar 

  18. Krishna, V., Sjöström, T.: On the convergence of fictitious play. Math. Oper. Res. 23(2), 479–511 (1998)

    Article  MathSciNet  Google Scholar 

  19. Lanctot, M., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4190–4203 (2017)

    Google Scholar 

  20. Leslie, D.S., Collins, E.J.: Generalised weakened fictitious play. Games Econ. Behav. 56(2), 285–298 (2006)

    Article  MathSciNet  Google Scholar 

  21. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O.P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379–6390 (2017)

    Google Scholar 

  22. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)

    Google Scholar 

  23. Perkins, S., Leslie, D.: Stochastic fictitious play with continuous action sets. J. Econ. Theory 152, 179–213 (2014)

    Article  MathSciNet  Google Scholar 

  24. Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 3814–3822 (2017)

    Google Scholar 

  25. Shamma, J.S., Arslan, G.: Unified convergence proofs of continuous-time fictitious play. IEEE Trans. Autom. Control 49(7), 1137–1141 (2004)

    Article  MathSciNet  Google Scholar 

  26. Wang, B., Zhang, Y., Zhong, S.: On repeated stackelberg security game with the cooperative human behavior model for wildlife protection. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, pp. 1751–1753 (2017)

    Google Scholar 

  27. Yang, R., Ford, B., Tambe, M., Lemieux, A.: Adaptive resource allocation for wildlife protection against illegal poachers. In: AAMAS (2014)

    Google Scholar 

  28. Yin, Y., An, B., Jain, M.: Game-theoretic resource allocation for protecting large public events. In: AAAI, pp. 826–833 (2014)

    Google Scholar 

  29. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems, pp. 3394–3404 (2017)

    Google Scholar 

Download references

Acknowledgments

This research was supported in part by NSF Research Grant IIS-1254206, NSF Research Grant IIS-1850477 and MURI Grant W911NF-11-1-0332.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nitin Kamra .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Approximate Best Response Oracle for Forest Protection Game

figure b

Devising a defender’s best response to the adversary’s belief distribution is non-trivial for this game. So we propose a greedy approximation to the best response. (see Algorithm 2). We define a capture-set for a lumberjack location l as the set of all guard locations within a radius \(R_g\) from any point on the trajectory of the lumberjack. The algorithm involves creating capture-sets for lumberjack locations l encountered so far in mem and intersecting these capture-sets to find those which cover multiple lumberjacks. Then it greedily allocates guards to the top m such capture-sets one at a time, while updating the remaining capture-sets simultaneously to account for the lumberjacks ambushed by the current guard allocation. Our algorithm involves the following approximations:

  1. 1.

    Mini-batch approximation: Since it is computationally infeasible to compute the best response to the full set of actions in mem, we best-respond to a small mini-batch of actions sampled randomly from mem to reduce computation (line 1).

  2. 2.

    Approximate capture-sets: Initial capture-sets can have arbitrary arc-shaped boundaries which can be hard to store and process. Instead, we approximate them using convex polygons for simplicity (line 5). Doing this ensures that all subsequent intersections also result in convex polygons.

  3. 3.

    Bounded number of intersections: Finding all possible intersections of capture-sets can be reduced to finding all cliques in a graph with capture-sets as vertices and pairwise intersections as edges. Hence it is an NP-hard problem with complexity growing exponentially with the number of polygons. We compute intersections in a pairwise fashion while adding the newly intersected polygons to the list. This way the \(k^{th}\) round of intersection produces uptil all \(k+1\)-polygon intersections and we stop after \(k=4\) rounds of intersection to maintain polynomial time complexity (implemented for line 8, but not shown explicitly in Algorithm 2).

  4. 4.

    Greedy selection: After forming capture-set intersections, we greedily select the top m sets with the highest rewards (line 9).

1.2 A.2 Supplementary Experiments with m, n > 1

Table 3. More results on forests F1 and F4 for m = n = 2.
Table 4. Demonstrating getting stuck in locally optimal strategies.

Table 3 shows more experiments for DeepFP and OptGradFP with m,n>1. We see that DeepFP is able to cover regions of importance with the players’ resources but OptGradFP suffers from the zero defender gradients issue due to logit-normal strategy assumptions which often lead to sub-optimal results and higher exploitability.

1.3 A.3 Locally Optimal Strategies

To further study the issue of getting stuck in locally optimal strategies we show experiments with another forest F5 in Table 4. F5 has three dense tree patches and very sparse and mostly empty other parts. The optimal defender’s strategy computed by DLP for m = n = 1 is shown in C1. In such a case, due to the tree density being broken into patches, gradients for both players would be zero at many locations and hence both algorithms are expected to get stuck in locally optimal strategies depending upon their initialization. This is confirmed by configurations C2, C3, C4 and C5 which show strategies for OptGradFP and DeepFP with m = n = 1 covering only a single forest patch. Once the defender gets stuck on a forest patch, the probability of coming out of it is small since the tree density surrounding the patches is negligible. However, with more resources for the defender and the adversary m = n = 3, DeepFP is mostly able to break out of the stagnation and both players eventually cover more than a single forest patch (see C7), whereas OptGradFP is only able to cover additional ground due to random initialization of the 3 player resources but otherwise remains stuck around a single forest patch (see C6). DeepFP is partially able to break out because the defender’s best response does not rely on gradients but rather come from a non-differentiable oracle. This shows how DeepFP can break out of local optima even in the absence of gradients if a best response oracle is provided, however OptGradFP relies purely on gradients and cannot overcome such situations.

1.4 A.2 Neural Network Architectures

All our models were trained using TensorFlow v1.5 on a Ubuntu 16.04 machine with 32 CPU cores and a Nvidia Tesla K40c GPU.

Cournot Game and Concave-Convex Game. Best response networks for the Cournot game and the Concave-convex game consist of single fully connected layer with a sigmoid activation, directly mapping the 2-D input noise \(z \sim {\mathcal N}([0,0],I_2)\) to a scalar output \(q_p\) for player p. Best response networks are trained with Adam optimizer [16] and learning rate of 0.05. To estimate payoffs, we use exact reward models for the game model networks. Maximum games were limited to 30,000 for Cournot game and 50,000 for Concave-convex game.

Forest Protection Game. The action \(u_p\) of player p contains the cylindrical coordinates (radii and angles) for all resources of that player. So, the best response network for the Forest protection game maps \(Z_A \in {\mathbb R}^{64}\) to the adversary action \(u_A \in {\mathbb R}^{n \times 2}\). It has 3 fully connected hidden layers with \(\{128,64,64\}\) units and ReLU activations. The final output comes from two parallel fully connected layers with n (number of lumberjacks) units each: (a) first with sigmoid activations outputting n radii \(\in [0,1]\), and (b) second with linear activations outputting n angles \(\in [-\infty , \infty ]\), which are modulo-ed to be in \([0, 2\pi ]\) everywhere. All layers are L2-regularized with coefficient \(10^{-2}\):

$$\begin{aligned} x_A&= relu(FC_{64}(relu(FC_{64}(relu(FC_{128}(Z_A)))))) \\ u_{A,rad}&= \sigma (FC_n(x_A));\quad u_{A,ang} = FC_n(x_A) \end{aligned}$$

The game model takes all players’ actions as inputs (i.e. matrices \(u_D, u_A\) of shapes (m, 2) and (n, 2)) respectively) and produces two scalar rewards \(r_D\) and \(r_A\). It internally converts the angles in the second columns of these inputs to the range \([0, 2\pi ]\). Since the rewards should be invariant to the permutations of the defender’s and adversary’s resources (guards and lumberjacks resp.), we first pass the input matrices through non-linear embeddings to interpret their rows as sets rather than ordered vectors (see Deep Sets [29] for details). These non-linear embeddings are shared between the rows of the input matrix and are themselves deep neural networks with three fully connected hidden layers containing \(\{60, 60, 120\}\) units and ReLU activations. They map each row of the matrices into a 120-dimensional vector and then add all these vectors. This effectively projects the action of each player into a 120-dimensional action embedding representation invariant to the ordering of the resources. The players’ embedding networks are trained jointly as a part of the game model network. The players’ action embeddings are further passed through 3 hidden fully connected layers with \(\{1024, 512, 128\}\) units and ReLU activations. The final output rewards are produced by a last fully connected layer with 2 hidden units and linear activation. All layers are L2-regularized with coefficient \(3 \times 10^{-4}\):

$$\begin{aligned} emb_p&= \sum _{dim=row}(DeepSet_{60,60,120}(u_p))\quad \forall p \in \{D,A\} \\ \hat{r}_D, \hat{r}_A&= FC_{2}(relu(FC_{128}(relu(FC_{512}(relu(FC_{1024}(emb_D, emb_A)))))) \end{aligned}$$

The models are trained with Adam optimizer [16]. Note that the permutation invariant embeddings are not central to the game model network and only help to incorporate an inductive bias for this game. We also tested the game model network without the embedding networks and achieved similar performance with about 2x increase in the number of iterations since the game model would need to infer permutation invariance from data.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kamra, N., Gupta, U., Wang, K., Fang, F., Liu, Y., Tambe, M. (2019). DeepFP for Finding Nash Equilibrium in Continuous Action Spaces. In: Alpcan, T., Vorobeychik, Y., Baras, J., Dán, G. (eds) Decision and Game Theory for Security. GameSec 2019. Lecture Notes in Computer Science(), vol 11836. Springer, Cham. https://doi.org/10.1007/978-3-030-32430-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32430-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32429-2

  • Online ISBN: 978-3-030-32430-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics