1 Introduction

The traditional methodology of securing systems is to rely solely on cryptographic means to ensure protection of data (via encryption) and authentication (via signatures and secured-passwords). However, both of these techniques rely on some data being kept secure. The advent of attacks such as Advanced Persistent Threats (APTs) means that exfiltration of the underlying cryptographic keys, or other sensitive data, is now a real threat. To model such long term stealthy attacks researchers have turned to game theory [1, 22, 26, 30], adding to a growing body of work looking at game theory in cybersecurity in various scenarios [7, 16, 35]. Probably the most influential work applying game theory in the security area has been the FlipIt game [10]; with the follow up paper [5] demonstrating applications of FlipIt to various examples including credential management, virtual machine refresh and cloud auditing for service-level-agreement. FlipIt has gained traction and popularity due to the assumption that the adversary can always get into the system; thus aligning with the new rhetoric in the security industry that compromise avoidance is no longer a realistic possibility and it is now about limiting the amount of compromise as quickly and efficiently as possible.

In FlipIt, two players, aptly named the defender and attacker, fight over control of a single resource. Each player has their own button which, when pressed, will give them control over the resource assigning them some form of benefit. Pressing the button has a cost associated to it. The FlipIt paper examines this game as a way of modelling attacks in which a stealthy adversary is trying to control a single resource. For example it could be used to model an adversary which compromises passwords (the adversary’s button press corresponds to a break of a system password), whilst the defender resets passwords occasionally (via pressing their button in the game). FlipIt has of course been generalised in many different directions [11, 17, 19, 24, 27, 28].

However, a standard defence against having a single point of failure for secure data (be it real data or cryptographic secrets), is to use some form of distributed cryptography [9], usually using some form of secret sharing [31]. Such techniques can either be used directly as in [4, 32], or using secret sharing to actually compute some data as in Multi-Party Computation [2, 8]. To capture this and other such situations, Laszka et al. [18] introduce a FlipThem game in which there are multiple resources, each equipped with a button as in FlipIt, and the attacker obtains benefit only by gaining control of every resource in the game. This models the full threshold situation in distributed cryptographic solutions.

The FlipThem game is extended by Leslie et al. [20] to the partial threshold case, where the attacker is not required to gain control of the whole system but only a fraction of the number of resources in order to have some benefit. Assuming both players select the rates of Poisson processes controlling the pressing of buttons, they calculate the proportion of time the attacker will be in control of the system in order to gain some benefit. Nash equilibrium rates of play are calculated which depend on move costs and the threshold of resources required for the attacker to gain any benefit.

A major downside of the analysis in [20] is that the button associated to all resources are given the same cost and move rate for the attacker (and similarly for the defender). So in this current work we introduce the more realistic setting in which a player may have different move rates and costs associated to each resource. For example one resource may be easier to apply patches to than others, or one may be easier to attack due to the operating system it runs. We calculate Nash equilibrium rates of play for this more realistic setting.

We also introduce a framework from the learning in games literature [13] which models the situation where each player responds to the observed actions of the other player, instead of calculating an equilibrium. This adaptive framework is required once we drop the unrealistic assumption that players know their opponent’s costs and reward functions, and our version makes considerably weaker assumptions on the information available to the players than the adaptive framework of [10]. In particular we introduce learning into a situation where the FlipThem game is played over a continuing sequence of epochs. We assume players repeatedly calculate the average observed rate of their opponent and respond optimally to that, resulting in a learning rule known as fictitious play [6, 13]. Performing multiple experiments, we find that when the costs result in a game with an interior equilibrium point (i.e. one in which all players play non-zero rates of all resources) the fictitious play procedure converges to this equilibrium. On the other hand, when there is not an interior equilibrium, we find unstable behaviour in the learning procedure. This result is important in the real world: the fictitious play formulation assumes that players know only their own benefit functions and can observe the play of others, yet the players still manage to converge to the calculated equilibria. Thus in these situations, even if our players are unable to calculate equilibrium strategies, their naïve optimising play will converge to an equilibrium and players’ long term rewards are captured by the equilibrium payoffs.

2 Model

Our Multi-rate Threshold FlipThem game has two players, an attacker and defender, fighting for control over the set of n resources, \(\mathcal {R} = \{ \mathcal {R}_1, \ldots , \mathcal {R}_n \}\). Both players have buttons in front of each resource that when pressed will give them control over that specific resource. For each resource \(\mathcal {R}_i\) the defender and attacker will play an exponential rate \(\mu _i\) and \(\lambda _i\), respectively, meaning that the times of moves for a player on a given resource will follow a Poisson Process with rate given by \(\mu _i\) or \(\lambda _i\). Using Markov Chain theory [14] we can construct explicit values for the proportion of time the each player is in control of resource \(\mathcal {R}_i\) depending on their rates of play. This stationary distribution is given by \(\pi ^i = \left( \pi ^i_0, \pi ^i_1\right) = \frac{1}{\mu _i + \lambda _i} \left( \mu _i,\lambda _i\right) \), and indicates that the defender is in control of the resource \(\mathcal {R}_i\) a proportion \(\mu _i/(\lambda _i+\mu _i)\) of the time, and the attacker is in control a proportion \(\lambda _i/(\lambda _i+\mu _i)\) of the time. We assume that each behaviour on each resource is independent of all the others, and hence the proportion of time that the attacker is control of a particular set of resources C, with the defender in control of the other resources, is simply given by the product of the individual resource proportions: \(\prod _{i\in C} \frac{\lambda ^i}{\lambda _i+\mu _i}\prod _{i\notin C} \frac{\mu ^i}{\lambda ^i+\mu _i}.\)

At any point in time, the attacker will have compromised k resources, whilst the defender is in control of the remaining \(n-k\) resources, for some k. In order for the attacker to have some gain, she must compromise t or more resources. The value t is called the threshold, as used in a number of existing threshold security situations discussed in the introduction. From a game theory point of view, whenever \(k\ge t\) the attacker obtains benefit, whilst when \(k<t\) the defender obtains benefit.

From this we can construct benefit functions for both players. For the attacker it is the proportion of time she is in control of a number of resources over the threshold, such that \(k \ge t\), penalised by a cost for moving. For the defender it is the proportion of time that she is in control of at least \(n-t+1\) resources, again penalised by a cost of moving. Thus, the benefit functions for attacker and defender respectively are given by

$$\begin{aligned} \begin{aligned} \beta _D(\varvec{\mu }, \varvec{\lambda })&= 1 - \sum _{\begin{array}{c} C\subseteq \{1,\ldots ,n\} \\ |C|\ge t \end{array}} \left[ \prod _{i\in C} \frac{\lambda _i}{\lambda _i+\mu _i}\right] \cdot \left[ \prod _{i\notin C} \frac{\mu _i}{\lambda _i+\mu _i}\right] - \sum _i d_i \cdot \mu _i\\ \beta _A(\varvec{\mu }, \varvec{\lambda })&= \sum _{\begin{array}{c} C\subseteq \{1,\ldots ,n\} \\ |C|\ge t \end{array}} \left[ \prod _{i\in C} \frac{\lambda _i}{\lambda _i+\mu _i}\right] \cdot \left[ \prod _{i\notin C} \frac{\mu _i}{\lambda _i+\mu _i}\right] - \sum _i a_i \cdot \lambda _i \ \end{aligned} \end{aligned}$$
(1)

where the \(a_i\) and \(d_i\) are the (relative) move costs on resource i for attacker and defender, which are assumed fixed throughout the game, and \(\varvec{\mu }\) and \(\varvec{\lambda }\) are the vectors of rates over all resources for the defender and attacker, respectively, constrained to be non-negative. The benefit functions in (1) show that the game is non-zero-sum, meaning we are unable to use standard zero-sum results found in the literature [25].

3 Finding the Equilibria of Multi-rate (nt)-FlipThem

We begin by finding the equilibria of the multi-rate version of the FlipThem game with n resources. This represents a more realistic scenario than previous studies, by allowing players to favour certain resources based on differential costs of attacking or defending, for example when a company owns multiple servers located in different areas and with different versions of operating systems. We want to find a stationary point or equilibrium of the two benefit functions (1) in terms of purely the costs of moving on each resource. This would mean both players are playing at a rate that maximises their own benefit function with respect to their opponents play and move costs. Thus, they would not wish to deviate from their current playing strategy or rates. This is known as a Nash Equilibrium [23]. Our challenge in this article, compared with previous works such as [20], is that neither benefit function is trivial to optimise simultaneously across the vector of rates.

3.1 Full Threshold: Multi-rate (nn)-FlipThem

We begin with the full threshold case in which the attacker must control all resources in order to obtain some benefit. The algebra is easier in this case; the partial threshold version is addressed in Sect. 3.2. For this full threshold case, the general benefit functions (1) simplify to

$$\begin{aligned} \begin{aligned} \beta _D (\varvec{\mu }, \varvec{\lambda })&= 1 - \prod ^n_{i=1} \frac{\lambda _i}{\mu _i + \lambda _i} - \sum _{i=1}^n d_i \cdot \mu _i \\ \beta _A (\varvec{\mu }, \varvec{\lambda })&= \prod ^n_{i=1} \frac{\lambda _i}{\mu _i + \lambda _i} - \sum _{i=1}^n a_i \cdot \lambda _i. \end{aligned} \end{aligned}$$
(2)

Note that these benefit functions reduce to those of [20] if we set \(\mu _i = \mu , \lambda _i = \lambda \), \(a_i = \frac{a}{n}\) and \(d_i = \frac{d}{n}\) for all i.

We start by finding the best response function of the defender, which is a function \(\text {br}^D\) mapping attacker rates \(\varvec{\lambda }\) to the set of all defender rates \(\varvec{\mu }\) which maximise defender payoff \(\beta _D\) when the attacker plays rates \(\varvec{\lambda }\). A necessary, though not sufficient, condition for \(\varvec{\mu }\) to maximise \(\beta _D\) is that each \(\mu _i\) maximises \(\beta _D\) conditional on the other values of \(\varvec{\mu }\). Furthermore, maxima with respect to \(\mu _i\) occur either when the partial derivative \(\frac{\partial \beta _D}{\partial \mu _i}\) is 0, or at a boundary of the parameter space. Equating this partial derivative to zero to gives

$$\begin{aligned} \frac{\partial \beta _D}{\partial \mu _i} = 0 \Rightarrow -\prod ^n_{j = 1} \lambda _j + d_i \cdot (\lambda _i + \mu _i)^2 \cdot \prod ^n_{j = 1,j \ne i} (\mu _j + \lambda _j) = 0. \end{aligned}$$

This is a quadratic in \(\mu _i\), meaning that fixing the defender’s benefit function shown in (2) for all attacker rates \(\varvec{\lambda }\) and all defender rates \(\mu _j\) where \(j\ne i\), gives only two turning points. Since \(\beta _D\) decreases to negative infinity as \(\mu _i\) gets large, the two candidates for a maximising \(\mu _i\) are at the upper root of this equation, or at \(\mu _i=0\). A non-0 \(\mu _i\) must therefore satisfy

$$\begin{aligned} \mu _i = -\lambda _i + \sqrt{\frac{\lambda _i}{d_i} \cdot \prod ^n_{{j = 1,j \ne i}} \frac{ \lambda _j}{ (\mu _j + \lambda _j) }}. \end{aligned}$$
(3)

Of course, a \(\mu _i\) satisfying this equation could be negative and thus inadmissible as a rate, but all we claim for now is that any non-zero \(\mu _i\) must be of the this form.

We can use the same method in differentiating the attacker’s benefit with respect to her rate \(\lambda _i\) for resource \(\mathcal {R}_i\), equating this to zero and manipulating to give

$$\begin{aligned} \lambda _i = -\mu _i + \sqrt{\frac{\mu _i}{a_i} \cdot \prod ^n_{{j = 1,j \ne i}} \frac{ \lambda _j}{ (\mu _j + \lambda _j) }}. \end{aligned}$$
(4)

Any Nash equilibrium in the interior of the strategy space (i.e. with strictly positive rates on all resources) must be a simultaneous solution of (3) and (4). Note that both equations can be rearranged to express \(\lambda _i+\mu _i\) in terms of a square root, and then we can equate the square root terms to find that \(\frac{\lambda _i}{\mu _i} = \frac{d_i}{a_i}\). Substituting this relationship back in to (3) and (4) gives

$$\begin{aligned} \lambda _i^* = \frac{d_i}{(a_i + d_i)^2} \cdot \prod _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^n\frac{d_j}{(a_j + d_j)}, \quad \mu _i^* = \frac{a_i}{(a_i + d_i)^2} \cdot \prod _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^n\frac{d_j}{(a_j + d_j)}. \end{aligned}$$
(5)

If a company was reviewing its defensive systems and was able to calculate these equilibria, they can see the rate at which they’d have to move in order to defend their system optimally, under the assumption that the attacker is also playing optimally. (Of course, if the attacker was playing sub-optimally, the defender would be able to gain a larger pay off.) From these equilibria the parties are able to calculate the long run proportion of time each of their resources within the system would be compromised by looking at the stationary distribution shown in Sect. 2. From this information, the company can also calculate the value of the benefit functions for each player when playing these rates. This is useful in the real world as it can be used by companies to make strategic decisions on the design of their systems. We do this by substituting the two equilibria back into the benefit functions (2). We can express these rewards in terms of the ratio between the costs of each resource, \(\rho _j = \frac{a_j}{d_j}\), giving the dimensionless expression

$$\begin{aligned} \begin{aligned} \beta _D^*&= 1 - \prod _{i=1}^n \frac{1}{\rho _i +1}\left[ 1+\sum _{j=1}^n\frac{\rho _j}{\rho _j+1}\right] ,\\ \beta _A^*&= \prod _{i=1}^n \frac{1}{\rho _i +1}\left[ 1-\sum _{j=1}^n\frac{\rho _j}{\rho _j+1}\right] . \end{aligned} \end{aligned}$$
(6)

We have expressed the payoffs to both players at the putative interior equilibrium. If this is an equilibrium, neither player will wish to deviate from these equilibrium rates. However we have thus far ignored the potential maximising \(\mu _i\) at 0. While no individual \(\mu _i\) or \(\lambda _i\) will deviate unilaterally to zero (recall that the partial derivatives have only two zeroes, and thus payoff decreases as \(\mu _i\) decreases to zero), if several rates simultaneously switch the payoff to a player could increase. We therefore consider what happens when rates can switch to zero, starting with the attacker.

By considering the attacker’s benefit function (2), we can see that if the attacker plays a zero rate on any resource, then in order to maximise the payoff, she should play zero rates on the rest of the system. In other words, she should drop out of the game and receive zero reward. Thus by comparing the attacker payoff in (6) to zero we can see quickly whether this is indeed a point at which the attacker will be content.

For the defender, things are more complicated. However, we can see from the benefit function (2) that a zero payoff by withdrawing from the game is again an option, and so we compare the equilibrium payoff (6) to zero as a partial check on whether the fixed point (5) is an equilibrium. We do not explicitly discount partial dropout of the defender, but note that by dropping out the defender is effectively reducing the game to a smaller number of servers, and hence should not have invested in the additional servers in the first place. Comparing the benefits in (6) to zero it is easy to see in this full threshold case that the defender’s benefit \(\beta _D^*\) is always non-negative when playing (5) meaning dropping out is not an equilibrium point for the defender, whereas \(\beta _A^*\) can drop below 0. We use this to find a condition which must be satisfied for the point (5) to be an equilibrium. In particular, we require \(\beta _A^*\) in (6) to be positive, and therefore

$$\begin{aligned} 1-\sum _{j=1}^n\frac{\rho _j}{\rho _j+1} >0. \end{aligned}$$
(7)

Thus, we have a condition (7) that, if satisfied, the attacker will not drop out of the game. If the condition is not satisfied, the attacker will prefer to drop out of the game entirely than to play at the interior equilibrium. Ensuring condition (7) is met can thus be viewed as an design criterium for system engineers when designing defensive systems.

3.2 Partial Threshold: Multi-rate (nt)-FlipThem

So far we have extended the full threshold FlipThem game [18] by obtaining the equilibria of the benefit functions constructed from the proportion of time the attacker is in control of the whole system. In order to generalise this further, we return to our partial threshold case in which the attacker gains benefit from controlling only \(t<n\) resources. Our general benefit functions for both players are written in (1); the analysis is analogous to methods demonstrated above in Sect. 3.1. In this (nt)-threshold case, the analogous best response conditions to (3) and (4) are

$$\begin{aligned} \mu _i = -\lambda _i + \sqrt{\frac{\lambda _i \cdot S_i}{d_i}} \quad \text {and} \quad \lambda _i = -\mu _i + \sqrt{\frac{\mu _i S_i}{a_i }}, \end{aligned}$$
(8)

where we have introduced \(S_i\) to denote

$$\begin{aligned} \sum _{\begin{array}{c} C\subseteq \mathcal {A}_i \\ |C|\ge t \end{array}} \left[ \prod _{j\in C} \frac{\lambda _j}{\lambda _j+\mu _j}\right] \cdot \left[ \prod _{j\notin C} \frac{\mu _j}{\lambda _j+\mu _j}\right] \end{aligned}$$

and \(\mathcal {A}_i = \{1,\ldots , i-1, i+1, \ldots ,n\}\). Interestingly, equating the square root terms results in the same relationship \(\frac{\mu _i}{a_i} = \frac{\lambda _i}{d_i}\) as in Sect. 3.1. Finally, substituting this relationship back into the best response functions (8) gives us

$$\begin{aligned} \begin{aligned} \mu _i^*&= \frac{a_i}{(a_i + d_i)^2} \cdot \sum _{\begin{array}{c} C\subseteq \mathcal {A}_i \\ |C|\ge t \end{array}} \left[ \prod _{j\in C} \frac{d_j}{a_j+d_j}\right] \cdot \left[ \prod _{j\notin C} \frac{a_j}{a_j+d_j}\right] , \\ \lambda _i^*&= \frac{d_i}{(a_i + d_i)^2} \cdot \sum _{\begin{array}{c} C\subseteq \mathcal {A}_i \\ |C|\ge t \end{array}} \left[ \prod _{j\in C} \frac{d_j}{a_j+d_j}\right] \cdot \left[ \prod _{j\notin C} \frac{a_j}{a_j+d_j}\right] . \end{aligned} \end{aligned}$$
(9)

As in Sect. 3.1, (9) is a Nash equilibrium for the game, unless one or other player can improve their payoff by dropping out of one or more resources. We can substitute these rates back into the players’ benefit functions as we did in the full threshold case in Sect. 3.1 to check that the payoffs at this putative equilibrium are non-negative. However, the formulae become very complicated to write down explicitly in general and we leave this to Sect. 4, when we deal with specific examples. Note this also means we do not have a clean condition analogous to (7) to test whether (9) is an equilibrium.

4 Introducing Fictitious Play into Multi-rate (nt)-FlipThem

While the equilibrium analysis above offers useful insight into the security game Multi-rate Threshold FlipThem, it can be viewed as an unrealistic model of real world play. In particular it is extremely unlikely the players have full knowledge of the payoff and move costs of their opponent, and therefore cannot calculate the equilibrium strategies. We now introduce game-theoretical learning, in which the only knowledge a player has of their opponent is the actions that they take. When the game is repeatedly played through time, players respond to their observations and attempt to improve their payoff. In this article we focus on a method of learning known as fictitious play [3, 6, 13].

We break the game up into periods of fixed length of time. At the end of period \(\tau \) each player observes the number of times the button of each resource i was pressed by their opponent in that period. Denote by \(\lambda _i^\tau \) and \(\mu _i^\tau \) the actual rate played by attacker and defender in period \(\tau \), and use \(\widetilde{\lambda _i}^\tau \) and \(\widetilde{\mu _i}^\tau \) to denote the number of button presses by the attacker and defender respectively, normalised by the length of the time interval. After \(\mathcal {T}\) plays of the game, each player averages the observations he has made of the opponent, resulting in estimates for each resource

$$\begin{aligned} \widehat{\lambda }_i^{\mathcal {T}} = \frac{1}{\mathcal {T}} \sum _{\tau =1}^\mathcal {T} \widetilde{\lambda }_i^\tau ,\qquad \widehat{\mu }_i^{\mathcal {T}} = \frac{1}{\mathcal {T}} \sum _{\tau =1}^\mathcal {T} \widetilde{\mu }_i^\tau . \end{aligned}$$

The players select their rates for time period \(\mathcal {T}+1\) by playing a best response to their estimates;

$$\begin{aligned} \mu _i^{\mathcal {T}+1} = \text {br}_i^D(\widehat{\varvec{\lambda }}^\mathcal {T}),\qquad \lambda _i^{\mathcal {T}+1} = \text {br}_i^A(\widehat{\varvec{\mu }}^\mathcal {T}). \end{aligned}$$

where \(\widehat{\varvec{\mu }}^{\mathcal {T}}\) and \(\widehat{\varvec{\lambda }}^{\mathcal {T}}\) are the defender and attacker’s vector of rates played on each resource. If it were the case that opponent rates were constant, the averaging of observations over time would be an optimal estimation of those rates. Since both players are learning, the rates are not constant, and averaging uniformly over time does not result in statistically optimal prediction. However lack of a better informed model of rate evolution means that averaging is retained as the standard mechanism in fictitious play; see [21, 34] for attempts to move beyond this assumption.

figure a

This fictitious play process is described in Algorithm 1, where we see the simplicity of the learning process and the sparsity of the information required by the players. The only challenging step is in calculating the best response function; as observed in Sect. 3 the best response of each player is not in general a simple analytical function of the rates of the opponent. From the defender’s point of view we consider all subsets of the resources; setting the rates of these resources to zero, and solving (8) for the non-zero rates, we find a putative best response; the set of rates with the highest payoff given the fixed belief \(\widehat{\varvec{\lambda }} ^\mathcal {T}\) is the best response. The attacker’s best response is calculated analogously. An interesting question, which we address, is whether this simple learning process converges to the equilibria calculated previously in Sect. 3.

The process we have defined is a discrete time stochastic process. It is in actual fact a stochastic fictitious play process [12]; since the number of button presses in a period is Poisson with expected value the played rate multiplied by the length of the time interval, the observations can be seen to satisfy

$$\begin{aligned} \widetilde{\mu }_i^{\mathcal {T}+1} = \text {br}_i^D(\widehat{\varvec{\lambda }}^\mathcal {T}) + M_{\mu ,i}^{\mathcal {T}+1}, \qquad \widetilde{\lambda }_i^{\mathcal {T}+1} = \text {br}_i^A(\widehat{\varvec{\mu }}^\mathcal {T}) + M_{\lambda ,i}^{\mathcal {T}+1}, \end{aligned}$$

where \(\mathbb {E}(M_{\cdot ,\cdot }^{\tau +1} | \mathcal {F}^\tau ) = 0\) if \(\mathcal {F}^\tau \) denotes the history of the process up to time \(\tau \). The methods of [3] then apply directly to show that the convergence (or otherwise) of the discrete stochastic process is governed by the continuous deterministic differential equations

$$\begin{aligned} \frac{\mathrm{d} \varvec{\lambda }}{\mathrm{d} t} = \text {br}(\varvec{\mu }) - \varvec{\lambda }, \qquad \frac{\mathrm{d} \varvec{\mu }}{\mathrm{d} t} = \text {br}(\varvec{\lambda }) - \varvec{\mu }. \end{aligned}$$
(10)

In standard fictitious play analyses, one might show that solutions of (10) are globally convergent to the equilibrium set. This is commonly achievable only in some classes of games, and since we do not have a zero-sum game we have not been able to show the required global convergence of (10).

4.1 Original FlipIt

The game of FlipIt can be considered a special case of our game of multi-rate (nt)-FlipThem, seen by setting \(n=t=1\). This has the advantage that the best responses can be written in closed form, and we can use (10) to set up a two dimensional ordinary differential equation in terms of the players’ rates and time. We start by writing the benefit functions for this special case

$$\begin{aligned} \beta _D (\mu , \lambda ) = 1 - \frac{\lambda }{\mu + \lambda } - d \mu , \qquad \beta _A (\mu , \lambda ) = \frac{\lambda }{\mu + \lambda } - a \lambda . \end{aligned}$$
(11)

Differentiating the players’ benefit functions (11) in terms of their respective resource rates and then solving for these gives the best response functions

$$\begin{aligned} \text {br}^D(\lambda ) = \left( -\lambda + \sqrt{\frac{\lambda }{d}}\right) ^+, \qquad \text {br}^A(\mu ) = \left( -\mu + \sqrt{\frac{\mu }{a}}\right) ^+ \end{aligned}$$

where \((x)^+ = \max (x, 0)\). The ordinary differential equation (10) becomes

$$\begin{aligned} \frac{\mathrm{d}\lambda }{\mathrm{d}t} = \left( -\mu + \sqrt{\frac{\mu }{a}}\right) ^+ - \lambda ,\qquad \frac{\mathrm{d}\mu }{\mathrm{d}t} = \left( -\lambda + \sqrt{\frac{\lambda }{d}}\right) ^+ - \mu . \end{aligned}$$
(12)

This is a two dimensional ordinary differential equation in terms of the players’ rates and time. The plot of the phase portrait of this is shown in Fig. 1. Where we have used \(d=0.1, a =0.3\) as the move costs. It easy to see that the arrows demonstrating the direction of the rates over time converge upon a single point. This point is the equilibrium we can calculate easily from the more general equilibria derived in Sect. 3.2. We can also use Algorithm 1 to plot trajectories of the system in order to view convergence of the system; the convergence is monotonic and uninteresting so we omit the plots.

Fig. 1.
figure 1

Phase portrait of (12) with \(d=0.1, a =0.3\)

4.2 (nt)-FlipThem

A multi-resource example in which we retain one rate per player (as opposed to one rate per resource for each player) is given by the situation in [18, 20]; each player chooses a rate at which to play all resources. Whilst this allows us to retain a two-dimensional system when considering the multiple resource case, unfortunately obtaining explicit best response functions is extremely difficult. Therefore, we revert to Algorithm 1 using time intervals of length 100; we fix this time interval for all further experiments in this article.

Fig. 2.
figure 2

Mean of the defender’s and attacker’s rate with (3, 2)-threshold and ratio \(\rho = \frac{a}{d} = 1.3\).

As in those previous works, we consider the stationary distribution of the system, with the defender playing with rate \(\mu \) and attacker rate \(\lambda \). This results in a stationary distribution for the whole system, given by

$$\begin{aligned} \pi = \frac{1}{(\mu +\lambda )^n}\bigg (\mu ^{n},n \cdot \lambda \cdot \mu ^{n-1}, \ldots , \left( {\begin{array}{c}n\\ k\end{array}}\right) \cdot \mu ^{n-k}\cdot \lambda ^k,\ldots , n \cdot \mu \cdot \lambda ^{n-1}, \lambda ^n \bigg ), \end{aligned}$$

where the states correspond to the number of compromised resources, ranging from 0 to n. Benefit functions for both players are given by

$$\begin{aligned} \beta _D(\mu , \lambda )&= 1 - \sum _{i=t}^n \pi _i - d \cdot \mu = 1 - \frac{1}{(\mu + \lambda )^n} \cdot \sum _{i=t}^n \left( {\begin{array}{c}n\\ i\end{array}}\right) \cdot \mu ^{n-i}\cdot \lambda ^i - d \cdot \mu , \\ \beta _A(\mu , \lambda )&= \sum _{i=t}^n \pi _i - a \cdot \lambda = \frac{1}{(\mu + \lambda )^n} \cdot \sum _{i=t}^n \left( {\begin{array}{c}n\\ i\end{array}}\right) \cdot \mu ^{n-i}\cdot \lambda ^i - a \cdot \lambda , \end{aligned}$$

and best responses are calculated by differentiating these benefit functions, as in [20] and Sect. 3. In Fig. 2 we plot the rate of the attacker and defender respectively by applying Algorithm 1 to random starting rates for both players, with (3, 2)-threshold and costs \(a=0.65\) and \(d=0.5\). The straight horizontal lines represents the players’ Nash equilibrium rates that can be calculated as a special case of the general equilibrium from (9). Note that we have chosen these costs in order to produce positive benefits for both players whilst playing the calculated Nash equilibrium. We see that both the defender’s and attacker’s mean rate converges to these lines.

4.3 Multi-rate (nt)-FlipThem

Finally, we come to the most general case of the paper, Multi-rate (nt)-FlipThem. As observed previously, depending on the player’s belief of their opponents rates on each resource, they may choose to drop out of playing on a certain resource, or perhaps even all of them. Our best response functions must iterate through all possibilities of setting the resource rates to zero and chooses the configuration with the highest benefit as the best response. This solution is then used as \(\mu ^{\mathcal {T}+1}\) or \(\lambda ^{\mathcal {T}+1}\) for the following period \(\mathcal {T}+1\).

We want to find a condition such that we can gain some insight as to whether in this most complicated setting our iterative learning rule will converge to the equilibria we calculated analytically in Sect. 3.2. We experimented with multiple combinations of n and t, randomly simulating costs of both players. From these empirical results, we observe that convergence occurs whenever the internal fixed point (9) results in non-negative benefits to both players.

Specific Case \(\mathbf ( {\varvec{n}}\,\mathbf =\,3, \ {\varvec{t}}\,\mathbf =\,2): \) In order to illustrate the outcomes of our fictitious play algorithm we specify our threshold \((n,t)=(3,2)\), and choose two representative cases of our randomly simulated examples. These particular examples were selected for ease of display, allowing us to illustrate their properties of convergence (or divergence) clearly on just one figure. The first, when the benefits are positive and the internal equilibrium is not ruled out, we term ‘success’. The second is an example in which the internal fixed point is not an equilibrium, and we term this case ‘failure’.

Success: Our first example is with ratios \((\rho _1, \rho _2, \rho _3) \approx (0.7833, 0.7856, 0.7023)\) and attacker costs \((a_1, a_2, a_3) \approx (0.6237, 0.5959,0.5149)\). Thus, the benefit values at equilibrium are \(\beta _D^{*} \approx 0.0359\) and \(\beta _A^{*} \approx 0.2429\). Since we have positive payoff for both players we expect convergence of the learning algorithm. This is exactly what we observe in Fig. 3; convergence of rates to the lines representing the equilibria calculated in Sect. 3.2.

Fig. 3.
figure 3

Mean of the defender’s rates (top) and attacker’s rates (bottom) on all resources with (3, 2)-threshold and ratios \((\rho _1, \rho _2, \rho _3) \approx (0.7833, 0.7856, 0.7023)\)

Failure: Our second example shows a lack of convergence when conditions are not met. Therefore, we choose ratios to be \((\rho _1, \rho _2, \rho _3) \approx (0.5152 , 0.5074, 0.5010)\) and attacker costs \((a_1, a_2, a_3) \approx (0.2597, 0.2570, 0.2555)\). This gives ‘equilibrium’ benefits for the defender and attacker of \(\beta _D^{*} \approx -0.0354\) and \(\beta _A^{*} \approx 0.4368\). The defender’s benefit in this situation is negative, meaning we expect a lack of convergence. Figure 4 shows the development of both players’ rates over time. We can see the rates do not approach anywhere near the equilibrium. Intuitively, this makes sense as the defender will not choose to end up playing in a game in which she receives negative payoff when she can drop out of the game completely in order to receive zero payoff.

Fig. 4.
figure 4

Mean of the defender’s rates (top) and attacker’s rates (bottom) on all resources with (3, 2)-threshold and ratios \((\rho _1, \rho _2, \rho _3) \approx (0.5152 , 0.5074, 0.5010)\).

We can see evidence of this dropping out in Fig. 5, which shows the rates the defender is actually playing on the resource (rather than the mean over time). The defender has certain periods where she drops out of the game entirely. The attacker’s mean rates then start to drop until a point when the defender decides to re-enter the game.

Fig. 5.
figure 5

Defender’s actual rates on all resources with (3, 2)-threshold and ratios \((\rho _1, \rho _2, \rho _3) \approx (0.5152 , 0.5074, 0.5010)\).

Fig. 6.
figure 6

Two snapshots of the defender’s reward surface with a slight perturbation in attackers rates. (Left) the payoff is just above the 0-plane. (Right) the whole payoff surface is below the 0-plane.

To see a reason for this volatility, Fig. 6 shows the defender’s payoff surface for nearby attacker strategies on either side of a ‘dropout’ event from Fig. 5. We fix the defender’s rate on resource 1 and plot the benefit surface as the rates on resources 2 and 3 vary. We also plot the plane of zero reward. It’s easy to observe that the maximum of this reward surface is above 0 in the left hand plot, but a small perturbation of the attacker rates pushes the maximal defender benefit below zero in the right hand plot, thus forcing the defender to drop out entirely. We conjecture that as the ratios decrease (and therefore become more costly for the defender) the defender drops out of the game more often within this learning environment.