Best Reply Player Against Mixed Evolutionarily Stable Strategy User

We consider matrix games with two phenotypes (players): one following a mixed evolutionarily stable strategy and another one that always plays a best reply against the action played by its opponent in the previous round (best reply player, BR). We focus on iterated games and well-mixed games with repetition (that is, the mean number of repetitions is positive, but not infinite). In both interaction schemes, there are conditions on the payoff matrix guaranteeing that the best reply player can replace the mixed ESS player. This is possible because best reply players in pairs, individually following their own selfish strategies, develop cycles where the bigger payoff can compensate their disadvantage compared with the ESS players. Well-mixed interaction is one of the basic assumptions of classical evolutionary matrix game theory. However, if the players repeat the game with certain probability, then they can react to their opponents’ behavior. Our main result is that the classical mixed ESS loses its general stability in the well-mixed population games with repetition in the sense that it can happen to be overrun by the BR player.


Introduction
In game theory, the Nash equilibrium is an optimal situation where neither player can benefit by changing strategy while her opponent keeps hers unchanged. A more general concept of game theoretical solution is the correlated equilibrium (Aumann 1974(Aumann , 1987. It is based on the assumption that players choose their actions according to their observation of the value of the same public signal, e.g., they may keep monitoring their opponent's past performance. This raises the question whether reactive players can overcome Nash players, from the view point of evolutionary game theory. Consider a sufficiently large asexual population with nonoverlapping generations, where the interaction is well mixed (i.e., in each round, the probability of interactions is proportional to the relative frequency of the phenotypes). For this selection situation, based on the Darwinian tenet, Maynard Smith and Price (1973) introduced the intuitive definition of monomorphic evolutionary stability: a phenotype is evolutionarily stable if a rare enough mutant cannot invade the resident monomorphic population displaying this phenotype. For pairwise interactions, the general formalization of this verbal definition reads as follows. Let W (X , Y ) denote the average benefit of phenotype X when interacting with phenotype Y , Then resident phenotype X is said evolutionarily stable, if for arbitrary mutant phenotype Y with sufficiently small relative frequency ε ∈ (0, 1) we have Note that the smallness bound ε 0 may vary with Y .
Starting from a symmetric matrix game, let S = {s 1 , . . . , s d } denote the set of pure strategies, then the phenotypes are described by mixed strategies, i.e., by probability distributions over S. In this case, the phenotype using the mixed strategy p will also be denoted by p if it does not lead to misunderstanding. Suppose that the average benefit of the interaction between individuals is given by a matrix game, i.e., W ( p, q) = p Aq, where A ∈ R d×d and p, q run over the (d − 1)-dimensional simplex d = x ∈ R d : ∀x i ≥ 0, d i=1 x i = 1 . In this model, the above general definition of ESS (1) reads as follows: p * is an evolutionarily stable strategy (ESS) if a small enough size of mutant population implies that for every possible mutant strategy q we have the pair formation is well mixed), each pair plays a large but the same number of games with each other (i.e., the payoffs are defined as the limit of the average payoffs per round), and each player can only use a genetically fixed mixed or pure strategy in all rounds, then the definition of ESS (2) remains valid. We emphasize that there is an essential difference between the well-mixed (fixed) pair formation and the well-mixed interaction, since in the latter case each player gets a new random opponent from the whole population from round to round, so there is no possibility to either synchronize their actions or react to the partner's strategy. However, it is well known that the properties of iterated games are very different from those of one-shot games (van Damme 1987). On the one hand, Cressman (1992) discussed the twice repeated one-shot game where there are two strategies. By analyzing the ESSs of this eight-strategy two-round game, he found that using the one-shot ESS in each round was seldom an ESS in the two-round game. In fact, if the one-shot ESS is a mixed strategy, it is never an ESS of the two-round game (personal communication). Based on that, here we consider repetitive games of randomly paired individuals, and only concentrate on the one-shot mixed ESS in the hawk-dove game.
On the other hand, pair formation can also be a behavior. Pacheto et al. (2006a;2006b) studied the consequences of dynamical linking, where the number of repetitions of the interactions between two individuals depends on the payoff from the given interaction, and they found the natural selection to favor cooperation over defection.
Furthermore, there are other types of mutation than the strategy mutant. In the resident population of the standard behavior ("maximization of own payoff"), the mutants may have different other regarding preferences (e.g., "to be better than the average," see Garay and Varga 2005), morality (e.g., "to do the right thing," see Alger and Weibull 2012Weibull 1997), or amorality (e.g., envy "punish the successful," see Garay and Móri 2011). In these papers, the evolution operates on the different "player types" and not on different strategists only. Now, we also consider two different types of player: the resident is able to use mixed strategy, while the mutant can use the best reply strategy. We care which one will win on evolutionary time scale (i.e., we are looking at the "end of evolution" across generations), and this is the reason why we consider the uninvadability inequality (1) the foundation of the present paper.
Iterated games are widely used in economics and evolutionary game theory. Firstly, remember that the notion of mixed strategy is built on iterated games between two players. Secondly, there exist important studies on evolutionary stability in iterated games. In particular, from the literature of the iterated prisoner's dilemma game (e.g., Axelrod and Hamilton 1981;Hilbe et al. 2013), it is well known that the iteration of the one-shot ESS in the prisoner's dilemma is not an ESS in the iterated prisoner's dilemma (e.g., Bendor and Swistak 1995;Boyd and Lorberbaum 1987;Farrell and Ware 1989;García and van Veelen 2016). Another example is the iterated survival game, where the altruism is evolutionarily stable (Garay and Varga 2011;Wakeley and Nowak 2019). Thirdly, the iterated hawk-dove game has already been studied (e.g., Houston and McNamara 1991;Wolf et al. 2011;Kokko 2003, 2005;Van Doorn et al. 2003). But, according to our knowledge, our selection situation, where best reply players compete with classical mixed ESS players, has not been investigated so far.
A mixed strategy is not reactive in the sense that a mixed strategy user randomly uses a pure strategy from game to game, independently of what her partner has done. Unlike a mixed strategy user, living things respond to stimuli, and if in the repetitive games a "reactive player" can consider her partner's earlier used pure strategy as a stimulus, then the reactions of a reactive player can form a deterministic time series. In other words, reactive player does not necessarily use a mixed strategy. In general sense, firstly, in the framework of iterated prisoner's dilemma game, e.g., the well-known "tit for tat" strategy (for other strategies see, e.g., Kendall et al. 2007) and the social norms (e.g., Ohtsuki and Iwasa 2006) belong to the set of reactive strategies. Secondly, the reactive player can imitate or learn (e.g., Hofbauer and Sigmund 1998;Szabó and Hódsági 2016), when she can compare her payoff with other players' payoff. In the second case, the reactive player's behavior is given by a game dynamics. 1 Now, we focus on the dynamical player as a special case of the reactive player, and our general question arises: Can a mutant dynamic player invade a population of mixed ESS users?
In population games when each player plays many rounds, there are two extreme interaction schemes, namely well-mixed interactions and iterated ones. The former is the basic assumption of the classical ESS. If the population size is large enough (say infinite), the probability that a fixed pair repeats the game is negligible. On the contrary, in an iterated game the two players keep on playing the game exclusively with each other for a very long (virtually infinite) time. However, there is another possibility halfway between these two extreme schemes, where the interaction between two players is repetitive (i.e., the mean number of repetitions between two players is positive but finite), but when the two players finish the interaction, they form new pairs with other players at random, in a well-mixed way. We will call this a well-mixed game with repetition.
One of the possible biological examples for a well-mixed game with repetition is the territorial behavior (Morrell and Kokko 2005), when the neighbors are fixed while the floaters randomly arrive to fight for the owners' territory (Varga et al. 2020). Clearly, the game repetition opens lots of questions if the players can collect information about their partners and they have memory. For instance, if there is physical difference between players, after fight the loser can accept hierarchy (Van Doorn et al. 2003), which creates asymmetry in the game, based on different fighting abilities of players. Since we would like to avoid asymmetric conflicts, we consider the simplest kind of memory: in the next repeated game each BR player can only remember the pure strategy used by her opponent in the preceding game, but cannot recall the payoff or the outcomes of previous games 2 Furthermore, all players have the same fighting ability (Morrell and Kokko 2003).
In what follows, we first concentrate on the iterated game, and then we are going to deal with well-mixed population games with repetition.

Iterated Games
Here, we consider best reply (see, e.g., Hofbauer and Sigmund 1998) as a reaction rule: this is the strategy yielding the most favorable outcome against the opponent's move in the preceding game. The present question is whether a best reply player can invade a monomorphic classical mixed ESS population. Let us consider a symmetric 2 × 2 game (anti-coordination, hawk-dove) with payoff matrix Now assume the game is played repeatedly, and consider the strategy for the repeated game that starts with a uniform random action and then plays in every round a best reply against the action of the opponent in the previous round. Call this strategy β. Then the (expected) payoff (per round) for β against p * (playing in every round action i with probability p * i , independently of the past) is again Remind that, against the mixed ESS, all mutants are neutral in the sense that mutants receive the same payoff as the ESS strategist does. Since β against p * uses (in the long run) action 1 with probability p * 2 = a a+b and action 2 with probability p * 1 = b a+b , we get Finally, β against β either alternate between 11 and 22, or play 12 forever, or 21 forever. Thus For c = 0, we have with strict inequality if a = b. In this case β cannot invade p * . Now let a < b and c > 0 large. Then Indeed, by elementary calculation we obtain that For a numerical example, let a = 1, b = 3 and c = 6. We note that if c is allowed to be negative, then b < a and c < 3 2 (b − a) will also enable β to invade p * . Hence, the memoryless ESS p * loses against the memory one strategy β. According to (1), the mutant best reply phenotype can replace the classical mixed ESS phenotype p * , since, independently of the relative frequency ε of the mutant, the average fitness of p * is lower than that of the best reply phenotype β (cf. Cressman et al. 2020).
Remark 2 What will change if a small probability of error is allowed? By error, we mean that the player fails to use the strategy she is expected to: her strategy specifies to use pure strategy s i but she uses another s j ( j = i) instead. It may have several reasons; for example, the player intends to use a certain strategy, but, by error, uses another one, or the BR player may erroneously perceive the strategy used by her opponent in the preceding round. The source of error is irrelevant: the only requirements are that (i) the random event of error has to be of sufficiently small probability, and (ii) it has to be independent of the past and present of the process. Particularly it cannot depend on the strategy she otherwise would use.
Let β denote the best reply strategy with an error of probability δ, i.e., in every game, independently, error occurs with the prescribed probability. In our 2×2 example erroneous best reply means changing to ESS when playing with an ESS player, and to another cycle when the opponent is another BR player. Indeed, when her opponent is an ESS player, the BR player virtually uses the mixed strategy q * being just the opposite of the ESS in the sense that the probabilities (or equivalently, the pure strategies) are interchanged: q * 1 = p * 2 and q * 2 = p * 1 . Thus, an erroneous action means that the probabilities are restored: in that anomalous game the perturbed BR player actually uses the ESS. The case of BR opponent can be treated analogously. In the latter case W (β , β ) = W (β, β), because the strategies of the two best reply opponents remain independent and uniform. By the properties of the mixed ESS, . This is a little bigger than W ( p * , β), but still less than W (β, β) = W (β , β ) if δ is sufficiently small. Thus, the dominance of the best reply strategy can be preserved. Well-mixed game with repetition. In the figure γ denotes the probability that the given pair repeats the game being a matrix game with payoff matrix A = (a i j ) 2×2 , where a 11 = c, a 12 = b, a 21 = c + a, a 22 = 0. If γ = 0, then no repetition takes place, that is, pair formation is governed by random mixing. If γ = 1, then the game is iterated, that is, the pairs repeat the game arbitrarily many times. In this case, if c > 3 2 (b − a), the best reply (BR) player phenotype replaces the mixed evolutionarily stable strategist (ESS) phenotype. If the game is repetitive, that is, playing pairs can, but not necessarily have to, repeat the game, and c > 1 + 1 2γ (b − a), then for every γ ∈ (0, 1) the parameters a, b, c of the payoff matrix can be chosen in such a way that BR players can replace those following a mixed ESS. In order that the cases of iterated and repeated games can easily be compared we suppose that the players cannot distinguish their opponents, they only know if the next game will be played with the same opponent, and, in that case, what strategy was used by the old-new opponent in the preceding round. Thus, if a BR-BR pair splits up, but they get together again during the well-mixed pair formation process, then both BR players consider the other one a newcomer. This simplifying condition obviously favors the ESS players, as it decreases the number of cycles in BR-BR pairs Observe that in the present case, the best reply player has two-step behavior cycles; thus, if each pair repeats the interaction with its opponent only once again, then the advantage of best reply players can already occur. Thus, if the one-shot ESS is a mixed strategy, it is not necessarily evolutionarily stable in the two-round game. This advantage may even remain in the case where the average number of (random) repetitions is arbitrarily small as shown in the next section.
We emphasize that the BR behavior is one of the simplest action-reaction rules, since the BR player can only remember the opponent's pure strategy used in the previous game, but cannot remember who won, for example. Therefore, phenomena such as the loser-winner effect leading to some kind of "asymmetry" (like dominance) cannot emerge in our iterated game (Morrell and Kokko 2003). Consequently, our game is symmetric, since all players either use the mixed ESS or the BR strategy, and there are no more differences among them.

Well-Mixed Games with Repetition: A Particular Case
As we have already mentioned in Sect. 1, iterated games have properties very different from those of one-shot games. Now we are interested in the selection situation where well-mixing interactions and game repetitions occur in a population at the same time (see Fig. 1).
Suppose we have a population of size N , where N is a sufficiently large even number. In the population, pairwise interactions take place; in these interactions a matrix game is played by the pair (the same game at every instance). For the sake of simplicity, we focus on the 2×2 example we dealt with in Sect. 2.1. Time is considered discrete; that is, the history of the population evolves in separate turns. There are two phenotypes present in the population. They are characterized by the strategy they use in the matrix game: phenotype p * follows a mixed ESS strategy (which is assumed to exist), and phenotype β is the best reply player. There are (1 − ε)N and εN of them, respectively (these numbers are even integers). In every turn, there may be pairs who randomly stick together for a repeated game. All the other pairs split up, and the free individuals are to be organized in new pairs for the next round. Pair formation is supposed to be completely random and independent of the past. The event that a given pair keeps together is independent of the past and also of other pairs. The probability of repetition is denoted by γ ∈ (0, 1).
When a pair plays their first game, a best reply player choses her strategy uniformly at random, but in the subsequent repetitions she applies the pure strategy being a best reply against the action of the opponent in the previous round. Even if a pair splits, they can happen to reunite in the next turn. Should it be the case, the best reply player is supposed to forget the last turn and randomize their strategy as if a new opponent were met. Note that this possibility is very unlikely for large N ; thus, the effect of this choice is negligible.

Theorem 1 Let I denote the open interval with endpoints a and a + c, and letĪ be its
.
(a) If b ∈ I , and γ > Q, then the best reply strategy displaces the ESS if N is large enough. (b) On the other hand, if b / ∈Ī or b ∈ I but γ < Q, then the ESS displaces the best reply strategy.
Note that b ∈ I implies Q > 0. With γ = 1, we get back the result of Sect. 2.1. In the numerical example of Sect. 2.1 as small as γ = 1/3 already ensures the best reply player's domination. This only means half a repetition on average.
Proof We will prove that in the limit as N → ∞ the average payoff of a BR player is greater or less than that of an ESS player, independently of the relative frequency ε of BR players, according that the conditions in (a) or (b) are satisfied, resp.
Let Z n denote the number of (β, β) pairs in round n. Then, (Z n , n = 1, 2, . . . ) is a homogeneous Markov chain with state space The state space can be obtained in the following way. Clearly, 2k ≤ εN on the one hand. On the other hand, in addition to these k pairs, there are εN − 2k of BR players and (1−ε)N of ESS players to be coupled. Then (1−ε)N ≥ εN −2k, or equivalently, ε − 1 2 N ≤ k is needed, otherwise further (β, β) pairs must be formed.
It is thus clear that if there are k of (β, β) pairs, then the number of ( p * , β) pairs is εN − 2k and we also have 1 2 − ε N + k pairs of type ( p * , p * ). Since the number of (β, β) pairs determines the composition of all pairs, and this composition determines the joint distribution of the number of different pair types in the next round, the Markov property of the sequence Z n follows.
Formally, the transition probabilities can be found in Appendix. (These formulas will not be needed in the sequel.) This Markov chain is irreducible and aperiodic, because all states communicate with each other in a single step: it can happen with positive probability that all pairs split up and reunite in an arbitrarily prescribed way. Thus, there exists a unique equilibrium distribution on S; this is the asymptotic distribution of the chain as the number of turns tends to infinity, irrespectively of the initial distribution. For a Markov chain like that the strong law of large numbers is valid, the time average of an arbitrary function f : S → R of the states tends almost surely to the expectation of f in the stationary regime.
Whenever two players meet, they play a random number of games, which is geometrically distributed with mean 1 1−γ . Let us denote the average per game payoff of each participant of a (β, β) type game sequence by W (β, β). Considering a pair of type ( p * , β), the average per game payoff for phenotypes p * and β will be denoted by W ( p * , β) and W (β, p * ), resp. Finally, let W ( p * , p * ) denote the average per game payoff of each player of a (β, β) pair during this sequence of repeated games. Then W (β, p * ) = W ( p * , p * ) because the ESS p * is totally mixed, i.e., it assigns positive probabilities to all pure strategies, as in this case q A p * = p * A p * must hold for all q ∈ d .
Suppose we are in the stationary regime. Let z be equal to the expectation of the stationary distribution, divided by N . Let the number of (β, β) pairs be equal to Z , then there are εN −2Z pairs of type ( p * , β) and 1 2 − ε N + Z pairs of type ( p * , p * ). Introduce z = 2E(Z )/N . The average payoff for an ESS player is Furthermore, the average payoff for a best reply player is Let us compute z, at least asymptotically as N → ∞. We remark here that Benaïm and Weibull (2003) deal with the goodness of deterministic approximations of Markovian stochastic population processes that arise from individual strategy adaptation in finite but large populations. Their general results support the validity of our large N approximations. Note, however, that in their paper the term best reply strategy is used in another sense: not in the context of pairwise interactions, but rather as best reply to the current population state which is supposed to be observable by the player.
If in a round there are Z pairs of type (β, β), an average number N zγ /2 of them will stay together. On the other hand, N (ε − z)(1 − γ ) pairs of type ( p * , β) will split up on the average. Thus, there will be N (1 − γ ) free individuals seeking a pair in the next round, with N ε(1 − γ ) of them belonging to phenotype β. Now, the mean number of (β, β) pairs they form is asymptotically equal to N ε 2 (1 − γ )/2 for large N . Being in equilibrium, one obtains the following equation for z: (4) and (5), we get that the average payoff for an ESS player is while the same for a best reply player is In Sect. 2.1, we have already seen that In this case, the average payoff does not vary during the sequence of repeated games. The case of W ( p * , β) is somewhat different. The payoff for phenotype p * in a pair of type ( p * , β) is in the first game, and in further games. Altogether this is on the average. Thus, the BR strategy is more fruitful than the ESS if Multiplying this inequality by 4(a + b) and rearranging, we get the condition , then the left-hand side is negative while the right-hand side is positive, thus the opposite inequality holds: the ESS outperforms the BR strategy. If (b − a)(c − b + a) > 0, that is, b ∈ I , then we can divide both sides by it, and the condition for the dominance of the BR strategy reads γ > Q. Again, the opposite inequality implies the dominance of the ESS.
This was obtained by replacing z with ε 2 , which, in fact, is the limit of z as N → ∞. If inequality γ > Q (or the opposite inequality) holds, then the limit of W ( p * ) is strictly less than (or greater than) that of W (β), therefore the same inequality must hold whenever N is sufficiently large.
We have performed simulations to illustrate the speed of convergence of empirical data to the theoretical results in the numerical example of Sect. 2.1. The program codes were written in R. In Fig. 2, four diagrams illustrate the effect of the parameters γ and ε on the difference between ESS and BR players the per game average payoff. Two values of the repetition probability (γ = 0.3 and γ = 0.6) are combined with two values of the BR density (ε = 0.1 and ε = 0.5). In all four cases the population size is equal to 1000, and the number of rounds (games) is 20000. The payoff matrix is A = 6 3 4 0 . The green histogram illustrates the distribution of per game average payoffs for the ESS players. The red one is the same for BR players. The two histograms are getting further and further apart as the probability of repetition or the proportion of BR players grows, implying that the advantage of the best reply strategy is getting more and more significant. Note that a repetition probability as small as γ = 0.3 already ensures the best reply player's dominance. This means less than half repetitions on average. When γ = 0.6, the expected number of repetitions is γ 1−γ = 1.5, still rather small. Fig. 2 Simulations. In the diagrams two values of the repetition probability γ are combined with two values of the density ε of the BR players. In all four cases the population size is equal to 1000, and the number of rounds is 20000. The payoff matrix is also fixed: a 11 = 6, a 12 = 3, a 21 = 4, a 22 = 0. The green (light gray) histogram illustrates the distribution of per game average payoffs for the ESS players. The red (dark grey) one is the same for BR players. The two histograms diverge more and more as the probability of repetition or the proportion of BR players grows, implying that the advantage of the best reply strategy is getting more and more significant (Color figure online)

Well-Mixed Games with Repetition: The General Case
Theorem 1 can easily be generalized to more complex games with larger matrices. Suppose the matrix game is defined by an d × d nonnegative invertible matrix A and there exists a totally mixed ESS p * , i.e., every coordinate of p * is positive.
For j = 1, . . . , d let σ ( j) be the row of the maximal element in column j (one of them, if there are more), i.e., max 1≤i≤d a i, j = a σ ( j), j . In other words, pure strategy s σ ( j) is a/the best reply to pure strategy s j . Let σ 0 ( j) = j for all j = 1, . . . , d, and When a BR player is paired with a new partner, in the first game she selects from the pure strategies uniformly at random, then in every repeated game chooses her strategy according to the function σ .

holds. Then, the best reply strategy displaces the ESS if N is large enough. On the other hand, if in (7) the opposite (strict) inequality is valid, then the ESS displaces the best reply strategy for every sufficiently large N .
Proof The proof of Theorem 1 up to the formula (6) does not utilize the particular form of the payoff matrix A, hence (6) remains true in the present case. What we only need is to compute W ( p * , β) and W (β, β).
When an ESS player is paired with a BR player, in the first game the latter chooses her action uniformly at random; therefore, the average payoff of the ESS player is 1 d p * A e. In every subsequent round, if the ESS player chooses strategy s i and in the preceding round she chose s j , then her opponent plays s σ ( j) and thus her payoff is a i,σ ( j) . This occurs with probability p * i p * j . Hence, the average payoff per game is and the expected number of such games is γ / (1 − γ ). Thus, the left-hand side of (7) is just the per game average payoff of the ESS player. When two BR players form a pair, in the first game they choose their strategies independently and uniformly at random. Suppose they choose s i and s j . Then, in the k + 1-st game they play s σ k (i) and s σ k ( j) , resp., if k is even, and s σ k ( j) , s σ k (i) if k is odd. Thus, each receives an average payoff of provided they are still together in the k + 1-st game. This happens with probability γ k . The multiplier 1 − γ appearing in the right-hand side of (7) performs averaging per game. Condition (7) is not so convenient to check in general. However, in the following special case it becomes very simple. Remember that the ESS p * is supposed to be totally mixed. In this case thus the row sums of A −1 are all positive and p * = (e A −1 e) −1 A −1 e.

Then, the BR strategy eventually replaces the ESS if N is large enough.
Proof Since p * is a totally mixed ESS, d −1 p * A e > d −2 e A e must hold. Combining this with condition (ii) we can see that the fraction in the right-hand side of (iii) has positive numerator and denominator.
Note that the right-hand side of (iii) is less than 1 by condition (ii), hence conditions (i)-(ii) always make it possible for the best reply player to outperform the ESS in a finite mean random number of repetitions.

Modification of the Best Reply Strategy
Finally, the following question also arises: If in the game considered above the interaction is well mixed, can the best reply player invade a monomorphic classical mixed ESS population? Clearly, when each best reply player has a randomly chosen new opponent from round to round, then the deterministic behavior cycle of a pair of best reply players does not occur, and the ESS strategy remains better than the uniform random strategy used by the best reply players.
Let us modify the best reply behavior in such a way that the best reply player's next strategy always depends on the strategy her random opponent used in the previous round, even if her opponent has changed. Let us denote this modified strategy (and the corresponding phenotype) by β . This time we suppose that the pair formation is well mixed: all pairs split up and the players reunite in a completely random way. In the very beginning the modified BR players choose their strategies uniformly at random. Like we did in Sect. 2.3, let us fix σ (i) again for every pure strategy s i ∈ S in such a way that s σ (i) is a best reply to s i . The modified BR player uses strategy s σ (i) if s i was the strategy her partner used in the previous game. The modified BR strategy β is better than the ESS if the per game and player average of the payoff of β strategists exceeds that of the ESS players in the long run. In that case the modified BR strategy can replace the ESS.

Theorem 4 In a sufficiently large population, the modified best reply strategy cannot displace the (totally) mixed ESS.
Proof In course of the consecutive well-mixed rounds, each best reply player's strategy series is a random sequence of not necessarily independent pure strategies. Since each well-mixed round determines the next strategy distribution of the best reply players, we can introduce a discrete time Markov chain to describe the strategy distribution of the subpopulation of best reply players.
For a formal definition let H denote the set of indices of pure strategies s i that are best replies to certain pure strategies, and for i ∈ H let τ i be any pure strategy against which s i is optimal. (They do not need to be all different.) Let N denote the size of the population; we suppose that N is a sufficiently large even number. The states of the discrete time stochastic process κ 2 , κ 3 , . . . are |H | dimensional vectors (k i ) i∈H with nonnegative integer coordinates summing up to εN , the number of modified best reply players. Formally, the state space is and κ n is the random vector taking its values from K whose i-th coordinate is the number of best reply players playing strategy s i , i ∈ H in the n-th round. (n > 1 is needed because in the first round best reply players still not necessarily use best reply strategies.) We will show that this process is a homogeneous Markov chain.
Let μ 2 , μ 3 , . . . be another stochastic process with the d dimensional state space defined as follows. Let the i-th coordinate of μ n be the number of ESS players playing strategy s i , 1 ≤ i ≤ d in the n-th round. Then, μ 1 , μ 2 , . . . are identically distributed according to the d-variate multinomial distribution with parameters (1 − ε)N and p * . Moreover, μ n is independent of κ 2 , μ 2 , . . . , κ n−1 , μ n−1 , κ n . The pair (κ n , μ n ) determines the strategy frequencies in the n-th round. It is clear that the process (κ n , μ n ), n = 2, 3, . . . , is a Markov chain.
We will use the following elementary fact. Let C 1 , C 2 , . . . be a partition of random events, which is independent of the event B. Then, by the law of total probability By using this formula twice, we get P(κ n = k n | κ n−1 = k n−1 , . . . , κ 2 = k 2 ) = m∈M P(κ n = k n | κ n−1 = k n−1 , . . . , κ 2 = k 2 , μ n−1 = m)P(μ n−1 = m) = m∈M P(κ n = k n | κ n−1 = k n−1 , μ n−1 = m)P(μ n−1 = m) = P(κ n = k n | κ n−1 = k n−1 ), as needed. Now the question is whether this Markov process is irreducible or not. In our case when ESS players use well-mixed strategies (that put positive weight on each possible pure strategy), and the interaction is well mixed (that is, each pairing of individuals for interaction is equally possible), this Markov process is irreducible and aperiodic, provided ε ≤ 1/2. Indeed, the transition probability between every pair of states is positive, because it can happen with positive probability that each best reply player interacts with an ESS player, and these opponents follow strategy τ i (i ∈ H ) just in as many occasions as needed for getting to the object state. 3 Thus there exists a unique stationary distribution; let us denote it by {π k , k ∈ K}. This is also the asymptotic distribution of the Markov chain as the number of turns tends to infinity. {π k } is a probability measure on the state space; |H |-variate, for the state space itself is |H |-dimensional. If the number of turns tends to infinity, the average number of best reply players following strategy i ∈ H is asymptotically k∈K k i π k . In the long run, the vector of average proportions of pure strategies followed by best reply players is just the expectation of the stationary distribution, divided by the number εN of such players. Denote this vector by q * ∈ |H | , thus q * i is the average proportion of best reply players playing strategy s i . That is, This is a probability distribution on H , because Let us extend q * to the set of all pure strategies by setting q * i = 0 for i / ∈ H . The existence of this stationary strategy distribution gives an easy way to calculate the average payoff of phenotypes; thus, we have W ( p * , p * ) = p * A p * , W ( p * , β ) = p * A q * , W (β , p * ) = q * A p * and W (β , β ) = q * A q * . Hence (2) can be applied, consequently the mixed ESS p * remains stable. Note that in the long run, the distribution of pure strategies followed by each individual β player is also equal to q * .

Discussion and Conclusions
In summary, strictly following the intuitive definition of evolutionary stability (1) by Maynard Smith and Price (1973), we found the following results.
Firstly, we gave an example where the classical mixed ESS p * defined by (2) is not evolutionary stable in the framework of iterated 2 × 2 matrix games. The novelty of the present paper is that here the mixed ESS can be replaced by a BR player. The mixed ESS is a Nash equilibrium which, when playing with a best reply opponent, receives a larger payoff in every round than another best reply player. Nonetheless, we have shown that best reply players can achieve a larger average payoff than players following the ESS. This is possible because best reply players in pairs, individually following their own strategies, develop cycles where the bigger payoff can compensate their disadvantage compared with the ESS players. This phenomenon is independent of the proportions of types in the population; that is, not affected by the frequencies of different pair types. It can even occur when the mean number of repetitions is arbitrarily small and it can also tolerate a small chance of mistake.
Secondly, based on the above observation, we also showed that in the well-mixed population game with repetition the classical mixed ESS loses its overall evolutionary stability. In more details, if there is but a small probability that two players repeat the matrix game, then, depending on the payoff matrix, it can occur that the BR player outperforms the classical mixed ESS.
Mixed or reactive strategy? Maynard Smith wrote in his book (1982): "Animals do not have roulette wheels in their hands", and he offered genetic and developmental mechanisms, which can give rise to variable behavior. 4 We note that individual-based simulations called the attention to the fact that if the behavioral strategies are either implemented by a 1 : 1 genotype-phenotype mapping or by a simple neural network, the evolutionary outcomes are different, largely depending on the behavioral and genetic architecture (van den Berg and Weissing 2015). Moreover, it seems that humans can generate random numbers that are uniformly distributed, independent of one another and unpredictable (Persaud 2005). However, a number of studies have shown that people often deviate slightly from the prediction of the classical game theory (see Barraclough et al. 2004;Wright and Leyton-Brown 2010, and the references therein). On the other hand, as we have mentioned in Sect. 1, when the individuals are supposed to live in a small group, the iterated prisoner's dilemma game is often used in evolutionary biology, where the evolutionary success of reactive strategies (like "tit for tat" ) over the pure defector is investigated. Moreover, since real populations are finite, the possibility of game repetition cannot be neglected, thus considering reactive players seems reasonable against mixed ESS. The basic intuition behind both research lines (pure or mixed ESS) rests upon the fact that in evolutionary game theory there is no biological reason to consider only nonreactive mixed strategy users as phenotypes. Finally, the overwhelming majority of biologists agree that living things can respond to stimuli, thus considering reactive players is biologically reasonable.
Although our paper strictly belongs to evolutionary game theory, we would like to call the attention to possible connections with classical game theory. The evolutionary success of BR is based on the assumption that the games are iterated and well mixed with repetition. In the iterated noncooperative games, the following question is quite important (Camerer and Ho 1999): "Which models describe human behavior best?" One possibility is the best-response (Camerer and Ho 1999), and the other one is the learning (Camerer et al. 2003), and both models are based on the belief about what others will do in the future based on past observation. Furthermore, the BR is not only a possible human behavior, but it also give a method to answer another basic question (Camerer and Ho 1999): "How does an equilibrium arise in a noncooperative game?" The BR also help to reach the equilibrium (Ho et al. 1998). Although we only pointed out that the mixed ESS can easily lose its stability against the reactive player, we hope our result may be of interest to game theorists in the field of mathematical economics.
Based on Sects. 2.1 and 2.2, it seems that evolutionary stability of a mixed strategy occurs when the interactions are well mixed in a large enough population (see also Bendor and Swistak 1995;Boyd and Lorberbaum 1987;Farrell and Ware 1989;Garay and Varga 2011;García and van Veelen 2016). There are essentially two ways of weakening this condition. In the first way, the interaction between the same phenotype is more likely. Observe that here the repetition of a game is not a must. For instance, if the probability of interaction between the same phenotype is small (e.g., game between relatives Hines and Maynard Smith 1979), then the classical ESS changes but a little. In other words, for a small change the mixed ESS is structurally stable. However, we note that if the clonal interaction rate is high enough, then that phenotype will win in the natural selection which maximizes the average fitness of its clone (Garay and Varga 2011).
In the second way, the players can repeat the game with a certain probability. Our main result is that in this way the classical mixed ESS loses its overall evolutionary stability, since even a small change can make the BR player able to outperform it. This is only true if the payoff matrix is not fixed: in a fixed game the winning probabilities of repetition are bounded away from zero, but for an arbitrarily small positive probability of repetition one can find payoff matrices so that the classical mixed ESS loses its advantage. This may be called weak structural instability.
Author Contributions Authors thank Josef Hofbauer for his generous help in preparing the paper, VC for her contribution to the programming work, and RC, ZV and IS, for their valuable comments on an earlier version of the paper.
Funding Open access funding provided by ELKH Centre for Ecological Research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A
In this Appendix, we provide the transition probabilities of the Markov chain (Z n ). Let i (m) = m i (1 − γ ) i γ m−i ; the i-th term of the binomial distribution with parameters m and 1 − γ . This is the probability that exactly i out of m pairs would split up. Let κ(m) denote the number of all possible arrangements of 2m players into m pairs, i.e., κ(m) = (2m)! 2 m m! . Finally, let λ m (u, v) be the probability that, randomly arranging u red and v white balls (u + v is even) in pairs, exactly m pairs of (red, red) type are formed (m ≤ u 2 ). Then .