Strategic experimentation with asymmetric players

We examine a two-player game with two-armed exponential bandits à la (Keller et al. in Econometrica 73:39–68, 2005), where players operate different technologies for exploring the risky option. We characterise the set of Markov perfect equilibria and show that there always exists an equilibrium in which the player with the inferior technology uses a cut-off strategy. All Markov perfect equilibria imply the same amount of experimentation but differ with respect to the expected speed of the resolution of uncertainty. If and only if the degree of asymmetry between the players is high enough, there exists a Markov perfect equilibrium in which both players use cut-off strategies. Whenever this equilibrium exists, it welfare dominates all other equilibria. This contrasts with the case of symmetric players, where there never exists a Markov perfect equilibrium in cut-off strategies.


Introduction
In many instances, the information produced by one agent is interesting to other agents as well. Think, for example, of firms exploring neighbouring oil patches: if one firm strikes oil, chances are there will be oil in its neighbour's patch as well. Such games of purely informational externalities have been analysed by the strategic bandit literature, 1 which so far has only analysed the case of homogeneous agents. However, in many instances, one of the oil firms, for example, might be a big multinational firm that has access to a superior drilling technology. In this article, we aim to analyse the impact of asymmetries in players' exploration technologies in a game of strategic experimentation with two-armed exponential bandits.
The seminal paper by Keller et al. (2005) analyses this problem with homogeneous players. In the current paper, we generalise the analysis by introducing asymmetric players, in the sense that their pay-off arrival rates from a good risky arm differ. This implies that, given the risky arm is good, the expected time needed to learn this differs between the players. As actions and outcomes are perfectly publicly observable, and players start out with a common prior, they will always have a common posterior belief. We characterise the set of Markov perfect equilibria with the players' common posterior belief as the state variable for all ranges of asymmetry between the players. If the degree of asymmetry between the players is sufficiently high, there exists an equilibrium in cut-off strategies, i.e. where both players use a cut-off strategy. That is, either player uses the risky arm if and only if the likelihood he attributes to the option being good is greater than a certain threshold. This equilibrium is unique in the class of equilibria in cut-off strategies. Whenever only one of the players experiments and the other free rides in this equilibrium, it is always the player with the weaker technology who free rides. In the case of homogeneous players (Keller et al. 2005), by contrast, there never exists an equilibrium in cut-off strategies, and players swap the roles of pioneer and free rider at least once in any equilibrium. In our setting, aggregate pay-offs in the equilibrium in cut-off strategies are higher than in any other equilibrium. If the degree of asymmetry is low, at least one player uses a non-cut-off strategy in any equilibrium. In contrast to the homogeneous case (Keller et al. 2005), we furthermore show that more frequent switches of arms do not unambiguously improve the equilibrium welfare with asymmetric players.

Related literature
This paper contributes to the literature on strategic experimentation with bandits, a problem studied quite widely in economics, amongst others, by Bolton and Harris (1999), Keller et al. (2005), Keller and Rady (2010), Klein and Rady (2011), Klein (2013) and Thomas (2017). In all of these papers, players are homogeneous. Except in Thomas (2017) and Klein and Rady (2011), players' bandits are of the same type and free riding is a common feature in all the above models except for Thomas (2017).
Many variants of this problem have been studied in the literature 2 . Rosenberg et al. (2013) and Murto and Välimäki (2011), for instance, assume that switches to the safe arm are irreversible and that experimentation outcomes are private information, while Bonatti and Hörner (2011) and Heidhues et al. (2015) investigate the case of private actions. In Dong (2018), actions and outcomes are public, but one of the players receives an initial private signal. Rosenberg et al. (2007) analyse the role of the observability of outcomes and the correlation between risky-arm types in a setting in which a switch to the safe arm is irreversible. Besanko and Wu (2013) use the Keller et al. (2005) framework to study how an R&D race is impacted by market structure. Das (2019) analyses an R&D race in a strategic bandit setting in which on the risky arm, players can learn both privately and publicly. Guo (2016) and Zambrano (2017) analyse the problem of a principal delegating the operation of a two-armed bandit to an agent; in Klein (2016), the bandit the agent operates has three arms. Banks et al. (1997) provide an experimental test of a single-agent two-armed bandit problem; Hoelzemann and Klein (2018) do so in a strategic setting. The paper closest to the present paper is Keller et al. (2005), who find that, with homogeneous players, there is never an equilibrium in cut-off strategies. By contrast, we show that, with heterogeneous players, an equilibrium in cut-off strategies may exist and that it is welfare maximising whenever it exists.
The rest of the paper is organised as follows. Section 2 sets out the model. Section 3 discusses the social planner's solution. A detailed analysis of equilibria for different ranges of heterogeneity is undertaken in Sect. 4. Finally, Sect. 5 concludes. Payoff functions are shown in "Appendix A", while some proofs are relegated to "Appendix B".

Two-armed bandit model with heterogeneous players
There are two players (1 and 2), each of whom faces a two-armed bandit in continuous time. One of the arms is safe, in that a player who uses it gets a flow pay-off of s > 0. The risky arm can be either good or bad. Both players' risky arms are of the same type. If the risky arm is good, then a player using it receives a lump sum, drawn from a time-invariant distribution with mean h > s, at the jumping times of a Poisson process. The Poisson process governing player 1's arrivals has intensity λ 1 = 1, while player 2's arrive according to a Poisson process with intensity λ 2 ∈ ( s h , 1). Thus, a good risky arm gives player 1 (2) an expected pay-off flow of g 1 = λ 1 h = h (g 2 = λ 2 h), with g 1 > g 2 > s. The parameters and the game are common knowledge.
The uncertainty in this model arises from the fact that players do not initially know whether their risky arms are good or bad. Players start with a common prior belief p 0 ∈ (0, 1) that their risky arms are good. Players have to decide in continuous time whether to choose the safe arm or the risky arm. At each instant, players can choose only one arm. We write k i,t = 1 (k i,t = 0) if player i ∈ {1, 2} uses his risky (safe) arm at instant t ≥ 0. Players' actions and outcomes are publicly observable, and based on these, they update their beliefs. Players discount the future according to the common discount rate r > 0.
Let p t be the players' common belief that their risky arms are good at time t ≥ 0. Given player i's (i ∈ {1, 2}) actions {k i,t } t≥0 , which are required to be progressively measurable with respect to the available information and to satisfy k i (t) ∈ {0, 1} for all t ≥ 0, player i's expected pay-off is given by where the expectation is taken with respect to the processes {k i,t } t≥0 and { p t } t≥0 . As can be seen from the objective function, there are no pay-off externalities between the players. Indeed, the presence of the other player impacts a given player's pay-offs only via the information that he generates, i.e. via the belief.
As mentioned in Introduction, we will focus our analysis on Markov perfect equilibria with the players' common posterior belief as the state variable. Formally, a Markov strategy of player i is any left-continuous function k i : [0, 1] → {0, 1}, p → k i ( p) (i = 1, 2) that is also piecewise continuous, i.e. continuous at all but a finite number of points.
As only a good risky arm can yield positive pay-offs in the form of lump sums, the arrival of a lump sum fully reveals the risky arm to be good. Hence, if either player receives a lump sum at a time τ ≥ 0, then p t = 1 for all t > τ. In the absence of a lump-sum arrival, the belief follows the following law of motion for a.a. t:

Planner's problem
Suppose there is a benevolent social planner, who controls the actions of both players and wants to maximise the sum of their pay-offs. Since the planner's expected pay-off at any point in time only depends on the belief at that time and the belief follows a controlled Markov process, this is a Markov decision problem. Therefore, it is without loss of generality for the planner to restrict himself to Markov strategies (k 1 ( p t ), k 2 ( p t )) with the posterior belief p t as the state variable. The Bellman equation for the planner's problem is given by where we write v( p) for the planner's value function and, like Keller et al. (2005), define the myopic opportunity cost of having player i play risky, c i ( p) = s − pg i , and the corresponding learning benefit Note that the planner's Bellman equation is linear in both k 1 and k 2 , so that our restriction to action plans {(k 1,t , k 2,t )} t≥0 with k i,t ∈ {0, 1} for all (i, t) is without loss in the planner's problem. To state the following proposition, which describes the planner's solution, we define g = g 1 + g 2 , and the value function is Proof Proof is by a standard verification argument. Please see "Appendix B.1" for details.
By the above proposition, the belief at which player 1 switches to the safe arm in the planner's solution is higher than it would be if both players' Poisson arrival rates were equal to λ 1 = 1. This is because, as player 2's arrival rate λ 2 decreases, the benefit from player 1's experimentation decreases.
The planner's solution is depicted in Fig. 1. 3 The planner's value function is a smooth convex curve which lies in the range [2s, g]. At the belief p * 2 ( p * 1 ) , player 2 (1) switches to the safe arm.

Non-cooperative game
We will first analyse a player's best responses to a given Markov strategy of the other player.

Best responses
Fix player j's strategy k j ( j ∈ {1, 2} \ {i}). If the pay-off function from player i's response satisfies the following Bellman equation, player i is playing a best response: As before, λ i b i ( p, v i ) can be interpreted as the learning benefit accruing to player i due to his own experimentation, while λ j b i ( p, v i ) is the learning benefit accruing to player i from player j's experimentation. The myopic opportunity cost of experimentation continues to be c i ( p) = s − g i p.
For a given k j ∈ {0, 1}, from (3) we know that player i's pay-off function satisfies the Bellman equation if and only if If λ i b i ( p, v i ) > s − g i p, then k i = 1 is the unique best response. From 3, we can conclude that this requires . A similar argument applies for the situations when the best responses are k i ∈ {0, 1} and k i = 0, respectively. This allows us to infer that This implies that when k j = 1, player i chooses the risky arm, safe arm or is indifferent between them depending on whether his value in the ( p, v) plane lies above, below or on the line The single-agent threshold for player i is given bȳ where μ i = r λ i . In "Appendix A.2", we display the ODEs the players' pay-off functions satisfy, as well as their solutions, for each possible action profile. We start off by showing that, as in the homogeneous case (Keller et al. 2005), no efficient equilibrium exists.
Proposition 2 In any MPE, both players play safe at all beliefs in [0,p 1 ]. There is thus no efficient MPE.
Proof Suppose to the contrary that p l , the infimum of the set of beliefs at which at least one player plays risky satisfies p l <p 1 . Clearly, v i ( p l ) = s for both i ∈ {1, 2}. We shall now distinguish two cases depending on whether or not there exists an¯ > 0 such that, in any -right neighbourhood of p l with ∈ (0,¯ ), only one player i plays risky. If there does not exist such an¯ > 0, i is not playing a best response, because p l <p i < s g i implies that the point ( p l , s) is below the diagonal D i . In the other case, player i faces the same trade-off as a single agent and does not play a best response either, because p l <p i .
In the next subsection, we will characterise the condition under which an equilibrium in cut-off strategies exists.

Equilibrium in cut-off strategies
As we have argued in the proof of Proposition 2, there is no experimentation below the beliefp 1 in any equilibrium. We will now argue that, in any equilibrium, only player 1 will experiment in some right neighbourhood ofp 1 , implying that player 1 is the last player to experiment in any equilibrium.
By Proposition 2, we know that v 1 (p 1 ) = v 2 (p 1 ) = s, and thus, by continuity, both players' value functions must be below their respective diagonals D i in some neighbourhood ofp 1 . Thus, in any equilibrium, at most one player can play risky in some right neighbourhood ofp 1 . Now, suppose that player 2 is the only player to experiment in some right neighbourhood ofp 1 . Then, the relevant ODE (Equation 13 in "Appendix A.2") gives us that λ 2p1 (1−p 1 )v 2 (p 1 +) =p 1 λ 2 (g 2 −s)−rc 2 (p 1 ) < 0, asp 1 <p 2 . Thus, player 2's value function drops below s immediately to the right ofp 1 , which contradicts his playing a best response. We can thus conclude that there exists some beliefp 1 >p 1 such that, on (p 1 ,p 1 ), player 2 plays safe. As either player can always guarantee himself his single-agent pay-off by ignoring the information he gets for free from the other player, his pay-off in any equilibrium is bounded below by his single-agent pay-off. Thus, in any equilibrium, v 1 > s on (p 1 ,p 1 ], and player 1 experiments, while player 2 free rides, in this range.
In the following proposition, we will show that there exists an equilibrium in cut-off strategies if and only if the degree of asymmetry between the players is high enough.
Proposition 3 There exists a λ * 2 ∈ ( s h , 1) such that there exists an equilibrium in cutoff strategies if and only if λ 2 ∈ ( s h , λ * 2 ]. In this equilibrium, player 1 plays risky on (p 1 , 1] and safe otherwise, while Player 2 plays risky on ( p 2 , 1] and safe otherwise. Proof Please refer to "Appendix B.3". "Appendix B.4" shows that the belief p 2 where player 2 switches to the safe arm in the above equilibrium is strictly greater than p * 2 , the threshold in the planner's solution. This shows that for p ∈ ( p * 2 , p 2 ], player 2 inefficiently free rides. 5v 1 andv 2 are obtained from Eqs. 14 and 16, respectively, by imposing the conditionv i (p 1 ) = s (i = 1, 2).

Fig. 2 Equilibrium in cut-off strategies
The equilibrium in cut-off strategies is depicted in Fig. 2 6 . In this equilibrium, both players' pay-offs are equal to s for p ≤p 1 . For p >p 1 , the black curve represents v 1 and the red curve represents v 2 . For p ∈ (p 1 , . 7 Player 1's equilibrium pay-off function is (strictly) convex (on (p 1 , 1)); it is smooth, except for a kink at p 2 . (For the particular parameter values used in Fig. 2 To depict this kink in the figure, we have magnified the area around p = p 2 . In the magnified part, the orange curve representsv 1 for p > p 2 . Player 2's pay-off function is strictly concave on (p 1 , p 2 ) and strictly convex on ( p 2 , 1); it has an inflection point at p 2 . It is smooth except for a kink atp 1 .
Experimentation decisions are strategic substitutes. Therefore in any equilibrium, at the lowest belief where some experimentation takes place, one pioneer is indifferent between choosing the safe and the risky arm, given that the other player is free riding. The free rider can determine a threshold belief p 2 where he is indifferent between choosing the safe arm and the risky arm, given that the pioneer is choosing the risky arm for all beliefs between the lowest cut-off and p 2 . This implies that for beliefs just above p 2 , the free rider finds it beneficial to experiment irrespectively of the action of the pioneer. When players are homogeneous, their free riding opportunities are the same. At p 2 , the pioneer's pay-off is less than that of the free rider as experimentation is costly. Thus, for beliefs just above p 2 , the pioneer has an incentive to free ride, given that the free rider experiments. This explains [as shown in Keller et al. (2005)] why there does not exist an equilibrium where both players use cut-off strategies. However, the free riding opportunities are different for heterogeneous players. As explained above, in any equilibrium the pioneer is always the player with the higher productivity (player 1). The lower player 2's productivity, the less player 1 has an incentive to free ride on player 2's experimentation. If player 2's productivity is very low, player 1 no longer has any incentive to free ride on 2's experimentation for beliefs right above p 2 . This intuitively explains the result of Proposition 3.
Geometrically, the diagonals D 1 and D 2 in Fig. 2 do not coincide when players are asymmetric. As the proof of Proposition 3 shows, the condition for existence of an equilibrium in cut-off strategies is precisely that player 2 will enter the region in which risky is dominant at a more optimistic belief than player 1. 8 This is possible if and only if the region in which risky is dominant for player 2 is relatively small enough compared to that of player 1, i.e. if and only if λ 2 is small enough compared to λ 1 = 1.
In Sect. 4.5, we show that if the players' learning speeds are different while the expected flow pay-off from the good risky arm is the same, there again exists an equilibrium in cut-off strategies if and only if the difference in the learning speeds is high enough. The same qualitative result obtains for identical learning speeds but different expected pay-offs from the good risky arm. Indeed, either form of asymmetry creates differences in the players' free riding incentives. Diagrammatically, this can be seen by a gap between the best response diagonals.

Equilibria in non-cut-off strategies
In the previous subsection, we have identified a necessary and sufficient condition for the existence of an equilibrium in cut-off strategies. In this subsection, we will analyse equilibria where at least one of the players uses a non-cut-off strategy. To begin with, we show that even for low degrees of asymmetry, there exists an equilibrium where player 2 uses a cut-off strategy.
Proposition 4 There exists an equilibrium in which only player 2 uses a cut-off strategy if and only if λ 2 > λ * 2 . In this equilibrium, the cut-off for player 2's strategy is p 2 . Player 1 plays risky on (p 1 , p 2 ] ∪ ( p 1 s , 1] and safe otherwise, where p 1 s > p 2 is the belief at which player 1's pay-off function and D 1 intersect.
Proof Please refer to "Appendix B.5".
The equilibrium where only player 2 uses a cut-off strategy is depicted in Fig. 3 9 . The black and the orange curves depict the pay-offs to player 1 and 2, respectively. As the degree of asymmetry between the players is low, p 1 > p 2 , and hence, an equilibrium where both players use cut-off strategies does not exist. In Fig. 3, we magnify the part around p = p 2 . We do not show p 1 in the figure, but for the parameter values used in Fig. 3, we have p 1 s = 0.4740 < 0.4754 = p 1 . At p = p 2 , both v 1 and v 2 have a kink. 10 To the immediate right of p 2 , v 1 becomes concave and v 2 becomes convex. v 2 remains convex for all p > p 2 , but has a kink 11 at p = p 1 s . v 1 has an inflection point at p = p 1 s and smoothly becomes convex at this belief. Propositions 3 and 4 together imply that there always exists an equilibrium where player 2 uses a cut-off strategy with p 2 as the cut-off. Indeed, as argued in the previous subsection, in any equilibrium,p 1 is the lowest belief where some experimentation takes place and only player 1 experiments at beliefs just abovep 1 . By the same token, risky is Player 2's best reply at all beliefs above p 2 , given Player 1 plays risky on When the degree of asymmetry is low, there will exist a range of beliefs just above p 2 where player 1 free rides. Thus, player 1 uses a non-cut-off strategy. This explains the result of Proposition 4. In the limit λ 2 ↓ λ * 2 , the range above p 2 where player 1 free rides vanishes, and hence, the equilibrium described in Proposition 4 coincides with the equilibrium in cut-off strategies.
Equilibria where at least one player uses a non-cut-off strategy always exist, as the following proposition shows. The following proposition, together with Proposition 3, fully characterises the set of all Markov perfect equilibria. To state the proposition, we let v i be player i's equilibrium pay-off. For both players n ∈ {1, 2}, we define p n S as the (unique) point of intersection of v n and D n . 12 Let p i S = min{ p 1 S , p 2 S } and p j S = max{ p 1 S , p 2 S }. Proposition 5 For any λ 2 ∈ ( s h , 1), there exists a continuum of Markov perfect equilibria in which at least one player uses a non-cut-off strategy. For each integer l > 1 and each sequence of threshold beliefs (p i ) l i=1 such thatp 1 <p 1 < · · · <p l = p i S , there exists an equilibrium such that both players play safe at all beliefs p ≤p 1 ; player 1 plays risky and player 2 plays safe in (p 1 ,  Proof That the proposed strategies are mutually best responses immediately follows from our discussion at the top of Sect. 4. That such equilibria always exist follows immediately from the continuity of players' pay-off functions and the fact that When the degree of asymmetry is low, it is easy to observe that both players have incentives for free riding just below p 2 ; i.e. safe and risky are mutually best responses in this region. Although an increase in the degree of asymmetry reduces the free riding incentives for player 1, they never vanish completely. Therefore, there will always be a range just abovep 1 where safe and risky are mutually best responses. Hence, equilibrium allows players to take turns in experimenting at arbitrary beliefs in (p 1 , p 2 ). This explains the result of Proposition (5).
Asp 1 <p 2 , the proposition implies that there exist equilibria in which player 2 experiments below his single-agent thresholdp 2 . Indeed, by being the last player to experiment on (p 1 ,p 1 ], player 1 provides an encouragement effect to player 2, as the latter is willing to play risky on (p 1 ,p 2 ] only because he knows that, should his experimentation not be successful, he will get to free ride on player 1's experimentation once the belief will have dropped top 1 .

Welfare rankings of equilibria
As in Keller et al. (2005), there are two potential sources of inefficiency in our model: players might not produce enough information, and/or they might produce the information too slowly. In order to analyse these different effects, we define the experimentation intensity at time t ≥ 0 as K t = λ 1 k 1,t + λ 2 k 2,t , and the integral T 0 K t dt as the amount of experimentation up to time T . Keller et al. (2005), by contrast, define the experimentation intensity at time t ≥ 0 asK t = k 1,t + k 2,t , and the amount of experimentation up to time T as T 0K t dt. Thus, we measure the output of players' experimentation efforts, with our measure taking into account that it matters for the information-production process which player invests time in the risky arm. The corresponding concepts in Keller et al. (2005), by contrast, measure the input, i.e. the overall resources spent on producing information. In the case of homogeneous players with productivities λ, the input, as indicated by their measure, of course corresponds to 1/λ times the output, as indicated by our measure. The following result mirrors the finding in Keller et al. (2005) (see their Lemma 3.1 in conjunction with their Propositions 5.1 and 6.1) that the amount of experimentation is the same in any Markov perfect equilibrium. This implies that the welfare ranking of equilibria is solely determined by the delay in information production.
Lemma 2 Suppose there is no success on the risky arm. Then, the amount of experimentation is the same in any Markov perfect equilibrium.
Proof As we have seen from our characterisation of equilibria, experimentation stops atp 1 in any equilibrium. By Bayes' rule, the law of motion of the belief conditional on no success is given by d p t = −K t p t (1 − p t ) dt. Thus, conditionally on no success, the amount of experimentation in any Markov perfect equilibrium is given by ∞ as upper bound which concludes the proof.
In the following proposition, we establish that in any equilibrium in which players swap the roles of pioneer and free rider at least once, player 1's (2's) pay-off will hit D 1 (D 2 ) at a more pessimistic (optimistic) belief than in the equilibrium in cut-off strategies.
Proposition 6 Consider any equilibrium described in Proposition 5. Suppose p 1 S >p 1 is the belief at which the equilibrium pay-off of player 1 meets the line D 1 and p 2 S >p 1 is the belief at which the equilibrium pay-off of player 2 meets the line D 2 . Then, we have p 1 S < p 1 . For l > 1 we have p 2 S > p 2 and for l = 1, p 2 S = p 2 .
Proof Please refer to "Appendix B.6".
In the equilibrium in cut-off strategies, player 2 free rides for all beliefs in (p 1 , p 2 ]. However, in any other equilibrium there exists some open subset of (p 1 , p 2 ) where he experiments and player 1 free rides. Thus, for all p ∈ (p 1 , p 2 ], the equilibrium in cut-off strategies gives the highest pay-off to player 2, as he can free ride on the more productive player's experimentation. This implies that, in the range p ∈ (p 1 , p 2 ], player 2's pay-off function in any non-cut-off equilibrium lies below his pay-off in the cut-off equilibrium and will therefore intersect the diagonal D 2 at a belief higher than p 2 . This explains why we have p 2 S > p 2 . On the other hand, for all p ∈ (p 1 , p 1 ], player 1 experiments in the equilibrium in cut-off strategies and receives his single-agent payoff. In any other equilibrium, however, there exists some open subset of (p 1 , p 1 ) where his single-agent optimal action is not a best response, and his equilibrium pay-off is therefore higher. Thus, as player 1's pay-off is lowest in the equilibrium in cut-off strategies, we have p 1 S < p 1 . Suppose λ 2 ∈ ( s h , λ * 2 ]. This implies that the equilibrium in cut-off strategies exists. In the following proposition, we show that the equilibrium in cut-off strategies strictly welfare dominates all other equilibria. Proposition 7 Suppose λ 2 ≤ λ * 2 and let v c agg be the aggregate equilibrium pay-off in the equilibrium in cut-off strategies and v nc agg be the aggregate equilibrium pay-off in an arbitrary equilibrium in non-cut-off strategies. Then, v c agg ≥ v nc agg , with the inequality strict on (p 1 , 1).
First, observe that in the equilibrium in cut-off strategies, both players experiment for beliefs greater than p 2 . Since p 2 S > p 2 (by Proposition 6), the range of beliefs where both players experiment is largest in the equilibrium in cut-off strategies. Next, in the equilibrium in cut-off strategies, whenever only one player experiments, it is the player with the higher pay-off arrival rate, player 1. In any other equilibrium, however, there is a range of beliefs where player 2 plays the role of the lonely pioneer. Since player 1 is more productive, in any equilibrium all experimentation ceases at p 1 , information is most efficiently generated in the equilibrium in cut-off strategies. This intuitively explains the result of Proposition 7. One can observe that, since, at any belief, the intensity of experimentation is highest in the equilibrium in cutoff strategies, information generation is fastest. Thus, this equilibrium involves least delay. As experimentation amounts are the same in all equilibria (Lemma 2), this implies that the cut-off equilibrium welfare dominates all other equilibria. 13 The comparison between the equilibrium in cut-off strategies and an equilibrium in which players swap roles once is depicted in Fig. 4. 14 Figure 4a, b depicts the actions of players in the equilibrium in cut-off strategies and the equilibrium in non-cut-off strategies, respectively. These equilibria correspond to the ones depicted in Fig. 4. 13 Dong (2018) shows that if the players' initial beliefs are asymmetric enough, equilibrium welfare improves. 14 Proposition 6 implies that the qualitative characteristics of p 1 s and p 2 s are the same in any equilibrium in non-cut-off strategies. For simplicity, we consider an equilibrium in non-cut-off strategies where players swap roles only once in the figure. The thick purple 15 curve (v 1 ) and the black curve (v 2 ) in Fig. 4 depict the pay-offs to player 1 and 2, respectively, in the equilibrium in cut-off strategies. In the equilibrium in non-cut-off strategies, pay-offs coincide for beliefs less than or equal top 1 . Atp 1 , players switch arms. The thin blue curve depicts the pay-off to player 1, and the thin yellow curve depicts the pay-off to player 2 in the equilibrium in non-cut-off strategies for p >p 1 . As argued, the blue curve meets the line D 1 at a belief p 1 S , which is strictly less than p 1 . In the region (p 1 , p 1 S ], player 2 experiments and player 1 free rides. At p 1 s , player 1 switches to the risky arm and player 2 switches to the safe arm. When the red curve meets the line D 2 at p 2 s > p 2 , player 2 switches to the risky arm again. Notice that in the equilibrium in non-cut-off strategies, player 2's pay-off is negatively sloped at the right neighbourhood of p =p 1 . Indeed, in the current example, we havẽ p 1 = 0.39 < 0.4054 =p 2 , wherep 2 is the single person threshold for player 2. This means that, in the equilibrium in non-cut-off strategies, player 2 is forced to act as the lonely pioneer to the left of his single-agent cut-off, which makes his pay-off negatively sloped. 16 When λ 2 > λ * 2 , the equilibrium in cut-off strategies does not exist. However, the argument in the proof of Proposition 7 allows us to show that, on (p 1 , p 2 ], the equilibrium of Proposition 4, which is the only equilibrium in which player 1 is experimenting throughout this range, strictly welfare dominates all other equilibria. Indeed, with heterogeneous players, more frequent switches have the effect of replacing experimentation by the strong player with experimentation by the weak player in some open subset in (p 1 , p 2 ), thereby delaying information production in this range. Thus, even though more frequent switches can expand the range of beliefs where both experiment, there is always a welfare loss in the range (p 1 , p 2 ]. Hence, if players switch the role of pioneer and free rider more frequently, the equilibrium welfare is not unambiguously improved. This is in contrast to the case with homogeneous players (Keller et al. 2005), where the only effect of increasing the frequency of switches is to expand the range of beliefs where both players experiment, thus unambiguously speeding up information production and improving equilibrium welfare. Yet, we have not been able to establish that the equilibrium of Proposition 4 is globally welfare maximising.

Learning rates versus pay-offs
In our baseline model, we have considered asymmetric Poisson arrival rates only. However, since the expected lump-sum pay-off from the good risky arm was the same for both players, the asymmetry in learning rates implied that the expected flow payoff from a good risky arm was also different across the players. In this subsection, we will analyse a model where learning rates differ, but the expected flow pay-off from a good risky arm is the same for both players. Defineĝ = λ 1 h 1 where λ 1 = 1 and h 1 > 0. For any λ 2 ∈ (0, 1), we choose a h 2 > 0 such that λ 2 h 2 =ĝ.
We will first analyse the social planner's problem. Please refer to "Appendix (B.10)" for the explicit form of the Bellman equation for the planner's value function w. The following proposition will show that the structure of the planner's solution is the same as in Proposition 1.

Proposition 8 The planner's optimal policy k
and the value function is 16 Mathematically, this can be seen as follows: consider a function v = g 2 p + C(1 − p)( 1− p p ) r λ 2 , such that the integration constant is derived from v(p 1 ) = s. Sincep 1 <p 2 , direct computation shows that v (p 1 ) < 0. In the equilibrium in non-cut-off strategies, to the immediate right ofp 1 , 2's pay-off is given by Footnote 15 continued v 2 = g 2 p + c 2 (1 − p)( 1− p p ) r λ 2 . The integration constant c 2 is determined from v 2 (p 1 ) =v 2 (p 1 ) > s ⇒ c 2 > C. Direct computation shows that this implies that 2's pay-off will be negatively sloped in some right neighbourhood ofp 1 .
Proof Proof is by a standard verification argument. Please see the "Appendix B.8" for details.
We will now analyse the non-cooperative game. Please refer to "Appendix (B.10)" for the explicit form of the Bellman equation player i's (i = 1, 2) value function w i satisfies.
The single-agent thresholds arep i = rs rs+(r +λ i )(ĝ−s) . It can be verified thatp 1 <p 2 . As in the baseline model, we can argue that in any equilibrium,p 1 is the lowest belief where some experimentation takes place and player 1 is the last one to experiment. This implies that, in any equilibrium, for beliefs right abovep 1 , pay-offs to player 1 and 2 are given byw 1 ( p) andw 2 ( p), respectively. 17 It can be verified thatw 1 is strictly convex andw 2 is strictly concave. By arguments similar to those in Lemma 1, we can infer that there exists a uniquep 1 ∈ (p 1 , 1) such thatw 1 (p 1 ) = D 1 (p 1 ) and a uniquep 2 ∈ (p 2 , sĝ ) such thatw 2 (p 2 ) = D 2 (p 2 ). In the following proposition, we establish that an equilibrium in cut-off strategies exists if and only if the degree of asymmetry is high enough.
Proposition 9 There exists aλ 2 ∈ (0, 1) such that there exists an equilibrium in cut-off strategies if and only if λ 2 ∈ (0,λ 2 ]. In this equilibrium, player 1 plays risky on (p 1 , 1] and safe otherwise, while player 2 plays risky on (p 2 , 1] and safe otherwise. Proof Please refer to "Appendix B.9" for details. Figure 5 depicts the equilibrium in cut-off strategies. 18 The black (red) curve depicts the pay-offs to player 1 (2). Since the flow pay-off obtained by each player from a good risky arm is fixed atĝ, the point of intersection between the best response line and the horizontal line w = s is the same for both players. As agents become more asymmetric, the best response lines diverge more from each other. Due to this, there emerges a range of beliefs where only player 2 can free ride. Hence, if the degree of asymmetry between the players is high enough, there exists an equilibrium in cut-off strategies.
Using similar arguments, we can establish that when the players' learning rates are equal but their flow pay-offs from a good risky arm are different, an equilibrium in cut-off strategies exists if the asymmetry between the players is high enough. As an illustration, suppose λ 1 = λ 2 =λ. The lump sum received by each player from a good risky arm at the jumping times of the Poisson process with intensityλ is drawn from a time-invariant distribution. The mean of this distribution h i (i = 1, 2) is such that h 1 > h 2 and h 2 ≥ ŝ λ . This implies g 1 > g 2 ≥ s. The best response diagonal of player i (i = 1, 2) is now given byD i : v = 2s − g i p. Beliefsp 1 andp 2 can be defined analogously to p 1 and p 2 above. Figure 6 19 shows an equilibrium in cut-off strategies in this framework. The black (red) curve depicts the pay-offs to player 1 (2). This equilibrium exists only when the players are highly asymmetric, and the best response diagonals are far apart from each other.
In both cases, if it exists, the equilibrium in cut-off strategies is welfare maximising. The argument is similar to above: Player 2 free rides the most in the equilibrium in cut-off strategies, so that the range of beliefs at which both players play risky is largest. In addition, for any equilibrium that is not in cut-off strategies, there is an open set of beliefs in which the roles of experimenting pioneer and free rider are reversed as compared to the equilibrium in cut-off strategies (where only player 2 ever free rides). In the case λ 1 = λ 2 , both effects lead to greater delay in information production in the non-cut-off equilibrium. In the case λ 1 = λ 2 =λ, the first effect leads to greater delay, Fig. 6 Cut-off equilibrium when learning rates are equal but risky flow pay-offs differ while the second effect leads to a higher opportunity cost of information production (s − g 1 p < s − g 2 p), in the non-cut-off equilibrium.

Conclusion
In this paper, we have characterised the set of Markov perfect equilibria in a two-armed bandit model with heterogeneous players. We have shown that there always exists an equilibrium in which the weaker player uses a cut-off strategy. If the heterogeneity is stark enough, there exists an equilibrium in cut-off strategies. If such an equilibrium exists, it is welfare optimal.
Thus, suppose there are two oil companies with vastly different drilling technologies, e.g. a big multinational firm and a small local enterprise. One could argue that the difference in technological capabilities between the two will be bigger in developing countries. On account of the big heterogeneity in capabilities, we should expect the equilibrium in cut-off strategies to exist. An empirically testable prediction of our model would thus be that there will be a higher frequency of instances in developing countries where the small local firm would free ride on the experimentation provided by the big multinational firm, and only enter the market after oil had been struck, even if the original level of uncertainty regarding the presence of oil was only moderate.
We have restricted players to using one arm only at any given instant t. By the linearity of the players' Bellman equations, our equilibria would remain equilibria if we allowed players to select experimentation intensities k i,t ∈ [0, 1]. There might, however, be more equilibria in this case.
Our analysis has relied heavily on the characterisation of players' best responses via the diagonals D i [see Eq. (4)], which was pioneered by Keller et al. (2005) for the homogeneous-player case. We expect that a similar approach could, mutatis mutandis, be used to study other kinds of asymmetries, e.g. pertaining to players' safe-arm pay-offs s i . We should expect a similar result to our Proposition 3 to hold in these cases, namely, that there existed an equilibrium in cut-off strategies if and only if the heterogeneity was stark enough. This is solved by v( p) = s + g + rg 1 1 + r − s 1 + r p + Cu 1 ( p).

A.2 ODEs of players in the non-cooperative game
If k 1 = k 2 = 0, both players' pay-off functions satisfy v i ( p) = s. If k 1 = k 2 = 1 prevails on an open set of beliefs in the non-cooperative game, both players' value function for beliefs in this set satisfies This is solved by where C is a constant of integration.
If k i = 1 and k j = 0, player i's pay-off function satisfies Solving this, we get where u i ( p) = (1 − p)[ (1− p) p ] μ i and μ i = r λ i . Player j's pay-off function satisfies This is solved by

B.1 Proof of Proposition 1
The , 1]; 20 thus, v is the pay-off function associated with the policy k * . 21 We shall first show that v is of class C 1 , (strictly) increasing and (strictly) convex (on ( p * 1 , 1)).
By monotonicity of v, v ≥ λ λ 2 s > λs in this range, which completes the proof.

B.3 Proof of Proposition 3
Proof By our previous arguments, in any equilibrium in cut-off strategies, player 1 will play risky on (p 1 , 1] and safe otherwise. In response, by the definition of p 2 , player 2 must play risky on ( p 2 , 1] and safe otherwise, if there is an equilibrium in cut-off strategies. Indeed, below p 2 , player 2 is playing a best response to player 1's action choice by the definition of p 2 . Since D 2 is decreasing, it is sufficient to show that player 2's pay-off function is increasing on [ p 2 , 1] in order to show that he is also playing a best response at beliefs above p 2 . Firstly, we note that the closed-form expression for player 2's pay-off function (see Eq. 12 in "Appendix A.2") implies that player 2's pay-off v 2 is strictly convex on ( p 2 , 1), as v 2 ( p 2 ) = D 2 ( p 2 ) > g 2 p 2 , where the inequality follows from p 2 < s g 2 (see Lemma 1). Furthermore, the relevant ODEs (Eqs. 15 and 11 in "Appendix A.2") show that v 2 ( p 2 ) = D 2 ( p 2 ) implies smooth pasting at p 2 . As moreoverv 2 > 0 (asC 2 < 0 and u 1 < 0), we can conclude that player 2's value function is strictly increasing on ( p 2 , 1) as well, and hence that player 2 is playing a best response at beliefs above p 2 .
Thus, the candidate strategy profile is indeed an equilibrium if and only if player 1's strategy is a best response to player 2's. This requires player 2 to choose the safe arm for all beliefs at which player 1's pay-off is below D 1 . Thus, it remains to determine under what conditions p 2 ≥ p 1 .
We will first argue that p 1 ( p 2 ) is increasing (decreasing) in λ 2 . Recall that p 1 is the point of intersection of the functionv 1 and the line D 1 . As λ 2 decreases, the line D 1 rotates anticlockwise around the point ( s g 1 , s). Sincev 1 is independent of λ 2 , p 1 decreases as λ 2 decreases. On the other hand, as λ 2 decreases, the line D 2 shifts to the right and becomes steeper. By direct computation, one shows thatv 2 becomes flatter as λ 2 decreases. This implies that p 2 increases.
For this, it is sufficient that which follows by direct computation.

B.5 Proof of Proposition 4
If λ 2 ≤ λ * 2 , p 1 ≤ p 2 , by the proof of Proposition 3. Suppose to the contrary that the equilibrium in which only player 2 uses a cut-off exists. By Proposition 6, p 1 S < p 1 ≤ p 2 = p 2 S , a contradiction to the characterisation of this equilibrium in Proposition 5. Now, suppose λ 2 > λ * 2 . By the proof of Proposition 3, p 1 > p 2 . It thus remains to show thatp 1 S > p 2 . Yet, player 1's pay-off from the conjectured equilibrium strategies at p 2 is given byv 1 ( p 2 ) < D 1 ( p 2 ), the inequality being immediately implied by p 1 > p 2 , we havep 1 S > p 2 , and, by Proposition 5, the equilibrium exists.
On (p k , p j S ], a similar argument to the case of even (odd) i − 1 applies if j = 2 ( j = 1), so that we can conclude thatv 1 < v 1 andv 2 > v 2 on (p 1 , p j S ], and hence p 1 S < p 1 and p 2 S > p 2 . For l = 1, from the equilibrium characterisation we know that p 2 S = p 2 and the above argument to show p 1 S < p 1 still applies.

B.7 Proof of Proposition 7
If player i (i = 1, 2) experiments and player j ( j = 1, 2, j = i) free rides then the players' aggregate equilibrium pay-off is given by v agg = v i + v j , with v i satisfying the ODE (13) and v j satisfying the ODE (15). If both players experiment then v agg = v 1 + v 2 and v n (n = 1, 2) satisfy the ODE (11). From Proposition 5, we know that v c agg (p 1 ) = v nc agg (p 1 ). Suppose v c agg (p i−1 ) ≥ v nc agg (p i−1 ) for some i ∈ {2, 3, . . . , k}. Suppose first that i − 1 ≥ 1 is odd. If there exists a p ∈ (p i−1 ,p i ] such that v c agg ( p) = v nc agg ( p), then by the ODEs (13) and (15), we can conclude that v c agg ( p−) > v nc agg ( p−). This implies there exists ap ∈ [p i−1 , p) such that v c agg (p) = v nc agg (p) and v c agg (p+) < v nc agg (p+), a contradiction to ODEs (13) and (15).
Suppose i − 1 ≥ 2 is even. Then from the previous step we can infer that v c agg (p i−1 ) > v nc agg (p i−1 ). In both kinds of equilibria, if i − 1 is even, (k 1 , k 2 ) = (1, 0) on (p i−1 ,p i ]. This implies that we have v c agg ( p) > v nc agg ( p) for all p ∈ (p i−1 ,p i ]. Thus, for all p ∈ (p 1 ,p k ], v c agg ( p) > v nc agg ( p). As λ 2 ≤ λ * 2 , we havep k = p 1 S . An argument similar to that for even i −1 shows that v c agg > v nc agg on p ∈ ( p 1 S , p 2 ]. Now, suppose that there exists ap ∈ ( p 2 , p 2 S ] such that v c agg (p) = v nc agg (p). By the ODEs (13) and (11), this implies v c agg (p−) > v nc agg (p−). This leads to a contradiction by the same argument as above. As (k 1 , k 2 ) = (1, 1) prevails in both equilibria on ( p 2 S , 1), the claim follows.
Direct computation shows that u 1 > 0 and s −p * 1 2ĝ+rĝ 1+r − s 1+r > 0, so that φ, and hence w| (p * 1 ,p * 2 ) is strictly convex. One furthermore shows by direct computation that φ(p * 1 ) = φ (p * 1 ) = 0, implying that w| (0,p * 2 ) is of class C 1 . 22 As in "Appendix A.1", from 17 we can obtain the ODEs that w satisfies for each range of beliefs and the corresponding general form of w for that range.
Planner's problem The planner's value function w satisfies w( p) = 2s + max whereB r andc( p) = s −ĝ p. Non-cooperative game Player i's value function w i satisfies As before, we can derive the best response diagonals as For beliefs right abovep 1 , pay-offs to players 1 and 2 are given bȳ respectively.