Applied Intelligence

, Volume 38, Issue 4, pp 479–488

Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game

Authors

    • Department of Information and Communication TechnologyUniversity of Agder
  • Sondre Glimsdal
    • Department of Information and Communication TechnologyUniversity of Agder
Article

DOI: 10.1007/s10489-012-0346-z

Cite this article as:
Granmo, O. & Glimsdal, S. Appl Intell (2013) 38: 479. doi:10.1007/s10489-012-0346-z

Abstract

The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.

Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.

Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.

Keywords

Bandit problemsGoore GameBayesian learningDecentralized decision makingQuality of service controlWireless sensor networks

1 Introduction

The conflict between exploration and exploitation is a well-known problem in reinforcement learning, and other areas of artificial intelligence. The Two-Armed Bandit (TAB) problem captures the essence of this conflict. In brief, a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information.

1.1 Thompson sampling

In [8] we reported a Bayesian technique for solving bandit like problems, revisiting the Thompson Sampling [21] principle pioneered in 1933. This revisit lead to novel schemes for handling multi-armed, and dynamic (restless) bandit problems [9, 13, 14], and empirical results demonstrated the advantages of these techniques over established top performers. Furthermore, we provided theoretical results stating that the original technique is instantaneously self-correcting and that it converges to only pulling the optimal arm, with probability as close to unity as desired. In addition to the theoretical convergence results found in [8], May et al. recently reported an alternative proof strategy for establishing convergence properties for optimistic Bayesian sampling [16]. As a further testimony to the renewed importance of the Thompson Sampling principle, Wang et al. [24] combined so-called sparse sampling with Bayesian exploration, enabling efficient searching of the arm selection space using a sparse look-ahead tree. In [4], Dimitrakakis derived optimal decision thresholds for the multi-armed bandit problem, for both the infinite horizon discounted reward case, and for the finite horizon undiscounted case. Later on, a modern Bayesian look at the multi-armed bandit problem was also taken in [16, 20]. Promising recent application areas for Thompson Sampling include Bayesian click-through rate optimization for sponsored search advertising [7] and web site optimization [6, 20].

1.2 Decentralized decision making and multi-armed bandits

Multiple interacting bandits problems are particularly fascinating because they can be used to model, and efficiently solve a large class of real world decentralized decision making problems, such as QoS-control in wireless sensor networks [15], routing [19], game playing [5], combinatorial optimization [1, 10], and resource allocation [11, 12]. In decentralized decision making problems, however, a certain phenomenon renders current bandit problem based solutions sub-optimal. Specifically, multiple decentralized decision makers are simultaneously exploring a collection of interacting bandits. This means that the variances of the reward distributions of each bandit problem are governed by the current level of exploration being manifested in the system as a whole. In other words, the variance of the reward distributions will be fluctuating with the degree of exploration taking place. Thus, initially, when exploration typically is significant, each decision maker should be correspondingly more conservative or cautious when interpreting the received rewards. Otherwise, by being too reckless, the decision maker may be led astray early on, converging to a sub-optimal decision.

The traditional approach to dealing with the above described fluctuation of reward distribution variance is to make learning sufficiently conservative. The purpose is to minimize the chance of each decision maker converging prematurely. Obviously, the disadvantages of this approach is the corresponding loss in learning speed caused by being too conservative also when exploration calms down. A recent approach deals with this problem indirectly by incorporating a Kalman filter into the decision making [9], allowing each decision maker to track changing reward distributions. Thus, too reckless learning initially is offset by the “forgetting” mechanism of the Kalman filter. This means that premature convergence is hindered. Yet, this tracking of changing reward distributions also means that exploration never stops. The decision makers will, as a result, never converge to a single optimal decision.

1.3 Paper contributions and organization

In this paper, we propose a novel scheme for solving one particular class of decentralized decision making problems, namely, the Goore Game (GG) [22]. The GG has applications within QoS control in wireless sensor networks, and we described both the GG and recent applications in Sect. 2. We then proceed to introduce a scheme for Accelerated Decentralized Learning in Two-Armed Bandit Based Decision Making (ADL-TAB) in Sect. 3. The ADL-TAB scheme directly and specifically addresses fluctuating reward distribution variances. To achieve this, we derive theoretical results that characterize the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning as follows. When a decision maker chooses which arm to pull, it also submits a measurement of its degree of exploration, which we refer to as arm selection variance. In turn, along with the random reward it receives from the arm pull, it also receives a signal that reflects the current aggregated level of exploration being manifested in the system. Using this signal, each decision maker accelerates learning by taking advantage of the increasingly more reliable feedback that can be obtained when exploration gradually turns into exploitation. In Sect. 4, we demonstrate empirically that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. We also achieve significant performance benefits when applying ADL-TAB to QoS control in wireless sensor networks that are dynamically changing through a stochastic sensor birth-death process. In brief, our scheme clearly outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. Finally, in Sect. 5, we conclude and provide pointers to further research.

2 The Goore Game (GG)

The GG is one of the most fascinating games studied in the field of artificial intelligence, and our presentation here is based on the exposition found in [18]. We describe the GG using the following informal formulation given in [17]:

Imagine a large room containingNcubicles and a raised platform. One person (voter) sits in each cubicle and a Referee stands on the platform. The Referee conducts a series of voting rounds as follows. On each round the voters vote “Yes” or “No” (the issue is unimportant) simultaneously and independently (they do not see each other) and the Referee counts the fraction,λ, of “Yes” votes. The Referee has a uni-modal performance criterionG(λ), which is optimized when the fraction of “Yes” votes is exactlyλ. The current voting round ends with the Referee awarding a dollar with probabilityG(λ) and assessing a dollar with probability 1−G(λ) to every voter independently. On the basis of their individual gains and losses, the voters then decide, again independently, how to cast their votes on the next round.

The game has many interesting and fascinating features which render it both non-trivial and intriguing. These are listed below:
  1. 1.

    The game is a non-trivial non-zero-sum game.

     
  2. 2.

    Unlike the games traditionally studied in the AI literature (like Chess, Checkers, Lights-Out, etc.) the game is essentially a distributed game.

     
  3. 3.

    The players of the game are ignorant of all of the parameters of the game. All they know is that they have to make a choice, for which they are either rewarded or penalized. They have no clue as to how many other players there are, how they are playing, or even of how/why they are rewarded/penalized.

     
  4. 4.

    The stochastic function used to reward or penalize the players can be completely arbitrary, as long as it is uni-modal.

     

The literature concerning the GG is sparse. It was initially studied in the general learning domain, and, as far as we know, was for a long time merely considered as an interesting pathological game. Recently, however, the GG has found important applications within two main areas, namely, Quality of Service (QoS) control in wireless sensor networks [3] and within cooperative mobile robotics, as summarized in [2].

The GG has found applications within the field of wireless sensor networks, as explained briefly here. Consider a base station that collects data from a sensor network. The sensors of the network are battery driven and have been dropped from the air, leaving some of them non-functioning. The functioning sensors can either be switched on or off, and since they are battery-driven, it is expedient that they should be turned off whenever possible. The base station, on the other hand, has been set to maintain a certain resolution (i.e., QoS), and therefore requires that Q sensors are switched on. Unfortunately, it does not know the number of functioning sensors, and it is only able to contact them by means of a broadcast, leaving it unable to address them individually. This leaves us with the following challenge: How can the base station turn on exactlyQsensors, only by means of its limited broadcast capability?

Iyer et al. [15] proposed a scheme where the base station provided broadcasted QoS feedback to the sensors of the network. Using this model, the above problem was solved by modeling it as a GG [23]. From the GG perspective, a sensor is seen as a voter that chooses between transmitting data or remaining idle in order to preserve energy. Thus, in essence, each sensor takes the role of a GG player that either votes “On” or “Off”, and acts accordingly. The base station, on the other hand, is seen as the GG Referee with a uni-modal performance function G(⋅) whose maximum is found at Q normalized by the total number of sensors available. The “trick” is to let the base station (1) count the number of sensors that have turned on, and (2) use the broadcast mechanism to distribute, among the sensors, the corresponding reward based on the probability obtained from G(⋅). The application of the GG solution to the field of sensor networks is thus both straightforward and obvious.

Furthermore, Tung and Kleinrock [23] have demonstrated how the GG can be used for coordinating groups of mobile robots (also called “mobots”) that have a restricted ability to communicate. The main example application described in [23] consists of a fixed number of mobots that can either (1) collect pieces of ore from a landscape, or (2) sort already collected ore pieces. The individual mobots vary with respect to how fast they collect and how fast they sort these pieces of ore. In this context, the GG is used to make sure that the mobots choose their action so as to maximize the throughput of the overall collection and sorting system.

Other possible cooperative robotics applications include controlling a moving platform and guarding a specified perimeter [2]. In all of these cases, the solution to the problem in question would essentially utilize the solution to the GG in a plug-and-play manner.

3 Accelerated decentralized learning in two-armed bandit based decision making (ADL-TAB)

This paper proposes a novel scheme for decentralized decision making in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. Based on the sibling conjugate priors, we also measure the current degree of exploration and exploitation being manifested in the system as a whole. This allows each decision maker to accelerate its learning by taking advantage of the increasingly more reliable feedback that can be obtained when exploration gradually turns into exploitation.

3.1 Bayesian sampling for two-armed normal bandits (BS-TANB)

At the heart of our decentralized decision making scheme, we find a Bayesian Sampling approach to Two-Armed Normal Bandits (BS-TANB) problems. A unique feature of BS-TANB is its computational simplicity, achieved by relying implicitly on Bayesian reasoning principles. Possessing a bell-shaped probability density function with mean μ and standard deviation σ
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equa_HTML.gif
the normal distribution, N(μ,σ), is central to BS-TANB. Essentially, BS-TANB uses the normal distribution for two purposes. First of all, it is used to provide a Bayesian estimate of the reward expectation associated with each of the available bandit arms. Secondly, a pertinent feature of BS-TANB is that it uses the normal distribution as the basis for a randomized arm selection mechanism. The following algorithm contains the essence of BS-TANB (see [9] for further details).
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Figa_HTML.gif

As seen from the above BS-TANB algorithm, t is a discrete time index and the parameters ϕt=〈(μ0[t],σ0[t]),(μ1[t],σ1[t])〉 form an infinite 4-dimensional continuous state space, with each pair (μi[t],σi[t]) giving the prior distribution of the unknown reward ri associated with Armi. Within Φ the BS-TANB navigates by transforming each prior distribution into a posterior distribution, based on the rewards \(\tilde{r}_{i}\) obtained from selecting Armi, α[t]=i, as well as the observation noise, \(\sigma_{ob}^{2}\), given as an input parameter to the algorithm. Essentially, the algorithm uses observation noise, \(\sigma_{ob}^{2}\), to determine how much emphasis to put on the reward \(\tilde{r}_{i}\), which is a crucial property that we will now take advantage of.

In the interest of notational simplicity, let Arm 1, α[t]=1, be the arm under investigation. Then, for any parameter configuration ϕtΦ we can state, using a generic notation,1 that the probability of selecting Arm 1, α[t]=1, is equal to the probability P(X1>X0|ϕt)—the probability that a randomly drawn value x1X1 is greater than the other randomly drawn value x0X0 at time step t. Since the associated stochastic variables X0 and X1 are normally distributed, with parameters (μ0[t],σ0[t]) and (μ1[t],σ1[t]), respectively, we have that:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ1_HTML.gif
(1)
In the following, we will let p[t] denote this latter probability.

3.2 BS-TANB based decentralized decision making

The overall decentralized decision making scheme that we propose is illustrated in Fig. 1. On each round t, the n decision makers Vq∈{V1,…,Vn} choose one of two arms, αq[t]=i∈{0,1}, simultaneously and independently (they do not see each other), with αq[t]=0 referring to a “No”-vote and αq[t]=1 referring to a “Yes”-vote.
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Fig1_HTML.gif
Fig. 1

Decentralized decision making with accelerated learning

Let pq[t]=P(αq[t]=1) be the probability that decision maker Vq casts a “Yes” vote on round t. Then 1−pq[t] is the probability that Vq casts a “No” vote, and each voting αq[t] can be seen as a Bernoulli trial in which a “Yes” vote is a success and a “No” vote is a failure. Note that the concrete instantiation of the arm selection probability pq[t] is governed by the learning scheme applied, which in our case is BS-TANB.

Definition 1

(Arm Selection Variance)

In a two-armed bandit problem where the current arm selection probability is p, we define Arm Selection Variance, σ2, to be the variance, p(1−p), of the outcome of the corresponding Bernoulli trial.

As seen in Fig. 1, in addition to casting a vote αq[t], each decision maker Vq also submits its present Arm Selection Variance, \(\sigma_{q}^{2}[t]\), in order to signal its level of exploration. Thus, as in the traditional Goore Game setup, a Referee calculates the fraction, λ[t], of “Yes” votes. In addition, it now also calculates the variance \(\sigma_{A}^{2}[t]\) of the total number of “Yes” votes, which simply is the sum of the variances of the independently cast votes (cf. Bienayme formula): \(\sigma_{A}^{2}[t] = \sum_{q=1}^{n} \sigma_{q}^{2}[t]\). Note that in practice, such as in QoS control in wireless sensor networks [15], this operation is conducted by the so-called base station of the network.

The Referee has a uni-modal normally distributed performance criterion G(λ[t];μG,σG), where μG is the mean and \(\sigma_{G}^{2}\) is the variance, which is thus optimized when the fraction of “Yes” votes is exactly μG, λ[t]=μG. The current voting round ends with the Referee awarding a reward \(\tilde{r}_{i}\) to each voter, with the reward being of magnitude G(λ[t];μG,σG). Additionally, white noise N(0,σW) is independently added to the reward received by each voter.

On the basis of their individual gains, the voters then decide, again independently how to cast their votes on the next round.

3.3 Measuring fluctuating observation noise in Goore Games

In order to develop a decentralized BS-TANB based scheme for solving the above problem, whose accuracy does not rely merely on conservative learning, it is crucial that we are able to determine the observation noise, \(\sigma_{ob}^{2}\), needed by BS-TANB for its Bayesian computations.

From the perspective of voter Vq, let Yq=∑rqαr[t] be the total number of “Yes” votes found among the n−1 votes cast by the other voters (rq). According to our Bayesian bandit scheme, each voter Vq, at any given iteration t of the game, cast its vote according to a Bernoulli distribution with success probability \(p_{q}[t] = P(\alpha_{q}[t] = 1) = P(X_{1} > X_{0} |\phi_{q}^{t})\) — the probability of voting “Yes”. Furthermore, initially, all voters vote “Yes” with probability pq[1]=0.5, and based on Bayesian computations, gradually shift their probability of voting “Yes” towards either 0 or 1, as learning proceeds. This leads us to design a solution for the case where Yq is a sum of independent random variables of similar magnitude, in other words, where Yq is approximately normally distributed for large n, \(Y_{q}\sim N(\mu_{F}^{q}, \sigma_{F}^{q})\). Since each term in the summation is Bernoulli distributed, the mean of the sum becomes \(\mu_{F}^{q} = \sum_{r\ne q} p_{r}[t]\) while the variance becomes \({\sigma_{F}^{q}}^{2} = \sum_{r\ne q} p_{r}[t] (1 - p_{r}[t])\). The above entails that each voter, Vq, essentially decides whether to add an additional “Yes” vote or not to a random sum of yes votes, \(Y_{q} \sim N(\mu_{F}^{q}, \sigma_{F}^{q})\). That is, the reward that voter Vq receives when he votes either “Yes” (αq[t]=1) or “No” (αq[t]=0), becomes a function \(G(\frac{Y_{q} + \alpha_{q}[t]}{n})\) governed by the random variable \(Y_{q}\sim N(\mu_{F}^{q}, \sigma_{F}^{q})\) as well as the decision αq of voter Vq.

Thus \(E[G(\frac{Y_{q} + \alpha_{q}[t]}{n})]\) is the expected reward received by voter Vq when pulling arm αq[t] and \(\operatorname {Var}[G(\frac{Y_{q} + \alpha_{q}[t]}{n})]\) is the variance of the reward, which we will refer to as observation noise, σob.

Lemma 1

LetXbe a normally distributed random variable, XN(μF,σF). The expected valueE[G(X)] of a deterministic functionG(X)∼N(μG,σG) ofXthen becomes:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ2_HTML.gif
(2)

Proof

https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ3_HTML.gif
(3)
with
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ4_HTML.gif
(4)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ5_HTML.gif
(5)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ6_HTML.gif
(6)
The integral of the resulting Gaussian then becomes:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ7_HTML.gif
(7)
 □

Lemma 2

A deterministic functionG(X)∼N(μG,σG) of a normally distributed random variable, XN(μF,σF), has the variance:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ8_HTML.gif
(8)

Proof

https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ9_HTML.gif
(9)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ10_HTML.gif
(10)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ11_HTML.gif
(11)
with
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ12_HTML.gif
(12)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ13_HTML.gif
(13)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ14_HTML.gif
(14)
The integral of the resulting Gaussian then becomes:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ15_HTML.gif
(15)
 □
Figure 2 depicts the intricate behavior of \(\operatorname {Var}[G(X)]\) when seen as a function of μF and σF, and when μG=0.4 and σG=0.2. Since both the mean of G, μG, and the mean of F, μF, generally are unknown, the latter equation cannot be used directly to guide the bandit based learning. Instead, we consider the maxima of \(\operatorname {Var}[G(X)]\) with μF∈(0,1) being the free variable. By considering the maxima, learning accuracy is prioritized, at the potential cost of reduced learning speed. In the following, we will see that the maximization eliminates both μF and μG from the equation.
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Fig2_HTML.gif
Fig. 2

The variance of G(X), \(\operatorname {Var}[G(X)]\), for μG=0.4 and σG=0.2. The axes depicts the mean μF and the standard deviation σF, respectively

Theorem 1

The maximum of the variance\(\operatorname {Var}[G(X)]\)with respect toμF∈(0,1), for the functionG(X)∼N(μG,σG), withXN(μF,σF), is:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ16_HTML.gif
(16)

Proof

We find maxima and minima for \(\operatorname {Var}[G(X)]\) with respect to μF by solving the following equation:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ17_HTML.gif
(17)
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ18_HTML.gif
(18)
The above equation has four solutions, with two symmetric maxima in the region of interest μF∈(0,1), as illustrated in Fig. 2. The first maximum is:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ19_HTML.gif
(19)
while the second maximum is:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ20_HTML.gif
(20)
By substituting either Eq. (19) or Eq. (20) into Eq. (8) and simplifying, we see that both μF and μG have been eliminated from the equation, which completes the proof:
https://static-content.springer.com/image/art%3A10.1007%2Fs10489-012-0346-z/MediaObjects/10489_2012_346_Equ21_HTML.gif
(21)
 □

A crucial consequence of the results presented in this section is that since σF in the above equation can be approximated based on the feedback σA from the Referee (see Fig. 1), we can find the worst case observation noise based on Theorem 1. Thus, we have found a closed form formula for worst case observation noise,σob, that each voter can apply adaptively in its Bayesian computations!

4 Empirical results

In this section we evaluate the ADL-TAB scheme by comparing it with the currently best performing algorithm—the family of Bayesian techniques reported in [8]. Based on our comparison with these “reference” algorithms, it should be quite straightforward to also relate the ADL-TAB performance results to the performance of other similar algorithms. We first use artificial data and then data from a simulated sensor network.

4.1 Artificial data

We have conducted numerous experiments using various reward distributions, including a wide range of G(λ)-functions and a wide range of voters, under varying degrees of observation noise. The full range of empirical results all show the same trend, however, we here report performance on a representative subset of the experiment configurations, involving the 3, 5, and 10 player Goore Game. Performance is measured in terms of Regretthe difference between the sum of rewards expected afterNsuccessive rounds of the GG, and what would have been obtained by always casting the optimal number of “Yes” votes.

For these experiment configurations, an ensemble of 1000 independent replications with different random number streams was performed to minimize the variance of the reported results. In order to investigate the performance of the schemes under a broad spectrum of environments, we test the schemes using three different representative G(λ) functions—one sloped, with optimum close to λ=0.5, GN(0.35,0.2), another one also sloped, but with optimum farther from λ=0.5, GN(0.125,0.2), and finally, one peaked reward function, also with optimum far from λ=0.5 (thus, being the most challenging one). In Table 1, Regret is reported after 10, 100, 1000, and 10 000 iterations for both the new accelerating scheme and the traditional static scheme.
Table 1

Regret after 10, 100, 1000, and 10 000 iterations for 3, 5, and 10 players

Scheme

#Players

Function

10

100

1000

10 000

Accelerating

3

GN(0.125,0.1)

11.56

26.72

30.96

33.17

Static

3

GN(0.125,0.1)

11.63

27.27

34.88

47.20

Accelerating

3

GN(0.125,0.2)

5.26

8.47

10.35

11.09

Static

3

GN(0.125,0.2)

5.28

9.53

15.15

25.15

Accelerating

3

GN(0.375,0.2)

6.62

10.86

11.99

12.63

Static

3

GN(0.375,0.2)

6.73

12.15

14.36

17.72

Accelerating

5

GN(0.125,0.1)

18.37

41.60

51.78

61.65

Static

5

GN(0.125,0.1)

18.28

44.94

58.52

99.49

Accelerating

5

GN(0.125,0.2)

6.92

12.94

22.99

60.80

Static

5

GN(0.125,0.2)

7.01

15.39

32.94

69.86

Accelerating

5

GN(0.375,0.2)

6.12

20.47

22.70

25.75

Static

5

GN(0.375,0.2)

6.16

24.24

30.11

35.74

Accelerating

10

GN(0.125,0.1)

32.81

93.82

133.65

443.7

Static

10

GN(0.125,0.1)

32.84

99.27

143.67

549.8

Accelerating

10

GN(0.125,0.2)

10.19

19.57

39.58

110.53

Static

10

GN(0.125,0.2)

10.21

22.57

56.91

167.09

Accelerating

10

GN(0.375,0.2)

4.40

31.20

113.42

116.65

Static

10

GN(0.375,0.2)

4.41

32.03

163.40

197.31

As seen from the table, for all reported configurations, our ADL-TAB scheme not only learns faster initially, but also attains the best regret in the long run. Note that for the two bottom configurations, we use an augmented σF, \(\widehat{\sigma_{F}} = c \cdot \sigma_{F}\), with c=1.5, when the final observation noise σob is calculated. Indeed, the constant c can be used to handle the non-stationarity arising as the number of voters grows, as demonstrated in Table 2.
Table 2

Performance with σF augmented, \(\widehat{\sigma_{F}} =c \cdot \sigma_{F}\) (10 players, GN(0.1,0.1), σW=0.1)

Scheme/c

1.0

1.25

1.5

1.75

Accelerating

1030.0

684.7

444.7

408.7

Static

965.6

624.3

550.4

414.2

Since ADL-TAB applies the standard deviation σG of the reward function G(λ) to find overall observation variance, it is interesting to see how robust the scheme is to distortion of σG. As summarized in Table 3, setting σG too low is better than setting it too high in the present setting. Indeed, performance improves slightly with a lower σG.
Table 3

Performance with distorted \(\widehat{\sigma}_{G}\) given to ADL-TAB (10 players, GN(0.125,0.2), σW=0.1)

\(\widehat{\sigma}_{G}\)

0.85⋅σG

0.90⋅σG

0.95⋅σG

1.0⋅σG

1.05⋅σG

1.10⋅σG

1.15⋅σG

Regret

74.4

75.7

90.9

123.5

162.9

194.4

237.5

Note that the above reported performance gap is reduced with the level of white noise added to G, as shown in Table 4. As the variance of the white noise raises to extreme values, the white noise dominates the overall observation noise, rendering the variance introduced by the voters insignificant. However, for realistic degrees of white noise, as also seen from the table, ADL-TAB clearly outperforms the static BS-TANB scheme.
Table 4

Performance with varying degrees of white noise N(0,σW) (10 players, GN(0.375,0.2))

Scheme/σW

0.01

0.05

0.1

0.5

1.0

5.0

Accelerating

56.6

54.8

61.1

123.4

315.8

2012.0

Static

120.5

121.4

121.6

184.4

371.7

2013.3

Thus, based on our empirical results on artificial data, we conclude that ADL-TAB is the superior choice for the GG, both when σG is known or slightly distorted, providing significantly better performance in all experiment configurations.

4.2 QoS control in wireless sensor networks

As mentioned in the introduction, the GG can be used for QoS control in wireless sensor networks. A scenario of particular interest is randomly deployed networks, whose applications include environmental monitoring and battlefield surveillance & reconnaissance [3]. An additional complexity for QoS control under such settings is sensor break down. Due to the random deployment of sensors, and due to the nature of typical applications, it is often infeasible to track down and repair broken sensors. Instead, batches of new sensors are deployed to replace broken ones, e.g., by air drop. As a result, the population of sensors are dynamically changing over time, in a stochastic manner.

To stress the ADL-TAB scheme under particularly challenging conditions, we have thus simulated the latter kind of dynamic environments. In brief, the simulated environment starts out with ten randomly deployed sensors. As in [15], both the lifetime of each sensor and the rate of deployment is governed by the same exponential distribution. Therefore the total number of operative sensors will fluctuate, but remain constant on average.

The QoS function used here is GN(0.375,0.2), which means that optimally 37.5 % of the sensors should be active at any given instant. A white noise of σw=0.3 is applied to feedback. This configuration was executed for 10 000 hours, with the average regret for 2000 sample runs summarized in Table 5 for different birth/death rates.
Table 5

Regret after 10, 100, 1000, and 10 000 hours

Scheme

Birth/Death Rate

10

100

1000

10 000

Accelerating

100

4.47

31.50

180.43

1055.23

Static

100

4.48

31.45

194.14

3003.87

Accelerating

50

4.47

31.65

195.84

1268.32

Static

50

4.45

31.58

203.19

4734.49

Accelerating

12.5

4.49

32.71

219.34

1270.76

Static

12.5

4.39

32.04

357.60

9589.84

From the table it is clear that ADL-TAB obtains significantly lower regret than the static BS-TANB scheme in these scenarios too. Indeed, the performance benefit is increasing with faster replacement of sensors. This can be explained by the ability of a newly placed sensor, that is governed by the ADL-TAB scheme, to perceive any learning stability already established among the operating sensors, which in turn will allow the sensor to accelerate its transition from exploration to exploitation.

Another effect worth noting is the stability of ADL-TAB. That is, the standard deviation of the sample runs for ADL-TAB after 10 000 iterations is much lower than for the static scheme, as can be seen from Table 6. This performance robustness provided by ADL-TAB can be explained by the acceleration that takes place to reduce exploration.
Table 6

Standard deviation of regret after 10 000 hours

Scheme

Birth/Death Rate = 100

Birth/Death Rate = 50

Birth/Death Rate = 12.5

Accelerating

σ100=254.6

σ50=229.0

σ12.5=206.5

Static

σ100=1561.5

σ50=1628.9

σ12.5=1393.7

5 Conclusion and further work

In this paper we proposed a novel scheme, ADL-TAB, for decentralized decision making based on the Goore Game. Theoretical results concerning the variance of the observations made by each individual decision maker enabled us to accelerate learning as exploration turns into exploitation. Indeed, our empirical results demonstrated that the accelerated learning improves both learning accuracy and speed, outperforming state-of-the-art Goore Game solution schemes, both when using artificial data and when using data from a wireless sensor network simulation.

As further work, we intend to study how the Kalman filter can be incorporated into ADL-TAB, so that non-stationary behavior can be modeled and addressed in a principled manner. We are also currently investigating how the present result can be extended to other classes of decentralized decision making problems. Finally, we believe this avenue of research can lead to enhancements in application areas such as decentralized task scheduling, processing pipeline optimization, and resource allocation.

Footnotes
1

By this we mean that P is not a fixed function. Rather, it denotes the probability function for a random variable, given as an argument to P.

 

Copyright information

© Springer Science+Business Media, LLC 2012