Online Learning of Network Bottlenecks via Minimax Paths

In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.

The aforementioned formulations assume that the network or the graph is fully specified, i.e., that all the edge weights are fully known. However, in practice, the edge weights might not be known in advance or they might include some inherent uncertainty. In this paper, we tackle such situations by developing an online learning framework to learn the edge weight distributions of the underlying network, while solving the bottleneck identification problem for different problem instances.
For example, in the transportation scenario, city governments often have access to fleets of vehicles utilized for various municipal services. These may be used to sequentially and continuously to gain knowledge about traffic flow from the environment, while it is still desirable to avoid causing unnecessary inconvenience and stress [21] to the employees operating the vehicles by excessively exploring congested paths. If care is taken to spread the costs over time, exploration may be performed continuously without having a specific end time known in advance (i.e., the time horizon of the sequential decision making problem).
For this purpose, we view this as a multi-armed bandit (MAB) problem and focus on Thompson Sampling (TS) [43], a method that suits probabilistic online learning well. Thompson Sampling is an early Bayesian method for addressing the trade-off between exploration and exploitation in sequential decision making problems. It balances these by randomly sampling available actions according to their posterior probability of being optimal, given prior beliefs and observations from previously selected actions. An action is more likely to be sampled if the posterior distribution over the expected reward of that action has high uncertainty (exploration) or high mean (exploitation).
The method has only recently been thoroughly evaluated through experimental studies [8,17] and theoretical analyses [26,1,38], where it has been shown to be asymptotically optimal in the sense that it matches well-known lower bounds of these types of problems [29]. Furthermore, the algorithm does not assume knowledge of the time horizon, i.e., it is an anytime algorithm.
Among many other problem settings, Thompson Sampling has been adapted to online versions of combinatorial optimization problems with retained theoretical guarantees [44], where one application is to find shortest paths in graphs [30,16,45,2].
Another commonly used method for these problems is Upper Confidence Bound (UCB) [3], which utilizes optimism to balance exploration and exploitation. UCB has been adapted to combinatorial settings [10], and also exists in Bayesian variants [25]. Recently, a variant of UCB has been studied for bottleneck avoidance problems in a combinatorial pure exploration setting [14]. They consider a different problem setting and method than those we present in this paper, though their bottleneck reward function is similar to the one we use in our approximation method. The main difference between their setting and the standard combinatorial semi-bandit setting in how agents interact with the environment, is that instead of being restricted to selecting sets of actions respecting combinatorial constraints, they allow agents to sequentially try individual arms to identify the best feasible solution to the combinatorial problem. This is not applicable to our setting, since we may not observe the feedback of individual edges without also traversing a path containing those edges, potentially incurring cost from some other edge on that path. Moreover, the objective in a pure exploration problem is to find the best action as quickly as possible, with either a fixed time budget or confidence level, using agents dedicated for this task. While identifying the best path is desirable in our problem setting as well, we are specifically interested in the case where existing agents are utilized and where using them exclusively for exploration is too costly. For that reason, we focus on anytime methods capable of distributing exploratory actions over time.
In this paper, we model the online bottleneck identification task as a stochastic combinatorial semi-bandit problem, for which we develop a combinatorial variant of Thompson Sampling. We then derive an upper bound on the corresponding Bayesian regret that is tight up to a polylogarithmic factor, which is consistent with the existing lower bounds for combinatorial semi-bandit problems. We face the issue of computational intractability with the exact problem formulation. We thus propose an approximation scheme, along with a theoretical analysis of its properties. Finally, we experimentally investigate the performance of the proposed method on directed and undirected real-world networks from transport and collaboration domains.

Bottleneck Identification Model
In this section, we first introduce the bottleneck identification problem over a fixed network and then describe a probabilistic model to be used in stochastic and uncertain situations.

Bottleneck identification over a network
We model a network by a graph G(V, E, w), where V denotes the set of vertices (nodes) and each e = (u, v) ∈ E indicates an edge between vertices u and v where u, v ∈ V and u = v. Moreover, w : E → R is a weight function defined for each edge of the graph, where for convenience, we use w e to denote the weight of edge e. If G is directed, the pair (u, v) is ordered, otherwise, it is not (i.e., (u, v) ≡ (v, u) for undirected graphs). A path p from vertex u . It can also be seen as a sequence of edges As previously mentioned, a bottleneck on a path p can be described as an edge with a maximal weight on that path. To find the smallest feasible bottleneck edge between the source node u and the target node v, we consider all the paths between them. For each path, we pick an edge with a maximal weight, to obtain all path-specific bottleneck edges. We then identify the smallest path-specific bottleneck edge in order to find the best feasible bottleneck edge, i.e., such that bottleneck edges with higher weights are avoided. Therefore, given graph G, the bottleneck edge between u ∈ V and v ∈ V can be identified via extracting the minimax edge between them. With P u,v denoting the set of all possible paths from u to v over G, the bottleneck weight (incurred by the bottleneck edge) can be computed by The quantity in Eq. 1 satisfies the (ultra) metric properties under some basic assumptions on the edge weights such as symmetry and nonnegativity. Hence, it is sometimes used as a proper distance measure to extract manifolds and elongated clusters in a non-parametric way [18,27].
However, in our setting, such conditions do not need to be fulfilled by the edge weights. In general, we tolerate positive as well as negative edge weights, and we assume the graph might be directed, i.e., the edge weights are not necessarily symmetric. Therefore, despite the absence of (ultra) metric properties, the concept of minimax edges is still relevant for bottleneck identification.
To compute the minimax edge, one does not need to investigate all possible paths between the source and target nodes, which might be computationally infeasible. As studied in [23], minimax edges and paths over an arbitrary undirected graph are equal to the minimax edges over any minimum spanning tree (MST) computed over that graph. This equivalence simplifies the calculation of minimax edges, as there is only one path between every two vertices over an MST, whose maximal edge weight yields the minimax edge, i.e., the desired bottleneck.
For directed graphs, an MST might not represent the minimax edges in a straightforward manner. Hence, we instead rely on a modification [6] of Dijkstra's algorithm [13] to extract minimax paths rather than the shortest paths.

Probabilistic model for bottleneck identification
In this paper, we study bottleneck identification in uncertain and stochastic settings. Therefore, instead of considering the weights w e for e ∈ E to be fixed, we view them as stochastic with fixed, albeit unknown, distribution parameters. Additionally, we assume that the weight of each edge follows a Gaussian distribution with known and finite variance. The Gaussian edge weight assumption is common for many important problem settings, like minimization of travel time [40] or energy consumption [2] in road networks. Furthermore, we assume that all edge weights are mutually independent. Hence, where θ * e denotes the unknown mean of edge e, and σ 2 e is the known variance. To reduce cumbersome notation in the proofs, since the variance is assumed to be finite, we let σ 2 e ≤ 1 (by scaling the edge weight distributions). However, we emphasize that we do not assume that w e and θ * e are bounded or non-negative. It is convenient to be able to make use of prior knowledge in online learning problems where the action space is large, which motivates a Bayesian approach where we assume that the unknown mean θ * e is sampled from a known prior distribution: θ * e ∼ N (µ e,0 , ς 2 e,0 ).
We use a Gaussian prior for θ * e since it is conjugate to the Gaussian likelihood and allows for efficient recursive updates of posterior parameters upon a new weight observation w e,t at time t: Since our long-term objective is to find a path which minimizes the expected maximum edge weight along that path, we need a framework to sequentially select paths to update these parameters and learn enough information about the edge weight distributions.
The assumptions in this section might seem restrictive, and indeed, when the edge weights represent e.g., traffic congestion in a road network, it is reasonable to believe that edges are not independent, especially for neighboring road segments. There are ways of extending this setting to capture such dependencies, while retaining similar regret guarantees for the studied methods. Such extensions include the contextual setting, where expected edge weights are assumed to follow parameterized functions of contextual features (e.g., time-of-day, local ambient temperature, precipitation) revealed to the agent in each time step, before each action is taken. We leave such extensions to future work, though we note that the proofs in this work may be extended in a straightforward manner, analogous to the analysis of linear contextual Thompson Sampling in [38]. Similarly, Thompson Sampling may be extended to the case where both the mean and variance are unknown, by assignment of a joint prior distribution over the parameters [37].

Online Bottleneck Learning Framework
Consider a stochastic combinatorial semi-bandit problem [7] with time horizon T , formulated as a problem of cost minimization rather than reward maximization. There is a set of base arms A (where we let d := |A|) from which we may, at each time step t ∈ [T ], select a subset (or super arm) a t ⊆ A. The selection is further restricted such that a t ∈ I ⊆ 2 A , where I is called the set of feasible super arms.
Upon selection of a t , the environment reveals a feedback X i,t drawn from some fixed and unknown distribution for each base arm i ∈ a t (i.e., semi-bandit feedback). Furthermore, we then receive a super arm cost from the environment, c(a t ) := max i∈at X i,t , i.e., the maximum of all base arm feedback for the selected super arm and the current time step.
The objective is to select super arms a t to minimize E T t=1 c(a t ) . This objective is typically reformulated as an equivalent regret minimization problem, where the (expected) regret is defined as To connect this to the probabilistic bottleneck identification model introduced in the previous section, we let each edge e ∈ E in the graph G correspond to exactly one base arm i ∈ A. For the online minimax path problem, the feasible set of super arms is then the set of all admissible paths in the graph, where the paths are directed or undirected depending on the type of graph. The feedback of each base arm i is simply the Gaussian weight of the matching edge e, with known variance σ 2 i and unknown mean θ * i . We denote the expected cost of a super arm f θ (a), where θ is a mean vector and f θ (a) : For Bayesian bandit settings and algorithms, it is common to consider the notion of Bayesian regret, with an additional expectation over problem instances drawn from the prior distribution (where we denote the prior distribution λ, over mean vectors θ * ):

Thompson Sampling with exact objective
It is not sufficient to find the super arm a which minimizes f µt (a) in each time step t, since a strategy which is greedy with respect to possibly imperfect current cost estimates may converge to a sub-optimal super arm. Thompson Sampling is one of several methods developed to address the trade-off between exploration and exploitation in stochastic online learning problems. It has been shown to exhibit good performance in many formulations, e.g., linear contextual bandits and combinatorial semi-bandits.
The steps performed in each time step t by Thompson Sampling, adapted to our setting, are described in Algorithm 1. First, a mean vectorθ is sampled from the current posterior distribution (or from the prior in the first time step). Then, an arm a t is selected which minimizes the expected cost fθ(a t ) with respect to the sampled mean vector. These first two steps are equivalent to selecting the arm according to the posterior probability of it being optimal. In combinatorial semi-bandit problems, the method of finding the best super arm according to the sampled parameters is often called an oracle.
When the super arm a t is played, the environment reveals the feedback X i,t if and only if i ∈ a t , which is a property called semi-bandit feedback. Finally, these observations are used to update the posterior distribution parameters.

Algorithm 1 TS for minimax paths (exact)
Input: Prior parameters µ 0 , ς 0 1: For each base arm, play a super arm which contains it. 2: for t ← 1, . . . , T do 3: Play arm a t , observe feedback X j,t for j ∈ a t 8: Compute µ t , ς t with feedback using Eqs. 2 and 3 9: end for

Regret analysis of Thompson Sampling for minimax paths
We use the technique to analyze the Bayesian regret of Thompson Sampling for general bandit problems introduced by [38] and further elaborated by [42], carefully adapting it to our problem setting. This technique was originally devised to enable convenient conversion of existing UCB regret analyses to Thompson Sampling, but can also be applied to new TS applications. Here, we do a novel extension to combinatorial bandits with minimax super-arm cost functions, which includes establishing concentration properties for the mean estimates of the non-linear super-arm costs. In the rest of this section, we outline the most important steps of the proof of Theorem 1, leaving technical details to the supplementary material (Appendix A). In the analysis, for convenience, we assume that T ≥ d.
We initially define a sequence of upper and lower confidence bounds, for each time step t: whereθ i,t is the average feedback of base arm i ∈ A until time t,θ t is the average feedback vector for all arms in A, and N t (i) is the number of times base arm i ∈ A has been played as part of a super arm until time t. Lemma 2. For Algorithm 1, we have that: This Bayesian regret decomposition is a direct application of Proposition 1 of [38]. It utilizes the fact that given the history of selected arms and received feedback until time t, the played super arm a t and the best possible super arm a * := arg min a∈I f θ * (a) are identically distributed under Thompson Sampling. Furthermore, also given the history, U t (a) and L t (a) are deterministic functions of the super arm a. This enables the decomposition of the regret into terms of the expected confidence width, the expected overestimation of the super arm with least mean cost, and the expected underestimation of the selected super arm. By showing that f θ * (a) ∈ [L t (a), U t (a)] with high probability, we can bound the last two of these terms.
T . Both terms are bounded in the same way, for which we need a few intermediary results. Focusing on the underestimation of the played super arm, we can see that: First, in Lemma 4, the difference between the true mean cost f θ * (a) of a super arm a and the corresponding estimated mean fθ(a) is bounded. The resulting upper bound is the maximum of the differences of the true and estimated means of each individual base arm feedback, such that: Lemma 4. For any super arm a ∈ I and time step t This is achieved by decomposing the absolute value into a sum of the positive and negative portions of the difference, then bounding each individually. Focusing on the positive portion by assuming that f θ * (a) ≥ fθ t−1 (a), and letting we can see that: The negative portion is bounded in the same way, directly leading to the result of Lemma 4. With this result, we can proceed with Lemma 3, where we let [x] + := max(0, x): The probability in Eq. 6 is of the event that the difference between the estimated and true means of an arm i exceeds the confidence radius 8 log T /N t−1 (i), while Eq. 7 is the expected difference conditional on that event. We bound Eq. 6 with Lemma 5 and Eq. 7 with Lemma 6.
It is now sufficient to show that the difference δ i,t−1 is small for all base arms i ∈ A with high probability, which we accomplish using a standard concentration analysis through application of Hoeffding's inequality and union bounds. Lemma 6. For any t ∈ [T ] and i ∈ A, we have Though the rewards are unbounded, this expectation can be bounded by utilizing the fact that the mean of a truncated Gaussian distribution is increasing in the mean of the distribution before truncation, by Theorem 2 of [22]. We can see that: We know that δ i,t−1 is zero-mean Gaussian with variance at most one, hence E δ i,t−1 δ i,t−1 > 0 ≤ 1.
With the result from Lemma 3, the last two terms of the regret decomposition in Lemma 2 are bounded by constants in T . Focusing on the remaining term, we just need to show that t∈ We note that the final upper bound is tight up to a polylogarithmic factor, according to existing lower bounds for combinatorial semi-bandit problems [28].

Thompson Sampling with approximate objective
Unfortunately, exact expressions for computing the expected maximum of Gaussian random variables only exist when the variables are few. In other words, we cannot compute f θ (a) exactly for a super arm a containing many base arms, necessitating some form of approximation approach. While it is possible to approximate f θ (a) through e.g., Monte Carlo simulations, we want to be able to perform the cost minimization step using a computationally efficient oracle.
We note that, even with the capability to exactly compute f θ (a), it would not be feasible to solve the minimization problem in line 6 of Algorithm 1. The expected cost f θ (a) of a super arm a (i.e., the expected maximum base arm feedback) depends not only on the individual expected values of the base arm feedback distributions, but also on the shape of the joint distribution of all base arms in a. Due to this fact, the stochastic version of the minimization problem lacks the property of optimal substructure (i.e., an optimal path does not necessarily consist of optimal sub-paths).
For the deterministic version of the problem, as defined in Eq. 1, the presence of this property enables the usage of computationally efficient dynamic programming strategies, like Dijkstra's algorithm, which is consequently infeasible with the objective in Algorithm 1.
Therefore, we propose the approximation method outlined in Algorithm 2, where the minimization step of line 6 has been modified from Algorithm 1 with an alternative super arm cost functionfθ(a) := max i∈aθi . Switching objectives, from finding the super arm which minimizes the expected maximum base arm feedback, to instead minimize the maximum expected feedback, has the benefit of allowing us to utilize the efficient deterministic minimax path algorithms introduced earlier for both directed and undirected graphs. For directed graphs, the modified version of Dijkstra's algorithm in [6] has a worst-case running time of O(|E| + |V | log |V |) with an efficient implementation using Fibonacci heaps [15]. Similarly, for undirected graphs, finding an MST (and subsequently a minimax path) can be achieved using Prim's algorithm [36], It is possible to use alternative notions of regret to evaluate combinatorial bandit algorithms with approximate oracles [10,9]. For our experimental evaluation of Algorithm 2, we introduce the following definition of approximate regret: An alternative Bayesian bandit algorithm which can be used with the alternative objective is BayesUCB [25], which we use as a baseline for our experiments. Like Thompson Sampling, BayesUCB has been adapted to combinatorial semi-bandit settings [32,2]. Whereas Thompson Sampling in Algorithm 2 encourages exploration by applying the oracle to parameters sampled the posterior distribution, with BayesUCB, the oracle is instead applied to optimistic estimates based on the posterior distribution. In practice, this is accomplished for our cost minimization problem by using lower quantiles of the posterior distribution of each base arm. This principle of selecting plausibly optimal arms is called optimism in the face of uncertainty and is the underlying idea of all bandit algorithms based on UCB.
We note that while in BayesUCB, as outlined in Algorithm 1 of [25], the horizon is used to calculate UCB values, the authors of that work also explain that upper quantiles of order 1 − 1/t (calculated without the horizon) achieve good results in practice. For that reason, we use lower quantiles of order 1/t in the version of BayesUCB studied in this work, making it an anytime algorithm, like Thompson Sampling.
To connect the different objectives in Algorithm 1 and Algorithm 2, we note that by Jensen's inequality,fθ(a) ≤ fθ(a) and that the approximation objective consequently will underestimate super arm costs. However, we establish an upper bound on this difference through Theorem 7.
For any super arm a ∈ I, let Y i for i ∈ a be Gaussian random variables with Y i ∼ N (θ * i , σ 2 i ). Furthermore, let Then, the following holds: for i ∈ A do 4:θ i ← Sample from posterior N µ i,t−1 , ς 2 i,t−1 5: end for 6: a t ← arg min a∈I max i∈aθi 7: Play arm a t , observe feedback X j,t for j ∈ a t 8: Compute µ t , ς t with feedback using Eq. 2 and 3 9: end for where the last inequality is due to Lemma 9 in [34] and since σ 2 i ≤ 1 for all i ∈ a. We also note that, by Jensen's inequal- Hence, we can conclude that In other words, Theorem 7 holds and the optimal solutions of the exact Algorithm 1 and the approximate Algorithm 2 differ by at most √ 2 log d. This bound is independent of the mean vector θ * , depending only on the number of base arms and that the variance is bounded.

Experimental Results
In this section, we conduct bottleneck identification experiments using Algorithm 2 for two real-world applications, i) road (transport) networks, and ii) collaboration (social) networks. These experiments are performed with an extended version of the simulation framework in [39] and evaluated using our approximate definition of regret. In addition, we compare Algorithm 1 to Algorithm 2 through a toy example.

Road networks
A bottleneck in a network is a segment of a path in the network that obstructs or stops flow. Identification of bottlenecks in a road network is a vital tool for traffic planners to analyze the network and prevent congestion. In this application, our goal is to find the bottleneck between a source and a target, i.e., a road segment which is necessary to pass and also has minimal traffic flow. In the road network model, we let the nodes represent intersections and the directed edges represent road segments, with travel time divided by distance (seconds per meter) as edge weights. The bottleneck between a pair of intersections is the minimum bottleneck over all paths connecting them, where the bottleneck for each of these paths is the largest weight over all road segments along it. Note that in order for the bottleneck between a pair of intersections to have a meaning, there needs to exist at least one path connecting them.
We collect road networks of four cities, shown in Table 1, from [33], where the average travel time as well as the distance is provided for each (directed) edge. We simulate an environment with the stochastic edge weights sampled from w e ∼ N (θ * e , σ 2 e ), where the observation noise is σ e = 0.4. For the experiments, the environment samples the true unknown mean θ * e from the known prior θ * e ∼ N (µ e,0 , ς e,0 2 ), where ς e,0 = 0.4s/m, and µ e,0 is the average travel time divided by distance provided by OpenStreetMap (OSM).
We consider one greedy agent (GR) and two t -greedy agents (e-GR) as baselines. The greedy agent (GR) always chooses the path with the lowest current estimate of expected cost. In each time step, each e-GR agent, with probability t decreasing with t (specifically, we let t = min(1, 1/ √ t)), chooses a random path, and acts like the greedy agent otherwise. In our experiments, we implement the two e-GR agents based on the combinatorial version of t -greedy introduced in Algorithm 1 in the Supplementary Material of [10]. The first e-GR agent chooses a path between the source and the target containing a uniformly chosen random node (e-GR-N), and the second e-GR agent chooses a path with a uniformly selected random edge (e-GR-E). We evaluate how the performance of the Thompson Sampling agent (TS) and the BayesUCB agent (B-UCB) compare to the baselines. We run the simulations with all five agents for each road network and report the cumulative regret at a given horizon T , averaged over five repetitions. The horizon is chosen such that the instant regret is almost stabilized for the agents. Table 2 shows the average cumulative regrets and their corresponding standard error over five runs at the horizon T . For all four road networks, the TS agent incurs the lowest average cumulative regret and standard error over five runs. Then, B-UCB follows TS and yields a better result than the baselines (GR and both e-GR variants). Figure 1 illustrates the average cumulative regret with standard error (SE) bars on the road networks of the four aforementioned cities. For Eindhoven, Figure 1a shows the average cumulative regret, where at horizon T = 6000 the TS agent yields the lowest cumulative regret. Then, B-UCB follows TS and achieves a better result compared to the other baselines. As time progresses, we can see that first TS and then B-UCB start saturating by performing sufficient exploration. With respect to the SE bars, there are differences between the five agents. The TS agent has the smallest SE bars. Figure 1b visualizes the Eindhoven road network, where the paths explored by the TS agent are shown in red. The road segments explored (tried) more often by the TS agent are displayed more opaque. Figure 1c, 1e, and 1g show the average cumulative regret with SE bars for Manhattan, Oslo, and Salzburg, respectively. The results show that TS incurs the lowest cumulative regret and smallest SE bars. Then, B-UCB follows TS in both aspects and obtains a better result than the other baselines.

Collaboration network
We consider a collaboration network from computational geometry (Geom) [24] as an application of our approach to social networks. More specifically, we use the version provided by [19] and distributed among the Pajek datasets [4] where certain author duplicates, occurring in minor or major name variations, have been merged. The [19] version is based on the BibTeX bibliography [5], to which the database from [24] has been exported. The network has 9072 vertices representing the authors and 22577 edges with the edge weights representing the number of mutual works between a pair of authors.
We simulate an environment where each edge weight is sampled as w e ∼ N (θ * e , σ 2 e ), within which θ * e is regarded as the true (negative) mean number of shared publications between a pair of authors linked by the edge e, and the observation noise is σ e = 5. Furthermore, in this experiment, while the true negative mean number of mutual publications are assumed (by the agent) to be distributed according to the prior θ * e ∼ N (µ e,0 , ς 2 e,0 ) with ς e,0 = 10, we instead generate the mean from a wider prior θ * e ∼ N (µ e,0 , 20 2 ), simulating a scenario where the prior belief of the agent is too high. The assumed mean µ e,0 of the prior is however consistent with the distribution from which θ * e is sampled, and is directly determined by the pairwise negative number of mutual collaborations from the dataset in [19]. Figure 2 shows the cumulative regret, averaged over five runs for the different agents with horizon T = 2000, again chosen such that the regret is stabilized for all agents. One can see that the TS agent reaches the lowest cumulative regret, similar to the experimental studies on road networks.

Exact objective toy example
While it is not feasible to evaluate Algorithm 1 on graphs representing real-life transportation or social networks, it is possible for small synthetic graphs. We construct a graph consisting of 6 nodes and 10 edges, with the source and target nodes connected by four paths of length 2 and four paths of length 3. For each edge e, we use the sample the mean from a standard Gaussian prior, such that θ * e ∼ N (0, 1). The stochastic weights are then generated in each time step t such that w e,t ∼ N (θ * e , 1).
In order to calculate the expected cost of each path, we use existing exact expressions for the expected maximum of two [11] and three [31,12] independent Gaussian random variables. Instead of using an oracle, we simply enumerate the paths to find the one with minimum expected cost.
In Figure 3, we compare Algorithm 1 (TS with exact objective) and Algorithm 2 (TS with approximate objective) using the exact notion of (cumulative) regret as defined in Eq. 4. Furthermore, we include a greedy baseline which also uses the exact objective. We use a horizon of T = 10000 and average the results over 20 experiments, wherein each algorithm is applied to a problem instance sampled from the prior.
We can see that the regret of exact TS quickly saturates, while approximate TS and the greedy method tend to end up in sub-optimal solutions. For approximate TS, this is to be expected since optimal arms for the exact and approximate problems may be different. It is worth noting, however, that approximate TS performs better than the exact greedy method on average.

Conclusion
We developed an online learning framework for bottleneck identification in networks via minimax paths. In particular, we modeled this task as a combinatorial semi-bandit problem for which we proposed a combinatorial version of Thompson Sampling. We then established an upper bound on the Bayesian regret of the Thompson Sampling method.
To deal with the computational intractability of the problem, we devised an alternative problem formulation which approximates the original objective. Finally, we investigated the framework on several directed and undirected realworld networks from transport and collaboration domains. Our experimental results demonstrate its effectiveness compared to alternatives such as greedy and B-UCB methods.
Proof. By Proposition 1 in [38], we can decompose the Bayesian regret of the algorithm in the following way: Proof.
T is done in the same way.
Lemma 4. For any super arm a ∈ I and time step t ∈ [T ], we have that Proof. We definev i,m as the average feedback of base arm i for the first m times it has been played as part of a super arm, i.e., such thatθ i,t =v i,Nt(i) . Then: Proof. We know that the average feedbackθ i,t−1 is Gaussian with E θ i,t−1 = θ * i and variance ≤ 1. Let Z := θ * i −θ i,t−1 . Then, Z is Gaussian with mean 0 and variance ≤ 1. The following holds: We notice that (Z − 8 log T Nt−1(i) ) is Gaussian with mean (− 8 log T Nt−1(i) ), where (− 8 log T Nt−1(i) ) < 0. Furthermore (see e.g., Theorem 2 in [22]), the expected value after truncation is increasing in (− 8 log T Nt−1(i) ). Hence, (Z is Gaussian with mean 0 and variance ≤ 1) φ(0)/(1 − Φ(0)) ≤ 1 Theorem 7. Given the optimal super arm a * for Algorithm 1 and the optimal super armã * for Algorithm 2, we have that f θ * (ã * ) − f θ * (a * ) ≤ √ 2 log d.
Proof. For any super arm a ∈ I, let Y i for i ∈ a be Gaussian random variables with Y i ∼ N (θ * i , σ 2 i ). Let W i := Y i −θ * i , such that W i ∼ N (0, σ 2 i ). Then, the following holds: Hence, we conclude