On testing transitivity in online preference learning

The efficiency of state-of-the-art algorithms for the dueling bandits problem is essentially due to a clever exploitation of (stochastic) transitivity properties of pairwise comparisons: If one arm is likely to beat a second one, which in turn is likely to beat a third one, then the first is also likely to beat the third one. By now, however, there is no way to test the validity of corresponding assumptions, although this would be a key prerequisite to guarantee the meaningfulness of the results produced by an algorithm. In this paper, we investigate the problem of testing different forms of stochastic transitivity in an online manner. We derive lower bounds on the expected sample complexity of any sequential hypothesis testing algorithm for various forms of stochastic transitivity, thereby providing additional motivation to focus on weak stochastic transitivity. To this end, we introduce an algorithmic framework for the dueling bandits problem, in which the statistical validity of weak stochastic transitivity can be tested, either actively or passively, based on a multiple binomial hypothesis test. Moreover, by exploiting a connection between weak stochastic transitivity and graph theory, we suggest an enhancement to further improve the efficiency of the testing algorithm. In the active setting, both variants achieve an expected sample complexity that is optimal up to a logarithmic factor.


Introduction
The setting of dueling bandits (Yue and Joachims 2009;Sui et al. 2018;Bengs et al. 2021) is a variant of the standard multi-armed bandit (MAB) problem, in which the learner is allowed to compare pairs of choice alternatives (arms) in a sequential manner. Thus, instead of repeatedly pulling an arm and observing a numerical reward, the learner pulls two arms and observes the winner of the corresponding duel. Like in the standard MAB problem, this feedback is assumed to be stochastic. A typical task of the learner is to find the "best" arm as quickly as possible, or, more generally, to identify a complete ranking of all arms. There is a variety of practically relevant applications for this learning scenario, such as ranking XBox gamers according to duel outcomes (Guo et al. 2012) or rating different objects based on pairwise preferences of users, which can nowadays be gathered quite conveniently by means of crowdsourcing services such as Amazon Mechanical Turk (Chen et al. 2013).
Relaxed assumptions of transitivity, especially different types of stochastic transitivity (Fishburn 1973), play an important role in this regard: If arm a is likely to be preferred over arm b, and b is likely to be preferred over arm c, then a is also likely to be preferred over c. Assumptions of that kind are important for several reasons. First, they assure that the learning task itself is actually well defined, for example that a naturally "best" arm actually exists. Second, they are on the basis of the design of efficient learning algorithms, which exploit generalized transitivity to reduce sample complexity (Yue and Joachims 2011;Mohajer et al. 2017;Falahatgar et al. 2018). This is comparable to how standard sorting algorithms avoid the comparison of all pairs of items and achieve an O(n log n) (instead of an O(n 2 ) ) complexity.
Somewhat surprisingly, the problem of testing the validity of transitivity assumptions underlying various algorithms has not been considered so far. Needless to say, this would be important to guarantee the meaningfulness of the results produced by such algorithms. In fact, if the assumptions made by an algorithm are violated by the datagenerating process in a concrete application, then neither its prediction nor any of its guarantees can be trusted anymore. In this paper, we therefore propose a method for testing an important form of transitivity, namely weak stochastic transitivity (WST), in an online manner. Being the weakest type of stochastic transitivity, WST is quite natural to start with. Moreover, weak stochastic transitivity of pairwise preferences (winning probabilities) is a necessary and sufficient condition for the existence of a complete ranking (strict total ordering) of all arms that is consistent with all pairwise preferences.
More specifically, we introduce an algorithmic framework consisting of two main components, namely an active sampling strategy and a sequential test. In this way, the algorithmic framework covers two conceivable scenarios to online hypothesis testing: -The passive online testing scenario, where the sampling strategy is any dueling bandits algorithm based on a transitivity assumption, and the test component is (passively) monitoring the statistical validity of the transitivity assumption made by the dueling bandits algorithm -in other words, the learning and testing component are working in parallel, independently of each other. -The active online testing scenario, in which the sampling strategy is specifically constructed to support the test component, i.e., to make a test decision as quickly as possible.
In this paper, we introduce the problem of testing different types of stochastic transitivity within a dueling bandits problem (Sect. 4), both with and without the so-called low noise assumption (Korba et al. 2017). We prove that the expected sample complexity for testing different types of stochastic transitivity stronger than WST in an online manner is infinite in the worst case. These results provide an additional theoretical motivation for focusing on WST, as it is the only type of stochastic transitivity that admits finite expected sample complexity for online testing, which can be inferred by an appropriate reduction to the setting of pure exploration bandits with multiple correct answers introduced by Degenne and Koolen (2019) (Sect. 5). We improve upon the corresponding asymptotic lower bounds on the expected sample complexity for testing WST from the latter reduction by providing instance-wise lower bounds for fixed confidence levels (Sect. 6). For the passive online testing scenario, we construct a test component based on multiple binomial hypothesis tests, for which we show consistency in terms of almost sure termination time and reliability in terms of maintained error bounds under mild assumptions on . For the active online testing variant, we provide a sampling strategy , such that the expected sample complexity of the latter test is optimal up to a logarithmic term (Sect. 7). Moreover, by exploiting a connection between WST and graph theory, we suggest an enhancement to further improve the efficiency of the testing algorithm (Sect. 8). The superiority of this variant in the passive setting is illustrated by an empirical evaluation (Sect. 9). The paper starts with a brief account of related work (Sect. 2), followed by a refresher of the dueling bandits problem as well as different types of stochastic transitivity (Sect. 3). Detailed proofs of theoretical results are provided in the supplementary material.

Related work
The dueling bandits problem was studied under strong stochastic transitivity in (Yue et al. 2012) and relaxed stochastic transitivity in (Yue and Joachims 2011), in both cases with the goal of regret minimization. In these works, the transitivity assumption is explicitly required for the theoretical guarantees. In other approaches, transitivity properties are assumed in a more indirect way, for example through probabilistic models of the feedback process. This includes the Plackett-Luce model (Luce 1959;Plackett 1975) resp. Bradley-Terry model (Bradley and Terry 1952) considered in (Szörényi et al. 2015) resp. (Maystre and Grossglauser 2017), as well as the Mallows model (Mallows 1957) studied in (Busa-Fekete et al. 2014). Mohajer et al. (2017) consider the goals of finding the best arm as well as the (top-k-)ranking of arms under WST, while Falahatgar et al. (2017aFalahatgar et al. ( , 2017bFalahatgar et al. ( , 2018 investigate the impact of various transitivity assumptions on these goals in an online PACframework. Finally, transitivity assumptions were also analyzed in batch learning scenarios, for example to estimate the underlying pairwise preference relation (Shah et al. 2016), or for the purpose of rank aggregation (Korba et al. 2017).
The literature on testing transitivity conditions is primarily rooted in the social sciences, psychology, and economics, with a special focus on experimental studies for real data. The only mathematical treatment we found is (Iverson and Falmagne 1985), where the authors provide an asymptotic likelihood-ratio test for WST. The use of Bayes factors for testing stochastic transitivity is proposed in (Cavagnaro and Davis-Stober 2014) . In (McNamara and Diwadkar 1997) and (Waite 2001), multiple binomial tests are conducted to test WST of preferences in different field studies. From a methodological point of view, this is closest to the sequential testing approach put forward in this paper. Yet, all these works are settled in classical hypothesis testing, assuming all the data to be available beforehand. In contrast to this, the focus of this paper is on hypothesis testing in an online setting, where data arrives sequentially, and test decisions should be taken as quickly as possible while maintaining a predefined level of confidence.
As already mentioned in the introduction the problem of testing stochastic transitivity in an online manner can be tackled by a suitable reduction to the pure exploration bandits with multiple correct answers introduced by Degenne and Koolen (2019), which will be discussed more thoroughly in Sect. 5.

Theoretical background
In this section, we concisely recall the main theoretical foundations needed throughout the paper. In the supplementary material, we provide a list of symbols used in the paper for the sake of convenience.

Dueling bandits
Consider a finite set of m arms identified by the index set [m] ∶= {1, … , m} . In the setting of the dueling bandits problem, two distinct arms i, j ∈ [m] can be compared with each other at each time step t ∈ ℕ . Querying a pairwise preference, the learner is provided with binary feedback about the winner of the duel, which is assumed to be generated by a timestationary iid probabilistic process. The probability ℙ(i ≻ j) that arm i wins against arm j is given by some underlying (unknown) ground truth parameter q i,j ∈ [0, 1]. We suppose that ties are not possible. Thus (assuming w.l.o.g. q i,i = 1 2 for every i ∈ [m] ), we can infer that = (q i,j ) 1≤i,j≤m is a reciprocal relation on [m], i.e., is an element of To assimilate the information available at time t ∈ ℕ , let us write (n t ) i,j for the number of comparisons between i and j until time t, and (w t ) i,j for the number of times i has won against j until time t. This obviously implies (w t ) i,j + (w t ) j,i = (n t ) i,j and (n t ) i,j = (n t ) j,i . Then, n t = ((n t ) i,j ) 1≤i,j≤m is a symmetric integer-valued matrix with zeros on its diagonal. If w ∈ ℕ m×m 0 and n ∈ ℕ m×m 0 , we denote the matrix ( Definition 3.1 A sampling strategy is a family of random mappings, which, depending on the time t and the observations n 0 , w 0 , … , n t−1 , w t−1 available before time t, determines the two distinct arms i(t), j(t) ∈ [m] that are to be compared at time t ∈ ℕ . Let denote the set of all sampling strategies, while ∞ denotes the family of sampling strategies that sample every pair {i, j} almost surely (a.s.) infinitely often, which means that (n t ) i,j → ∞ a.s. as t → ∞.
Note that if ∈ ⧵ ∞ , then a sampling strategy ̂∈ that chooses the same pair as in each time step with probability 1 − 1∕t , and otherwise (i.e., with probability 1/t) picks a pair {i, j} uniformly at random from [m] 2 , fulfills ̂∈ ∞ and Thus, ̂ and behave similarly in the limit. This shows that the assumption ∈ ∞ , which is required for theoretical results in our framework, is rather mild.

Stochastic transitivity
Different types of stochastic transitivity have been used in the realm of dueling bandits problems (Bengs et al. 2021), mainly because they provide a certain degree of regularity of the reciprocal relations in Q m , and thereby facilitate learning. In particular, the following transitivities are commonly considered in the literature. The set consisting of all stochastic transitive reciprocal relations of a certain type is and we write Q m (¬XST) ∶= Q m ⧵ Q m (XST) . The following relationships hold between the different types of stochastic transitivities:

Violations of WST
To illustrate the issues that may arise in case of a violation of the WST assumption, and highlight the importance of testing such assumptions, consider algorithms that are based on the idea of (noisy) sorting (Szörényi et al. 2015;Mohajer et al. 2017). Roughly speaking, the active sampling strategies underlying such algorithms mimic the behavior of sorting algorithms, such as merge sort or quicksort -with the main difference that, due to the assumed stochasticity, deciding the order between two arms may require repeated comparisons. Obviously, weak stochastic transitivity is the least assumption required by such algorithms. On the other side, it is easy to see that a sorting-based algorithm will always return a complete ranking (with high confidence), regardless of whether the underlying relation contains preferential cycles or not. Yet, this ranking will strongly depend on the order in which the arms are compared, and hence be more or less random and therefore meaningless.

Online transitivity testing
We focus on the following testing problem in the context of an underlying dueling bandits problem: where XST ∈ {WST, MST, − RST, SST}. This test shall be conducted for different types of transitivity in an online manner.
Thus, it is natural to consider sequential hypothesis tests, in which a test decision can be provided at any time during the data generating process. The particular choice of the null hypothesis is motivated by the passive scenario, in which a learning algorithm assumes XST to be fulfilled and the test shall detect a possible violation thereof. As we focus on tests with guarantees on both, its type I and the type II error, it is possible to swap 0 and 1 , and still obtain qualitatively the same theoretical results as below.
In the course of the paper, we focus on algorithms A for the testing problem, which might be probabilistic and interact with the underlying dueling bandits environment, as stipulated by the definition of a sampling strategy (Definition 3.1). In case an algorithm A terminates, it returns a decision denoted by (A) ∈ {XST, ¬XST} with the semantic (A) = XST resp.
(A) = ¬XST indicate that A predicts that XST holds resp. is violated. Moreover, we denote by T A the sample complexity of an algorithm A , i.e., the number of pairwise comparisons A has made before termination.
For our theoretical analysis of the testing problem, we will consider the following set of relations: where h ∈ [0, 1∕2) . In case h > 0 , the relations in Q h m are said to satisfy the low noise assumption (Korba et al. 2017). Here, the parameter h determines to some extent the complexity of the testing problem: For instance, the larger h, the easier it becomes to determine the sign of q i,j − 1∕2, which in turn facilitates checking WST . For XST ∈ {WST, MST, SST, − RST} and any h ∈ [0, 1∕2) , we define Moreover, we may regard Q m as a subset of ℝ m(m−1)∕2 and, in this way, equip it with the standard Euclidean topology of ℝ m(m−1)∕2 . Therefore, for a subset Q ′ m ⊆ Q m , we use the standard notation Q ′ m for the boundary of Q ′ m as a subset of this topological space Q m . The notion of a solution to the XST-testing problem is stated in the following.
(1) 0 ∶ satisfies XST 1 ∶ does not satisfy XST, Definition 4.1 For given h ∈ [0, 1∕2) and error probabilities , ∈ (0, 1), we say that an algorithm 1 A solves the XST-testing problem on Q h m for and (in short: A solves P m,h, , XST ) if T A is almost surely finite on any instance ∈ Q 0 m and the following holds: Interestingly, as the following theorem reveals, the testing problem (1) for a stochastic type of transitivity stronger than WST turns out be too difficult. Hence, we will focus on the case XST = WST in the rest of the paper.
To prove this theorem, we show that any solution A to P m,h, , XST may be used to test, for some p 0 ∈ [0, 1] , any p 1 > p 0 , and with an error probability of at most max{ , } whether a coin C ∼ Ber(p) has bias p = p 0 or p = p 1 . But if p 1 converges to p 0 , the number of coin flips necessary to maintain the error probability tends to infinity in expectation. A detailed proof of the theorem is provided in Section B in the supplement.

Reduction to Pure Exploration Bandits with Multiple Correct Answers
The testing problem at hand may be reduced to the Pure Exploration Bandits scenario with multiple correct answers as presented by Degenne and Koolen (2019), the details of which can be found in Section F of the supplement. This approach leads to the following results: Bernoulli distributions with success probability p resp. q. We prove in the supplement 2 (cf. Lemma F.7) that and hold for all h ∈ (0, 1∕2) . This indicates that the case h = 0 is more complex than the case h > 0 and shows that any optimal solution A( ) to P m,h, , WST or P m,0, , WST fulfills respectively, as max{m, h −1 } → ∞ . Unfortunately, these results do not yield any information on the case where is fixed. Moreover, the algorithmic solution A( ) presented by Degenne and Koolen (2019) is very inefficient for the problem of testing WST , if not infeasible in practice, which is due to a hard min-max problem that has to be solved at each time step (cf. Remark F.1). In the following, we will discuss further lower and upper bounds on the worst-case sample complexity of solutions to P m,h, , WST . Our results are to some extent stronger than (3) and (4), as they are covering the cases of a fixed confidence level , which in turn corresponds to the typical setting of (online) testing.

Lower bounds for online testing of weak stochastic transitivity
In this section, we provide lower bounds on the expected termination time of any algorithm solving P m,h, , WST . Similarly to Theorem 4.2, these results are obtained by reducing a testing problem for the biases of independent coins to P m,h, , WST . A sample complexity analysis of the latter testing problem results in the bounds stated below, the proof of which can again be found in Section B.
In order to state an instance-wise lower bound for the case h > 0 , let us introduce some more notation: Given ∈ Q 0 m , we write for a permutation on [m], which fulfills q (i), (i+1) > 1∕2 for every i ∈ [m] . We show in the appendix (Lemma B.1) that exists for every ∈ Q 0 m , even though we only need this for every ∈ Q 0 m (WST) . In case ∈ Q 0 m (WST) , is the underlying ground-truth ranking of , and permuting rows and columns according to results in a reciprocal relation with entries > 1∕2 above the diagonal.
, ∶= min{ , } , and = . Then, there exists a constant c = c(h 0 , 0 ) > 0 such that Note that the right-hand side of (6) is of the order m 2 h −2 ln( −1 ) , which is coherent with (5). The fact that the instance-wise bound only depends on m − 1 2 instead of all m 2 entries of is due to our proof technique, which is nonetheless of the same order with respect to m.
Let us now consider the more complex case h = 0 . As any solution to P m,0, , WST is also a solution to P m,h, , WST for any h ∈ (0, 1∕2) , Theorem 6.1 is applicable in this case. However, we can slightly improve upon this. In the following, for functions f , g ∶ X → (0, ∞) , we Theorem 6.2 Let , ∈ (0, 1∕2) be fixed and suppose A to be an algorithm that solves P m,0, , WST . Then, the following holds: As we point out in the proof of this theorem, the set Q † m in (a) can be chosen as the set of all ∈ Q m , for which some permutation on [m] exists such that the following conditions are fulfilled: In the proof of the theorem, to make (b) more explicit, we provide several examples for a family { (h)} h∈(0,1∕2) ⊆ Q h m (WST) , for which Regarding the occurrence of the limes superior in Lemma A.2, this is the best we may infer from Lemma A.2. At first sight, part (b) of Theorem 6.2 may appear to contradict (5), which does not involve a ln ln h −1 -factor. However, note that (5) only yields a bound on the worst-case of the asymptotic of ln( −1 ) as ↘ 0 , whereas our bound holds for any fixed . 3 Thus, there is actually no contradiction.

Online testing of WST
Guided by our findings in Sect. 6, we now focus on the testing problem (1) for WST in the framework developed in Sect. 4. Note that weak stochastic transitivity is in any case of particular interest for the ranking problem in dueling bandits, as it is both a sufficient and a necessary condition for the existence of a ranking over the arms consistent with the preference relation , in the sense that an arm i is preferred over an arm j if and only if q i,j ≥ 1∕2 .
A first naïve approach for a testing component for the passive scenario (cf. Section 1) is Algorithm 1, which does the following: Terminate as soon as we can decide, for every (i, j) ∈ (m) 2 , each with error probability at most � = min{ , } m 2 −1 , whether q i,j > 1∕2 or q i,j < 1∕2 holds, and output WST if an auxiliary relation ′ generated during runtime is WST , and ¬WST otherwise. To construct ′ , the value q ′ i,j is set to 1 resp. 0 whenever we are sure enough (for the first time) that q i,j > 1∕2 resp. q i,j < 1∕2 holds. Here, testing the sign of q i,j − 1∕2 with confidence level may be done by stopping as soon as The term appropriate is specified in Definition 7.1 below.
In the initialization step of A naive , we inform the algorithm about how often every item i has already been compared to every other item j before the start, denoted by (n 0 ) i,j , and how often i has won against j, denoted by (w 0 ) i,j . Our setting allows us to assume that (w 0 ) i,j ∼ Bin((n 0 ) i,j , q i,j ) for all 1 ≤ i < j ≤ m . As the theoretical results do not depend on the explicit choice of n 0 and w 0 , we assume w.l.o.g. that (n 0 ) i,j = 1 for all distinct i, j ∈ [m] throughout the paper.

Definition 7.1 For any p ∈ [0, 1] , suppose {X (p)
n } n∈ℕ to be a family of iid random variables with distribution Ber(p) . We say that a function C ∶ ℕ → [0, ∞] is (h, )-correct for given h ∈ [0, 1∕2) and ∈ (0, 1∕2) , if the following holds: (a) For any p ≠ 1∕2 , the following stopping time is almost surely finite: (b) For all p > 1∕2 + h , we have and similarly for all p < 1∕2 − h , In case h > 0 , a first example for an (h, )-correct function C h, can be inferred from Hoeffding's inequality, by means of With this, the decision whether q i,j > 1∕2 or q i,j < 1∕2 is not made in a sequential manner, but instead after exactly ⌈h −2 ln( −1 )∕2⌉ duels of i and j have been conducted. At the end of this section, we will introduce more sophisticated any-time confidence bounds admitting decisions in a sequential manner, and also treat the case h = 0. By construction, the sample complexity of Algorithm 1 is exactly the number of iterations that are required for testing the signs of all q i,j − 1∕2 , (i, j) ∈ (m) 2 . By choosing C according to (7), testing the sign of q i,j − 1∕2 requires in any case exactly N ∶= ⌈h −2 ln( −1 )∕2⌉ iid samples governed by Ber(q i,j ) . However, the explicit time at which a pair has been sampled at least N times highly depends on the underlying sampling strategy , so that an analysis of the sample complexity of A naive can only be done w.r.t. the corresponding sampling strategy . As the testing component is working in parallel to in the passive setting, i.e., it has no influence on the behavior of , the minimum requirement for a test component in the passive online test seems to be consistency in terms of an a.s. finite termination time and the adherence to predefined error bounds for a general class of sampling strategies. Both requirements are met by the test underlying A naive by Theorem 7.2 for the class ∞ if A naive is instantiated with an (h, � )-correct C.

Remark 7.3
In the passive online testing scenario, i.e., the sampling strategy is instantiated in a black-box fashion by some dueling bandits algorithm based on a transitivity assumption (such as those by Falahatgar et al. (2017aFalahatgar et al. ( , 2018), it might happen that terminates before the testing algorithm came to a decision, and in particular that is not defined any more. In this case, if one is still interested in whether transitivity was fulfilled in hindsight, one may continue sampling according to the strategy ̂ , which picks each query The other way around, if the testing algorithm came to a positive decision ( (A) = XST ), although the online ranking algorithm has not yet terminated, one can simply continue the sampling strategy without the testing component.
In case of a negative decision ( (A) = ¬XST ), the online ranking algorithm should be interrupted due to violating the assumptions.
In the active online testing scenario (cf. Section 1), on the other side, we have the possibility to choose in a favorable way and consequently analyze the sample complexity of Algorithm 1. For this purpose, we consider a sampling strategy = (m, C) depending on the other parameters of A naive , which focuses on the time-dependent set consisting of all pairs {i, j} , for which it is not yet sure with confidence level ′ whether q i,j > 1∕2 or q i,j < 1∕2 holds. Formally, the following set is considered: In each time t, the sampling strategy (m, C) queries {i, j} ∈ [m] 2 uniformly at random from U C (t), if U C (t) is non-empty, and otherwise queries {i, j} ∈ [m] 2 uniformly at random from [m] 2 . Note that the second case (i.e., U C (t) is empty) is only defined in order to ensure that ∈ ∞ , which in turn allows for applying Theorem 7.2. In light of this, we obtain the following corollary. With regard to Theorem 6.1, the testing algorithm from Corollary 7.4 is already asymptotically optimal up to logarithmic factors for the WST testing problem in (1) for instances ∈ Q h m . Nevertheless, one may ask, firstly, whether termination is only possible as soon as being sure about the signs of q i,j − 1∕2 of all the m 2 many {i, j} ∈ [m] 2 , and secondly, if the rough correction term in the error probability (i.e., m 2 ) for the sign test of any , is optimal. In the following section, we answer both questions negatively, giving rise to more sophisticated testing procedures. Moreover, we also present a solution to P m,0, , WST and develop instance-wise upper bounds for P m,h, , WST . We conclude this section with a discussion of further suitable anytime confidence bounds, the proofs of which are deferred to the supplement for the sake of convenience. In the following, if p ∈ [0, 1] and C ∶ ℕ → ℝ are fixed, let us define N (p) (C) as in Definition 7.1. Inspired by the sequential probability ratio test (Wald and Wolfowitz 1948) for testing whether a coin has bias 1∕2 + h or 1∕2 − h , we may define for any h ∈ (0, 1∕2) and ∈ (0, 1∕2 leads to a sequential test, where the runtime depends on the (unknown) ground-truth p, which makes the question of instance-dependent bounds actually interesting. But on the other side, for any p ∈ (0, 1) , the random variable ) ≤ N a.s. for some N ∈ ℕ . However, as we also point out in Lemma A.1, the optimality of the sequential probability ratio test assures us that choosing We now turn to the more complex case of preference relations in Q 0 m . In the following, we write ln 2 (⋅) ∶= ln ln(⋅) and ln 3 (⋅) ∶= ln ln ln(⋅) for the sake of convenience. From a result by Farrell (1964) we can infer that, for some appropriate value 4 n 0 ∈ ℕ , the function is (0, )-correct and fulfills which is shown in Lemma A.3 in the supplement. With the help of C Farrell 0, , we will be able to present a solution A to P m,0, , WST , in which the term h −2 ln ln h −1 will naturally appear in the sample-complexity bound (cf. in Theorem 8.6). As we have seen in Theorem 6.2, the ln ln h −1 -factor may not be avoided here.

Enhanced online WST testing
In this section, we will exploit the connection between graph theory and WST in order to improve the algorithm from Corollary 7.4. The main idea for improvement is the following: Suppose we wanted to test whether ∈ Q 3 is WST.
If we are sure enough that q 2,1 , q 2,3 > 1∕2 holds (depicted by the edges 2 → 1 , 2 → 3 in the picture to the right), then we can infer that is WST, since the definition of weak stochastic transitivity is fulfilled in both cases ( q 1,3 < 1∕2 and q 1,3 > 1∕2 ). Thus, testing q 1,3 is in some sense superfluous. To generalize this kind of reasoning to the case m > 3 , we first introduce a graph-theoretical interpretation of the problem.

Graph-theoretical considerations
Throughout this section, we let Note that, for every ∈ Q 0 m and every distinct i, j ∈ [m] , either q i,j > 1∕2 or q j,i > 1∕2 holds. Hence, each ∈ Q 0 m can be identified by a tournament G ∶= G = ([m], E G ) with E G ∶= (i, j) ∈ [m] × [m] | i ≠ j and q i,j > 1∕2 . It can be shown that ∈ Q 0 m is WST iff the corresponding identifying tournament G is acyclic (Proposition D.2).
In the toy example above, note that the identifying tournament of is acyclic in any case, i.e., regardless whether q 1,3 < 1 2 or q 1,3 > 1 2 holds, making one edge of the identifying tournament superfluous for inferring WST of and allowing a correct decision merely on the digraph given by 2 → 1 , 2 → 3. The following two definitions generalize the idea of superfluous edges for general digraphs.
Definition 8.1 A digraph G is called transitive in expansion if each of its extensions to a tournament is acyclic. In other words, no tournament G on [m] with E G ⊆ EG contains any cycle.

Definition 8.2 Let
Regarding Proposition D.2, we may write G m (WST) for the set of all digraphs G on [m], which are transitive in expansion. The following result provides a link between transitivity in expansion and the notion of negligibility.

Proposition 8.3
Let G ∈ G m . If G does not contain a cycle and every {i, j} ∈ [m] 2 with (i, j), (j, i) ∉ E G is negligible for G, then G ∈ G m (WST) holds.
This result together with the connection of preference relations and tournaments, brings us closer to answering the questions raised at the end of Sect. 7, as we show the following: If G is transitive in expansion, then there exists some graph G , which is transitive in expansion, satisfying EG ⊆ E G and �EG� = � m 2 � − ⌊ m+1 3 ⌋ (Proposition D.5), i.e., in particular Thus, it is possible to infer WST of by merely considering � m 2 � − ⌊ m+1 3 ⌋ edges of the identifying tournament, while a violation of WST by can be confirmed if the identifying tournament contains a cycle.

Exploiting transitivity in expansion
Equipped with these insights, we suggest Algorithm 2 as a testing procedure for P m,h, , WST . In the next theorem, we verify that this algorithm has in fact the desired theoretical guarantees; the proof is given in Section D in the supplement.

Theorem 8.4 Let
∈ ∞ , , ∈ (0, 1) and h ∈ [0, 1∕2) be fixed and define . Lemma D.9 indicates that one can not expect to choose a correction term smaller than � m 2 � − ⌊ m+1 3 ⌋ for the desired type II error within the choice of in Algorithm 2. Furthermore, the fact that the graph G ∈ G m with edges 1 → 2 → … → m → 1 contains a cycle, unlike any of its proper subgraphs, demonstrates optimality of the correction term m for the desired type I error within the choice of . As a direct consequence of Theorem 8.4, we obtain a result analogous to the one stated in Corollary 7.4 for Algorithm 2 called with m, the sampling strategy from Corollary 7.4, and C Hoeffding 3 ⌋) −1 } , so that it achieves an optimal worst-case runtime (up to a logarithmic term of m) in the active online testing scenario as well.

Instance-wise upper bounds and exploiting negligibility of edges
We conclude this section with more sophisticated solutions to P m,h, , WST in the active setting, which take into account that those queries {i, j} , which are negligible with high probability, are superfluous and should be avoided. To this end, we define the sampling strategy * (m, C) as the sampling strategy which, similarly to the sampling strategies (m, C) considered in Corollary 7.4, keeps track of a specific subset of [m] 2 consisting of all {i, j} for which q i,j > 1∕2 or q i,j < 1∕2 can be decided with enough confidence (with regard to C) at time t. In contrast to the latter, the used subset by * (m, C) takes also the negligibility of edges into account. Formally, * (m, C) considers the following set at time t: The sampling procedure of * (m, C) is just like (m, C) , but only replacing U C (t) by U * C (t) . Note that Ê t may be defined in terms of n 0 , w 0 , … , n t−1 , w t−1 as the set of all (i, j) ∈ [m] × [m] for which some t ′ < t exists, such that whence * (m, C) is in fact a sampling strategy as stipulated in Definition 3.1.
From Theorem 8.4, we immediately obtain that Algorithm 2 called with parameters m, * (m, C) and C is a solution to P m,h, , WST . But even if this guarantee holds for any (h, � )-correct function C, it is desirable to choose C in such a way that the sample complexity of the corresponding algorithm is low. According to Lemma A.1, Lemma A.3, and Lemma A.2, are to some extent optimal in this regard for the cases h > 0 resp. h = 0 . With these, we obtain the following instance-wise upper bounds on the expected termination time for solutions to P m,h, , WST . They show that the values |q i,j − 1∕2| determine the complexity of testing whether is weakly stochastic transitive or not. In comparison to the lower bound stated in Theorem 6.1, our instance-wise upper bounds depend on all m 2 instead of only m − 1 2 entries of . Needless to say, in terms of the asymptotic behavior as m → ∞ , this difference is negligible. ] ∈ (m 2 ln(m)h −2 ln( −1 )) as max{m, h −1 , −1 } → ∞ , i.e., it is asymptotically optimal up to a ln(m)-factor. In order to compare the result of Theorem 8.5 with the instance-wise lower bound from Theorem 6.1 more thoroughly, suppose ∈ Q h m (WST) and (i, j) ∈ (m) 2 with | (i) − (j)| > 1 to be fixed for the moment and let = = for simplicity. Due to e(h, � ) ∈ (h −1 ) as h ↘ 0 , the dependency of (8) on the (i, j)-entry of is approximately h −1 i,j h −1 , whereas this dependency in (6) is of the form h −2 i,j . This suggests, that the two bounds are closest in case h ≈ h i,j . Considering that the choice C = C SPRT h, � assures optimal early detection of sign(q i,j − 1∕2) only in case |q i,j − 1∕2| = h , the appearance of h −1 in (8) may not come as a surprise. Moreover, the scaling � ≈ ∕m 2 leads to an additional factor of 2 ln(m) in (8) compared to (6).
In the first experiment, we investigate the termination time of A naive and A improved for preference relations in Q 0.05 m (WST) or Q 0.05 m (¬WST). To this end, we sample uniformly at random from Q 0.05 m (WST) (resp. Q 0.05 m (¬WST) ), run the test algorithms until termination, respectively, and repeat this process for 100 times. Here, both A naive and A improved -started with some -observe the same duel chosen by in each time step, as well as the same outcome of the duel. As stated in Theorem 8.4, A improved may thus terminate earlier than A naive in any case. In the following table we report the obtained average termination times (and the corresponding standard error in brackets) for varying values of m. The results reveal that A improved needs significantly less samples for checking WST than A naive throughout, and the effect is strongest if is not WST and m is large. In particular, if the underlying preference relation is not WST, the termination time of A improved is mostly decreasing with the number of available arms, while the termination time of A naive , on the other side, increases rapidly with the number of arms. Moreover, both test algorithms did not make any error in deciding whether WST holds or not for the underlying preference relation , i.e., the observed accuracy of both test algorithms was 100% throughout. Last but not least, it is worth mentioning that A improved (as well as A naive ) terminates for each problem scenario much earlier than the derived worst-case upper bound (1 − 2 �� ) m 2 , which is ≥ 4370 m 2 for any m ≥ 3 (cf. Theorem 8.5). Next, we analyze the impact of the degree of violation of WST within a preference relation -measured by the number of cycles 5 in the identifying tournament G -on the sample complexities of A naive and A improved , respectively. For this purpose, we choose 1 , 2 , 3 and 4 as respectively, where x ∶= 0.6 and y ∶= 0.4 . The following table shows the number of cycles in G i together with the average runtimes (as well as the empirical standard errors in These results support the following conclusions. Firstly, the larger the number of cycles in the identifying tournament G i of the underlying preference relation i (i.e., the more severe the WST property is violated), the lower the sample complexity of A improved is on average. Secondly, the latter effect reveals an "elbow" dependency in the sense that the decrease of the termination time is rapidly declining with the number of cycles, with the strongest decline if at least one cycle is present. Thirdly, A naive does not seem to benefit from stronger violations of WST and in fact does not exploit structural properties of the current estimated preference relation for an early termination such as A improved does. Finally, the results for 1 with regard to the averaged elapsed time demonstrate that checking the transitive in expansion property of the internal graph maintained by A naive (i.e., line 7 in Algorithm 2) increases the computational cost per iteration step by a factor of ≈ 2.16 25639 25919 0.6 ≈ 3.64 . However, the superiority of A improved over A naive in terms of sample complexity is so strong, that it outperforms A naive even with regard to computational costs on 2 , 3 and 4 .
In summary, the experiments empirically confirm our theoretical results on the superiority of the enhanced testing algorithm A improved compared to A naive .

Conclusion
In this paper, we have analyzed the problem of testing stochastic transitivity assumptions within the dueling bandits framework. For various types of stochastic transitivity, we provided instance-dependent lower bounds on the expected number of samples needed by any sequential test to come to a test decision obeying predefined error bounds. These results indicate that testing a stochastic transitivity assumption stronger than weak stochastic transitivity is hopeless in worst case scenarios.
In light of these results, we have introduced a flexible algorithmic framework, which allows one to either monitor the validity of the weak stochastic transitivity assumption made by a dueling bandit algorithm during its sampling process in a passive way, or to actively query pairs of arms in order to confirm or refute this assumption as quickly as possible. To this end, we designed a sequential testing method within the algorithmic framework and provided theoretical guarantees for its type I and type II error as well as an almost surely finite termination time within the passive testing scenario, if it is instantiated with an appropriate function to measure the confidence of pairwise probability estimates. In addition, we have provided some examples for appropriate confidence functions and have shown optimality of the resulting algorithm up to a logarithmic factor in terms of the expected runtime for a suitable sampling strategy, which is actively supporting the test component. Finally, we enhanced the testing method by incorporating graph-theoretical considerations, resulting in faster decisions on the validity or violation of WST, and provided instance-dependent upper bounds on the expected runtime of this testing procedure.
Based on our findings, it would be of interest to transfer the ideas for WST testing as developed in this paper to weaker yet still practically relevant assumptions in the realm of dueling bandits, such as the existence of a Condorcet Winner. Furthermore, a more thorough experimental study for the suggested algorithmic framework would be important to gain more insights into the actual degree of support provided by the testing component to already established sampling strategies for ranking problems.