Optimal Clustering from Noisy Binary Feedback

We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.


Introduction
Modern Machine Learning (ML) models require a massive amount of labeled data to be efficiently trained.Humans have been so far the main source of labeled data.This data collection is often tedious and very costly.Fortunately, most of the data can be simply labeled by non-experts.This observation is at the core of many crowdsourcing platforms such as reCAPTCHA, where users receive low or no payment.In these platforms, complex labeling problems are decomposed into simpler tasks, typically questions with binary answers.In reCAPTCHAs, for example, the user is asked to click on images (presented in batches) that contain a particular object (a car, a road sign), and the system leverages users' answers to label images.As another motivating example, consider the task of classifying bird images.Users may be asked to answer binary questions like: "Is the bird grey?", "Does it have a circular tail fin?", "Does it have a pattern on its cheeks?",etc. Correct answers to those questions, if well-processed, may lead to an accurate bird classification and image labels.In both aforementioned examples, some images may be harder to label than others, e.g., due to the photographic environment, the birds' posture, etc.Some questions may be harder to answer than others, leading to a higher error rate.To build a reliable system, tasks/questions have to be carefully designed and selected, and user responses need to be smartly processed.Efficient systems must also learn the difficulty of the different tasks, and guess how informative they are when solving the complex labeling problem.
This paper investigates the design of such systems, tackling clustering problems that have to be solved using answers to binary questions.We incorporate a model that takes into consideration the varying difficulty levels or heterogeneity of clustering each item.We propose a full analysis of the problem, including information-theoretical limits that hold for any algorithm and novel algorithms with provable performance guarantees.Before giving a precise statement of our results, we provide a precise description of the problem setting and the statistical model dictating the way users answer.This model is inspired

Problem setting and feedback model
Consider a large set I of n items (e.g.images) partitioned into K disjoint unknown clusters I 1 , . . ., I K .Denote by σ(i) the cluster of item i.To recover these hidden clusters, the learner gathers binary user feedback sequentially.Upon arrival, a user is presented a list of w ≥ 1 items together with a question with a binary answer.The question is selected from a predefined finite set of cardinality L. The process of selecting the (list, question) pair for a given user can be carried out in either a nonadaptive or adaptive manner (in the latter case, the pair would depend on user feedback previously collected).Importantly, our model captures item heterogeneity: the difficulty of clustering items varies across items.We wish to devise algorithms recovering clusters as accurately as possible using the noisy binary answers collected from T users.
We use the following statistical model parametrized by a matrix p := (p kℓ ) k∈[K],ℓ∈[L]1 with entries in [0, 1] and by a vector h := (h i ) i∈I ∈ [1/2, 1] n .These parameters are (initially) unknown.When the t-th user is asked a question ℓ t = ℓ ∈ [L] for a set W t of w ≥ 1 items, she provides noisy answers: for the item i ∈ I k in the list W t , her answer X iℓt is +1 with probability q iℓ := h i p kℓ + hi pkℓ , and −1 with probability qiℓ , where for brevity, x denotes 1 − x for any x ∈ [0, 1].Answers are independent across items and users.Our model is simple, but general enough to include as specific cases, crowdsourcing models recently investigated in the literature.For example, the model in [11] corresponds to our model with only one question (L = 1), two clusters (K = 2), and a question asked for a single item at a time (w = 1).Note that in our model, answers are collected from a very large set of users, and a given user is very unlikely to interact with the system several times.This justifies the fact that answers provided by the various users are statistically identical.
Item hardness.An important aspect of our model stems from the item-specific parameter h i .It can be interpreted as the hardness of clustering item i, whereas p kℓ corresponds to a latent parameter related to question ℓ when asked for an item in cluster k.Note that when h i = 1/2, q iℓ = 1/2 irrespective of the cluster of item i. Hence any question ℓ on item i receives completely random responses, and this item cannot be clustered.Further note that intuitively, the larger the hardness parameter h i of item i is, the easier it can be clustered.Indeed, when asking question ℓ, we can easily distinguish whether item i belongs to cluster k or k ′ if the difference between the corresponding parameters of user statistical answers h i p kℓ + hi pkℓ and h i p k ′ ℓ + hi pk ′ ℓ is large.This difference is |p kℓ − p k ′ ℓ |(2h i − 1), an increasing function of h i .We believe that introducing item heterogeneity is critical to obtain a realistic model (without h, all items from the same cluster would be exchangeable), but complicates the analysis.Most theoretical results on clustering or community detection do not account for this heterogeneity -refer to Section 2 for detail.
Illustrative Example.We introduce an example to illustrate the structure and characteristics of our model.
Example 1.Consider the task of classifying images into two types of birds: Mallards and Canadian Geese.Mallards (see Figure 1a for an image), a type of duck, and Canadian Geese (see Figure 1b for an image), which are not classified as ducks, present a unique classification challenge.In this case, L = 1, and the question posed to the users is: "Is the bird in the image a duck?".We assign cluster 1 for the Mallard images and cluster 2 for the Canadian Goose images.Assume that p 11 = 0.8 and p 21 = 0.3: they are latent probabilities of answering yes to the question given an image of a Mallard and a Canadian Goose, respectively.p 11 and p 21 represent the latent probabilities of answering yes to a question 1.These parameters also consider the scenario where a user, randomly selected from a large set, may not answer a question correctly due to a lack of knowledge or other reasons.For each image i, h i indicates the difficulty of classification.
For instance, when the image is of Mallards (a type of duck), and the image is clear, the classification is relatively easy, and h i is set to h i = 1.Consequently, the probability of correctly identifying the Mallards is q i1 = h i p 11 + hi p11 = 1 • 0.8 + 0 • 0.2 = 0.8.However, when another image j of Mallards is blurred due to poor lighting or other factors, and the classification difficulty is h j = 0.8, the probability of correct classification decreases to q j1 = h j p 11 + hj p11 = 0.8 • 0.8 + 0.2 • 0.2 = 0.68.As a result, the feedback obtained for image j is more ambiguous compared to that for image i, due to the increased difficulty in classification.
Assumptions.We make the following mild assumptions on our statistical model Assumption (A1) excludes the cases where clustering is impossible even if all parameters were accurately estimated.Indeed, when h * = 0, there exists at least one item i which receives completely random responses for any question, i.e., q iℓ = 1/2 for any ℓ ∈ [L].Observe that when ρ * = 0, there exists . Items in the different clusters k and k ′ can have the same value of q iℓ .As a consequence, from the answers, we cannot determine whether i is in cluster k or k ′ .In Example 1, r 11 = 0.6, r 21 = −0.4,and the value of ρ * is ρ * = |0 • 0.6 + 0.4| = 0.4.Assumption (A2) states some homogeneity among the parameters of the clusters.It implies that q iℓ ∈ [η, 1 − η] for all i ∈ I and ℓ ∈ [L].
Let Ω be the set of all models satisfying (A1) and (A2).
For convenience, we provide a table summarizing all the notations in Appendix A.

Main contributions
We study both nonadaptive and adaptive sequential (list, question) selection strategies.In the case of nonadaptive strategy, we assume that the selection of (list, question) pairs is uniform in the sense that the number of times a given question is asked for a given item is (roughly) ⌊T w/(nL)⌋.The objective is to devise a clustering algorithm taking as input the data collected over T users and returning estimated clusters as accurate as possible.When using adaptive strategies, the objective is to devise an algorithm that sequentially selects the (list, question) pairs presented to users, and that, after having collected answers from T users, returns accurate estimated clusters.Our contributions are as follows.We first derive information-theoretical performance limits satisfied by any algorithm under uniform or adaptive sequential (list, question) selection strategy.We then propose a clustering algorithm that matches our limits order-wise in the case of uniform (list, question) selection.We further present a joint adaptive (list, question) selection strategy and clustering algorithm, and illustrate, using numerical experiments on both synthetic and real data, the advantage of being adaptive.
Fundamental limits.We provide a precise statement of our lower bounds on the cluster recovery error rate.These bounds are problem specific, i.e., they depend explicitly on the model M = (p, h), and they will guide us in the design of algorithms.
(Uniform selection) In this case, we derive a clustering error lower bound for each individual item.Let π denote a clustering algorithm, and define the clustering error rate of item i ∈ I as ε π i (n, T ) := P[i ∈ E π ], where E π denotes the set of mis-classified items under π.The latter set is defined as , where (S π 1 , . . ., S π K ) denotes the output of π and γ is a permutation of [K] minimizing the cardinality of When deriving problem-specific error lower bounds, we restrict our attention to so-called uniformly good algorithms.An algorithm π is uniformly good if for all M ∈ Ω and i ∈ I, ε π i (n, T ) = o(1) as T → ∞ under T = ω(n).We establish that for any M ∈ Ω satisfying (A1) and (A2), under any uniformly good clustering algorithm π, as T grows large under T = ω(n), for any item i, we have: where In the above definition of the divergence D U M (i), KL(a, b) is the Kullback-Leibler divergence between two Bernoulli distributions of means a and b (KL(a, b) := a log a b + ā log ā b ).Note that uniformly good algorithms actually exist (see Algorithm 1 presented in Section 4).
(Adaptive selection) We also provide clustering error lower bounds in the case the algorithm is also sequentially selecting (list, question) pairs in an adaptive manner.Note that here a lower bound cannot be derived for each item individually, say item i, since an adaptive algorithm could well select this given item often so as to get no error when returning its cluster.Instead we provide a lower bound for the cluster recovery error rate averaged over all items, i.e., ε π (n, T ) := 1 n i∈I ε π i (n, T ).Under any uniformly good joint (list, question) selection and clustering algorithm π, as T grows large under T = ω(n), we have: where (y jℓ KL(q jℓ , q iℓ ) + y iℓ KL(q iℓ , q jℓ )) , and In the above lower bound, the vector y encodes the expected numbers of times the various questions are asked for each item.Specifically, as shown later, y iℓ T w n can be interpreted expected number of times the question ℓ is asked for the item i. Maximizing over y in (4) hence corresponds to an optimal (list, question) selection strategy, and to the minimal error rate.Further interpretations and discussions of the divergences D U M (i) and D A M (i, y) are provided later in the paper.Algorithms.We develop algorithms with both uniform and adaptive (list, question) selection strategies.
(Uniform selection) In this case, for each item i and based on the collected answers, we build a normalized vector (of dimension L) that concentrates (when T is large) around a vector depending on the cluster id σ(i) only.Our algorithm applies a K-means algorithm to these vectors (with an appropriate initialization) to reconstruct the clusters.We are able to establish that the algorithm almost matches our fundamental limits.More precisely, when T = ω (n) and T = o(n 2 ), under our algorithm, we have, for some absolute constant C > 0, The above error rate has an optimal scaling in T, w, L, n.By deriving upper and lower bounds on D U M (i)), we further show that the scaling is also optimal in (2h i − 1) 2 and almost in ρ * (see Assumption (A1)).
(Adaptive selection) The design of our adaptive algorithm is inspired by the information-theoretical lower bounds.The algorithm periodically updates estimates of the model parameters, and of the clusters.Based on these estimates, we further estimate lower bounds on the probabilities to misclassify every item.The items we select are those with the highest lower bounds (the items that are most likely to be misclassified); we further select the question that would be the most informative about these items.We believe that our algorithm should approach the minimal possible error rate (since it follows the optimal (list, question) selection strategy).Our numerical experiments suggest that the adaptive algorithm significantly outperforms algorithms with uniform (list, question) selection strategy, especially when items have very heterogenous hardnesses.

Related work
To our knowledge, the model proposed in this paper has been neither introduced nor analyzed in previous work.The problem has similarities with crowdsourced classification problems with a very rich literature [2], [14] [10], [20], [7], [15], [19], [4], [13] (Dawid-Skene model and its various extensions without clustered structure), [16], [6] (Clustering without item heterogeneity).However, our model has clear differences.For instance, if we draw a parallel between our model and that considered in [11], there tasks correspond to our items, and there are only two clusters of tasks.More importantly, the statistics of the answers for a particular task do not depend on the true cluster of the task since the ground truth is defined by the majority of answers given by the various users.Our results also differ from those in the crowdsourcing literature from a methodological perspective.In this literature, fundamental limits are rarely investigated, and if they are, they are in the minimax sense by postulating the worst parameter setting (e.g., [19], [11], [4]) or it is problem-specific but without quantifying of the error rate (e.g., [13]).Here we derive more precise problem-specific lower bounds on the error rate, i.e., we provide minimum clustering error rates given the model parameters (p, h).Further note that most of the classification tasks studied in the literature are simple (can be solved using a single binary question).
Our problem also resembles cluster inference problems in the celebrated Stochastic Block Model (SBM), see [1] for a recent survey.Plain SBM models, however, assume that the statistics of observations for items in the same cluster are identical (there are no items harder to cluster than others, this corresponds to h i = 1, ∀i ∈ I in our model), and observations are typically not operated in an adaptive manner.The closest work in the context of SBM to ours is the analysis of the so-called Degree-Corrected SBM, where each node is associated with an average degree quantifying the number of observations obtained for this node.The average degree then replaces our hardness parameter h i for item i.In [5], the authors study the Degree-Corrected SBM, but deal with minimax performance guarantees only, and non-adaptive sampling strategies.
3 Information-theoretical limits

Uniform selection strategy
Recall that an algorithm π is uniformly good if for all M ∈ Ω and i ∈ I, ε π i (n, T ) = o(1) as T → ∞ under T = ω(n).Assumptions (A1) and (A2) ensure the existence of uniformly good algorithms.The algorithm we present in Section 4 is uniformly good under these assumptions.The following theorem provides a lower bound on the error rate of uniformly good algorithms.
Theorem 1.If an algorithm π with uniform selection strategy is uniformly good, then for any M ∈ Ω satisfying (A1) and (A2), under T = ω(n), the following holds: The proof of Theorem 1 will be presented later in this section.Theorem 1 implies that the global error rate of any uniformly good algorithm satisfies: where Divergence D U M (i) and its properties.The divergence D U M (i), defined in Section 1, quantifies the hardness of classifying item i.This divergence naturally appears in the change-of-measure argument used to establish Theorem 1.To get a better understanding of D U M (i), and in particular to assess its dependence on the various system parameters, we provide the following useful upper and lower bounds, proved in Appendix B: Proposition 1. Fix i ∈ I. Let k ′ be such that: Then, we have: Note that D U M (i) vanishes as h i goes to 1/2, which makes sense since for h i ≈ 1/2, item i is very hard to cluster.We also have Application to the simpler model of [11].Consider a model with a single question and two clusters of items.From Theorem 1, we can recover an asymptotic version of Theorem 2.4. in [11].
The proof of Corollary 1 is presented in Appendix D. Corollary 1 implies, as T → ∞ under T = ω(n).Smaller h i and |p 11 − p 21 | imply item i is harder to classify.Note that Theorem 2.4. in [11] (corresponds to p 21 = 1 − p 11 in our Corollary 1) provides a minimax lower bound whereas our result is problem-specific and hence more precise.Note that Corollary 1 also applies directly to Example 1 mentioned in Introduction.The lower bound on the error probability for each item i scales as exp(−c T n (2h i − 1) 2 ) with some constant c > 0. Proof of Theorem 1.The proof leverages change-of-measure arguments, as those used in the classical multi-armed bandit problem [12] or the Stochastic Block Model [18].Here the proof is however complicated by the fact that we wish a lower bound on the error rate for clustering each item.
Let π denote a uniformly good algorithm with uniform selection strategy, and let M ∈ Ω be a model satisfying Assumptions (A1) and (A2).In our change-of-measure, we denote by M the original model and by N a perturbed model.Fix i ∈ I, where For these choices of i, k ′ , and h ′ , we construct the perturbed model N as follows.Under N , all responses for items different than i are generated as under M. The responses for i under N are generated as if i was in cluster k ′ and had difficulty h ′ .We can write the log-likelihood ratio of the observation under N to that under M as follows: where we let q ′ := (q ′ ℓ ) ℓ∈[L] with q ′ ℓ = h ′ p k ′ ℓ + h′ pk ′ ℓ .Let P N and E N (resp.P M = P and E M = E) denote, respectively, the probability measure and the expectation under N (resp.M).Using the construction of N , a change-of-measure argument provides us with a connection between the error rate on item i under M and the mean and the variance of L under N : Proof of (10).The distribution of the log-likelihood L under N satisfies: for any g ≥ 0, Using the definition (9) of the log-likelihood ratio, we bound the first term in (11) as follows: To bound the second term in (11), note that (2h ′ − 1) is a strictly positive constant. 2Hence, the perturbed model N satisfies (A1).By the definition of the uniformly good algorithm, we have . Hence: Combining ( 11), ( 12) and ( 13) with g = − log(4ε π i (n, T )), we have Using Chebyshev's inequality, we obtain: Combining this result to (14) implies the claim (10).
Next, Lemma 1 provides the upper bound on mean and variance of L under the model N .
Lemma 1. Assume that (A2) holds.For i, i ′ such that σ(i) = k ̸ = k ′ = σ(i ′ ), under the uniform selection strategy, we have The proof of this lemma is presented to Appendix C. Note that in view the above lemma, the r.h.s. of ( 10) is asymptotically dominated by Thus, Theorem 1 follows from the claim in (10) and Lemma 1. □

Adaptive selection strategy
The derivation of a lower bound for the error rate under adaptive (list, pair) selection strategies is similar: Theorem 2. For any M ∈ Ω satisfying (A1) and (A2), and for any uniformly good algorithm π with possibly adaptive (list, question) selection strategy, under T = ω(n), we have: where (y jℓ KL(q jℓ , q iℓ ) + y iℓ KL(q iℓ , q jℓ )) , and Proof of Theorem 2. Again we use a change-of-measure argument, where we swap two items from different clusters.First, we prove the lower bound for the error rate of a fixed item i. Fix i ∈ I, let j be an item satisfying σ(j) ̸ = σ(i) and let (y jℓ KL(q jℓ , q iℓ ) + y iℓ KL(q iℓ , q jℓ )) .
Consider a perturbed model N ′ , in which items except i and j have the same response statistics as under M, and in which item i behaves as item j, and item j behaves as item i.Let P N ′ and E N ′ denote, respectively, the probability measure and the expectation under N ′ .The log-likelihood ratio of the responses under N ′ and under M is: The mean and variance of L under N ′ are: (y jℓ KL(q jℓ , q iℓ ) + y iℓ KL(q iℓ , q jℓ )) y jℓ KL(q jℓ , q iℓ ) + KL(q jℓ , q iℓ ) + y iℓ KL(q iℓ , q jℓ ) + KL(q iℓ , q jℓ ) , using a slight modification of Lemma 1.By a similar argument as that used in the proof of Theorem 1, we get: as is in (10).Note that the r.h.s. of ( 16) is asymptotically dominated by We deduce that ) .Thus, from the definition of D A M , we have: Taking the logarithm of the previous inequality, we conclude the proof.□

Algorithms
In this section, we describe our algorithms for both uniform and adaptive (list, question) selection strategies.

Uniform selection strategy
In this case, we assume that with a budget of T users, each item receives the same amount of answers for each question.After gathering these answers, we have to exploit the data to estimate the clusters.
To this aim, we propose an extension of the K-means clustering algorithm, that efficiently leverages the problem structure.The pseudo-code of the algorithm is presented in Algorithm 1.
The algorithm first estimates the parameters q ℓi : the estimator qℓi just counts the number of times item i has received a positive answer for question ℓ.We denote by qi = (q ℓi ) ℓ the resulting vector.By normalizing the vector 2q i − 1, we can decouple the nonlinear relationship between q, h and p.Let ri = 2q i −1 ∥2q i −1∥ be the normalized vector.Then, ri concentrates around r σ(i) := r σ(i) /∥r σ(i) ∥.Importantly, the normalized vector r σ(i) does not depend on h i but on the cluster index σ(i) only.The algorithm exploits this observation, and applies the K-means algorithm to cluster the vectors ri .By analyzing how ri concentrates around r σ(i) and by applying the results to our properly tuned algorithm (decision thresholds), we establish the following theorem.

User response collection:
Question ℓ is asked to each item i for τ = ⌊ T w nL ⌋ times.For all ℓ ∈ [L] and i ∈ I, We will present the proof of Theorem 3 later in this section.In view of Proposition 1 and the lower bounds derived in the previous section, we observe that the exponent for the mis-classification error of item i has the correct dependence in T w/Ln and the tightest possible scaling in the hardness of the item, namely (2h i − 1) 2 .Also note that using Proposition 1, the equivalence between the ℓ ∞ -norm and the Euclidean norm, and (A1), we have: * , for some absolute constant C > 0. Hence, Algorithm 1 has a performance scaling optimally w.r.t.all the model parameters.
The computational complexity of Algorithm 1 is O(n 2 ).By choosing a small (log n) subset of items (and not all the items in I) to compute centroids (T i ), it is possible to reduce the computational complexity to O(n log n).This would not affect the performance of the algorithm in practice, but would result in worse performance guarantees.
Proof of Theorem 3. In this proof, we let τ = ⌊ T w nL ⌋ be the number of times question ℓ is asked for item i.We also denote by α := (α 1 , . . ., α K ) the fractions of items that are in the various clusters, i.e., |I k | = α k n.Without loss of generality, and to simplify the notation, we assume that the set of misclassified items is E = ∪ K k=1 (I k \ S k ), where recall that {S k } k∈[K] is the output of the algorithm (i.e., the permutation γ in the definition of this set is just the identity).
The proof proceeds in three steps: (i) we first decompose the probability of clustering error for item i, using the design of the algorithm and Assumptions (A1) and (A2).We show that this probability can be upper bounded by the probabilities of events related to ∥r i − r σ(i) ∥ and ∥ξ k − r k ∥ for all k, where recall that r k := r k /∥r k ∥.The remaining steps of the proof aim at bounding the probabilities of these events.Towards this objective, (ii) in the second step, we establish a concentration result on ∥r i − r σ(i) ∥, and (iii) the last step upper bound ∥ξ k − r k ∥.
Step 1. Error probability decomposition.The algorithm classifies item i to the cluster k minimizing the distance between ri and ξ k .As a consequence, we have: where the two above inequalities are obtained by simply applying the triangle inequality.Now observe that in view of Assumptions (A1) and (A2), we have: for We deduce that: Step 2. Concentration of ri and upper bound on (a).We prove the following lemma, a consequence of a concentration result of qi : ∥r σ(i) ∥ , and r k = r k ∥r k ∥ .For each i ∈ I, The proof of Lemma 2 is presented at Appendix E. Note that by definition of ρ * , we have: Applying Lemma 2 with ε = (2hi−1)ρ * 20

16
, we obtain an upper bound on the term (a): Step 3. Upper bound of the term (b).Next, we establish the following claim: To this aim, we first show that a large fraction of the items satisfy , we get: Define . Then from (19), . Further define S as the number of the items in I that satisfy . Since r1 , . . ., rn are independent random variables, 1 are independent Bernoulli random variables.From Chernoff bound, we get: where for (i), we set λ = log 1 pmax .Therefore, to prove (18), it suffices to show that: 4 cannot be a center node (i.e., one the i * k for k = 1, . . ., K).This is due to the following facts: , since for all w such that ∥r w − r w ∥ ≤ 1 4 ( n T ) 1 4 , , since for all w ∈ I k such that ∥r w − r w ∥ ≤ 1 4 ( n T ) 1 4 , Therefore, when T n = ω(1), Let R k denote the set of items S k before computing ξ k (S k used for the calculation of ξ k ) -see the algorithm.Then, from (20) and the definition of S k before computing ξ k , for all v ∈ R k .
From the above inequality and Jensen's inequality, Therefore, when T = ω(n), which concludes the proof of (18).
The proof of theorem is completed by remarking that when T = o(n 2 ), then

log (T /n) .
This implies that the upper bound we derived for the term (a) is dominating the upper bound of the term (b).Finally,

Adaptive selection strategy
Algorithm 2 Adaptive Clustering Algorithm.
Use Algorithm 1 with input K, t to obtain the estimated clusters {S k } k=1,...,K Estimate statistical parameters: and d ). Present items W t (including i * ) with the smallest di with question ℓ * .end while Output: Our adaptive (item, question) selection and clustering algorithm is described in Algorithm 2. The design of the adaptive (item, question) selection strategy is inspired by the derivation of the informationtheoretical error lower bounds.The algorithm maintains estimates of the model parameters p and h and of the clusters {I k } k∈ [K] .These estimates, denoted by p, ĥ, and {S k } k∈[K] , respectively, are updated every τ = T /(4 log(T /n)) users.More precisely, we use Algorithm 1 to compute {S k } k∈[K] , and from there, we update the estimates as: where Y iℓ is the number of times where question ℓ has been asked for item i so far, and where σ(i) corresponds to the estimated cluster of i (i.e., i ∈ S σ(i) ).Let Y := (Y iℓ ) i∈I,ℓ∈ [L] .Now using the same arguments as those used to derive error lower bounds, we may estimate that after seeing the t-th user, a lower bound on the mis-classification error for item i is exp − di (Y ) , where di (Y ) := min The above lower bounds are heuristic in nature, as they are based solely on estimated parameters and clusters.These are derived from the divergence D A M (i, y) using ( 5), with a particular emphasis on the adjustable parameters for item i.This approach takes a pessimistic view of the hardness parameters, with the exception of those for item i. Revisiting the scenario of Example 1, there is only one question (L = 1) and the adaptability of the algorithm is principally determined by how the budget T is allocated among the items.Observe that, when h i is estimated to be small, the value of KL(h ′ pk ′ ℓ + h′ pk ′ ℓ , ĥi pσ(i)ℓ + hi pσ(i)ℓ ) tends to be small.Conversely, when h i is estimated to be large, the value of KL(h ′ pk ′ ℓ + h′ pk ′ ℓ , ĥi pσ(i)ℓ + hi pσ(i)ℓ ) tends to be large.Therefore, the more difficult the item i is, the greater the need for a larger Y i1 , and the higher the frequency of it being selected.Analyzing the accuracy of these lower bounds is particularly challenging (it is hard to analyze the estimated item hardness ĥi ).Using these estimated lower bounds, we select the items and the question to be asked next.We put in the list W t the w items with the smallest di (Y ).The question ℓ is chosen to maximize the term: , where i * = arg min i∈I di (Y ) (see Algorithm 2 for the details).Note that the question is selected by considering the item i * that seems to be the most difficult to classify.
Note that in order to reduce the computational complexity of the algorithm, we may replace the KL function in the definition of d i by a simple quadratic function (as suggested in the proof of Proposition 1).This simplifies the minimization problem over h ′ to find h ′ i .We actually have an explicit expression for h ′ i with this modification.The computational complexity of the adaptive algorithm (Algorithm 2 in Appendix) is: As in the uniform case, by choosing a small (log n) subset of items (and not all the items in I) to compute centroids (T i ), one can reduce the computational complexity to: O(n log(n) log(T /n)).We provide experimental evidence on the superiority of our adaptive algorithm in the following sections.

Numerical experiments: Synthetic data
In this section, we evaluate the performance of our algorithms on synthetic data.We consider different models.the problem investigated here is different from those one may find in the crowdsourcing or Stochastic Block Model literature.Hence, we cannot compare our algorithms to existing algorithms developed in this literature.Instead we focus on comparing the performance of our nonadaptive and adaptive algorithms.
Model 1 -heterogeneous items with dummy questions.
Consider n = 1000 items and two clusters (K = 2) of equal sizes.The hardness of the items are i.i.d., picked uniformly at random in the interval [0.55, 1].We ask each user to answer one of four questions.The answers' statistics are as follows: for cluster k = 1, p 1 = (0.01, 0.99, 0.5, 0.5) and for cluster k = 2, p 2 = (0.99, 0.01, 0.5, 0.5).Note that only half of the questions (ℓ = 1, 2) are useful; the other questions (ℓ = 3, 4) generate completely random answers for both clusters.Figure 2 (top) plots the error rate averaged over all items and over 100 instances of our algorithms.Under both algorithms, the error rate decays exponentially with the budget T , as expected from the analysis.Selecting items and questions in an adaptive manner brings significant performance improvements.For example, after collecting the answers from t = 200k, the adaptive algorithm recovers the clusters exactly for most of the instances, whereas the algorithm using uniform selection does not achieve exact recovery even with t = 1000k users.In particular, the adaptive algorithm is able to reduce the error rates on the 20% most difficult items, i.e., items that have the 20% smallest h i .In Figure 2 (bottom), we present the error rate of these items.The error rates for these most difficult items are significantly reduced by being adaptive.In Figure 3, we present the evolution over time of the budget allocation observed under our adaptive algorithm.We group items and questions into four categories.For example, one category corresponds to the question ℓ = 1, 2 and to the 20% most difficult items.As expected, the adaptive algorithm learns to select relevant questions (ℓ = 1, 2) with hard items more and more often as time evolves.
The performance of our algorithms are shown in Figure 4. Overall, compared to Model 1, the error rates are better.For example, exact cluster recovery is achieved using only 100k users for almost all instances.
Model 3 -homogeneous items with dummy questions.Here we study the homogeneous scenario where all items have the same hardness: h i = 1, ∀i ∈ I.We still have 1000 items grouped into two clusters of equal sizes.We set p 1 = (0.3, 0.2, 0.2, 0.2), p 2 = (0.7, 0.2, 0.2, 0.2) (question ℓ = 2, 3, 4 are useless).The performance of the algorithms is shown in Figure 5.The adaptive algorithm exhibits better error rates than the algorithm with uniform selection, although the improvement is not as spectacular as in heterogeneous models where adaptive algorithms can gather more information about difficult items.In homogeneous models, the adaptive algorithm remains better because it selects questions wisely.6 Numerical experiments: Real-world data Finally, we use real-world data to assess the performance of our algorithms.Finding data that would fit our setting exactly (e.g.several possible questions) is not easy.We restrict our attention here to scenarios with a single question, but with items with different hardnesses.We use the waterbird dataset by [17].This dataset contains 50 images of Mallards (a kind of duck) and 50 images of Canadian Goose (not a duck).The dataset reports the feedback of 40 users per image, collected using Amazon Mturk: each user is asked whether the image is that of a duck.This scenario mirrors the one outlined in Example 1 in Introduction.Each image is unique in the sense that the orientation of the animal varies, the brightness and contrasts are different, etc.We hence have a good heterogeneity in terms of item hardness.Actually, the classification task is rather difficult, and the users' answers seem very noisy -overall answers are correct 76% of the time.
From this small dataset, we generated a larger dataset containing 1000 images (by just replicating images).To emulate the sequential nature of our clustering problem, in each round, we pick a user uniformly at random (with replacement), and observe her answers to the selected images.
The error rates of both algorithms are shown in Figure 6.The global error rate is averaged over 100 instances.Both algorithms have rather low performance, which can be explained by the inherent hardness of the learning task.The adaptive algorithm becomes significantly better after t = 20k users.this can be explained as follows.The adaptive algorithm needs to estimate the hardness of items before being efficient.Until the algorithm gathers enough answers on item i, its estimate of ĥi remains close to 0.5.As a consequence, the algorithm continues to pick items uniformly at random.As soon as the algorithm gets better estimates of the items' hardnesses, it starts selecting items with strong preferences.

Conclusion
In this paper, we analyzed the problem of clustering complex items using very simple binary feedback provided by users.A key aspect of our problem is that it takes into account the fact that some items are inherently more difficult to cluster than some others.Accounting for this heterogeneity is critical to get realistic models, and is unfortunately not investigated often in the literature on clustering and community detection (e.g. that on Stochastic Block Model).The item heterogeneity also significantly complicates any theoretical development.
For the case where data is collected uniformly (each item receives the same amount of user feedback), we derived a lower bound of the clustering error rate for any individual item, and we developed a clustering algorithm approaching the optimal error rate.We also investigated adaptive algorithms, under which the user feedback is received sequentially, and can be adapted to past observations.Being adaptive allows to gather more feedback for more difficult items.We derived a lower bound of the error rate that holds for any adaptive algorithm.Based on our lower bounds, we devised an adaptive algorithm that smartly select items and the nature of the feedback to be collected.We evaluated our algorithms on both synthetic and real-world data.These numerical experiments support our theoretical results, and demonstrate that being adaptive leads to drastic performance gains.

B Proof of Proposition 1
For given i ∈ I, let k = σ(i) and k ′ ∈ [K] be such that: Upper bound.Recalling the definition of q iℓ := h i p kℓ + hi pkℓ , it follows that for any h ′ ∈ [(h * + 1)/2, 1], where the second inequality is from the comparison between the KL divergence from χ 2 -divergence and the third inequality is from (A2), i.e., Taking the minimum over h ′ ∈ [(h * + 1)/2, 1], we obtain the upper bound.Lower bound.Using Pinsker's inequality, we obtain: where for the last inequality, we again use the fact that h This completes the proof of Proposition 1.Note that we can further write: using the relationship between the ℓ ∞ -norm and the Euclidean norm and (A1).□ C Proof of Lemma 1 E N [L] can be obtained as follows: To bound the variance of L, we first decompose L 2 as follows: where We compute L 2 t as follows: where the last inequality follows from the fact that q iℓ ∈ [η, 1 − η] under (A2), i.e., log q ′ ℓ q iℓ ≤ log 1 η and log q′ ℓ qiℓ ≤ log 1 η .We deduce that: KL(q i ′ ℓ , q iℓ ) + KL(q i ′ ℓ , q iℓ ) , where for the last inequality, we used the Pinsker's inequality.Moreover we can compute the expectation of t̸ =t ′ L t L t ′ as follows: 1[ℓ t = ℓ and ℓ t ′ = ℓ ′ ] KL(q ′ ℓ , q iℓ ) KL(q ′ ℓ ′ , q iℓ ′ ) T (T − 1) w Ln 2 KL(q ′ ℓ , q iℓ ) KL(q ′ ℓ ′ , q iℓ ′ ) = T (T − 1) w Ln where for the last equality, we use the expression KL(q ′ ℓ , q iℓ ) + KL(q i ′ ℓ , q iℓ ) .

E Proof of Lemma 2
We use Hoeffding's inequality to establish the lemma.
Proof of Lemma 3. Note that the number of times question ℓ is asked for item i is τ .Using Hoeffding's inequality (Theorem 4), it is straightforward to check: for any ε > 0 and ℓ ∈ [L], We conclude the proof using the union bound as follows:

Figure 2 :
Figure 2: Model 1. (top) Global error rate vs. number of users.(bottom) Error rate for the 20% most difficult items vs. number of users.One standard deviations are shown using shaded areas.

Figure 3 :
Figure 3: Model 1.The budget allocation under the adaptive algorithm vs. number of users.Items and questions are grouped into 4 categories, e.g.(0 − 20%, ℓ = 1, 2) is the category regrouping the 20% most difficult items and questions ℓ = 1, 2. One standard deviations are shown using shaded areas.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Model 2. Global error rate vs. number of users.One standard deviations are shown using shaded areas.