1 Introduction

The amount of data available in digital form is increasing every day. Given the expensive costs of human inspection (and annotation), unsupervised approaches to mining these data become more and more paramount.

Cluster analysis lies at the core of most unsupervised learning tasks. Jain et al. (1999) define clustering as “the organization of a collection of patterns […] into clusters based on similarity. Intuitively, patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster”. In addition to pattern, each element to be clustered has also received the names of “object, record, point, vector, […] event, case, sample, observation, or entity” (Tan et al. 2005, ch. 2). We will stick to the term object thorough this article.

In this the most common setting, it is assumed that all objects belong to some cluster. Even if several surveys have reviewed the vast literature on clustering methods (Dubes and Jain 1980; Jain et al. 1999; Xu and Wunsch 2005), so far they all have focused on this standard task, which can be named all-in clustering. Two of the most widely used methods to solve it are the distance-based k-means (MacQueen 1967) and the probabilistic-model-based Expectation-Maximization (Dempster et al. 1977) algorithms.

However, there is a number of situations in which the data are known not to fit neatly within this all-in assumption. In such cases, we know there is a fraction of data which are neither similar to one another nor to the data within the clusters. Often, these data will correspond to a certain form of noise and should hence be separated from the sought regular clusters, which constitute the signal. Within this alternative setting, a number of different tasks can be identified according to the characteristics of the data and the aim of the task itself.

In one of these tasks, the all-in clustering goal is preserved, but the data are known to contain a small fraction of noise. This has been called the robust clustering task (Davé and Krishnapuram 1997). To solve it, some authors have proposed changes to standard clustering methods to make them more robust to the presence of noise (Kaufman and Rousseeuw 2005; Peel and McLachlan 2000). Other approaches explicitly incorporate a noise cluster, often with different properties from the regular signal clusters (Davé 1991; Banfield and Raftery 1993; Guillemaud and Brady 1997; Biernacki et al. 2000). A last family is that of algorithms specifically devised for robust clustering, such as BIRCH (Zhang et al. 1996) or DBSCAN (Ester et al. 1996).

It is worth noticing that there is a number of related tasks which share this setting, such as one-class classification or learning (Moya et al. 1993; Schölkopf et al. 2001; Tax and Duin 2004) and outlier detection (Hodge and Austin 2004; Chandola et al. 2009). In both cases, there is also a dataset containing both signal and a fraction of noise objects. However, the focus of these tasks shifts away from that of clustering, becoming the estimation of a model which covers the signal objects in the former, and the detection of the objects that significantly deviate from the rest in the latter.

A different task is that in which there is only a minority of signal objects, standing against the majority of noise. Most often, the signal objects will be embedded within the noise ones, becoming respectively foreground and background objects, and the distinction between the former and the latter must be done on grounds of density criteria. In the literature, this task has been compared to “clustering needles in a haystack” (Ando 2007), and has received names such as one-class clustering (Crammer and Chechik 2004), density-based clustering (Gupta and Ghosh 2006) or minority detection (Ando and Suzuki 2006). As a catchall term, in this article we will refer to this setting and task as minority clustering. An example dataset for a minority clustering problem is depicted in Fig. 1.

Fig. 1
figure 1

Sample Toy minority clustering dataset

Even if this new task is related to the previously presented ones, the reversal of the signal-to-noise ratio can make existing approaches unsuitable. For instance, Crammer and Chechik (2004) give insights into why existing one-class classification approaches, which are tailored to finding large-scale structures, may be unable to identify small and locally dense regions embedded in noise. Empirical comparisons have also stated the low performance exhibited by all-in and robust clustering methods in the minority clustering task (Gupta and Ghosh 2006).

However, to the best of our knowledge, all the methods proposed so far require as an input the distribution of the foreground clusters or both the foreground clusters and the background noise, either in the form of a probability distribution or, equivalently, of a divergence metric.Footnote 1 This can become a significant issue when facing large amounts of data coming from a new and unexplored domain, whose distribution may be completely unknown.

With the aim of providing a way to obtain distribution-free methods, a number of combination methods have appeared for all-in clustering (e.g., Strehl and Ghosh 2002; Topchy et al. 2003, 2004; Gionis et al. 2005). Among them, Topchy et al. (2003) introduced the idea of using an ensemble of weak clusterings, which “produce a partition, which is only slightly better than a random partition of the data”, to obtain a high-quality consensus clustering.

Ensemble clustering methods are known to offer a greater degree of flexibility with respect to individual algorithms. They allow the reusal of knowledge coming from multiple and heterogeneous sources, and can be used in a number of settings which are unfeasible using monolithic approaches, such as feature-distributed or privacy-preserving clustering (Strehl and Ghosh 2002). Moreover, most of them can be considered embarrassingly parallel, and as such can obtain significant speed-ups when deployed in distributed environments, using techniques such as Map/Reduce (Dean and Ghemawat 2004).

In this article, we make a three-fold proposal:

  • First, we propose an unsupervised minority clustering approach, Ensemble Weak minOrity Cluster Scoring (Ewocs), based on weak-clustering combination. In it, a number of weak clusterings are generated, and the information coming from each one of them is combined to obtain a score for each object. A threshold separating foreground from background objects is then inferred from the distribution of these scores. We have been able to find a theoretical proof of the properties of the proposed method, and we consider a number of criteria by which the threshold value can be determined.

  • Second, we propose Random Bregman Clustering (Rbc), a weak clustering algorithm based on Bregman divergences, for use within Ewocs ensembles; as well as an extension of the Random Splitting (RSplit) weak clustering algorithm of Topchy et al. (2003).

  • Third, we propose an unsupervised procedure to determine a set of suitable scaling parameters for a Gaussian kernel, to be used within Rbc.

We have implemented a number of approaches built from the proposed components, and evaluated them on a collection of datasets. The results of the evaluation show how approaches based on Ewocs are competitive with respect to—and even outperform—other minority clustering approaches in the state of the art, in terms of F1 and AUC measures of the obtained clusterings.

The Ewocs algorithm has already been used in the real-world task of relation detection, which was reduced to a minority clustering problem (Gonzàlez and Turmo 2009). However, we now provide a formalization of the approach, as a minority clustering algorithm by itself, and a study of its theoretical properties, which were both missing from our previous work.

The rest of the article is organized as follows. Sect. 2 gives an overview of related work in the fields of minority clustering and clustering combination. Next, Sect. 3 contains a description of the Ewocs approach, particularly the derivation of a minority clustering algorithm whose properties are theoretically proved under a set of conditions. The obtained algorithm has a number of components which allow different implementations: Sects. 4 and 5 give details on the specific weak clustering algorithms and threshold score determination methods we have used, respectively. Sects. 6 and 7 contain the details and results of an empirical evaluation of the proposed approaches on synthetic and real-world data, respectively. Finally, Sect. 8 draws conclusions of our work.

2 Related work

One of the first works to identify the minority clustering task in opposition to that of one-class classification is that of Crammer and Chechik (2004). The authors formalize the problem in terms of the Information Bottleneck principle (IB) (Tishby et al. 1999), and provide a sequential algorithm to solve this one-class IB problem. Given a Bregman divergence as a generalized measure of object discrepancy, and a fixed radius value, the OC-IB method outputs a centroid for a single dense cluster. The foreground cluster consists of the objects which fall inside the Bregmanian ball of given radius centered around the given centroid. More recently, Crammer et al. (2008) propose a different algorithm for the same model, based in rate-distortion theory and the Blahut-Arimoto algorithm, and extend it to allow for more than one cluster.

In a different direction, Gupta and Ghosh (2005) reformulate the problem in terms of cost, defined as the sum of divergences from the cluster centroid to each sample within it, and extend the OC-IB method to avoid local minima. A triad of methods (HOCC, BBOCC and Hyper-BB) is proposed. However, the requirement of an a priori determination of the cluster radius (or equivalently, size) is not removed, and the output remains a single ball-shaped cluster.

To overcome this second limitation, Gupta and Ghosh (2006) propose Bregman Bubble Clustering (BBC), as a generalization of BBOCC to several clusters. However, the number of such clusters must still be given a priori, as well as the desired joint cluster size. The authors also propose a soft clustering version of BBC, as well as a unified framework between all-in Bregman clustering (Banerjee et al. 2005) and BBC, in all their hard and soft versions. Ghosh and Gupta (2011) revisit all the theory of BBC, and present Density Gradient Enumeration (DGRADE), a procedure to determine the number of clusters as well as the initial centroids for BBC. However DGRADE introduces new parameters of its own, whose tuning requires a potentially expensive exhaustive search in the space of possible values.

The work of Ando and Suzuki (2006) is similar to previous ones in that it also uses the Information Bottleneck principle as a criterion to identify a single minority cluster. However, the method is more general in the sense that it allows arbitrary distributions, not only those induced by Bregman divergences, as foreground and background. Ando (2007) extends this last proposal, allowing multiple foreground clusters, and also provides a unifying framework of which not only the task of minority clustering, but also those of outlier detection and one-class learning, are particular cases.

A last line of research is that opened by Gupta et al. (2010), who propose Hierarchical Density Shaving (HDS). HDS is built upon the Hierarchical Mode Analysis algorithm (HMA) introduced forty years before by Wishart (1969), and can be seen as a generalization of the robust clustering algorithm DBSCAN (Ester et al. 1996). The algorithm produces a hierarchical clustering which is an approximation of the one which would be obtained by HMA. Dense clusters are then identified in the hierarchy using a heuristic criterion. The authors propose the AutoHDS framework, in which the parameters of the algorithm are manually tuned with the help of an interactive tool. The proposed application provides a visualization of the obtained minority clustering as the parameter values are updated.

Except for HDS, which is of a more heuristic nature, all the approaches discussed so far formalize the task of minority clustering as an optimization problem, and differ in the considered objective function and in the algorithm used to optimize it. In all cases, the formalization requires to make explicit the distribution of the sought clusters. In the case of HDS, a divergence function is required, and used throughout the algorithm. This is clearly a drawback, as the performance of these methods degrades if the distribution of the data does not match the one used by the model.

As discussed thoroughly in Sect. 3, Ewocs provides a different approach to the problem: we propose a procedure, based on aggregation of clustering ensembles, by which a score for each object can be found, and we show how these scores correlate with the fact of whether an object belongs to the foreground clusters or to the background. This alternative approach allows the use of much less informed (weak) clustering algorithms, and is the first one to our knowledge to use ensembles for the task. In addition to providing a distribution-free clusterer, the use of ensembles also brings practical benefits, as the algorithm becomes easily parallelizable: the individual clusterings can be found in a distributed fashion, and synchronization is only needed in batches to add up the object scores.

3 Ewocs

This section presents our Ensemble Weak minOrity Cluster Scoring (Ewocs) algorithm to solve the task of minority clustering.

As mentioned in the introduction, the aim of Ewocs is to leverage the information provided by clusterings in an ensemble, combining the evidence from each one of them to obtain a minority clustering of the dataset. The central idea in the algorithm is that of object score: from each individual clustering, objects are assigned a certain score, and these scores will be aggregated across the ensemble. In order to quantify their density, we propose the use of a score related to the size of the clusters each object is assigned to across the clusterings. In addition to being computationally cheap, we will be able to prove that this function shows interesting theoretical properties in minority clustering scenarios. As a consequence of them, it will be possible to separate foreground from background objects using a threshold on their aggregated scores.

The following sections detail and formalize the intuitions presented in this overview. First Sect. 3.1 defines our setting for the task of minority clustering. Sect. 3.2 presents, from a theoretical point of view, the scoring scheme that lies at the core of our method. Sects. 3.3 and 3.4 then study the conditional probability distributions of the assigned scores: the first one on a single dataset; the second, across multiple dataset samplings. Next, Sect. 3.5 introduces the concept of consistent clustering, and shows how, when using clustering functions from a consistent family, an inequality on the score expectations for foreground and background objects can be established. This inequality will allow us to obtain as a corollary, in Sect. 3.6, a generic algorithmic procedure for minority clustering, based on the proposed scores. Finally, it is also possible to obtain a clustering model using this algorithm: its construction and application is described in the last Sect. 3.7.

3.1 Task setting

Our definitions of clustering are based on concepts from fuzzy set theory:

Definition 1

(Fuzzy set)

A fuzzy set over an ordinary set \(\mathcal{X}\) is a pair \(\tilde{\mathcal{X}} = (\mathcal{X}, f_{\tilde{X}})\), where \(f_{\tilde{X}} : \mathcal{X}\rightarrow[0, 1]\) is the membership function (or characteristic function) of \(\tilde{\mathcal{X}}\). For \(x_{i} \in\mathcal{X}\), \(f_{\tilde{X}}(x)\) expresses the grade of membership of x i to \(\tilde{\mathcal{X}}\), and will often be denoted as \(\mathrm{grade}(x_{i}, \tilde{\mathcal{X}})\) (Zadeh 1965).

Definition 2

(Fuzzy c-partition)

A fuzzy c-partition (or fuzzy pseudopartition) of an ordinary set \(\mathcal{X}\) is a family of fuzzy sets Π={π 1π k } over \(\mathcal{X}\) such that

(Bezdek 1981; Klir and Yuan 1995).

A clustering over a dataset \(\mathcal{X}= \{ x_{1} \ldots x_{n} \}\) of size n can now be defined as:

Definition 3

(Hard partitional clustering)

A hard (partitional) clustering Π of dataset \(\mathcal {X}\) is a partition Π={π 1π k } of \(\mathcal{X}\). Each one of the subsets π c Π is a hard cluster.

Definition 4

(Soft partitional clustering)

A soft (partitional) clustering Π of dataset \(\mathcal {X}\) is a fuzzy pseudopartition Π={π 1π k } of \(\mathcal {X}\). Each one of the fuzzy subsets π c Π is a soft cluster.

Remark 1

A hard clustering can be seen as a particular case of soft clustering where the grade of membership of a certain x i to the π c is zero for all but exactly one cluster, for which the grade is one.

Assume we have a finite set of \(\hat{k}\) generative distributions or sources \(\varPsi= \{ \psi_{1} \ldots\psi_{\hat{k}} \}\), with a priori probabilities \(\{ \alpha_{1} \ldots\alpha_{\hat{k}} \}\), from which the dataset \(\mathcal{X}\) has been sampled. Each object x i will be generated by one of the sources in Ψ, and we can hence consider a set \(\mathcal{Y}\) of hidden variables, with each y i Ψ containing the source which generated the corresponding x i .

The setting presented so far is common to all-in clustering and minority clustering. However, in the latter we can make additional assumptions about the sources in Ψ. In particular, and without loss of generality, we can assume the first of those sources, ψ 1, to be a background source; and the objects generated by it, the background objects. The rest of sources and objects shall be named the foreground sources (whose set will be denoted as Ψ +) and the foreground objects, respectively.

The relationship between background and foreground sources satisfies two additional assumptions which can be stated as follows:

Density::

Foreground sources are dense, i.e., objects generated by the same foreground source are more similar to each other than to those generated by the background source.Footnote 2

Locality::

Foreground sources are local, i.e., objects generated by different foreground sources are as similar to each other as they are to those generated by the background source.

These two assumptions are similar to those in previous works, for instance, those of atypicalness and local distribution defined by Ando (2007), and are implicitly present in others (e.g., Gupta and Ghosh (2006) look for “locally dense regions”). In fact, we consider these assumptions to define our task. Thus, minority clustering (in contrast to, for instance, all-in clustering and robust clustering) can be defined as:

Definition 5

(Minority clustering)

Minority clustering is the task of organizing a collection of objects based on similarity, when we can assume that a minority fraction of them are dense and local, and are embedded in a majority which are not.

3.2 Per-clustering scoring

Assume now we have a (possibly infinite) family of clustering functions F. From them, a sequence of functions (f 1…) are independently drawn at random, with a certain probability density. When applied to the dataset, each f r will produce a softFootnote 3 clustering \(\varPi_{r} = \{ \pi_{r1} \ldots\pi_{rk_{r}} \}\) with a number k r of clusters.

After clustering function f r is applied, the cluster size and object scores can be calculated from the output clustering Π r .

Definition 6

(Cluster size)

The size of cluster π rc is the sum of the grade of membership to the cluster of all objects in the dataset:

$$ \mathrm{size}(\pi_{rc}) = \sum_{x_i \in\mathcal{X}} \mathrm {grade}(x_i, \pi_{rc}) $$
(1)

Definition 7

(Object score)

The score of an object x i by clustering function f r is

$$\begin{aligned} s_{ri} = \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot\mathrm{size}(\pi_{rc}) \end{aligned}$$
(2)

i.e., the sum of the sizes of the output clusters, weighted by the grade of membership of x i to each one of them.

An additional concept will turn out to be of much importance later.

Definition 8

(Co-occurrence vector)

The co-occurrence vector for object x i and clustering function f r is c ri =[c ri1c rin ]T, where each component c rij is

$$ c_{rij} = \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot \mathrm{grade}(x_j, \pi_{rc}) $$
(3)

Remark 2

Using the co-occurrence vector, the score of object x i by clustering function f r can be written as

$$\begin{aligned} s_{ri} = & \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot \mathrm{size}( \pi_{rc}) \\ = & \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot \sum_{x_j \in\mathcal{X}} \mathrm{grade}(x_j, \pi_{rc}) \\ = & \sum_{\pi_{rc} \in\varPi_r} \sum_{x_j \in\mathcal{X}} \mathrm {grade}(x_i, \pi_{rc}) \cdot \mathrm{grade}(x_j, \pi_{rc}) \\ = & \sum_{x_j \in\mathcal{X}} \sum_{\pi_{rc} \in\varPi_r} \mathrm {grade}(x_i, \pi_{rc}) \cdot \mathrm{grade}(x_j, \pi_{rc}) \\ = & \sum_{x_j \in\mathcal{X}} c_{rij} \end{aligned}$$

From its definition, we can infer that the co-occurrence vector will satisfy the following property:

Proposition 1

The values of the entries c rij in the co-occurrence vector belong to the interval [0,1].

Proof

By the properties of fuzzy pseudopartitions, and hence of soft clusterings, we know that

$$ \forall x_i:\quad \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi _{rc}) = 1 $$

The product of two of these terms, which will also be equal to 1, can be expressed as

$$\begin{aligned} 1 = & \biggl(\,\sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi _{rc}) \biggr) \cdot \biggl(\sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_j, \pi _{rc}) \biggr) \\ = & \sum_{\pi_{rc}, \pi_{rc'} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot\mathrm{grade}(x_j, \pi_{rc'}) \\ = & \sum_{\pi_{rc} \in\varPi_r} \mathrm{grade}(x_i, \pi_{rc}) \cdot\mathrm{grade}(x_j, \pi_{rc}) + \mathop{\sum_{\pi_{rc}, \pi_{rc'} \in\varPi_r}}_{\pi_{rc} \neq \pi_{rc'}} \mathrm{grade}(x_i, \pi_{rc}) \cdot\mathrm{grade}(x_j, \pi_{rc'}) \\ = & c_{rij} + \bigtriangledown c_{rij} \end{aligned}$$

Given that the grade of membership is by definition non-negative, all pairwise products of grades will also be non-negative—and, being sums of pairwise products, both c rij and ▽c rij will at their turn be non-negative too: 0≤c rij ,▽c rij .

Finally, given that c rij and ▽c rij are two non-negative terms adding up to 1, it is clear that neither of them can exceed this value: c rij ,▽c rij ≤1. Hence, as we wanted to prove, 0≤c rij ≤1. □

Rather than considering a single application of one clustering function f r F on \(\mathcal{X}\), we will mainly be concerned with aggregating the results over a number R of repetitions of the process. In this context, we can define:

Definition 9

(Average co-occurrence vector)

The sequence of average co-occurrence vectors for object x i is \(( \mathbf{c}^{\star}_{1i} \ldots)\), where each component of \(\mathbf{c}^{\star}_{Ri} = {[ c^{\star}_{Ri1} \ldots c^{\star}_{Rin} ]}^{T}\) is

$$ c^\star_{Rij} = \frac{1}{R} \sum_{r=1}^R c_{rij} $$
(4)

Definition 10

(Average score)

The sequence of average scores of object x i is \(( s^{\star}_{1i}, s^{\star}_{2i} \ldots)\), where each \(s^{\star}_{Ri}\) is

$$ s^\star_{Ri} = \frac{1}{R} \sum_{r=1}^R s_{ri} $$
(5)

Remark 3

Using average co-occurrence vectors, the average score of object x i can be expressed as

$$ s^\star_{Ri} = \frac{1}{R} \sum_{r=1}^R s_{ri} = \frac{1}{R} \sum_{r=1}^R \sum_{x_j \in\mathcal{X}} c_{rij} = \sum_{x_j \in\mathcal{X}} \frac{1}{R} \sum_{r=1}^R c_{rij} = \sum_{x_j \in\mathcal{X}} c^\star_{Rij} $$

It is interesting to note that

Proposition 2

The s ri are linear transformations of c ri , and the \(s^{\star}_{Ri}\) are linear transformations of \(\mathbf{c}^{\star}_{Ri}\).

Proof

Using an all-ones vector,

$$\begin{aligned} s_{ri} = & \mathbf{1}^T \cdot\mathbf{c}_{ri} = [1\quad 1\quad \cdots\quad 1] \cdot [c_{ri1}\quad c_{ri2}\quad \cdots\quad c_{rin}]^T = \sum_{x_j \in\mathcal{X}} c_{rij} \\ s^\star_{Ri} = & \mathbf{1}^T \cdot\mathbf{c}^\star_{Ri} = [1\quad 1\quad \cdots\quad 1] \cdot [c^\star_{Ri1}\quad c^\star_{Ri2}\quad \cdots\quad c^\star_{Rin}]^T = \sum_{x_j \in\mathcal{X}} c^\star_{Rij} \end{aligned}$$

 □

3.3 Dataset-conditioned distribution

The dataset \(\mathcal{X}\) and clustering function f r uniquely determine the values for the co-occurrence vectors c ri , and hence for all other values considered in the previous section. However, as the selection of f r is not deterministic, the c rij can be regarded as random variables, and their conditional distribution across clustering functions, given a certain dataset \(\mathcal{X}\), can be considered.

As the selection of each f r is independent from the others, the values of the c rij for different r will also be. The c ri for different r will hence be independent and identically distributed random vectors, with a common expectation vector μ i and covariance matrix Σ i . We will refer to each element, μ ij , of μ i as the affinity of x i and x j .

Definition 11

(Object affinity)

The affinity of objects x i and x j is the conditional expectation of c rij given \(\mathcal{X}\),

$$ \mu_{ij} = E[c_{rij} \mid\mathcal{X}] $$
(6)

Remark 4

Being the expectations of the c rij , with c rij ∈[0,1], the affinities μ ij will also fall in the [0,1] interval.

We can additionally define

Definition 12

(Object expected score)

The expected score of object x i is the conditional expectation of s ri given \(\mathcal{X}\),

$$ \mu_i = E[s_{ri} \mid\mathcal{X}] $$
(7)

It is then easy to successively prove that

Proposition 3

The value of the expected score μ i of object x i is

$$ \mu_i = E[s_{ri} \mid\mathcal{X}] = \sum_{x_j \in\mathcal{X}} \mu_{ij} $$
(8)

Proof

As s ri is the sum of the c rij , its conditional expectation is

$$ \mu_i = E[s_{ri} \mid\mathcal{X}] = E\biggl[\sum _{x_j \in\mathcal{X}} c_{rij} \mid\mathcal{X}\biggr] = \sum _{x_j \in\mathcal{X}} E[c_{rij} \mid\mathcal{X}] = \sum _{x_j \in\mathcal{X}} \mu_{ij} $$

 □

Remark 5

Being the sum of \(n = | \mathcal{X}|\) terms within the interval [0,1], the value of μ i will fall in the interval [0,n]. In order to make scores across differently-sized datasets comparable, we will also consider a normalized expected score \(\bar{\mu}_{i}\), defined as \(\bar{\mu}_{i} = \mu_{i} / n\).

Proposition 4

As the number of repetitions R increases, the conditional distributions of the average co-occurrence vectors \(\mathbf{c}^{\star}_{Ri}\) approach a multivariate Gaussian distribution with expectation μ i and covariance matrix Σ i /R.

Proof

As the c rij are independent and identically distributed for different r, by the multivariate central limit theorem we know that the sequence

$$ \sqrt{R} \Biggl( \frac{1}{R} \sum_{r=1}^R \mathbf{c}_{ri} - \boldsymbol{\mu}_i \Biggr) = \sqrt{R} \bigl( \mathbf{c}^\star_{Ri} - \boldsymbol{\mu}_i \bigr) $$

converges in distribution to a multivariate Gaussian distribution with expectation 0 and covariance matrix Σ i . Hence, for large enough R,

$$\begin{aligned} \sqrt{R} \bigl( \mathbf{c}^\star_{Ri} - \boldsymbol{ \mu}_i \bigr) \approx& \mathcal{N}(\mathbf{0},\varSigma_i) \\ \mathbf{c}^\star_{Ri} - \boldsymbol{\mu}_i \approx& \mathcal{N}(\mathbf{0},\varSigma_i / R) \\ \mathbf{c}^\star_{Ri} \approx& \mathcal{N}(\mathbf{ \mu}_i,\varSigma_i / R) \end{aligned}$$

 □

Proposition 5

As the number of repetitions R increases, the conditional distributions of the average scores \(s^{\star}_{Ri}\) approach a Gaussian distribution with expectation μ i .

Proof

Being linear transformations of random vectors \(\mathbf{c}^{\star}_{Ri}\) approaching a multivariate Gaussian distribution, the \(s^{\star}_{Ri}\) also approach a Gaussian distribution

$$\begin{aligned} s^{\star}_{Ri} = \mathbf{1}^T \cdot \mathbf{c}^{\star}_{Ri} \approx \mathcal{N}\bigl( \mathbf{1}^T \cdot\boldsymbol{\mu}_i,{\bigl( \varSigma^\star_{Ri}\bigr)}^2\bigr) \end{aligned}$$

with a certain variance \({(\varSigma^{\star}_{Ri})}^{2}\). The conditional expectation of these variables hence converges to

$$ \lim_{R \rightarrow\infty} E\bigl[s^\star_{Ri} \mid\mathcal{X}\bigr] = \mathbf{1}^T \cdot\boldsymbol{\mu}_i = \sum_{x_j \in\mathcal{X}} \mu_{ij} = \mu_i $$

 □

3.4 Sampling distribution

We can now proceed to consider the distribution of the scores across multiple samplings of the dataset \(\mathcal{X}\). In particular, we will first focus on the distribution of the affinity μ ij between objects x i and x j , conditioned to their being respectively generated by a certain pair of sources ψ s and ψ t . We shall name this measure the affinity of the two sources, ζ st .

Definition 13

(Source affinity)

The affinity of sources ψ s and ψ t is the conditional expectation of the object affinity μ ij , given that y i =ψ s and y j =ψ t , across all datasets \(\mathcal{X}\) sampled from Ψ:

$$ \zeta_{st} = E[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t ] $$

A particular case of affinity is that of ψ t =ψ s , which we shall name the self-affinity ζ ss of source ψ s .

We can now also consider the conditional expectation of the normalized expected scores \(\bar{\mu}_{i}\) for objects from source ψ s .

Definition 14

(Source normalized expected score)

The normalized expected score of a source ψ s is the conditional expectation of the normalized expected score \(\bar{\mu}_{i}\) of objects x i generated by ψ s , across all datasets \(\mathcal{X}\) sampled from Ψ:

$$ \zeta_s = E[\bar{\mu}_i \mid y_i = \psi_s] $$

This newly defined score satisfies that:

Proposition 6

The value of the normalized expected score ζ s for a source ψ s is

$$ \zeta_s = \sum_{\psi_t \in\varPsi} \alpha_t \cdot\zeta_{st} $$

Proof

The value of \(\bar{\mu}_{i}\) is

$$ \bar{\mu}_i = \frac{1}{n} \mu_i = \frac{1}{n} \sum_{x_j \in\mathcal{X}} \mu_{ij} $$

The conditional expectation of \(\bar{\mu}_{i}\) across samplings of \(\mathcal{X}\) for which \(| \mathcal{X}| = n\) can then be found as

$$\begin{aligned} E\bigl[\bar{\mu}_i \mid y_i = \psi_s, | \mathcal{X}| = n\bigr] = & E \biggl[\frac{1}{n} \sum _{x_j \in\mathcal{X}} \mu_{ij} \mid y_i = \psi_s, | \mathcal{X}| = n \biggr] \\ = & \frac{1}{n} E \biggl[ \sum_{x_j \in\mathcal{X}} \mu_{ij} \mid y_i = \psi_s, | \mathcal{X}| = n \biggr] \end{aligned}$$

Assuming the \(x_{j} \in\mathcal{X}\) are independent and identically distributed, and using the law of total expectation, this can be expressed as

$$\begin{aligned} E\bigl[\bar{\mu}_i \mid y_i = \psi_s, | \mathcal{X}| = n\bigr] = & \frac{1}{n} \sum_{x_j \in\mathcal{X}} E \bigl[ \mu_{ij} \mid y_i = \psi_s, | \mathcal{X}| = n \bigr] \\ = & \frac{1}{n} \sum_{x_j \in\mathcal{X}} \sum _{\psi_t \in\varPsi} P(y_j = \psi_t) \cdot E \bigl[ \mu_{ij} \,\mid\, y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \\ = & \frac{1}{n} \sum_{x_j \in\mathcal{X}} \sum _{\psi_t \in\varPsi } \alpha_t \cdot E\bigl[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \\ = & \frac{1}{n} \sum_{\psi_t \in\varPsi} \alpha_t \cdot E\bigl[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \cdot \sum_{x_j \in\mathcal{X}} 1 \\ = & \frac{1}{n} \sum_{\psi_t \in\varPsi} \alpha_t \cdot E\bigl[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \cdot n \\ = & \sum_{\psi_t \in\varPsi} \alpha_t \cdot E\bigl[ \mu_{ij} \mid y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \end{aligned}$$

Finally, assuming independence of normalized expected scores and source affinities with respect to dataset size n, and plugging the definition of the latter into the above formula, we obtain the desired result:

$$\begin{aligned} E\bigl[\bar{\mu}_i \mid y_i = \psi_s, | \mathcal{X}| = n\bigr] = & \sum_{\psi_t \in\varPsi} \alpha_t \cdot E\bigl[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t, | \mathcal{X}| = n \bigr] \\ \zeta_s = E[\bar{\mu}_i \mid y_i = \psi_s] = & \sum_{\psi_t \in\varPsi} \alpha_t \cdot E[\mu_{ij} \mid y_i = \psi_s, y_j = \psi_t ] = \sum _{\psi_t \in\varPsi} \alpha_t \cdot\zeta_{st} \end{aligned}$$

 □

3.5 Consistent clustering

We will now impose some conditions on the used clustering families, with respect to how they preserve the density and locality of the sources in Ψ. We will start by considering the detectability of a source by a clustering family:

Definition 15

(Source detectability)

Given a set of sources Ψ and a clustering family F, a foreground source ψ s Ψ + is detectable by F if its normalized expected score ζ s is larger than that ζ 1 of the background source ψ 1.

Proposition 7

(Detectability criterion)

Given a set of sources Ψ and a clustering family F, a foreground source ψ s Ψ + is detectable by F if:

$$ \alpha_s \cdot(\zeta_{ss} - \zeta_{1s}) > \alpha_1 \cdot(\zeta_{11} - \zeta_{s1}) + \mathop{\sum_{\psi_t \in\varPsi^+}}_{\psi_t \neq\psi_s} \alpha_t \cdot(\zeta_{1t} - \zeta_{st}) $$

Proof

From the definition of detectability and Proposition 6,

$$\begin{aligned} \zeta_s > & \zeta_1 \\ \sum_{\psi_t \in\varPsi} \alpha_t \cdot\zeta_{st} > & \sum_{\psi_t \in\varPsi} \alpha_t \cdot\zeta_{1t} \\ \alpha_s \cdot\zeta_{ss} + \alpha_1 \cdot\zeta_{s1} + \mathop{\sum_{\psi_t \in\varPsi^+}}_{\psi_t \neq\psi_s} \alpha_t \cdot \zeta_{st} > & \alpha_s \cdot\zeta_{1s} + \alpha_1 \cdot\zeta_{11} + \mathop{\sum_{\psi_t \in\varPsi^+}}_{\psi_t \neq\psi_s} \alpha_t \cdot \zeta_{1t} \\ \alpha_s \cdot(\zeta_{ss} - \zeta_{1s}) > & \alpha_1 \cdot(\zeta_{11} - \zeta_{s1}) + \mathop{\sum_{\psi_t \in\varPsi^+}}_{\psi_t \neq\psi_s} \alpha_t \cdot(\zeta_{1t} - \zeta_{st}) \end{aligned}$$

 □

Remark 6

This arrangement of the terms in the difference ζ s ζ 1 is intended to capture the degree to which the clustering family captures the density and locality properties of the data in the minority clustering setting:

  • For dense sources, self-affinity should be much larger than affinity to the background source. Therefore, the value of the left-side term should be large.

  • For local sources, affinity to the background source and to other foreground sources should not be much different than their affinity to the background source itself. Therefore, the value of the right-side term should be small.

If a clustering family captures the density and locality of all foreground sources in a set, all of them will be detectable. In this case, the family is said to be consistent with the source set:

Definition 16

(Clustering family consistency)

Given a set of sources Ψ, a clustering family F is consistent with Ψ if and only if all foreground sources ψ s Ψ + are detectable by F.

The importance of detectable sources and consistent families lies in the fact that:

Theorem 1

Given a dataset \(\mathcal{X}\) sampled from a set of sources Ψ and a consistent clustering family F, for a sufficiently large number of repetitions R, the expected value of the average score \(s^{*}_{Ri}\) of objects x i generated by a foreground source ψ s Ψ + is larger than the expected value of the average scores \(s^{*}_{Rj}\) of objects x j generated by the background source ψ 1.

Proof

Using \(n = |\mathcal{X}|\), replacing the definitions of the different used quantities, and applying properties of the expectation, we know that, if ψ s is detectable,

$$\begin{aligned} \zeta_s > & \zeta_1 \\ n \cdot\zeta_s > & n \cdot\zeta_1 \\ n \cdot E[\bar{\mu}_i \mid y_i = \psi_s] > & n \cdot E[\bar{\mu}_j \mid y_j = \psi_1] \end{aligned}$$

Assuming independence on the size of the dataset \(\mathcal{X}\),

$$\begin{aligned} n \cdot E\bigl[\bar{\mu}_i \mid y_i = \psi_s, \bigl| \mathcal{X}' \bigr| = n\bigr] > & n \cdot E\bigl[ \bar{\mu}_j \mid y_j = \psi_1, \bigl| \mathcal{X}'\bigr| = n\bigr] \\ n \cdot E\bigl[\mu_i / n \mid y_i = \psi_s, \bigl| \mathcal{X}' \bigr| = n\bigr] > & n \cdot E\bigl[\mu_j / n \mid y_j = \psi_1, \bigl| \mathcal{X}' \bigr| = n \bigr] \\ n \cdot E\bigl[E\bigl[s^\star_{Ri} \mid y_i = \psi_s, \mathcal{X}', \bigl| \mathcal{X}' \bigr| = n \bigr]\bigr] / n > & n \cdot E\bigl[E\bigl[s^\star_{Rj} \mid y_j = \psi_1, \mathcal{X}',\bigl| \mathcal{X}' \bigr| = n\bigr]\bigr] / n \\ E\bigl[s^\star_{Ri} \mid y_i = \psi_s, \mathcal{X}', \bigl| \mathcal{X}' \bigr| = n\bigr] > & E \bigl[s^\star_{Rj} \mid y_j = \psi_1, \mathcal{X}', \bigl| \mathcal{X}' \bigr| = n\bigr] \end{aligned}$$

which, assuming independence again, leads to

$$\begin{aligned} E\bigl[s^\star_{Ri} \mid y_i = \psi_s, \mathcal{X}\bigr] > & E\bigl[s^\star_{Rj} \mid y_j = \psi_1, \mathcal{X}\bigr] \end{aligned}$$

 □

3.6 Algorithm

A corollary of this last Theorem 1 is

Corollary 1

Given a dataset \(\mathcal{X}\) sampled from a set of sources Ψ, and using a clustering family F which is consistent with Ψ, we can devise an algorithmic procedure to obtain a minority clustering of  \(\mathcal{X}\).

Proof

Given a dataset \(\mathcal{X}\), we can apply a sequence of clustering functions f r , drawn from F, and find the average score \(s^{\star}_{Ri}\) for each object \(x_{i} \in\mathcal{X}\). The expected value of the average scores of the background objects will be lower than that of the foreground ones. If a suitable threshold value is determined, we will be able to discriminate most foreground and background objects according to their score. □

Remark 7

A single threshold suffices to separate background and foreground objects because Theorem 1 ensures the scores of the former will be lower than those of objects coming from any of the foreground sources.

Remark 8

It is important to note that, whereas this procedure will allow us to separate foreground and background objects, it will not find the different clusters formed by the foreground ones. A regular ensemble clustering algorithm, such as those of Ghosh et al. (2002) or Topchy et al. (2005), can be applied on the objects that have been deemed to belong to the foreground for that goal. We will hence focus on the foreground/background separation problem for the rest of the paper.

The resulting algorithm, which we have named Ensemble Weak minOrity Cluster Scoring (Ewocs), is described in Algorithm 1.

Algorithm 1
figure 2

Ensemble Weak minOrity Cluster Scoring (Ewocs)

The first step of Ewocs is the initialization of an auxiliary array, which will contain the accumulated scores \(s^{+}_{i}\) of all objects, to zero (line 1). The main loop is then entered (lines 2–6). The number of iterations of this loop, R, determines the ensemble size and is a user-supplied parameter. Larger values of R will yield better results, but at the expense of a larger computational cost.

At each iteration, a clustering function f r is drawn at random from family F (line 3) and is then applied to dataset \(\mathcal{X}\) to obtain a clustering Π r (line 4). The size of each cluster π rc in clustering Π r is then found (line 5), and then the score of each object, as defined in Eq. (2), is found and added to the accumulated score \(s^{+}_{i}\) (line 6).

When the main loop is over, the final average score of each object, \(s^{\star}_{Ri}\) is found from the final accumulated score \(s^{+}_{i}\) and the ensemble size R (line 7). From the distribution of these scores \(s^{\star}_{Ri}\), a threshold value \(s^{\star}_{th}\) which separates the scores of the foreground and the background objects is inferred (line 8). At this point, the only steps that remain are separating the objects according to their scores into a foreground and a background cluster (line 9) and returning the resulting clustering (line 10).

The time complexity of the algorithm prior to the determination of the threshold is dominated by the R repetitions of the main loop. Inside it, if the number of clusters in the clusterings produced by the functions f r F is bounded and not dependent on the size of the dataset \(\mathcal{X}\), the cost of each iteration is in the order of \(O(|\mathcal{X}|)\). In order to keep an overall complexity of \(O(R \cdot |\mathcal{X}|)\), linear with respect to the number of repetitions and the size of the dataset, it is thus necessary to use clustering families and threshold determination methods whose complexity is also a linear function of this size \(|\mathcal{X}|\).

The obtained Ewocs algorithm has a number of components which allow different implementations: neither the consistent clustering function family F (line 3) nor the method for the determination of the threshold score separating foreground and background objects (line 8) are specified. As mentioned in the introduction, the following two sections, Sects. 4 and 5, give insights into each one of these two issues, respectively.

3.7 Clustering model

Some algorithms are only devised to build a clustering of an input dataset, and do not provide any device to determine the hypothetical assignments of new objects to one of the obtained clusters. This is the case, for instance, of most hierarchical (including HAC) and ensemble clustering (e.g., Ghosh et al. 2002; Gionis et al. 2005) algorithms. However, most popular partitional methods—starting with k- and c-means, and continuing with all probabilistic mixture algorithms—provide, as a byproduct of the clustering process, a clustering model which may then be later used as a classification model for new data, after identifying the obtained clusters with classes.

In the case of Ewocs, if the functions in the used family F provide models together with the clusterings when applied to dataset \(\mathcal{X}\), these individual models can be extended to obtain an aggregated minority clustering model.

More specifically, if the application of f r F to \(\mathcal{X}\) produces clustering Π r and clustering model \(\mathcal{M}_{r}\), after Algorithm 1, an Ewocs minority clustering model \(\mathcal{M}^{E}\) can be constructed, containing:

  • the inner clustering models \(\mathcal{M}_{r}\),

  • the size of each cluster π rc in the clusterings Π r ,

  • and the threshold value \(s^{\star}_{th}\) which separates foreground and background objects.

The process of classifying a new object x x using the obtained model \(\mathcal{M}^{E}\) is described in Algorithm 2. It follows the main steps of the previous Algorithm 1, but replacing the application of new clustering functions f r F, by that of the previously obtained clustering models \(\mathcal{M}_{r}\) (line 3). After all models have been applied, the average score of the object is found (line 5), and the object is deemed to belong to the foreground or background cluster according to whether its score exceeds the previously found threshold (line 6).

Algorithm 2
figure 3

Classification using an Ewocs clustering model

4 Weak clustering

As stated in Sect. 3.5, the theoretical properties of the Ewocs algorithm depend only on the condition of the used clustering family being consistent. We believe that the requirements for being consistent, according to Definition 16, should be fairly loose—and that, hence, the Ewocs algorithm is suitable for use with weak clustering algorithms.

In this context, a clustering function family F is a clustering algorithm which includes elements of randomness. Each sequence of random values will determine a member function of the family. From a conceptual point of view, drawing a function f r from the family F will hence correspond to drawing a sequence of random values to be later used by the algorithm. From a computational one, it can correspond, for instance, to choosing a seed value for the algorithm’s internal random number generator.

The two weak clustering algorithms that are used in the work of Topchy et al. (2003) are based on either splitting the dataset using random hyperplanes, or on clustering projections of the data on random subspaces. We found the first of them particularly convenient for our purposes, and extended it. Sect. 4.1 reviews this our extension of the random splitting algorithm.

However, even if these methods have been proved to produce clusterings useful for combination within an ensemble, they both perform linear mappings of the data and, hence, are based on the notion of linear separation. Although non-linearly separable clusters can be successfully identified by linear separators, non-linear weak separators have not been thoroughly explored. Besides, linear methods depend on the data being expressible as feature vectors, and hence cannot directly deal with structured objects such as sequences or trees.

Our proposal in this direction is a new weak clustering algorithm based on Bregman divergences, which allows non-linear splitting boundaries and, through the use of kernels, can deal with structured data. This proposed Random Bregman Clustering is described in Sect. 4.2.

Later, Sect. 6.4.2 will provide an estimation of the consistency of the proposed clustering families over a number of datasets. The results shall provide an empirical assessment of the suitability of these two families for use within Ewocs.

4.1 Random splitting

The random splitting algorithm presented in Topchy et al. (2003) performs only binary bisections of the objects in the dataset. Our Random Splitting algorithm (RSplit) is a generalization of this algorithm, which allows an arbitrary number of clusters k.

For this algorithm we require the objects in dataset \(\mathcal{X}\) to be expressible as z-dimensional real vectors (i.e., \(\mathcal {X}\subset \mathbb{R}^{z}\)). To account for multiple clusters, we have adopted the same representation of hyperplanes as in the Multi-Class Support Vector Machines of Crammer and Singer (2001): each splitting hyperplane is defined by a weight vector ω c =(ω c1ω cz ) and an offset δ c , and objects belong to the cluster (class in the original formulation) from whose hyperplane they are separated by the largest margin.

Similarly to Topchy et al., in a clustering ensemble setting, the number of clusters k does not need to be given a priori, but is rather drawn at random between 2 and a user-supplied value k max .

This idea leads to the simple procedure described in Algorithm 3. The algorithm takes three sequential steps. The first of them is the selection of the effective number of clusters k (line 1). Any discrete distribution between 2 and k max , such as the uniform distribution, can be used. For each cluster π c , random weights ω c and offsets δ c (line 2) are then generated. Again, we have stuck to the uniform distribution from all the possible continuous distributions within the [−1…1] range.

Algorithm 3
figure 4

Random Splitting (RSplit)

Once these values are generated, the margin of each object x i with respect to the hyperplanes is found as the dot product between the object x i and the hyperplane’s weight vector ω c , shifted by the latter’s offset δ c . Each object is assigned to the cluster induced by the hyperplane to which its margin is maximal (line 3). The resulting clustering can then be returned (line 4).

The time complexity of this algorithm is dominated by the calculation of the margin in step (line 3), and is hence in the order of \(O(k_{max} \cdot z \cdot| \mathcal{X}|)\).

The algorithm requires as input the maximum number of clusters k max in each split. A part of Sect. 6.4.3 is devoted to the empirical study of the sensitivity of the results of Ewocs to its value.

We will henceforth refer to this algorithm as RSplit, and to its application within Ewocs as Ew-RSplit.

4.2 Random Bregman clustering

As stated in the introduction to Sect. 4, two desirable properties of weak clustering algorithms, but to which few attention has been devoted so far, are, first, the ability to find non-linear boundaries in vectorial data, and, second, the possibility to deal with non-vectorial and/or structured data. Kernel methods have a long story of successes across a wide spectrum of machine learning tasks (Shawe-Taylor and Cristianini 2004) and, specifically, they are known for their capability to address both of these issues. The use of kernel functions allows to separate non-linearly separable classes, even with linear methods (Freund and Schapire 1999); and kernels have been devised and successfully applied for non-vectorial objects such as word sequences (Cancedda et al. 2003) or parse trees (Collins and Duffy 2002).

Kernel functions induce a distance metric between objects. Any kernel function K ϕ is equivalent to an inner product in a high-dimensional space, onto which there will exist a certain mapping ϕ. Hence, if ϕ(x) and ϕ(y) are, respectively, the images of two objects x and y in this space, K ϕ (x,y)=ϕ(x)⋅ϕ(y). Their squared Euclidean distance on the mapped space, D ϕ (x,y), can then be found as:

$$\begin{aligned} D_\phi(x, y) = & \bigl\| \phi(x) - \phi(y) \bigr\|^2 \\ = & \bigl(\phi(x) - \phi(y)\bigr) \cdot\bigl(\phi(x) - \phi(y)\bigr) \\ = & \phi(x) \cdot\phi(x) + \phi(y) \cdot\phi(y) - 2 \cdot\phi(x) \cdot\phi(y) \\ = & K_\phi(x, x) + K_\phi(y, y) - 2 K_\phi(x, y) \end{aligned}$$
(9)

This transformation is the basis for existing kernel-based all-in clustering algorithms, such as kernel k-means (Girolami 2002). In our case, given that these squared Euclidean distances will be, by construction, Bregman divergences, we can join Mercer kernel theory and that of Bregman clustering and devise a weak all-in clustering procedure. The idea is to randomly select a number of objects which can act as seeds for the clustering, and then define clusters according to the divergence from these seeds of the objects in the dataset. The resulting Random Bregman Clustering (Rbc) method is described in Algorithm 4.

Algorithm 4
figure 5

Random Bregman Clustering (Rbc)

Rbc is thus a seed-based algorithm. Given dataset \(\mathcal {X}\), a Bregman divergence D and a maximum number of clusters k max , the first step of Rbc is selecting the effective number of clusters in the clustering, k (line 1). Any discrete distribution between 2 and k max , such as the uniform distribution, can be used. A subset \(\hat{\mathcal{X}}\) of size k is then selected at random from \(\mathcal{X}\) (line 2). We shall name this subset the seed subset, and each one of their members will be a seed. Each seed will induce a cluster in the output clustering.

The output clustering is constructed following the theoretical framework provided by Bregman clustering (Banerjee et al. 2005). First, the distance of each object \(x_{i} \in\mathcal{X}\) to the seeds \(\hat{x}_{c} \in\hat{\mathcal{X}}\) is found. If a hard clustering is desired, each object is then assigned to the cluster induced by its nearest seed (line 4). If, instead, a soft clustering is desired, the grade of membership of each object to each cluster is proportional to the exponential of the negated divergence from the seed of the latter to the former (line 6). In both cases, the only remaining step is then returning the resulting (hard of soft) clustering (line 7).

The construction of the hard clustering is hence equivalent to a single assignment step of Bregman hard clustering; and that of the soft clustering is equivalent to a single expectation step of Bregman soft clustering, with a uniform a priori probability of membership to all clusters.

The time complexity of the Rbc algorithm is dominated by the clustering construction step (line 4 or 6), and, as long as the kernel computation does not depend on the maximum number of clusters k max or on the size of the dataset \(| \mathcal{X}|\), it is in the order of \(O(k_{max} \cdot| \mathcal{X}|)\). This is comparable to the cost of RSplit, so the increase in expressiveness of the algorithm does not come at the expense of an increase in computational complexity. The algorithm hence remains inexpensive, and suitable for use in a weak clustering ensemble.

In addition to the particular divergence function used, the algorithm only takes as parameter the maximum number of clusters k max , whose influence in Ewocs, as mentioned previously, will be considered in Sect. 6.4.3.

We will henceforth refer to the hard and soft versions of this algorithm as HRbc and SRbc, respectively, and to their application within Ewocs as Ew-HRbc and Ew-SRbc.

We have explored the use of the following families of Bregman divergences for two given objects x and y, at the core of the Rbc algorithm:

Squared Euclidean Distance (Euc),:

widely used in a variety of domains because of its simplicity and good performance. It is simply:

$$ D_E(x, y) = {(x - y)}^T (x - y) $$
(10)
Squared Mahalanobis Distance (Mah),:

which has specifically been reported to give the best results within previous approaches to minority clustering (Gupta and Ghosh 2005, 2006). It is a version of standard Euclidean distance normalized for a particular dataset:

$$ D_M(x, y) = {(x - y)}^T \varSigma^{-1} (x - y) $$
(11)

where Σ is the covariance matrix of the considered dataset.

Gaussian-Kernel Distance (G(α,γ)),:

successfully applied in non-parametric (i.e., distribution-free) clustering algorithms, such as mean shift (Fukunaga and Hostetler 1975; Cheng 1995). The Gaussian kernel K ϕ (x,y) between two objects x and y is defined as the exponential of the negated squared Euclidean distance between them, with two additional scaling parameters α and γ:

$$ K_\phi(x, y) = \alpha\cdot e^{-\gamma{\| x - y \|}^2} $$
(12)

By Eq. (9), their induced squared Euclidean distance mapped space, D ϕ (x,y), can be found as:

$$\begin{aligned} D_\phi(x, y) = & K_\phi(x, x) + K_\phi(y, y) - 2 K_\phi(x, y) \\ = & \alpha+ \alpha- 2 \alpha\cdot e^{-\gamma{\| x - y \|}^2} \\ = & 2 \alpha\bigl( 1 - e^{-\gamma{\| x - y \|}^2} \bigr) \end{aligned}$$
(13)

Gaussian kernels locally map the Euclidean space around each point into a hypersphere of radius \(\sqrt{2 \alpha}\), and the rate at which neighbouring points are pushed apart towards the edge of the hypersphere increases with the value of parameter γ. If this Gaussian-kernel distance is used in Rbc, small values of α lead to fuzzy boundaries between the clusters, whereas large values produce crisp ones. As a particular case, the limit of soft Rbc as α→∞ is equivalent to hard Rbc using squared Euclidean distance.

4.3 Unsupervised tuning of Gaussian-kernel distance

The use of the presented Gaussian-kernel distance requires the choice of the values for α and γ, which model the degrees of fuzziness and locality of the output clustering, respectively. The determination of suitable values for α and γ can become a problematic issue, especially in unsupervised clustering settings.

Similar problems are to be addressed in all-in fuzzy clustering algorithms which depend on a parameter. The degree of fuzziness parameter, traditionally referred to as m, of the fuzzy c-means algorithm (Bezdek 1981) is probably the one whose tuning has received the most attention in the literature (Deer and Eklund 2003; Yu et al. 2004; Okeke and Karnieli 2006; Schwämmle and Jensen 2010).

In the approach of Schwämmle and Jensen (2010), the authors study the behaviour of the cluster centroids as the degree of fuzziness m increases, and find that, at a certain point, the clustering degrades and the clusters start collapsing on each other. This phenomenon can be detected by watching the minimum distance between centroids: the moment the degradation starts, the first two clusters collapse and this distance becomes close to zero. It is interesting to note that, according to the authors, this happens however many clusters are used, even if the number does not match the actual one.

Given that “a large fuzzifier value suppresses outliers in data sets”, the authors consider that maximum fuzziness should be sought, and hence propose selecting the largest m value for which the minimum centroid distance still remains above a predefined threshold ϵ (set so as to reduce floating-point errors).

We have adapted the approach of Schwämmle and Jensen to determine the optimal values of α and γ for Ew-SRbc. The method is particularly suitable to our needs: it does not depend on specific properties of FCM, nor requires knowledge of the exact number of clusters in the dataset. However, as the Ew-SRbc method does not provide centroids for the found signal clusters, we have instead tuned the parameters with the SoftBBC-EM algorithm of Gupta and Ghosh (2006). Given that the optimal divergence metric for clustering will be more dependant on the dataset than on the used algorithm, we believe that the parameters detected using SoftBBC-EM will provide, at least, competitive performance when used within Ew-SRbc.

For a given value of γ, the influence of α on the clustering is equivalent to that of m for FCM. When moving from α→∞ to α→0, the fuzziness of the clustering is increased from a completely crisp clustering to gradually fuzzier ones. At a certain point α th , the clustering starts degrading, and each object is eventually assigned a uniform probability of belonging to any cluster.

On the flipside, for a given value of α, the influence of γ on the clustering gives rise to two turning points: for values larger than a certain γ h , the distance between all pairs of objects tend to 2α; whereas for those smaller than a certain γ l , they all tend to 0. Both phenomena degrade the clustering, and hence also lead to cluster collapse. However, there is an interaction between the values of α and γ: larger values of α force crisper decisions, and hence extend the feasible region for γ.

Hence, the (α,γ) plane will contain an approximately V-shaped curve on one of whose sides the value of the minimum centroid distance will fall below the floating-point-precision threshold ϵ. Following the criterion of Schwämmle and Jensen, we look for maximum fuzziness, and hence the algorithm should select the vertex of this curve. At this point, the value of α is the minimum one which still avoids degradation, and for it γ h and γ l have become equal.

We have empirically verified that such curves actually arise across a variety of datasets. For instance, Fig. 2 shows a contour plot of the minimum centroid distance of the clusterings obtained using SoftBBC-EM on the Toy dataset. In it, the thicker curve denotes the contour level for a value of ϵ=10−3, and the point at its vertex corresponds to the values of α and γ detected by the algorithm.

Fig. 2
figure 6

Contour plot of minimum centroid distance (Toy data)

Given that the minimum centroid distance function has to be obtained by sampling, which introduces an amount of experimental noise, standard numerical methods for optimization cannot be used, and minimization is instead performed using a recursive logarithmic grid search algorithm. This allows us to exponentially increase the precision in the detection of the optimal point, without an exponential increase of the computational burden.

We will henceforth refer to the distance induced by this automatically tuned Gaussian kernel as G(Auto).

5 Threshold determination

The last step of the Ewocs algorithm is that of determining, from the sequence of scores \(s^{\star}_{1} \ldots s^{\star}_{n}\) found by the ensemble clustering process,Footnote 4 a threshold value \(s^{\star}_{th}\) which separates foreground and background objects. We have considered the following procedures to perform this decision.

5.1 Best

In Best, the score for which the performance of the method is maximal according to a given measure is taken as threshold. From the metrics that we have used for our evaluation, we have chosen for our experiments the cutoff point to be the one that maximizes the F1 measure, which will be defined in Sect. 6.3. This criterion is informative as an upper bound of the performance of the other ones, and we have hence reported it for our experiments.

5.2 Size

Following other works in minority clustering (Gupta and Ghosh 2005, 2006; Ghosh and Gupta 2011), in Size the number of foreground objects is assumed to be known a priori. After sorting the objects by their score, it is this number of highest-scored objects that are taken to form the foreground cluster, whereas the rest are considered background objects. The score of the object in the cutoff point is taken as threshold.

However, the proposers of this criterion give no hints about how the number of foreground objects can be estimated, and we believe this limits its applicability for unsupervised minority clustering. We have nevertheless included it to allow a comparison to previous approaches which use it. For our experiments, we have assumed that the exact number of foreground objects is known, and used this value. Hence, the results for Size should also be regarded as an upper bound.

5.3 Dist

Following our previous work on relation detection (Gonzàlez and Turmo 2009), Dist arises from the observation of the distribution of the sorted sequence of scores of the clustered objects (see Fig. 3 for example). A small number of instances are assigned high scores whereas a large number are assigned low ones, presumably corresponding to foreground and background objects, respectively. The cutoff point should try to separate these two regions. Intuitively, this point will lie in the region of maximum convexity of the curve, and hence close to the lower left corner of the plot. An approximate but efficient way to determine the threshold is to minimize the distance from the origin in a normalized plot of the scores.

Fig. 3
figure 7

Accumulated score distribution (Ew-SRbc on Toy data)

The first step in this criterion is hence sorting the objects \(x_{i} \in \mathcal{X}\) by decreasing scores assigned to them by the Ewocs algorithm, so that, in the sequence \(s^{\star}_{1} \ldots s^{\star}_{n}\), \(\forall i: \; s^{\star}_{i} \geq s^{\star}_{i+1}\). These scores are then linearly mapped to the range [0…1], obtaining normalized versions \(\bar{s}^{\star}_{i}\):

$$\begin{aligned} \bar{s}^\star_i = \frac{s^\star_i - \min s^\star_j}{\max s^\star_j - \min s^\star_j} \end{aligned}$$
(14)

Then, the distance from the origin in the normalized plot is found for each object, and that at the minimum distance is selected as cutoff object x th :

$$\begin{aligned} \mathbf{dist}(x_i) = & \sqrt{{(\bar{s}^{\star}_i)}^2 + {(i / \max i)}^2} \end{aligned}$$
(15)
$$\begin{aligned} x_{th} = & \operatorname{arg\,min}_{x_i \in\mathcal{X}} \mathbf{dist}(x_i) \end{aligned}$$
(16)

5.4 nGauss

The theoretical analysis of the Ewocs method presented in Sect. 3 provides us a new approach to automatically determine the threshold score. In particular, we can much benefit from the result stated in Proposition 5: the conditional distributions of the average scores \(s^{\star}_{i}\) approach a Gaussian distribution with expectation μ i . If we assume that the value of μ i depends mainly on the source ψ s which produced x i , we can try to approximate the overall distribution of average scores \(s^{\star}_{i}\) by a mixture of Gaussian components, one for each one of the sources generating the dataset.

As an example, the histogram of scores generated by the same run of Ew-SRbc on the Toy data is shown in Fig. 4a. As well as the joint distribution of scores (labeled All), the separate histograms for objects from the foreground and background sources are also plotted. Two Gaussian peaks are easily identifiable around the scores of 0.05 and 0.25, and we could expect another minor Gaussian component to explain the probability mass around the score of 0.9.

Fig. 4
figure 8

Ew-SRbc on Toy data

The key to threshold selection is thus determining the number of mixtures, identifying them, and finding the boundaries between them. The cutoff points must lie at one of these boundaries. There is a wide spectrum of methods to solve this task, and among them we have chosen Expectation-Maximization (EM), being by far the most popular one. The determination of the number of mixtures reduces to discovering the number of clusters and hence to a model selection problem. Given that the score distribution will always be one-dimensional (for whichever dimension of the input dataset), and one-dimensional EM is fast, we have used the usual approach of running EM for increasing numbers of clusters and then using a model-selection criterion to select the best one (Fraley and Raftery 1998). More specifically, we have used the Bayesian Information Criterion (Schwartz 1978). In Fig. 4b, the arcs denote the mean, variance and a priori probabilities of the identified components.

Proposition 5 states than only the mixture with the lowest mean should contain the background objects. However, it is empirically observed that the selection criterion often chooses models which split this (and/or other) source into several components (this can be observed, for instance, in Fig. 4b). It is hence necessary to separate the found components into those corresponding to the background source and those from the foreground ones. More specifically, if k components \(\hat{\psi}_{1} \ldots\hat{\psi}_{k}\) have been identified (sorted by increasing means \(\hat{\mu}_{1} > \cdots> \hat{\mu}_{k}\)), for each c∈{1…k−1}, the possibility that \(\hat{\psi}_{1} \ldots \hat{\psi}_{c}\) contain background objects and \(\hat{\psi}_{c + 1} \ldots \hat{\psi}_{k}\) contain foreground ones needs to be considered.

The set of cutoff point candidates is hence built from the boundary scores for each c∈{1…k−1}, i.e., the scores \(s^{\star}_{c}\) for which:Footnote 5

$$ p\Biggl(s^\star_c \in\bigcup_{d=1}^{c} \hat{\psi}_d \mid s^\star_c\Biggr) = p\Biggl(s^\star_c \in\bigcup_{d=c+1}^{k} \hat{\psi}_d \mid s^\star_c\Biggr) $$

Moreover, and as stated in Sect. 5.3, the small number of foreground instances are assigned high scores whereas the large number of background instances are assigned low scores. As a result, the variances of the scores of the former will differ significantly from those of the latter, being much larger.

This last fact provides us with a heuristic criterion to choose a single threshold score from the candidate set: being \(\hat{\sigma}^{2}_{1} \ldots\hat{\sigma}^{2}_{k}\) the variances of the found components \(\hat{\psi}_{1} \ldots\hat{\psi}_{k}\), we select the boundary score that maximizes the difference between the average component variances at both sides:

$$ s^\star_{th} = \mathop{\mathrm{arg\,max}}_{s^\star_c} \Biggl| \frac{1}{c} \sum_{i=1}^c \hat{\sigma}^2_i - \frac{1}{k - c} \sum_{i=c+1}^k \hat{\sigma}^2_i \Biggr| $$
(17)

We will refer to this criterion as nGauss+Var. As an upper bound of its performance, we will also consider a nGauss+Best criterion, which selects the boundary score \(s^{\star}_{c}\) which maximizes the F1 measure. In Fig. 4b, the possible cutoff points are depicted by dashed vertical rules. The score selected as threshold by both nGauss+Best and nGauss+Var is emphasized in black.

Remark 9

It is important to note here that need not be a one-to-one correspondence between mixture components and sources (as mentioned, we have often found the scores from a single source to be split across several components in the chosen mixture model), so we do not expect the number components selected at this step to be the number of sources in the data. We have devised this procedure for threshold determination only, and cannot ascertain how well correlated the number of components and the number of sources will be.

A slightly different alternative to overcome the foreground and background component separation problem is that of simplifying the possible models and performing EM with only 2 clusters. In this case, there is no ambiguity in the choice of the background and foreground components, as there must be one of each. We have named this simplified Gaussian modeling approach 2Gauss.

Finally, as a last and implementation-related detail, we have found that using the linearly mapped scores \(\bar{s}^{\star}_{i}\) as defined in Eq. (14) as input to the EM algorithm for model fitting, instead of the actual scores \(s^{\star}_{i}\), reduces the floating point rounding error and improves the quality of the detected threshold.

6 Evaluation on synthetic data

In order to validate the proposed Ewocs algorithm and to assess the performance of Ewocs-based approaches, we have performed a series of experiments on synthetic data. In a preliminary stage, the consistency (in the sense of Definition 16) of the different used weak clustering algorithms has been empirically assessed. Later, a full-fledged comparison of the performance of Ewocs-based approaches to other methods in the state of the art has been carried out.

Next sections give details about the evaluation procedure. Sect. 6.1 describes the used datasets and Sect. 6.2 enumerates the different approaches to be evaluated or employed as reference. Next Sect. 6.3 describes the evaluation protocol, including the considered metrics, and, finally, Sect. 6.4 exposes and discusses the obtained results.

6.1 Data

The first dataset we have used for our experiments is the sample data plotted in Fig. 1. It is a simple 2-dimensional dataset in which five foreground sources, with different shapes and variances, are scattered against a background filled with a uniform distribution. Even though evaluation on a single dataset such as Toy scarcely possesses any statistical significance, “for a 2-dimensional dataset, graphical verification is an intuitive and reliable validation of clustering” (Ando 2007), and we believe this can be useful as an illustration of most of the concepts in our work.

For a more serious evaluation, we have prepared a number of synthetic datasets where foreground Gaussian sources are embedded within a set of uniformly distributed background objects. Several parameters, such as the number of sources, the number of foreground and background objects and the means and variances of the Gaussian sources, were chosen at random for each dataset. A summary of the ranges of these parameters can be found in Table 1. In total, 160 such datasets have been generated.Footnote 6 We will refer to this collection as Synth.

Table 1 Parameter range for synthetic dataset generation

Additionally, in order to perform the preliminary experiments on method consistency, for each dataset in Synth, 9 additional samplings using the same source parameters were generated. The whole 10-dataset groups have been used for consistency estimation.

6.2 Approaches

We have implemented the Ewocs algorithm using each one of the weak clusterers proposed in Sect. 4.

Ew-RSplit::

Ewocs using the RSplit algorithm of Sect. 4.1.

Ew-HRbc::

Ewocs using the hard Rbc algorithm, HRbc, of Sect. 4.2.

Ew-SRbc::

Ewocs using the soft Rbc algorithm, SRbc, of Sect. 4.2.

The notation Ew-RSplit/R×k (resp., Ew-HRbc/R×k and Ew-SRbc/R×k) will be used to refer to the results obtained by Ewocs with an ensemble of R clusterings, each one produced by RSplit (resp., HRbc and SRbc) with k max =k.

In order to assess the performance of Ewocs-based approaches with respect to the state of the art, we have implemented five existing methods for minority clustering:

BBOCC::

as proposed by Gupta and Ghosh (2005). We have used the actual number of foreground objects as the desired clustering size parameter.

BBCPress::

as proposed by Gupta and Ghosh (2006). Similarly to BBOCC, we have used the actual number of foreground objects as the desired clustering size parameter. The number of clusters, however, has been assumed to be given a priori, and by BBCPress/k we will refer to the runs of this algorithm with a number of clusters k.

DGRADE::

as proposed by Ghosh and Gupta (2011). Again, the actual number of foreground objects has been used as number of dense points, to be classified into clusters. Among the three strategies sketched by the authors, we have implemented the only one not requiring the number of clusters k or a maximum stability parameter from the user. This strategy has been preferred, despite its greater computational cost, because of its much lower degree of supervision. Finally, following the original paper, the output of DGRADE has been refined using the BBC algorithm.

AutoHDS::

as proposed by Gupta et al. (2010).Footnote 7 The tuning of the smoothing and particle threshold parameters of the algorithm using the interactive approach proposed by the authors is not feasible in our case (for the Synth corpus, it would require the manual tuning of 160 sets of parameters). We have instead considered a setting in which a single set of parameter values is used across all datasets. Thus, by AutoHDS/n eps n part we will refer to the runs of this algorithm with a smoothing parameter n eps and a particle threshold n part .

k MD::

as proposed by Ando (2007). The implementation tries to mimic to the maximum extent that of the original paper: we have used Gaussian distributions for the foreground clusters and a uniform distribution for the background. The clusters have been initialized by selecting fixed-size sets of most similar points to a randomly chosen one. To refer to the runs of this algorithm with a certain parameter tuning, we will use the notation k MD/R×s 0s min , where R refers the number of cluster detection iterations, and s 0 and s min refer to the initial and required cluster size parameters.

For the divergence-based approaches (i.e., all but k MD), Mah has been used as metric.

It is important to note that these methods, as well as, to our knowledge, all other existing minority clustering methods proposed so far, include critical elements of supervision, in the form of parameters such as the number of foreground objects, the number of foreground clusters, and/or the foreground cluster sizes.

Additionally, we have considered three pseudo-systems for reference, to give lower and upper bounds of the performance of the actual systems:

Random::

A random clusterer, which assigns foreground and background clusters according to a Bernoulli distribution. We have taken the one among such clusterers which assigns the labels according to the actual source size ratio in the data.

AllFG::

A blind clusterer, which assigns all objects to the foreground cluster.

Convex::

An oracle clusterer for the Synth dataset, which detects as foreground objects those objects that lie within the convex hull of the actual foreground sources. The output of this Convex clusterer will hence detect all foreground objects, but include some background ones as false positives.

6.3 Protocol

In the preliminary evaluation of clustering consistency, for each one of the 10 samplings of the datasets in the Synth collection, 25 runs of every weak clustering algorithm were performed, and the source affinities have been estimated from the co-occurrence matrices of these 250 clusterings. We have then reported the fraction of datasets with which the considered methods are consistent (Cons), as well as, more precisely, the fraction of sources which are detectable by them—both macro- (M-Det) and micro-averaged (μ-Det) by dataset.

In order to assess and compare the performance of the different approaches in the full minority clustering evaluation, we have used the well-known measures of precision (Prc), recall (Rec) and F1, which have been previously employed for the evaluation of minority clustering (Ando 2007). The use of percentages when printing values of these metrics is customary.

Additionally, to evaluate the performance of the scoring phase, isolating it from that of threshold selection, we have also included information about Receiver Operator Characteristic (ROC) curves, more specifically, the Area Under the ROC Curve (AUC) (Fawcett 2006). The relation of dominance between ROC curves has been proved equivalent to that of precision/recall curves (Davis and Goadrich 2006), and they are less sensitive to variances of the class skew.

To reduce the impact of randomness, we have carried out 5 different runs for each method, configuration and dataset, and reported the average measures.

Finally, to compare the performance of the different methods across the synthetic datasets, we have used the Bergmann-Hommel non-parametric hypothesis test (Bergmann and Hommel 1988). Being non-parametric, the test judges the relative performances of the different methods with respect to each other, rather than their absolute scores or score differences. Recently, works such as that of Demšar (2006) have advocated for non-parametric tests to assess significance in machine learning tasks, as the assumption of metric commensurability across datasets, required by usual parametric tests such as Student or ANOVA, is often broken. The use of the Bergmann-Hommel test in particular has been recommended by García and Herrera (2008).

The graphical presentation of the results is that introduced by Demšar (2006): methods are placed along the horizontal axis according to their average ranks across datasets, and those for which no statistically significant difference can be found are joined by thick bars.

6.4 Results

The first Sect. 6.4.1 presents the results of the full experiments on the Toy dataset. The next two sections, 6.4.2 and 6.4.3, detail the results obtained over the Synth collection.

6.4.1 Clustering on the Toy dataset

A graphical depiction of the output of a representative subset of the compared approaches on the Toy dataset is shown in Fig. 5. The plots correspond to the parameter configurations achieving the best results.

Fig. 5
figure 9

System output for the compared methods (Toy data)

The BBOCC method is unable to detect the multiple foreground sources and instead creates a single cluster covering two of them. Similarly, the BBCPress method, despite being given the correct number of sources, fails to recognize the half-moon-shaped one and instead splits it into two clusters, and rounds the triangle-shaped one. As a result, the top right source to be missed. The limitations of these two methods are well-known, and come from the fixed number and shape (hyperelliptical) of clusters they look for. Seeding BBC using DGRADE does not work in this case, either.

On the flipside, the AutoHDS, k MD and Ew-SRbc methods are able to recognize the variously shaped foreground sources. AutoHDS seems to include too many background objects into the clusters, whereas the classification of the two other methods is more accurate. For this Toy dataset, k MD produces tighter clusters, favouring precision over recall, whereas for Ew-SRbc this tendency is reversed.

The ROC curves for these approaches are plotted in Fig. 6. k MD and AutoHDS do not provide an adjustable decision threshold; instead, their output is a fixed crisp boundary, and hence their ROC curve is composed of two straight segments. On the contrary, Ew-SRbc, as all other Ewocs-based approaches, assigns a continuous score to all objects, and the separation between foreground and background ones is based on a threshold. Hence, its ROC curve, as a function of this threshold, is much smoother. For this reason, even if the differences in precision, recall and F1 score between the methods are small (see Fig. 6b), the curves for AutoHDS and k MD are missing a large fraction of the AUC, which that of Ew-SRbc is able to enclose. The fact will also be relevant to the evaluation on Synth.

Fig. 6
figure 10

ROC curves for AutoHDS, k MD and Ew-SRbc (Toy data)

Regarding the proposed threshold determination approaches, Fig. 7 shows the precision, recall and F1 curves for the output of Ew-SRbc on Toy, according to the number of objects clustered as foreground. The cutoff points for the different criteria are plotted above the F1 curve. For this particular case, nGauss+Var finds the same cutoff point as nGauss+Best, and they are both plotted as nGauss.

Fig. 7
figure 11

Precision, Recall and F1 curves, and cutoff point determined by different threshold detection criteria (Ew-SRbc on Toy data)

The threshold values found by Size, nGauss and 2Gauss are quite close to the optimal one, Best. It is only the threshold found by Dist which falls somehow behind, trading in this case too much recall for precision.

6.4.2 Consistency on the Synth dataset collection

Table 2 contains the values of consistency and averaged source detectability of the different weak clustering algorithms, estimated over all Synth datasets. Given that more dimensional data will exhibit a larger degree of sparsity which may render the results not comparable with those of lower dimensional datasets, we have opted to present the results segregated by the number of dimensions in the datasets.

Table 2 Consistency of the proposed weak clustering algorithms (Synth data)

Our hypothesis that weak clustering algorithms are consistent with data generated by dense and local sources seems corroborated by the empirical evidence coming from these experiments. The property holds in all tested datasets for 3-, 5- and 8-dimensional data. Only for 2-dimensional datasets, the algorithms, especially RSplit and SRbc using the Mah distance, fail to detect some of the sources—up to 7.45 % of them in the case of SRbc with Mah. Overall, for these two methods full consistency is only achieved in three fourths of the datasets; and HRbc fulfills the property in 91.67 % of the cases. On the flipside, the performance of SRbc using G(10,10) is remarkable, as it obtains perfect consistency even in these harder cases. The results also confirm the intuition that 2-dimensional datasets, being less sparse, are harder to deal with.

However, even if perfect consistency is not achieved, the fact that, in the worst of the cases, more than 94% of the sources are detectable suggests that the lack of full consistency does not necessarily hamper the actual performance of the Ewocs algorithm. The study of the clustering results over the same Synth collection in next section will shed light on this issue.

6.4.3 Clustering on the Synth dataset collection

Table 3 contains the AUC values for the compared methods across all datasets in the Synth collection, as well as their achievable precision, recall and F1 values, using the Best threshold selection criterion.Footnote 8 As mentioned before, the degree of sparsity increases with the number of dimensions, and this simplifies the clustering task, and the results across datasets with different dimensionality may not be commensurable. For this reason, we have again opted to split the results according to dataset dimensionality.

Table 3 Results for Synth data

For reasons of brevity, only the parameter configurations which achieve the best results for each method are included. Later in this same section, experiments studying the sensitivity of each method to the tuning of their parameters will be presented.

Finally, Fig. 8 contains a graphical representation of the outcome of Bergmann-Hommel tests on the F1 and AUC measures across all datasets in Synth. As mentioned previously, the position on the line indicates average rank across datasets (with 1 corresponding to a method consistently obtaining the highest score); and methods without a statistically significant difference between them are joined by thick bars.

Fig. 8
figure 12

Bergmann-Hommel tests for the compared approaches (Synth data)

In these experiments, Ewocs-based approaches are able to obtain results in the state of the art for minority clustering, and particularly, Ew-SRbc is able to outperform the existing approaches for the task, achieving a performance close to the upper bound, given by Convex. We believe this is an excellent result, and one which confirms the validity of the Ewocs algorithm.

BBOCC is the weakest approach among the compared ones. Even if its results are above the Random and AllFG baselines, the limitation to a single hyperelliptical cluster produces clusterings with a lower precision than those from other approaches. The differences are statistically significant in terms of both F1 and AUC.

Regarding Ew-RSplit, the extension from 2 to a larger number of hyperplanes improves the performance of the RSplit algorithm within the ensemble. However, the algorithm favours too much recall over precision, and even if this allows it to achieve a good AUC measure, its values of F1 are lower than other methods which exhibit a similar performance, such as Ew-HRbc and BBCPress. These too approaches trade some of the recall of Ew-RSplit for precision, thus obtaining lower AUC but higher F1. The differences between the three systems, nevertheless, are deemed not significant by the Bergmann-Hommel test, and can hence be considered similar in terms of minority clustering power.

The results of AutoHDS are also comparable in terms of F1 to those of these three methods. Hypothesis testing finds no statistically significant differences among them, either. However, the method seems unable to benefit from the increasing sparsity present in higher-dimensional datasets, obtaining the lowest F1 scores among all methods in the 8-dimensional ones. We believe the high sensitivity of the method to the tuning of its parameters (which we will consider below) can be an explanation for these poor results: it is unlikely that the same parameters produce good clusterings across all datasets in Synth. This seems a major drawback of the approach, and one which we think seriously reduces its utility in unsupervised minority clustering scenarios.

Finally, concerning DGRADE, k MD and Ew-SRbc, their performance is significantly better than that of the other methods in terms of F1, and that of Ew-SRbc is also better in terms of AUC. This is true for Ew-SRbc not only when using the G(10,10) distance, which achieves the best results on Synth with a significant difference from the competing methods, but also when using the unsupervised one G(Auto). The results for Ew-SRbc using G(Auto) are only slightly below those of k MD in terms of F1, and slightly below those obtained using G(10,10) in terms of AUC. In both cases the differences are not statistically significant. Taking into account that the determination of G(Auto) is completely unsupervised, we believe we can qualify these results as really encouraging.

However, the results using the Mah distance within Ew-SRbc fall much below those obtained with the G(α,γ) family. One reason for this behaviour may lie in the fixed degree of fuzziness allowed by Mah: the standardized scale that this distance provides may not always give the most suitable fuzziness. The greater versatility offered by the G(α,γ) distances is thus a valuable property.

Note that the high F1 score of k MD comes from its elevate precision, which is particularly high, for instance, in 3-dimensional datasets; whereas Ew-SRbc tends to favour recall over precision and DGRADE seems to find more balanced solutions. These tendencies agree with the ones observed in the Toy dataset.

Finally, the values of AUC for AutoHDS and k MD are lower than for all other methods except BBOCC. The difference comes, as mentioned in Sect. 6.4.1, from the lack of an adjustable threshold in their output.

At the light of these results, we can assert that Ewocs-based approaches perform competitively with respect to the state of the art in the minority clustering task, in terms of AUC and F1 of the obtained clusterings. Ensemble clustering methods have hence been proven to be useful for this task.

Moreover, the fact that the Ew-SRbc method is able to outperform all other compared approaches when using the manually tuned Gaussian-kernel distance, and most of them when using the automatically tuned one, leads us to believe that, on the one hand, kernel-based distances are a serious alternative to other similarity measures used in clustering tasks; and that, on the other, the proposed Rbc algorithm can be successfully employed to construct individual clusterings suitable for combination within a clustering ensemble.

However, these conclusions require an evaluation of the sensibility to parameter tuning of the compared approaches.

Parameter sensitivity

A number of experiments have been performed to assess the relevance of parameter tuning on the different approaches, in terms of the impact these parameters have in their performance on the minority clustering task.

Figure 9 provides two plots of the Best F1 score as a function of the parameters in Ew-SRbc: the ensemble size R, the maximum number of clusters in each individual clustering k max , and the Gaussian-kernel distance scaling factors α and γ. The plots correspond to the 2-dimensional subset of the Synth collection, being the datasets where the difference in performance between approaches is the largest.

Fig. 9
figure 13

Effect of parameters on Ew-SRbc (2-dimensional Synth data)

First Fig. 9a plots the curves of F1 for a fixed distance function G(10,10). It can be seen how a change in any of the two parameters does influence the F1 score. However, the difference in performance is small, and, more importantly, the value stabilizes with increasing values of both R and k max . In particular, the ensemble size R controls the convergence of the object scores to the source affinities. Higher values will provide more accurate clusterings, with the drawback of an increased computational cost. In these experiments, a value such as that of R=k max =100 we have used, produces good quality clusterings across a wide range of situations. Nevertheless, we will revisit their influence on clustering performance in the evaluation on real world collections.

However, the plot in Fig. 9b, which shows the curves of F1 for fixed values of R=k max =100, presents a different picture. The scaling parameters of the Gaussian-kernel distance also have an impact on the F1 of the clusterings produced by Ew-SRbc, but in this case the values do not stabilize. Moreover, the curves for different α present a maximum around γ=10, and lower values of F1 are obtained at either side of these maxima. The score using G(α,γ) distances can exceed significantly that obtained using Mah (both with Ew-SRbc and Ew-HRbc), but it can also eventually drop much below.

The selection of the suitable values for α and γ seems indeed a crucial issue when using Ew-SRbc, as intuited in Sect. 4.3. Nevertheless, the plot in Fig. 9b also shows how the value of F1 obtained using the automatically tuned G(Auto) distance provides an approximation to the optimum. We hence believe that G(Auto) can be used to perform the minority clustering task satisfactorily, even if we must also admit that fine tuning can improve the overall results.

Regarding non-Ewocs-based approaches, Fig. 10 contains plots of the Best F1 score for BBCPress, AutoHDS and k MD, as a function of their various parameters. For reference, the plots also include the value obtained by Ew-SRbc/100×100 using G(Auto).Footnote 9

Fig. 10
figure 14

Effect of parameters on BBCPress, AutoHDS and k MD (2-dimensional Synth data)

DGRADE provides an effective alternative to BBCPress to obtain a suitable set of parameters and seeds for BBC (Fig. 10a). However, their computational cost limits its applicability for large collections. Concerning AutoHDS and k MD (Figs. 10b and 10c), the latter seems more robust to the choice of its parameters. However, no method was proposed to automate the tuning of either method, other than interactive trial-and-error. Ew-SRbc hence has as an advantage over the compared approaches, because of the automatic tuning procedure of the proposed G(Auto) distance. Moreover, the results obtained using Ew-SRbc and G(Auto) are better than those of the compared approaches in terms of AUC and F1.

We believe the existence of such a tool is a significant difference with respect to other approaches, and that this makes Ew-SRbc suitable for completely unsupervised minority clustering tasks.

Threshold determination

Table 4 contains the values of precision, recall and F1 obtained when applying the different criteria to the output of each minority clustering method. Again, for brevity the table contains only the results across the 2-dimensional datasets of Synth. Concerning the statistical significance of the differences, Fig. 11 contains the graphical representation of the outcome of Bergmann-Hommel tests on precision, recall and F1 across all (not only 2-dimensional) datasets in Synth.

Fig. 11
figure 15

Bergmann-Hommel tests for the compared criteria (Ew-SRbc/100×100 using G(10,10) on Synth data)

Table 4 Results for 2-dimensional Synth data

The results show there is still a gap between the maximum achievable F1 score (criterion Best) and that obtained using the different criteria. There is another gap between the F1 scores of the criteria that contain some element of supervision (Size and nGauss+Best) and those of the completely unsupervised ones (Dist, nGauss+Var and 2Gauss). These differences are present in a consistent way across all the Ewocs-based approaches.

Criterion Size is the one to obtain results closest to Best in terms of F1, and that to obtain the best figures for precision, but at the cost of being the one which gives the least recall. All differences are statistically significant.

However, the upper bound achievable using Gaussian modelling of the scores, that of nGauss+Best, lies quite close to the output of Size. For the Ew-SRbc/100×100 method using G(10,10) on 2-dimensional data, the difference is only a 0.3 % in terms of F1. nGauss+Best also shifts the bias towards recall instead of precision, which is much closer to the region where the optimal threshold (that of Best) lies.

Finally, regarding the three unsupervised criteria, nGauss+Var seems the one which comes closest in terms of performance to nGauss+Best. Even if this does not hold for the particular subset of 2-dimensional data, overall nGauss+Var gives higher precision and lower recall than nGauss+Best. These differences are not statistically significant, but overall the one in F1 score is. Dist and 2Gauss show a strong bias for recall, particularly the latter, and fall much below nGauss+Best in precision. They perform worse in terms of F1 than the other proposed approaches. However, from the statistical point of view, the difference is not significant between them and nGauss+Var.

Taking these and all obtained results into account, we can affirm that, even if elements of supervision improve the results in the task of minority clustering, the proposed Ewocs algorithm allows us to obtain competitive results using an unsupervisedFootnote 10 approach: the results obtained by Ew-SRbc/100×100 using G(Auto) and one of Dist, nGauss+Var or 2Gauss are above those obtained by other supervised approaches, such as BBOCC or BBCPress.

Regarding the elements of supervision introduced by each one of the criteria, it is remarkable that the use of nGauss+Best, which would require an a posteriori selection of the number of background Gaussian components from a small number of them, suffices for Ew-SRbc/100×100 using G(Auto) to outperform all other approaches, including k MD, which requires careful tuning of three parameters R, s 0, s min .

Even if manual determination of the most suitable G(α,γ) distance, or more informed (i.e., supervised) threshold detection criteria, such as Size or Best, allow further increases in the F1 scores obtained by Ew-SRbc, we believe that the fact that, using no or little supervision, Ew-SRbc outperforms supervised minority clustering approaches in the state of the art is an excellent result, and one which proves the validity of the whole minority clustering framework introduced by the Ewocs algorithm.

7 Evaluation on real-world data

We have carried out a number of additional experiments with the goal of extending our conclusions to larger and higher-dimensional collections coming from real-world problems.

The first dataset we have used belongs to the text classification domain. Specifically, we have used a subset of the Reuters-21578 corpus,Footnote 11 which is a popular benchmark for the task. The Reuters corpus was previously used by Ando (2007) to evaluate their minority clustering algorithm.

Our second collection of datasets comes from the area of information extraction, within which, as mentioned previously, Gonzàlez and Turmo (2009) introduced the Ewocs algorithm. More specifically, the authors considered the problem of unsupervised relation detection (i.e., learning which pairs of entities mentioned in a document collection are linked by some relation without resorting to annotated data), and proposed a reduction of the problem to minority clustering. Ewocs was then used to find foreground objects, which in the context of the task corresponded to pairs of related entities.

The experiments presented in the following sections extend those of the previous work, and compare the results of Ewocs not only to other relation detection approaches, as in the original paper, but also to other minority clustering approaches. Sect. 7.1 describes the used corpora and data generation procedure. Sect. 7.2 reviews the approaches that we have considered for the task. Finally, Sect. 7.3 presents the obtained results.

7.1 Data

As previously mentioned, the Reuters corpus has been used by other authors to evaluate minority clustering algorithms. Specifically, Ando (2007) assembled a dataset which contained the documents in topics oilseed, money-supply, sugar and gnp as foreground objects, and those in acq as background.

However, in preliminary experiments we found this partition not to provide a real minority clustering problem—but rather an all-in clustering one with unequal cluster sizes. For this reason, we decided to use a different subset of the collection, in which clusters have to be determined on the grounds of density. The documents belonging to the largest category, earn, have been taken as foreground objects, and the rest of documents as background ones. To reduce the density of the background, only a random 60 % of its documents has been kept. The resulting datasetFootnote 12 has a total of 3987 and 10507 documents belonging to each one of the two classes, respectively.

Similarly to other works in document clustering (Zhao and Karypis 2004), the text in each document has been tokenized, and numbers and stop words have been removed. Last, the remaining tokens have been stemmed using the method of Porter (1980), and the tf-idf vectors have been found (Spärck-Jones 1972). We will refer to the obtained dataset as Reuters.

Regarding the relation detection datasets, and following Gonzàlez and Turmo (2009), a hold-out evaluation scheme has been used: minority clustering is first performed on the objects generated from a large document collection, and the obtained clustering models are then applied on additional objects from new documents, where performance is measured. We have used the year 2000 subset of the Associated Press section of the AQUAINT Corpus to perform the clustering (Graff 2002), and the hold-out datasets are generated from the annotated corpora used in the Relation Detection and Recognition task of the ACE evaluation (ACE 2008), for which ground truth is available. Specifically, we used the training data of ACE evaluations for years 2003, 2004 and 2008. The corpora add up to almost 29 million and over half a million words, respectively.

Each dataset in the collection will contain binary feature vectors which capture syntactical properties of the contents of pairs of entities of two considered types (e.g., for Org-Per, one of the two entities will be an organization and the other one a person). We have considered for evaluation the 11 entity type pairs that were already used by Gonzàlez and Turmo (2009). Table 5 contains a quick overview of the entity types annotated in the corpus.Footnote 13

Table 5 Ace entity types

The set of syntactic features that have been used to generate the binary vectors to be clustered is the part-of-speech–based one used in the original paper.Footnote 14 Features occurring in less than ten objects are filtered. The number of objects and dimensions in the resulting datasets are listed in Table 6. We will refer to this dataset collection as Apw-Ace.

Table 6 Size of Apw-Ace datasets

7.2 Approaches

In order to assess the generality of the results on the Synth collection over real-world data, we have used the same set of methods presented in Sect. 6.2 for this new set of experiments. Thus, the BBOCC, BBCPress, DGRADE, AutoHDS and k MD algorithms have been applied over Reuters, as well as the Random and AllFG baselines. For the divergence-based approaches, we have resorted to Euc rather than Mah because of the highest computational cost of the latter as the number of dimensions grows.

Regarding k MD, multinomial distributions have been used for both the foreground and background clusters. Additionally, for this particular algorithm, instead of using a tf-idf representation of the Reuters dataset, we have employed the unsupervised feature selection scheme of Slonim and Tishby (2000): documents are represented using raw term frequencies, but, to reduce data dimensionality, only the 200 stems that contribute the most to the mutual information between stems and documents are selected. This configuration mimics that used by Ando (2007) on the same corpus.

The much larger sizes of the datasets in Apw-Ace renders impossible the use of some of the previous methods, namely DGRADE and AutoHDS, because of their cubic computational complexity. Nevertheless, we have kept the rest of approaches in the comparison; the only change has been the use of Bernoulli rather than multinomial distributions in k MD, the former being more suitable for binary feature vectors.

Moreover, in order to compare the performance of minority clustering approaches with respect to other relation detection methods, we have included an additional method in our comparison:

Grams::

as proposed by Hassan et al. (2006). The method uses a combination of n-gram models and graph-based mutual reinforcement to generate POS-based patterns, sorted by confidence, which can then be applied on new data. The approach requires no additional external resources and acquires patterns which can be applied to hold-out data, and thus allows a fair comparison within the present setting.

The authors of the Grams method do not provide a way to determine a threshold value for the confidence of the patterns so, similarly to other methods, we are taking the Best value in terms of obtained F1 score. Thus, the results displayed for Grams are an upper bound of the performance of the method.

7.3 Results

Next two sections expose and analyze the results of the experiments on the two considered real-world scenarios: Sect. 7.3.1 deals with those on the Reuters dataset, and Sect. 7.3.2 details the outcome of the evaluation on Apw-Ace.

7.3.1 Clustering on the Reuters dataset

Table 7 contains the AUC values for the compared methods on the Reuters dataset, as well as the precision, recall and F1 values obtained using the different threshold selection criteria. Similarly to Sect. 6.4.3, only the configurations which achieve the best results for each method are included.

Table 7 Results for Reuters data

Strikingly, the results obtained by k MD are well below those of the baseline Random and AllFG methods—contrary to the excellent performance shown on the Synth datasets. In particular, the obtained recall is extremely poor, below 1 %, and precision barely reaches 25 % for the k MD/100×800–5 setting, which is the one to obtain the best results among those tried. k MD/100×800–50, which was used by Ando (2007) on Reuters documents, achieves even lower precision, down to 18 %. Overall, F1 remains around 1.5 %, clearly pointing that the feature selection scheme, or the multinomial distribution modelling used, or both, are not suitable for the task at hand.

AutoHDS does also perform poorly on this dataset, assigning almost all objects to the foreground clusters. Its results are hence virtually indistinguishable from those of the AllFG baseline.

Regarding the BBOCC, BBCPress, DGRADE and Ew-SRbc methods, their results are placed high above the baselines. In fact, the best results for the task are achieved with BBOCC (equivalently, BBCPress/1) which gives an AUC value of 0.971 and F1 of 87.7 %. The values clearly exceed the AUC of 0.902 and F1 of 74.1 % achievable by Ew-SRbc using the Best threshold. It is surprising how this method, the simplest one after the baselines, is also the one to obtain the best results, exceeding the proposed Ewocs method by such a margin.

However, there is a number of factors to take into account concerning the generality of this statement. First, by the construction of Reuters dataset, the problem is well-suited for methods looking for one (and only one) dense foreground cluster surrounded by a sparse background. One proof of this is that the second best AUC and F1 values obtained by BBCPress are those with k=4, lying much below those for k=1, and also those of Ew-SRbc. There is thus a strong sensitivity to the value of parameter k. In this sense, DGRADE continues to provide a better (and less supervised) starting model for BBC-style clustering, even if its results are still slightly below those of Ew-SRbc.

Moreover, methods BBOCC and BBCPress use the Size threshold selection criterion, and hence require the number of foreground objects to be known a priori. Figure 12 shows the precision, recall and F1 values achieved by BBOCC as a function of the provided number of foreground objects (expressed as a ratio of the total dataset size). The F1 value obtained by Ew-SRbc using nGauss+Var—a completely unsupervised approach—is included for reference: a bad estimation of the foreground cluster size causes the precision/recall balance to break, and the F1 scores to fall from the optimal ones, located around the actual foreground ratio value of 27.5 %.

Fig. 12
figure 16

Effect of foreground size ratio on BBOCC performance (Size criterion on Reuters data)

There is thus an inherent brittleness in the fitting of parameters for BBOCC, BBCPress and, even if to a lesser extent, DGRADE (despite being able to determine the number of foreground clusters, it does require the number of foreground objects as input) on Reuters—in the same way as we had found it for Synth—and this can become an important drawback if the dataset characteristics change.

Concerning the different threshold selection criteria available for Ew-SRbc, it is encouraging to see how the Reuters dataset allows a much better identification by part of the unsupervised methods—namely Dist, nGauss+Var and 2Gauss. The value of F1 achieved by nGauss+Var matches the upper bound of Gaussian-based criteria nGauss+Best, and fall only 3 points below the upper bound value obtained with Best. The gap between the results obtained with Dist and 2Gauss and those with the supervised thresholds Best and Size is also smaller in this case than it was for 2-dimensional Synth datasets (Table 4). It is also worth noting how, for this dataset, nGauss+Var and Dist produce precision-biased clusterings, whereas 2Gauss gives more recall-favouring ones.

These results are encouraging, but we believe an analysis of the performance of Ewocs on Reuters as the ensemble size parameter R increases is required to obtain insights into its behaviour. Such an analysis is provided in the following paragraphs.

Ensemble size sensitivity

Figure 13a contains a plot of the performance of Ew-SRbc method, in terms of AUC and F1, using a fixed value of k max =50 and successively increasing ensemble sizes R, on the Reuters dataset.

Fig. 13
figure 17

Effect of R on Ew-SRbc/R×50 performance (using Euc on Reuters data)

Increases in the ensemble size lead to an improvement of the performance of Ewocs, up until a point where the results stabilize. This matches the behaviour observed on Synth (Sect. 6.4.3). Nevertheless, given the greater complexity of the task, a larger number of individual clusterings are required for the results to converge—in fact, one several orders of magnitude larger.

This improvement is more relevant for the unsupervised Dist and nGauss+Var. Thus, we believe the observed performance boost comes more from an increase in the gap between the scores of foreground and background objects—one which allows unsupervised criteria to detect the threshold more accurately—than from significant changes in the relative values of the scores. If this second phenomenon were the case, the improvements would affect equally both unsupervised and supervised criteria.

Figure 13b contains a plot of the mean and standard deviation of F1 across 10 runs of Ew-SRbc, using two of the proposed criteria.Footnote 15 One can observe how the standard deviation of the results is reduced considerably as the ensembles grow in size, almost disappearing by the time the number of clusters reaches R=100000.

Overall, the results of this last series of experiments confirm those in Sect. 6.4.3: the parameter R does have a considerable influence on the results obtained by Ewocs-based approaches. In particular, a larger clustering ensemble increases the separation between the scores of background and foreground objects, thus improving the accuracy of the threshold detection stage. The evaluation on the larger datasets from Apw-Ace will provide more insights about the runtime trade-offs associated to the setting of the R parameter, and we hence defer further discussion to that point.

7.3.2 Hold-out clustering on the Apw-Ace dataset collection

Table 8 contains the full table of results for the compared approaches on each one of the datasets in the Apw-Ace collection. For reasons of space, the divergence used by each method has been omitted: it is Euc for BBOCC and BBCPress, and G(0.1,0.1) for Ew-SRbc. Figure 14 contains the Bergmann-Hommel tests for the F1 score using the Best criterion and the AUC metric, which summarize and assess the statistical significance of the results in the table.

Fig. 14
figure 18

Bergmann-Hommel tests for the compared approaches (Apw-Ace data)

Table 8 Results for Apw-Ace data

As seen in both the table and the figure, minority clustering algorithms outperform the reference unsupervised relation detection approach Grams both in terms of AUC and F1 score. Only k MD obtains lower scores, as its behaviour degrades towards the AllFG baseline: it assigns almost all objects to the foreground. We believe the poor performance exhibited by k MD in both this and the Reuters collection—compared to the good results it obtained on Synth, where the Gaussian distributions in the data matched those in the model—casts doubts on the suitability of the method on datasets whose sources follow non-standard or unknown distributions.

Concerning the other three methods, the hypothesis test finds no significant differences between BBOCC, BBCPress and Ew-SRbc, even if the last is the one to provide the best AUC and F1 scores overall.

Regarding the detection of the threshold, Table 9 contains the results for each one of the datasets and criteria of the Ew-SRbc/50000×100 method. The results are similar to those in Synth and Reuters, with the extra supervision used by Size allowing it to stay within 2–3 % of the F1 score of Best, and the two unsupervised methods Dist and 2Gauss providing similar result slightly below those of the supervised ones. Only the behaviour of nGauss+Var is significantly different, its figures being much lower than those of its counterparts. We will return to this issue shortly, and try to provide a likely explanation for it.

Table 9 Results for Apw-Ace data (Ew-SRbc/50000×100 using G(0.1,0.1))

With respect to the relation between the performance of the diverse threshold detection criteria and the size, the trend observed in Reuters appears again in Apw-Ace. For the Dist and 2Gauss criteria, increasing the ensemble size improves their scores and reduces the gap between them and the supervised Best and Size. However, for criterion nGauss+Var again, the pattern is not so clear: whereas for a few pairs (Gpe-Per, Gpe-Veh, Org-Veh, Per-Veh) the detected threshold improves as more clusterings are added to the ensemble, in the rest of the datasets the performance metrics stagnate in the lower part of the scale. To illustrate this phenomenon, Fig. 15 contains two plots of the F1 score as a function of the ensemble size R for the Org-Veh and Fac-Loc datasets.Footnote 16 The value achieved by BBCPress/10 using the Best threshold is also included for reference.

Fig. 15
figure 19

Effect of R on Ew-SRbc/R×100 performance (using G(0.1,0.1) on Ace-Apw data)

This inconsistency can be due to the use of Eq. (17): if, without being as large as the background one, the variances of diverse foreground sources differ significantly from one another, it is possible for the point of maximum inter-variance difference to fall among the foreground objects, thus providing a threshold with higher precision and lower recall. This could be the case in datasets generated from relation detection problems, because, for a given entity type pair, some relations can be expressed using a reduced set of linguistic patterns (and thus give place to particularly dense regions), whereas for other there can be a wider variety. The criterion thus may not be robust to foreground sources of heterogeneous density—and further exploration is required in order to improve it.

Convergence and runtime

As mentioned at the end of Sect. 7.3.1, augmenting the ensemble size R increases the separation between the scores of background and foreground objects, thus improving the accuracy of the threshold detection stage. We believe the larger datasets in Apw-Ace offer a good testbed to study the ratio of convergence of these scores.

Differently from other iterative algorithms, the fact that weak clustering algorithms are being used means that convergence of the scores is not smooth, but presents an alternation of larger and smaller steps. To study this process, we have considered the average score change produced by the R-th clustering:

$$\Delta s_R = \frac{\sum_{x_i \in\mathcal{X}}{(s^\star_{Ri} - s^\star _{(R-1)i})}^2}{|\mathcal{X}|} $$

Figure 16a contains a sample plot of the values of Δs R , aggregated in disjoint windows of 100 repeats, using Dist on the Fac-Loc dataset.Footnote 17 It can be seen how the maximum, mean and minimum values show an overall descending trend, yet present continuous oscillations. On the contrary, the medians exhibit a smooth decreasing behaviour, and seem thus to be useful as indicators of the convergence rate of the scores.

Fig. 16
figure 20

Score convergence on Ew-SRbc/R×100 (using G(0.1,0.1) on Apw-Ace data)

To confirm this intuition, Fig. 16b shows the values of F1 achieved with criterion Dist after R weak clusterings (being of the methods which benefit the most from an increased ensemble size) relative to the ones obtained with the presented R=50000, as a function of the median of the score changes Δs R in 100 clustering windows. The plot shows how, for almost all collections, the values of F1 have stabilized by the point the median Δs R falls below 10−10—the only exception being Fac-Loc, which already starts with medians 1…Δs 100)≈10−9, and does not converge until the value reaches 10−12 with an ensemble of R=2000 clusterings. This behaviour suggests a replacement of parameter R by a threshold on medians R ), and one which gives place to a natural parallelization of the algorithm: we can obtain a batch of weak clusterings of the dataset in parallel, and then use the median of the average score changes produced by them to determine whether the scores have converged. In this direction, Fig. 16c, plots the ensemble size required to achieve this medians R )<10−10 level, as a function of dataset size. The speed of convergence of the scores seems to be proportional to \({|\mathcal{X}|}^{0.4}\).

To inspect how the use of a convergence criterion affects the runtime of the algorithm, Fig. 17a shows the runtime per clustering (separated in training and testing) of Ew-SRbc for each one of the collections in Apw-Ace. As we can see, the runtime cost can be fit proportionally to \({|\mathcal{X}|}^{1.1}\), only slightly above the theoretical linear complexity \(O(|\mathcal{X}|)\) we had considered in Sect. 3.6. The fact that the larger datasets we have used also have a higher number of features is likely to be the cause for this quasi-linear behaviour.

Fig. 17
figure 21

Runtime for Ew-SRbc/R×100 (using G(0.1,0.1) on Apw-Ace data)

Finally, Fig. 17b plots the total runtimes of Ew-SRbc up to the point where the median of Δs R falls below 10−10, against the dataset size. The points are distributed quite closely to a \(t \propto{|\mathcal{X}|}^{1.5}\) curve, as could be expected from the previous two fits. We believe this is certainly another positive result: we have seen how other approaches in the state of the art (e.g., DGRADE, AutoHDS) have computational complexities of \(O(|\mathcal{X}|^{3})\), which render them unusable for large-scale datasets. Moreover, the fact that Ewocs is easily parallelizable also makes it an attractive option in terms of runtime.

8 Conclusions

In this article, we have considered the problem of minority clustering, contrasting it with regular all-in clustering. We have identified a key limitation of existing minority clustering algorithms—namely, we have seen how the approaches proposed so far for minority clustering are supervised, in the sense that they require the number and distribution of the foreground clusters, as well as the background distribution, as input.

The fact that, in supervised learning and all-in clustering tasks, combination methods have been successfully applied to obtain distribution-free learners, even from the output of weak individual algorithms, has led us to make a three-fold proposal.

First, we have presented a novel ensemble minority clustering algorithm, Ewocs, suitable for weak clustering combination. The properties of Ewocs have been theoretically proved under a set of weak constraints. Second, we have presented two weak clustering algorithms: one, Rbc, based on Bregman divergences; and another, RSplit, an extension of a previously presented random splitting one. Third, we have proposed an unsupervised procedure to determine the scaling parameters for a Gaussian kernel, used within a minority clustering algorithm.

We have implemented a number of approaches built from the proposed components, and evaluated them on a collection of synthetic datasets, for a comparison to other minority clustering methods in the state of the art. The results of the evaluation show how approaches based on Ewocs, and especially the one built using SRbc as weak clustering algorithm and G(Auto) as object divergence function, are competitive with respect to—and even outperform—other minority clustering approaches in the state of the art, in terms of F1 and AUC measures of the obtained clusterings.

The completely unsupervised minority clustering approach, built from Ewocs, SRbc, G(Auto) and an unsupervised threshold detection criterion (one of Dist, nGauss+Var or 2Gauss) already outperforms other supervised minority clustering approaches. With only the minor supervision introduced by replacing the threshold detection by nGauss+Best, the resulting approach outperforms all other considered systems, including the much more supervised k MD.

The results on synthetic data have been corroborated with an evaluation on real-world data. A first dataset—more specifically, a subset of the classical text classification Reuters corpus—has allowed us to study the influence of the clustering ensemble size on the results achieved by Ewocs. Specifically, we have found larger ensembles to boost the accuracy of the unsupervised threshold detection criteria. The completely unsupervised minority clustering approach built from Ewocs, SRbc, Euc and 2Gauss obtains a performance within hundredths of the upper bound of Ewocs.

Additionally, the approach has been applied to a collection of datasets coming from unsupervised relation detection problems of an even larger scale. In that task, the use of Ewocs after a reduction of the problem to minority clustering allows the detection of pairs of related entities with more accuracy than using an approach specifically tailored to relation detection. Moreover, the fact that Ewocs builds a clustering model allows the detection of related entities in new documents not available at clustering time.

At the light of the results, we believe that the Ewocs algorithm is an effective method for ensemble minority clustering, and that it allows the building of competitive and unsupervised approaches to the task. It is appealing because of its simplicity, flexibility and theoretical well-foundedness, and can hence be taken into account for clustering on a diversity of domains, where unsupervised minority clustering tasks may be the rule and not the exception.