1 Introduction

We review the external clustering performance metric BCubed [3], indicate a flaw and propose a repair. We then evaluate the repair both theoretically and experimentally.

In essence, clustering and (single label) classification perform the same task: given a set of items E, they partition E. However, when it comes to evaluation with comparison to a gold standard, things are very different.

With classification, the number of blocks in the partition is known (the set of labels), and a mapping exists between the true blocks and the predicted blocks (namely the identity mapping on the labels). So, counting errors is straightforward by making the cross table of predicted and gold truth values (the confusion table), and computing precision and recall as the diagonal divided by the two margins, respectively.

With clustering, there is (at prediction time) no known number of blocks (as the label set is unknown), and there is no mapping between the predicted blocks and the true labels. This makes counting errors much less straightforward, witnessed by the numerous proposals on how to do this, nicely surveyed and classified by Amigó, Gonzalo, Artiles, and Verdejo [1].

The BCubed measure, proposed by Bagga and Baldwin [3], sidesteps the problem of matching true and hypothesized clusters. It does not measure errors over the clusters, but computes a precision and recall value for each element, and then takes the average. i.e., the recall for element e is the fraction of the true cluster of e that is contained in the predicted cluster of e. As each element e is contained in both its true and predicted cluster, both recall and precision of e are always larger than 0, even when a predicted and true clustering are disjoint except for the element e. This can be repaired by leaving out e itself in the calculation of precision and recall of e. In this paper, we investigate this alternative definition of BCubed (Sect. 2), evaluate the new metric both theoretically and empirically (Sect. 3), and conclude that it retains all positive properties of BCubed, yields a minimum zero score when it should and can produce different rankings for predicted clusterings when compared to BCubed.

2 BCubed revisited

Let E be a set and \(N^{T}\) and \(N^{H}\) two clusterings (partitions) of E, corresponding to the true and hypothesized clustering, respectively. We use \(N_{e}^{T}\) to denote the block in \(N^{T}\) containing e, and similarly for \(N^H_e\) and \(N^{H}\). Figure 1 shows how precision and recall relative to an element e are defined given the true and hypothesized clusterings \(N^{T}\) and \(N^{H}\).

Fig. 1
figure 1

Comparing the elements in the true cluster \(N_{e}^{T}\)of e to those in the predicted cluster \(N_{e}^{H}\)of e. \(TP_e\), \(FP_e\), and \(FN_e\) represent the sets of True Positives, False Positives, and False Negatives for e, respectively. P(e) and R(e) are Precision and Recall relative to e

The BCubed measure for a given clustering is then the average harmonic mean (the F1-value) of the precision and recall for each element. This F1 value is what is denoted by “BCubed” or “BCubed score” in the literature, a convention also followed in this paper. This harmonic mean is usually defined as \(2PR/(P+R)\), but the equivalent direct definition is insightful here as well. Let \(A\oplus B\) denote the symmetric difference of the sets A and B. Then

$$\begin{aligned} F1(e) = \frac{\vert N_{e}^{H}\; \cap \; N_{e}^{T} \vert }{\vert N_{e}^{H}\; \cap \; N_{e}^{T}\vert +.5\cdot \vert N_{e}^{H} \oplus N_{e}^{T}\vert }. \end{aligned}$$
(1)

Fig. 1 shows that \(TP_{e}\) \(\ne \emptyset \), as e is always in \(TP_{e}\)  and thus that precision, recall and F1 are always positive for each element, implying that the BCubed score of a clustering is never equal to 0.

Having a meaningful zero point is a requisite for a metric to be measured on ratio-scale. We can say that a score of 0 is meaningful if none of the predictions was correct, thus when all items in the contingency table are off the diagonal. Let us formulate this as a desideratum for a clustering metric:

(ZeroScore):

For every true clustering, there is a predicted clustering with score 0.

BCubed fails the ZeroScore constraint and can even give quite a high score of .66 to an absolutely wrong prediction. Consider this simple example: \(E=\{1,2\}\) and the true clustering \(N^{T}\) is \(\{E\}\). Now let the predicted clustering \(N^{H}\) be (the only other possibility) \(\{\{1\}, \{2\} \}\). Obviously it is wrong, but for both elements e, \(P(e)\) \(=1\), as it makes no mistakes, \(R(e)\) \(=.5\), as half of the true elements of the block of e are in its predicted block, and so \(F1(e)\) \(=.66\). Taking the mean F1 over all elements, we get a BCubed score of .66 for this predicted partition.

In fact, because \(TP_e\) can never be empty, BCubed fails the ZeroScore constraint in a much stronger manner: for every true clustering there is no prediction with score 0. Bcubed can never be equal to 0.

In order to repair BCubed so that it does satisfy the ZeroScore constraint, we only need to remove e from both \(N_{e}^{H}\) and \(N_{e}^{T}\)Ṫhus \(N_{e}^{T}\) now denotes the set of all elements in the same true cluster as e except e itself, and similarly for \(N_{e}^{H}\)Ẇe call these the neighbors of e. Then \(TP_{e}\) =\(N_{e}^{H}\) \(\cap \) \(N_{e}^{T}\) can be empty, and thus all measures can be equal to 0. The price paid for this is that we may divide by 0 in the definitions of P, R and F1 and thus must account for that. So all definitions remain the same, but we add the following provisos:

  • If \(N_{e}^{H}\) \(=\emptyset \), \(P(e)\) \(=1\).

  • If \(N_{e}^{T}\) \(=\emptyset \), \(R(e)\) \(=1\).

  • If \(N_{e}^{T}\)=\(N_{e}^{H}\) \(=\emptyset \), \(F1(e)\) \(=1\).

With these rules the new definitions yield the same scores as the original BCubed definitions on the singleton cases. In the first case, the hypothesized cluster containing e is \(\{e\}\), thus no mistakes for e can be made. In the second, recall for e is indeed perfect, and a perfect F1 score for a true singleton is of course only obtained if we exactly predict that.

It is easy to see that with this proviso the definition of \(F1(e)\) as in (1) is still equivalent to the often used \(2PR/(P+R)\) formulation.

2.1 A new name

In the rest of the paper, we further evaluate this repair. But let us first give it a name. The BCubed measure was introduced by Bagga and Baldwin. In a footnote they attribute the idea of BCubed to Bierman, and thus the cubed Bs. We opted for ELM, an abbreviation of Elements Like Me, which is a good mnemonic of the way we compute the repaired BCubed measure.

2.2 First impression of the differences

The following example gives a good impression of the difference between the two measures. Let E be the set consisting of the first 15 digits, and let it have the following true clustering

$$\begin{aligned} \{ \{1,2\}, \{3,4,5\}, \{6,7\}, \{8\}, \{9\}, \{10,11,12\}, \{13,14\}, \{15\} \}. \end{aligned}$$
(2)

We have generated all possible predictions with the proviso that each cluster must consist of consecutive elements. For a set of N consecutive elements, there are \(2^{N-1}\) of these. For 15 elements, this results in 16,384 possible predicted partitions. Figure 2 shows the distribution of the BCubed and ELM scores for all these predictions. Both scores are approximately normally distributed and 25% of the ELM scores are below the lowest BCubed score. The ELM scores are more evenly spread over the possible scores. Not only does BCubed start higher, its variance of .006 is much lower than the .02 for ELM. ELM and BCubed can also rank clustering systems differently: in this example, 18% of all (\(2^{14}\times 2^{13}\)) pairs of predictions are ranked differently by ELM and BCubed.

The Kendall-Tau \(\beta \) statistic (which accounts for tied ranks) between the ELM- and the BCubed-based ranking in this example is 0.63, also indicating that there are substantial differences between the rankings produced by the two metrics.

Fig. 2
figure 2

Population density diagram of all BCubed and ELM scores for all 16,384 predictions of the model in (2) plus the main statistics

3 Evaluation

We evaluate the new ELM metric both theoretically and empirically in a number of ways:

  1. 1.

    Unlike BCubed, ELM satisfies the ZeroScore constraint.

  2. 2.

    ELM has good behaviour on extreme non-informed (referred to as “degenerate” by Beeferman, Berger, and Lafferty [5]) systems.

  3. 3.

    The ranking of clustering systems based on ELM can be different from the ranking based on BCubed.

  4. 4.

    There are systematic differences between ELM and BCubed in synthetic experiments.

  5. 5.

    There are systematic differences on a real dataset, with a state-of-the-art clustering system based on BERT.

  6. 6.

    ELM satisfies the same four clustering metric constraints developed by Amigó et al. [1] that BCubed satisfies.

We compare the BCubed and ELM versions of PR and F1 using superscripts \(P_B, P_{ELM}\), \(R_B, R_{ELM}, F1_B\) and \(F1_{ELM}\). In the proofs, the precision, recall and F1 scores over a set of elements E are defined as the average of these scores over all elements in E.

3.1 ZeroScore constraint

Theorem 1

For every set E, with at least 2 elements, and a gold standard partition \(N_T\) over E, there is a clustering \(N_H\) over E such that for every e in E, \(F1_{ELM}(e) = 0\).

Proof

Let E and \(N^{T}\) be as stated in the Theorem. Let \(E_s\subseteq E\) be the set of elements which are clustered into singletons. There are three cases: no true singleton clusters, exactly one true singleton cluster or more than one singleton cluster. If there are no true singleton clusters, simply let the predicted clustering partition E into singletons. Recall that we now use \(N_{e}^{H}\) and \(N_{e}^{T}\) as denoting all elements in the same cluster as e except e. In particular, with a true singleton cluster \(\{e\}\), \(N_{e}^{T}\) \(=\emptyset \). Then for each \(e\in E\), \(N_{e}^{H}\) \(=\emptyset \) and \(\vert \) \(N_{e}^{T}\) \(\vert \ge 1\) (as \(\vert E\vert \ge 2\)). And thus \(TP_{e}\) \(=\emptyset \) and \(F1(e)\) \(=0\), because the special clause for F1 does not apply. If there is more than one true singleton cluster, create the predicted clustering \(N^{H}\) as follows: one cluster \(E_s\) and for each \(e \in E{\setminus } E_s\), a singleton cluster \(\{e\}\). Again, we must show that \(F1(e)\) \(=0\), for each \(e\in E\). First, let \(e\in E_s\). Then \(N_{e}^{T}\) \(=\emptyset \) and \(N_{e}^{H}\) \(=E_s\setminus \{e\}\), which is not equal to \(\emptyset \) as E has at least 2 elements. And thus \(TP_{e}\) \(=\emptyset \) and \(F1(e)\) \(=0\) because the special clause for F1 does not apply. If \(e\not \in E_s\), the reasoning is as in the case without singletons. If \(E_s\) is itself a singleton, say \(\{s\}\), we proceed as follows. Because E has at least two elements, it has another element different from s, say t. Let \(N^{H}\) consist of the cluster \(\{s,t\}\) and, again, for each \(e \in E\setminus \{s,t\}\), a singleton cluster \(\{e\}\). Using the same argument as above, for each \(e\in E\), \(TP_{e}\) \(=\emptyset \) and \(F1(e)\) \(=0\), as the special case never applies. \(\square \)

3.2 ELM behaves well on degenerate clusterings

Theorem 2

Let \(N^{T}\) be a true clustering over a set E and \(N^{H}\) the clustering consisting only of singleton clusters. Then \(P(e)\) \(=1\), for all \(e\in E\), and \(R(e)\) \(=\) \(F1(e)\) \(=1\) only if \(N_{e}^{T}\) \(=\emptyset \) and 0 otherwise.

An immediate corollary is that the ELM F1 for the degenerate singleton clustering is equal to the proportion of singletons in the gold standard partition.

Proof

Assume E, \(N^{T}\) and \(N^{H}\) are as in the theorem. In particular then \(N_{e}^{H}\) \(=\emptyset \), for all \(e\in E\). Then by the special clause in the definition, \(P(e)\) \(=1\) for all e, and \(R(e)\) \(=\) \(F1(e)\) \(=1\) if \(N_{e}^{T}\) \(=\emptyset \). When \(N_{e}^{T}\) \(\ne \emptyset \), still \(N_{e}^{T}\) \(\cap \) \(N_{e}^{H}\) \(N_H(e)\) \(=\emptyset \), and thus both \(R(e)\) and \(F1(e)\) are 0. \(\square \)

Now consider the other degenerate clustering: all elements are contained in one cluster. Let \(N^{H}\) be this degenerate all in one predicted clustering, with \(N^{T}\) the true clustering over a set E. Then obviously, \(R(e)\) \(=1\), for all \(e\in E\). Because \(N_{e}^{H}\) \(=E\setminus \{e\}\) and thus \(N_{e}^{T}\) \(\cap \) \(N_{e}^{H}\) \(=\) \(N_{e}^{T}\)  the precision \(P(e)\)equals \(\frac{\vert N_{e}^{T}\vert }{\vert E\vert }-1\). And thus the mean precision equals

$$\begin{aligned} P= \frac{\sum _{e \in E} \vert N_{e}^{T}\vert }{\vert E\vert \cdot (\vert E\vert -1 )} = \frac{\sum _{c\in N^{T}}\vert c\vert \cdot (\vert c\vert -1)}{\vert E\vert \cdot ( \vert E\vert -1)} = \frac{\sum _{c\in N^{T}}\vert c\vert ^2 - \vert c\vert }{\vert E\vert ^2 - \vert E\vert }, \end{aligned}$$

where the \(c\in N^{T}\) denote the true clusters. Note that the BCubed mean precision for this degenerate clustering is equal to \(\frac{\sum _{c\in N^{T}}\vert c\vert ^2}{\vert E\vert ^2 }\). Also note that when we view the clustering as a directed network partitioned into cliques, the ELM precision equals the density of this network, which ranges from 0 when each clique is a singleton to 1 only if the network is complete and thus consists of one giant cluster.

We can conclude that for both degenerate clusterings, ELM gives the lowest reasonable score.

3.3 ELM can produce different rankings compared to BCubed

We give an example of a true clustering and two predicted clusterings (which can be seen as two competing systems), which are ranked differently by ELM compared to BCubed. The clusterings are over the set \(E=\{1,2,3,4,5\}\) and are given in the first 3 rows of Table 1, with for example \(H_1\), depicting the clustering \(\{1\}, \{2\}, \{3, 4, 5\}\). The other rows compute \(F1(e)\) for each element, for each clustering and using ELM and BCubed. System \(H_2\) is better according to ELM, while \(H_1\) is better according to BCubed.

Both \(H_1\) and \(H_2\) contain one error, but the error in \(H_2\) is in the larger cluster. Clustering intuition says that errors in smaller clusters should be penalized more than errors in larger ones, and that is what ELM does here, and BCubed does not.

Table 1 F1 scores per element and the mean, for the given true and two system clusterings over the set \(\{1,2,3,4,5\}\), according to both Bcubed and ELM

3.4 ELM vs BCubed on synthetic data

We expand on the small synthetic experiment conducted in Sect. 2.2 by computing the BCubed and ELM scores for all clusterings of size 14 against all other clusterings of size 14. As there is a total of \(2^{14-1}\) possible clusterings, we thus have 8,192 experiments, with each of these experiments producing two rankings of the predicted clusters, one for BCubed and one for ELM. The distribution of these scores is shown in Fig. 3. To further investigate the differences between BCubed and ELM when used to rank systems, we calculate the Kendall-Tau statistic between all rankings and also look at the number of system pairs where the order was swapped between BCubed and ELM (which is part of the calculation of Kendall-Tau). The number of pairs where the ranking order was swapped between BCubed and ELM was roughly 39 billion out of the 274 billion cases (14%). The Kendall-Tau over all pairs of rankings is normally distributed with a mean of 0.70 and a standard deviation of 0.06, also similar to the example in Sect. 2.2.

Fig. 3
figure 3

Population density diagram of BCubed and ELM scores between all possible pairs of ground truth and predicted clusterings of size 14 plus the main statistics (N = \(2^{13} \cdot 2^{12}\))

Fig. 4
figure 4

The fraction of the number of pairs with reversed orderings between BCubed and ELM for all rankings for all clusterings with size 14 (N = 8192) plus the main statistics

Figure 4 shows the distribution of the fraction of the number of swaps for all of the 8, 192 experiments. Thus each datapoint is the fraction of possible system pairs where the order between the ranking between BCubed and ELM was swapped for that particular ranking. The y-axis represents all 8192 rankings and indicates what percentage of all rankings has a certain fraction of swaps. To investigate which type of ground truth clusterings result in the largest number of swaps, we employ the Pearson correlation between the entropy of the ground truth clustering and the number of pairs swapped in the ranking. The entropy for a given ground truth clustering \(N^{T}\) is given by

$$\begin{aligned} entropy(N^{T}) = -\sum \limits _{C \in N^{T}}p(C)*\log _2(p(C)) \text{, } \text{ where } p(C) = \frac{\vert C \vert }{\vert E\vert }. \end{aligned}$$

The Pearson correlation between the entropy of a ground truth clustering and the number of swaps for that ground truth was 0.81 (\(N=8192\)). As clusterings that have more small elements have higher entropy, this means that the number of swaps tends to be higher when the ground truth has a larger number of small clusters. This is as expected as the difference between ELM and BCubed is larger on smaller clusters.

3.5 ELM vs BCubed on real data

We compare ELM and BCubed for three fixed cluster-size baselines and a clustering algorithm using BERT [7] on a large dataset consisting of 110 samples (separate clustering problems) with in total 24,180 true clusters over in total 89,491 elements. The mean and median cluster sizes are 4 and 2, respectively. Each sample is a sequence of pages of text divided into documents. Thus each cluster consists of a document, which is a continuous sequence of pages. The elements are thus the pages. This scenario is common in the field of Page Stream Segmentation [20]. On average roughly \(35\%\) of the clusters in a stream are singleton clusters.

Following Bagga and Baldwin [3] and Amigó et al. [1], we report the mean average F1 scores. Thus for every sample E in our testset, we take the average over the \(F1(e)\) for each \(e\in E\), and then we take the mean over all samples in the testset.

The dataset, together with the code for all the experiments conducted in this paper is available on GitHub.Footnote 1 For the experiments with the BERT model, the dataset was split into a 70% train and 30% test set.

As the fixed page size baselines are not learned, we can use the entire dataset (train and test) for these experiments. The BERT model is evaluated on the test part of the dataset, consisting of 34 samples with 6347 clusters over 25,676 elements.

We first compare the precision, recall and their harmonic mean for BCubed and ELM on three fixed baselines: the two degenerate clusterings with only singleton clusters and one giant cluster, and a more sensible baseline that evenly partitions a sample into clusters of the mean true cluster size of that sample. The results are shown in Fig. 5. Note how the plots for precision for the all-singleton prediction and recall for the one-giant-cluster prediction show constant values of 1 for both ELM and BCubed.

Fig. 5
figure 5

Distributions of mean average Precision, Recall and F1 for BCubed and ELM for the three fixed baselines (only singletons, one giant cluster, and each cluster has the length of the samples mean true cluster length (N = 110))

The plots indicate that the smaller the cluster sizes in the predicted clustering, the larger the difference in both the mean and standard deviation of BCubed and ELM, for all 3 measures, again as expected.

Table 2 Mean BCubed and ELM precision, recall and F1 scores for the BERT based clustering model evaluated on the Page Stream Segmentation dataset (N = 34)

We will now cluster this dataset using the BERT model for Page Stream Segmentation from Guha, Alahmadi, Samanta, Khan, and Alahmadi [9]. In short, this model creates textual representations of each page using a BERT model, and then uses this representation to divide the pages into pages starting a new document and other pages. This classification is equivalent to a clustering. We follow their experimental setup, replacing the English bert-base model with the Dutch versionFootnote 2 as the dataset is in Dutch. We train the model for 10 epochs, using a batch size of 512 and a learning rate of \(2e^{-5}\).

Table 2 shows hardly any difference in precision and recall, but still a three percent point difference in F1 score. The KDE plots of the differences in Fig. 6 show the same trend.

This result shows that the differences between ELM and BCubed do not only exist on synthetic and simple baseline models, but also on well performing state-of-the-art models tested on large real data.

Fig. 6
figure 6

KDE Plots of the relative differences between BCubed and ELM for Precision, Recall and F1 for the BERT clustering model (N = 34)

3.6 ELM satisfies the constraints of Amigó et al

We show that the four constraints developed by Amigó et al. [1] hold for the ELM F1 metric. The family of BCubed-like cluster evaluation metrics is the only one satisfying all these four constraints. For a thorough explanation and motivation of these constraints we refer to the original paper. We follow the same line of reasoning as Amigó et al. [1] and also use their informative pictures.

3.6.1 Homogeneity

The homogeneity constraint states that a cluster assignment \(D_1\) that splits samples into homogeneous subgroups should be scored higher than an assignment \(D_2\) that mixes samples of different subgroups together, like in Fig. 7.

The ELM recall for each element is the same in \(D_1\) and \(D_2\), but the precision is lower for the elements in the mixed cluster in \(D_2\), than in the homogeneous clusters in \(D_1\). Hence, the mean ELM F1 score of \(D_1\) is higher.

Fig. 7
figure 7

Homogeneity constraint: black nodes belong to one cluster and the white nodes belonging to another cluster. Shown are two partitions: the homogeneous \(D_1: \{ C_{1}, C_{2}, C_{3} \}\) and the mixed \(D_2: \{ C_1, C_2 \cup C_3 \}\). Figure 7 is a modification of Figure 5 from [1]

3.6.2 Completeness

The cluster completeness constraint states that a cluster assignment \(D_1\) that groups items belonging to the same cluster together should receive a higher score than a clustering \(D_2\) that subdivides items from a homogeneous cluster, like in Fig. 8.

The argument is the dual of the previous argument. Here, precision is maximal for all elements in both partitions as all clusters are homogeneous. But ELM recall is lowered for those elements in the separate \(C_2\) and \(C_3\). In fact, recall for ELM is 0 for singleton clusters. Thus the mean ELM F1 is higher for the partition \(D_1\) with the joined clusters.

Fig. 8
figure 8

Completeness constraint: All nodes belong to the same cluster shown are two partitions: \(D_1 = \{ C_1, C_2 \cup C_3\}\) and \(D_2 = \{ C_1, C_2, C_3 \}\). Figure 8 is a modification of Figure 6 from [1]

3.6.3 Rag Bag

The Rag Bag constraint states that adding a singleton cluster to a cluster consisting of all differently labeled elements, a rag-bag, should score higher than an assignment adding this singleton to a homogeneous cluster, as in Fig. 9. In this example, this means that \(D_1\) should score higher than \(D_2\).

First observe that all elements have the same recall in both clusterings. Now the element in \(C_3\) has the same precision of 0 when it is added to \(C_1\) or to \(C_2\). The elements in the rag-bag \(C_2\) also keep the same precision (namely 0) irrespective to whether \(C_3\) is joined or not. But those in the homogeneous \(C_1\) see a drop in precision (from 1 to \(\frac{3}{4}\)) when \(C_3\) is joined. Thus \(D_1\) has a higher mean ELM F1.

Fig. 9
figure 9

Rag Bag constraint: black nodes belong to one cluster and all other nodes are singleton clusters shown are two cluster assignments: \(D_1 = \{ C1, C2 \cup C3 \} \) and \(D_2 =\{ C1 \cup C3, C2\} \). Figure 9 is taken from Figure 7 from [1]

3.6.4 Cluster size vs. quantity

As stated by Amigo et al., the Cluster Size vs. Quantity constraint can be loosely formulated by saying that small mistakes in large clusters should be penalized less than small mistakes in small clusters. Amigo et al. operationalize this constraint as follows. Let \(n>2\), and E a set of elements with \(\vert E \vert =3n+1\), and let T, \(H_{1}\) and \(H_{2}\) be three partitions over E, where T is the ground truth, and \(H_{1}\) and \(H_{2}\) are two predicted clusterings. Let T be a partitioning of E containing one cluster \(C_{1}\) of size \(n+1\), and n clusters each of size 2, \(C_{2}\) through \(C_{n+1}\). Let \(H_{1}\) be a partitioning of E that splits \(C_{1}\) into a cluster \(C_{1}'\) of size n, and \(C_{1}''\) of size 1, and with \(C_{2}\) through \(C_{n+1}\) unaltered. Let \(H_{2}\) be a partitioning that leaves \(C_{1}\) unaltered, but splits \(C_{2}\) through \(C_{n+1}\) into 2n singleton clusters \(\{ C_{2}^{L}, C_{2}^{R}, \cdots , C_{n+1}^{L}, C_{n+1}^{R} \}\). An illustraion of this setup for \(n=3\) is given in Fig. 10. The thus formalized constraint now says that the ELM score of \(H_1\) should be higher than that of \(H_2\), given T.

Theorem 3

(Cluster Size Vs. Quantity) Given \(n>2\), T, \(H_{1}\) and \(H_{2}\) as described above, the ELM F1 score for \(H_1\) is higher than that for \(H_2\).

Proof

Let \(T, H_{1}\) and \(H_{2}\) be as stated in the constraint for some \(n>2\). Given that both \(H_{1}\) and \(H_{2}\) only split true clusters in T into smaller subsets, \(P(e)=1\) for every element in E for both \(H_{1}\) and \(H_{2}\), and thus proving that the mean ELM F1 is larger for \(H_1\) than for \(H_2\) simplifies to proving that this holds for the mean recall. We will show that the sum of all R(e) is higher for \(H_1\) than for \(H_2\), which proves the theorem.

For \(H_{1}\), the recall of all 2n nodes belonging to the correctly predicted clusters \(C_{2}\) through \(C_{n+1}\) equals 1, and the recall of the single node in \(C_{1}''\) is 0 (this would be \(\frac{1}{n+1}\) for BCubed). The ELM recall of all n nodes in \(C_{1}'\) equals \(\frac{n-1}{n}\) (this would be \(\frac{n}{n+1}\) for BCubed). Thus for \(H_1\), \(\Sigma _{e\in E}R(e)\) equals \(2n + n\cdot \frac{n-1}{n}=3n-1\).

For \(H_{2}\) (which correctly predicts the big cluster but splits all true two-size clusters) the ELM recall \(R(e)=0\), for all \(e \in C_i\) with \(i\ne 1\) (this would be \(\frac{1}{2}\) for BCubed)). For the \(n+1\) nodes in the correctly predicted \(C_{1}\) the recall is 1, and thus for \(H_2\), \(\Sigma _{e\in E}R(e)= n+1\). For every \(n>1\), \(3n-1 > n+1\), as desired. \(\square \)

Fig. 10
figure 10

An illustration of the Cluster Size Vs. Quantity constraint for ELM for \(n=3\) and \(E=\{1,2,\ldots ,10\}\). The numbers in the two bottom rows are the ELM F1 scores for each element, and the mean F1 (the ELM score)

4 BCubed in the literature

We survey for which tasks BCubed has been used and discuss two other refinements of BCubed.

BCubed is used in the Machine Learning community for several clustering problems where a gold standard clustering is available, such as coreference resolution [6, 14, 15, 17, 18], Entity Linking [10, 11], and name disambiguation[2, 8]. In the case of coreference resolution, the task is to map words or short phrases that occur in a text to real-world entities. This mapping defines a clustering of all these words and phrases.

In coreference resolution in particular, BCubed is often used as a successor to the link based metric used in MUC [19]. BCubed has two main advantages over MUC: its ability to score singleton clusters, and the fact that it takes the severity of clustering mistakes into account, something MUC does not. ELM obviously still keeps these advantages. In both coreference resolution and Entity Linking, cluster size is likely long tail distributed, with a few very large clusters and numerous smaller clusters, and many singletons. We have seen that BCubed especially overestimates on elements from small clusters and that ELM repairs this. As the reported F1 measure is the mean over all elements, this skewed distribution amplifies the overestimation. We thus believe that especially in these applications, ELM is preferable to BCubed.

Several refinements of BCubed have been proposed, to adapt the metric to specific use-cases. Moreno and Dias [13] proposed two adjustments to the BCubed F1 metric that makes it more suited for usage with highly unbalanced datasets, which for example occur frequently in the tasks of image clustering, or the clustering of results for ambiguous search terms on the web. They argue that the standard version of BCubed is less suited for this, because the larger clusters (of the irrelevant class) have an unreasonable effect on the total score, comparable to the unreasonableness of accuracy in such cases. Both proposed alterations have the effect of weighting precision more than recall. The most straightforward one is not to use the harmonic mean F1, but a differently weighted average. The same remedy can be applied to ELM by using different weights in equation (1) for \(FP_e\) and \(FN_e\).

An extension to BCubed that handles overlapping clusters correctly is proposed by Amigó et al. [1], where the quality of a predicted cluster is evaluated by comparing an element with all other elements (including itself) in the ground truth (for recall, predicted cluster for precision) and comparing how many clusters they share in the prediction compared to the ground truth. However, this extension might assign the maximum F1 score to a clustering that is not exactly equal to the gold standard. Rosales-Méndez and Ramírez-Cruz [16] propose CICE-BCubed, which fixes the aforementioned issue for BCubed by also checking for pair occurrences in different classes. The adapted BCubed variant proposed by Amigo et al. that makes it suitable for usage with overlapping clusterings (and the change proposed by the authors of CICE-BCubed), is not straightforward to implement for ELM. The main problem arises from the fact that this extended variant of BCubed must include a comparison between the element and itself, to be able to penalize a model for the spurious creation or deletion of singleton clusters. Consider the example where the ground truth contains two elements \(e_{1}\) and \(e_{2}\) that both belong to cluster a, and a prediction where \(e_{1}\) and \(e_{2}\) both belong to a, but \(e_{1}\) also belongs to a new cluster b. Intuitively, the precision for this element should not be 1, as the prediction added a cluster, but the definition of ELM means that this relation is not considered, and thus this mistake is not penalized. We leave the repair of this shortcoming of ELM in the case of overlapping clusters for future work.

5 Discussion

We have calculated the F1 scores for both BCubed and ELM on the element level, and then defined the F1 score of a predicted clustering as the average of the F1 scores of all elements. Although we believe this is closest to the original (not explicitly stated) definition as given by Bagga and Baldwin [3], this is not the only way in which BCubed can be defined. Amigó et al. [1] define BCubed from the average precison and recall over all elements and then applying the \(2PR/(P+R)\) manner of calculating the F1 score using these averages. In words: we have used the average harmonic mean instead of the harmonic mean of the averages. For the main message of this article this does not matter as both ways of defining BCubed do not satisfy the ZeroScore constraint.

6 Conclusion

We indicated that the BCubed F1 measure gives an overestimation of the performance of a clustering method, repaired the definition, and evaluated the result positively.

ELM satisfies a basic property of a metric: it can always obtain the minimal score of 0, and it gives it to each prediction which has nothing correct (i.e., not a single true positive). We want to emphasize that the idea and intuition behind the ELM metric is identical to that of BCubed.

We showed that the difference between ELM and BCubed is largest when the size of true clusters is small and when there are many of such small clusters (e.g. when cluster size is power law distributed). Even on large real datasets with a well performing state-of-the-art clustering algorithm, ELM F1 was three percent point lower than BCubed.

We end with looking at the problem from the perspective of network science [4, 12]. If we view a clustering not as a set of subsets on some domain D but as a binary relation on D, we take a network perspective. A clustering or partition then corresponds to an equivalence relation \(\equiv \). The neighbor function \(N(e)=\{e'\in D\mid e\equiv e'\}\) then is the clustering function used to define BCubed and ELM. In network science, it is customary to work with simple (that is, irreflexive), and if possible, undirected relations. If we replace the equivalence relation with this irreflexive undirected relation, we end up with the same partition (in network science the blocks are called cliques). But on this network, the same neighbor function defines ELM, simply because no element is a neighbor of itself. We may speculate how BCubed would have been defined if one of the three B’s had been a network scientist.