1 Introduction

Automatic structuring of data is a very helpful tool for improving a user’s understanding of the data and providing an overview over the information contained therein. If the amount of data increases, a structuring into a hierarchy is especially useful since it can group the data on different levels of granularity. This gives a more detailed insight into the data and enables a user to concentrate on interesting parts of the data, analyzing details only when necessary. Hierarchical clustering methods were developed in the past to fulfill this task. The goal of these approaches is to group data only based on their features, i.e. information assigned to or extracted from the objects themselves. However, there is usually more than one way to create useful clusters. Without additional criteria, it cannot be decided, which of these is the best one. It depends on various aspects including the user’s preferences, knowledge, task, or context. Therefore, more recent research analyzes how any kind of prior knowledge about an optimal final structure can be integrated into the clustering process. Pairwise constraints, as originally introduced by Wagstaff et al. (2001), gained much attention for this purpose. While their idea is simple and effective, pairwise constraints are limited to partitional clustering (Wagstaff and Cardie 2000). Therefore, this paper discusses a hierarchical constraint type that extends the ideas of pairwise constraints to hierarchical clustering.

The main contribution of this paper is to discuss how hierarchical clustering can be constrained appropriately to integrate prior knowledge about the target cluster structure. We suggest the use of must-link-before (MLB) constraints and formally discuss their properties here. We show the effectiveness of these constraints through an approach that extends hierarchical agglomerative clustering (Hastie et al. 2009, Chap. 14.3.12) to obey MLB constraints and its application in a document clustering task. In some former publications (Bade and Nürnberger 2006, 2008, 2009), we already presented some preliminary results of this research and different ideas of constraint integration. However, these articles only introduce briefly the general concepts of MLB constraints. In Bade and Nürnberger (2006) we proposed a first—from the current perspective unnecessary complex—formalization of MLB constraints. In the current contribution we provide a much clearer formal introduction of the concept and its properties. Furthermore, in prior work we focused on learning a modified metric rather than a direct, instance-based integration of constraints in the clustering process itself. In Bade and Nürnberger (2008, 2009) we presented basic ideas of the iHAC algorithm. However, no detailed description was given. This contribution provides additional details including an algorithm for an efficient implementation of the ideas. Furthermore, this paper includes a discussion on evaluation measures for hierarchies.

From the algorithmic point of view, we solve a hierarchical semi-supervised clustering task. The overall goal is to derive a hierarchical cluster structure guided through some kind of supervision. Likewise to clustering with pairwise constraints, we are dealing with similarly simple and general partial knowledge about the target cluster structure. Hence, our supervision is provided through a set of constraints which are specifically designed to work with the hierarchical clustering task. These constraints are formalized on the data object level, describing the (hierarchical) relationship between specific objects.

1.1 Application scenarios

As with pairwise constraints, MLB constraints naturally occur in quite diverse application scenarios. Some of those are briefly discussed in the following. Our research was mainly motivated in providing a solution for the task of hierarchical clustering documents in personal information management (Jones and Teevan 2007; Jones 2008). Here, a user is creating and maintaining a hierarchy of documents that s/he needs to re-access later on. An automatic method shall assist the user in this task. To take into account the user’s preferences on the target hierarchy, constraints can be used to guide the automatic clustering process. Two different modes of interaction are possible, which should be supported by the clustering algorithm.

First option is an interactive feedback scenario, in which users rate selected items concerning their (hierarchical) relationship. Specifically, they would critique an initial clustering through the system similar to relevance feedback in information retrieval systems (Manning et al. 2008). This feedback is translated directly into MLB-constraints and applied to provide an improved clustering of the data.

In the second mode, the system analyzes an existing hierarchical structure on old data and applies this to new available data. The existing hierarchy is translated into a (large) set of MLB-constraints, which is applied for clustering the new data. Here, the goal is to extend the user’s existing information space (represented through a hierarchy filled with data) with new data. The existing hierarchy might not be sufficient to cover the new data as new topics naturally arise over time. Hence, we again need to learn a new hierarchical cluster structure, covering both old and new data. As we already know the user’s structuring preferences on the old data, this is incorporated as constraints to the clustering. From a user interaction perspective, it is also a necessity that the user can easily navigate through the data. Therefore, it is crucial that the learned/adapted hierarchy is very similar to the existing one to improve the user’s ease of use. The strongest limitation would be to only allow for hierarchy extension, which is visualized in Fig. 1. The given hierarchy on the left (which corresponds to the user collection) is extended with new nodes (A, B, C) based on the collection on the right describing categories not described by the user collection yet. The new nodes can correspond to new classes, which are sibling nodes of existing nodes in the hierarchy (e.g., node A), or to new subclasses, which refine existing nodes in depth (e.g., nodes B and C). The unlabeled documents on the right are inserted into this refined hierarchy, i.e., either in new or existing nodes. Of course, the tasks of hierarchy refinement and insertion of items are accomplished in parallel.

Fig. 1
figure 1

Semi-supervised hierarchical clustering of documents (left: given hierarchy with training examples and refinement through clustering (node A: new class; nodes B, C: new subclasses); right: collection of new documents)

Please note that this scenario can be solved in different ways. One alternative is the use of incremental algorithms starting with the existing hierarchy and incrementally adding new data. Instead of reclustering the data, the existing hierarchy is modified through a certain set of operators. This is especially beneficial, if data points emerge one by one in a sequence over time. Examples for such approaches can be found in McKusick and Langley (1991), Fisher (1987), Choi and Peng (2004). Fisher (1987) introduced the COBWEB algorithm to build a cluster hierarchy. It uses a heuristic measure of category utility to decide where to add the new object and uses four different operators to modify the hierarchy: classifying the object with respect to an existing class, creating a new class, combining two classes into a single class, and dividing a class into several classes. McKusick and Langley (1991) proposed ARACHNE, an algorithm based on COBWEB. It has the same underlying idea but uses different operations to modify the tree structure. Specifically, they test nodes for vertical and horizontal fitting, which either moves nodes upwards in the hierarchy or merges them with a sibling. Choi and Peng (2004) split their proposed algorithm in two different phases. In the first phase, an initial hierarchy is learned using a batch learning algorithm. In the second phase, new items are incrementally added to the hierarchy. This is especially interesting, if we have already given a hierarchy as a starting point as in our example.

All of those algorithms could be extended to incorporate (hierarchical) constraints. They do not handle them so far as the operations for altering the current hierarchy might lead to constraint violations. In this paper, we decided on a batch learning algorithm to show the effectiveness of the proposed constraints. However, in the future, different algorithms (batch as well as incremental) should be analyzed on their behavior with these constraints. This is out of the scope of this paper. Here, we focus on introducing the constraints itself and discussing their capabilities.

Another good source for constraints are spatial information. As also mentioned and further investigated by Wagstaff (2002), spatial information is better expressed through constraints than through straight forward methods like adding additional features. In this way, spatial constraints can be enforced by the algorithm or in other ways handled more explicitly than through a feature representation. Wagstaff (2002) worked with pairwise constraints. It is our hypothesis that hierarchical constraints as proposed in this paper are an interesting alternative. However, we have not done any experiments yet as the focus of this paper is on document data. This is left for future work. Nevertheless, we want to roughly discuss, how constraints could be applied for spatial data on two different scenarios.

One such application is image segmentation (for a more detailed introduction on image segmentation see, e.g., Gonzalez and Woods 2007). Here, the goal is to cluster pixels of an image to form coherent regions which are assumed to reflect certain objects. While image segmentation can mean to create a partition of pixels, a hierarchical segmentation can give a more detailed description of an image. For example, one could be interested in extracting the car as a whole as well as extracting its door or tire. A clustering approach can be used to create a partition or cluster hierarchy of pixels based on features like pixel color. However, this could lead to clusters with no spatial coherence, which is not useful for image segmentation. MLB constraints can be used to encode these spatial constraints. Those constraints can be computed automatically from the image without the need for manual interference. Through the MLB constraints, the clustering algorithm can be modified to find spatially coherent clusters. Furthermore, additional available information can be integrated in the same manner, not requiring a different approach.

The same is possible for other clustering tasks, where spatial position does matter, e.g., like in approaches for the automatic learning of management zones in precision agriculture (Khosla et al. 2010; Ruß and Kruse 2011). Here, we have geo-referenced data of individual regions on large farms. Depending on the properties, different regions are handled differently by the farmer in terms of watering, adding fertilizer and so on. Forming regions that should receive the same handling is the goal of management zone learning. As in image segmentation, these regions must have spatial coherence. Hence, we have an identical clustering problem of individual elements with certain properties and additional spatial constraints.

1.2 Structure of the paper

The paper is structured as follows. We start with a review on pairwise constraints in the next section. Then, we introduce hierarchical constraints in Sect. 3 and discuss their properties and applicability. In Sect. 4, we focus on constrained hierarchical clustering and discuss an algorithm that integrates MLB constraints efficiently into hierarchical agglomerative clustering (HAC). Finally, we evaluate the algorithm under different scenarios of supervision in terms of clustering quality and runtime in Sect. 5.

2 Pairwise constraints

The goal of pairwise constraints is to express the relative cluster assignment of two items (Wagstaff et al. 2001). As they are defined on pairs of items, they do not require the knowledge of explicit categories. It is sufficient to know whether two items belong to the same or to different categories without the need to name or describe these categories. Nonetheless, if training data is given in the form of instances and their respective target class, a representation with pairwise constraints can always be extracted. Hence, the usage of pairwise constraints allows for a more general applicability.

There are two types of pairwise constraints, must-link constraints and cannot-link constraints. A must-link constraint expresses that two items belong to the same cluster, and a cannot-link constraint describes an item pair that should be separated through clustering. An example is given in Fig. 2.

Fig. 2
figure 2

An example of must-link (left) and cannot-link constraints (right)

Several approaches were developed in the past that use pairwise constraints during clustering. They can be divided in two broad types of approaches, i.e., instance-based and metric-based approaches. Instance-based approaches use constraints to influence the cluster algorithm directly based on the constraint pairs. This is done in different ways, e.g., by using the constraints for initialization (e.g., as in Kim and Lee 2002), by enforcing them during the clustering (e.g., as in Wagstaff et al. 2001), or by integrating them in the cluster objective function (e.g., as in Basu et al. 2004a). Furthermore, instance-based approaches can be divided into approaches that require hard constraint satisfaction (e.g., as in Wagstaff et al. 2001) and those that allow violation of some constraints (e.g., as in Basu et al. 2004a).

The metric-based approaches try to generalize the given knowledge by mapping the constraints to a distance metric or similarity measure, which is then used during the clustering process (e.g., as in Bar-Hillel et al. 2003). The basic idea of most of these approaches is to weight features differently, depending on their importance for the distance computation. While the metric is usually learned in advance using only the given constraints, the approach of Bilenko et al. (2004) adapts the distance metric during clustering. Such an approach allows for the integration of knowledge from unconstrained items.

2.1 A review of selected approaches

In the following, specific approaches from the literature are presented in more detail. These papers form a representative selection of the existing literature showing the diversity of existing approaches.

Instance-based approaches

The first one to propose pairwise constraints was Kiri Wagstaff. Wagstaff and Cardie (2000) first integrated them in the COBWEB clustering method, enforcing hard constraint satisfaction. Following up their work, Wagstaff et al. (2001) proposed COP-KMEANS. This approach alters the assignment of items to the cluster center in k-means. Items are only assigned to clusters such that no constraint is violated. If such an assignment is not possible at any time, the algorithm terminates without a solution. Basu et al. (2002) used labeled data as seeds for initializing cluster centers. Additionally, the labeled data remains fixed to the same clusters during re-computation of cluster assignment. Two years later, they defined an objective function for k-means that is a trade-off between the standard k-means objective and constraint violations (Basu et al. 2004a). They also initialize the clusters using the transitive closure of must-link constraints. Bilenko et al. (2004) combines this approach with metric learning. The Euclidean distance is weighted through a matrix, whereby each cluster adapts its own matrix. It is noteworthy that the metric is adapted during the clustering instead before clustering as done in most other approaches. This work is also embedded in a framework based on hidden Markov random fields (Basu et al. 2004b).

Instance-based approaches were furthermore combined with other clustering methods. Grira et al. (2004) integrated pairwise constraints into fuzzy clustering. They propose a new objective function that combines the standard objective with constraint violations, similar to the work of Basu et al. (2004a). Based on this, they derived a new update formula for the individual refinement steps of the algorithm. Ruiz et al. (2007b) extended density-based clustering. Their algorithm ensures that all constraints are satisfied by design. This, however, leads to the creation of many small clusters. After an evaluation with benchmark data, they also applied their method on query log data (Ruiz et al. 2007a).

Metric learning

Bar-Hillel et al. (2003, 2005) learn a Mahalanobis metric from must-link constraints only. They propose a Relevant Component Analysis which provides a closed form solution for different optimization problems. However, this is only possible because cannot-links are neglected. Nevertheless, they show comparable results to Xing et al. (2003) while requiring less computation time. Xing et al. (2003) also learned a weighted metric similar to the Mahalanobis distance using both types of constraints. A restricted problem using only a diagonal matrix is solved as an optimization problem while a gradient descent search is used for learning a full matrix. Both Bar-Hillel et al. (2005) and Xing et al. (2003) showed the superiority of metric learning over instance based approaches. Kim and Lee (2002) also learned a weighted metric similar to the Mahalanobis distance. They solved an optimization problem that aims at minimizing distance between must-links and maximizing distance between cannot-links with a gradient descent search.

A different approach was taken by Klein et al. (2002). They directly modified the similarity matrix, which can be used directly in hierarchical agglomerative clustering. First, entries of must-link pairs are set to 0. However, this can result in a violation of the triangular inequality for some points. This problem is fixed with a shortest-path-algorithm, which actually lets the must-link constraints also influence other item pairs. Then, all cannot-link pairs are set to a maximal value, which again destroys metric properties. However, it is argued that this does not prohibit using the matrix during clustering. In fact, it is identical to enforcing the constraints during clustering. No fixing is applied, i.e., cannot-link pairs only influence themselves.

Cohn et al. (2003) learn a feature weighting for a probabilistic model. Clustering is done with the EM algorithm. The weights are learned using a gradient descent. Schultz and Joachims (2004) and Finley and Joachims (2005) both use SVMs to learn a metric. Schultz and Joachims (2004) use relative comparisons between pairs of similarities to learn the matrix of a Mahalanobis distance. With these pairs of similarities an optimization problem is formalized and solved with an SVM algorithm. Finley and Joachims (2005) directly train a binary classifier that outputs an item pair’s associated similarity value. This is done with the structural SVM algorithm. The learned similarity function is then used in correlation clustering.

2.2 Problems of constraint usage

Feasibility issues

A number of interesting theoretical work concerning constraints was done by Davidson and Ravi. They analyzed the feasibility problem, i.e., the question whether a partition with k clusters can be derived without violating any constraints. Even though a solution might exist, the constrained clustering might not be able to determine this solution. They studied feasibility in connection with k-means (Davidson and Ravi 2005b, 2007a) and HAC (Davidson and Ravi 2005a, 2009). In the latter work, they also consider the problem of irreducibility, which is the possibility to reach a dead-end during cluster merging, although a solution with a smaller number of clusters exists. In further work, they describe a heuristic to generate easy constraint sets from labeled data, i.e., constraint sets that do not suffer from the feasibility problem (Davidson and Ravi 2006). They also show that the number of clusters cannot be estimated efficiently from a constraint set and that using a too small value for k can be problematic for performance also in metric learning (Davidson and Ravi 2007b).

Beneficial constraints

Although it is widely reported that introducing constraints largely improves clustering quality, not all constraints are equally suited to increase the effectiveness of clustering. Davidson et al. (2006) empirically show that not all constraints improve performance. They identify two properties of constraints which influence the clustering performance, i.e., informativeness and coherence. Informativeness describes how much information is provided that cannot be determined by the algorithm alone. Coherence is the amount of agreement between constraints themselves, given a certain metric. Other work addresses the different benefit of constraints through active learning that aims at identifying constraints that will provide the highest performance gain (Basu et al. 2004a; Klein et al. 2002).

Both problems are very important and also apply to the hierarchical constraints as discussed in this paper. Although it is out of the scope of this paper to discuss and analyze these problems in detail, they need to be kept in mind during algorithm design and evaluation of results.

2.3 Limitations for hierarchies

In a hierarchy of clusters, clusters are part of other, larger clusters. Therefore, items belong in fact to many clusters. As pairwise constraints are not related to a specific cluster and instead generally declare whether two items belong to the same or different clusters, they cannot distinguish between the different clusters an item can belong to in hierarchical clustering. Nevertheless, two items separated in specific clusters are always together in more general clusters—at least in the root (see Fig. 3). Therefore, pairwise constraints are not suitable in the hierarchical setting. This was already mentioned but not solved in the initial work on constrained clustering by Wagstaff and Cardie (2000). Kestler et al. (2006) also identified this problem. They use hierarchical agglomerative clustering and solve the problem by fixing the constraints to a certain dendrogram level. More specifically, they only consider constraints for the top-most dendrogram level. However, in most cases, it is not clear, on which dendrogram level the constraints should be applied.

Fig. 3
figure 3

Pairwise constraints for hierarchies: different constraints between the same objects on different hierarchy levels

Hierarchical learners vs. learning hierarchies

The use of a hierarchical learner like hierarchical agglomerative clustering does not necessarily mean that the goal is to learn a hierarchy. Those learners can as well be used to derive a cluster partition. However, this simplifies the task because a less complex target structure (partition instead of hierarchy) is derived and evaluated. There are many work, in which hierarchical clustering and pairwise constraints are analyzed (e.g., Davidson and Ravi 2005a, 2009; Bae and Bailey 2006; Klein et al. 2002). However, neither of this work aims at deriving a true cluster hierarchy. Therefore, these articles solve a different problem.

3 Hierarchical constraints

Because of the previously mentioned limitations of pairwise constraints, this work suggests another type of constraints that is suitable for hierarchical clustering tasks, the must-link-before (MLB) constraints. The goal is to stress the hierarchical order, in which data objects d i are related. To this end, item pairs are replaced with item triples:

$$ \mathit{MLB}_{xyz} = (d_x,d_y,d_z). $$
(1)

A constraint is labeled with MLB and has three indices relating to the three data objects involved. The order is important as it determines the exact hierarchical relationship according to an underlying cluster hierarchy H as defined in the following:

Definition 1

(Cluster Hierarchy)

A cluster hierarchy H is a tuple (C,≺,root). C is the set of clusters. ≺ defines the hierarchical parent-child-relation between clusters, which is a strict partial ordering between the clusters. rootC is the root node of the hierarchy.

A constraint as in (1) expresses a relative relationship according to hierarchy levels of an underlying hierarchy H. This hierarchy need not to be specified explicitly. It can be implicitly given through the data. MLB constraints are a means to describe those implicit (or explicit) hierarchical relations. Specifically, MLB xyz states that the data objects d x and d y are related on a more specific hierarchy level than the data objects d x and d z as shown in Fig. 4. This can be summarized in the following two properties:

Fig. 4
figure 4

A MLB constraint in the hierarchy

Property 1

Given the MLB constraint MLB xyz , any cluster c in the cluster hierarchy H that contains d x and d z also contains d y : ∀cH:d x cd z cd y c.

Property 2

Given the MLB constraint MLB xyz , there exists at least one cluster c in the cluster hierarchy H, which only contains d x and d y but not d z : ∃cH:d x cd y cd z c.

Based on the previous ideas, a must-link-before constraint is defined as follows:

Definition 2

(Must-Link-Before constraint)

A Must-Link-Before (MLB) constraint MLB xyz is a triple of items MLB xyz =(d x ,d y ,d z ), for which Properties 1 and 2 hold according to an underlying cluster hierarchy H.

Sources for constraints

MLB constraints can come from different sources as shown through several examples in the introduction (see Sect. 1.1). Often they occur naturally (as item triples). Furthermore, existing hierarchical structures (like a folder hierarchy of documents or tagged documents in combination with an ontology between the tags) are a typical source of MLB constraints. Those hierarchies can be formalized through MLB constraints to be used as source of supervision in (clustering) algorithms. This is demonstrated in the following, where we describe how to extract the clustering constraints from a given training set and class hierarchy. A simple example is shown in Fig. 5. On the right side, some MLB constraints are shown, which were extracted from the hierarchy on the left. The first one expresses that a data object in class 4 is closer to a data object in class 5 than to a data object in class 2. The second constraint formalizes that two data objects of class 2 are closer to each other than to a data object in class 3. The third one states that a data object in class 3 is closer relative to a data object in class 5 than with a data object in class 1, which holds because 3 and 5 are grouped together on a more specific hierarchy level than 3 and 1. Of course, it is useful to extract all possible constraints in a systematic manner to gain as much information from the hierarchy as possible.

Fig. 5
figure 5

Example of MLB constraint extraction from a hierarchy

Generality

Although MLB constraints specifically target hierarchical clustering, they can also be applied to flat clustering. A partition of clusters can be viewed as a very simple hierarchy with a root node and all clusters of the partition as direct subclusters of this root node. In this case, the following MLB constraint can be formulated: two data objects belonging to one cluster c i must-link-before a data object of a different cluster c j : MLB xyz =(d x c i ,d y c i ,d z c j ). Hence, a cluster partition can be extracted as a special hierarchy.

However, please note the following difference between MLB constraints and pairwise constraints, which might impact the clustering behavior. MLB constraints add a further meta-relationship on top of the relative cluster assignment relationship defined by pairwise constraints. In fact, MLB constraints compare the strength of two must-link constraints stating that the must-link between the first and the second item is stronger than the must-link between the first and the third item. However, nothing is said about the quantity of the difference. Hence, MLB xyz can describe data objects, which are just one hierarchy level away, but also data objects, which are much further apart, e.g., over 5 or 10 hierarchy levels. In fact, this means that MLB constraints enforce the creation of a hierarchy, which is at least detailed enough to ensure the hierarchical differences of data objects as described by the constraint set. However, they allow for an arbitrarily finer hierarchy as well.

In the case of a cluster partition, the partition (as represented through the simple hierarchy described above) is the least detailed hierarchy that can be learned. However, further sub-clusters are allowed through the constraint set and, therefore, could be possibly derived through a hierarchical clustering algorithm. However, if a partitioning algorithm is used, sub-clusters are not possible and the MLB constraint set exactly describes the partition like it is the case for pairwise constraints. In fact, MLB constraints can be transformed to pairwise constraints in this case (and only in this case). Specifically, one MLB constraint (d x ,d y ,d z ) can be transformed to one must-link constraint con =(d x ,d y ) and two cannot-link constraints con (d x ,d z ) and con (d y ,d z ). Therefore, if it is explicitly known in advance that the clustering goal is a partition rather than a hierarchy, there is no need to use MLB constraints instead of pairwise constraints.

The MLB constraints have the same generality as must-link and cannot-link constraints as they also do not require knowledge about class labels. Formulating an MLB constraint merely requires to specify a relative relatedness between three data objects. As mentioned before, no statement is made on how strong this relationship is. A constraint can consider data objects which are close together—concerning an underlying hierarchy—as well as data objects very far apart. It might be useful to explicitly weigh the relationship, similar to soft must-link and cannot-link constraints (Wagstaff 2002). A weighted MLB constraint therefore associates a weight to a document triple:

$$ \mathit{MLB}_{xyz} = (d_x,d_y,d_z,w). $$
(2)

However, constraint weighting is not further considered here. It is mentioned here as a side note for the sake of completeness and as a basis for further discussion in the future.

Transitivity of MLB constraints

From a set of must-link constraints and a set of cannot-link constraints, one can transitively infer further constraints (Basu et al. 2008, Chap. 1). Likewise, transitive inference is possible for MLB constraints. This can increase the constraint set. Of course, such an inference requires non-conflicting constraints, i.e., no errors or noise in the given constraints. In the following, we discuss and prove three theorems describing transitive inference for MLB constraints.

Theorem 1

(Symmetry)

The first two data objects of an MLB constraint are symmetric:

$$ (d_1,d_2,d_3)\quad \Rightarrow\quad(d_2,d_1,d_3). $$

If d 1 and d 2 are related on a more specific hierarchy level than d 1 and d 3, this also implies that d 2 and d 3 are grouped together only after d 1 and d 2. This can be proven using the properties of MLB constraints given before.

Proof

Given is the constraint (d 1,d 2,d 3). Therefore, it holds according to Properties 1 and 2 that

$$ \mbox{(1)}\quad \forall c \in H: d_1 \in c \wedge d_3 \in c \rightarrow d_2 \in c \quad \mbox{and}\quad \mbox{(2)}\quad \exists c \in H: d_1 \in c \wedge d_2 \in c \wedge d_3 \notin c. $$

(d 2,d 1,d 3) is also a valid constraint, if it can be shown that both properties hold, i.e.,

$$ \mbox{(I)}\quad \forall c \in H: d_2 \in c \wedge d_3 \in c \rightarrow d_1 \in c \quad \mbox{and}\quad \mbox{(II)}\quad \exists c \in H: d_2 \in c \wedge d_1 \in c \wedge d_3 \notin c. $$

The second property (II) holds as it is identical with the given fact (2). The first property (I) can be shown by contradiction. The question is: Does a cluster exist that contains d 2 and d 3 but not d 1? If so, the following assumption would be true:

$$ \mbox{(a)} \ \exists c \in H: d_2 \in c \wedge d_3 \in c \wedge d_1 \notin c, $$

which contradicts (I). Data objects in a hierarchy that are assigned to the same node (at one certain hierarchy level) stay together in all ancestor nodes on more general hierarchy levels as given through the parent-child-relation ≺. Hence, they cannot be split up again. Our previously made assumption (a) therefore implies that a cluster containing d 1 and d 2 but not d 3 cannot exist, because otherwise one would need to split up a node again:

$$ \neg\exists c \in H: d_1 \in c \wedge d_2 \in c \wedge d_3 \notin c. $$

This violates fact (2). Therefore, the assumption made is wrong and (I) holds for (d 2,d 1,d 3). Hence, the symmetry theorem is proven. □

Further transitive inference is possible, if two MLB constraints are combined that overlap in two of three items.

Theorem 2

(Transitivity inside a subtree)

Given two MLB constraints with an identical data object to identify a subtree and an identical reference data object in a different subtree (at the third position), a third MLB constraint with the same reference data object can be inferred through:

$$ (d_1,d_2,d_4) \wedge(d_2,d_3,d_4) \quad\Rightarrow\quad(d_1,d_3,d_4). $$

Theorem 3

(Transitivity over different hierarchy levels)

Given are two MLB constraints with two identical data objects. The first data object identifies the intermediate hierarchy level and is positioned third in one constraint and first or second in the second constraint. The second data object identifies the deeper hierarchy level and is positioned first or second in both constraints. Then, a third constraint can be inferred between the lower and the higher level without the intermediate level:

$$ (d_1,d_2,d_3) \wedge(d_1,d_3,d_4)\quad \Rightarrow\quad(d_1,d_2,d_4). $$

Both theorems can be proven again based on the properties of MLB constraints. Let us start with proving Theorem 2.

Proof

Given are the constraints (d 1,d 2,d 4) and (d 2,d 3,d 4). Therefore, it holds according to Properties 1 and 2 for both constraints that

$$\begin{aligned} &\mbox{(1)}\quad \forall c \in H: d_1 \in c \wedge d_4 \in c \rightarrow d_2 \in c \quad \mbox{and}\\ &\mbox{(2)}\quad \exists c \in H: d_1 \in c \wedge d_2 \in c \wedge d_4 \notin c \quad \mbox{and}\\ &\mbox{(3)}\quad \forall c \in H: d_2 \in c \wedge d_4 \in c \rightarrow d_3 \in c \quad \mbox{and}\\ & \mbox{(4)}\quad \exists c \in H: d_2 \in c \wedge d_3 \in c \wedge d_4 \notin c. \end{aligned}$$

(d 1,d 3,d 4) is also a valid constraint, if it can be shown that both properties hold, i.e.,

$$\begin{aligned} &\mbox{(I)}\quad \forall c \in H: d_1 \in c \wedge d_4 \in c \rightarrow d_3 \in c \quad \mbox{and}\\ &\mbox{(II)}\quad \exists c \in H: d_1 \in c \wedge d_3 \in c \wedge d_4 \notin c. \end{aligned}$$

To prove the first part, we can combine the given facts (1) and (3):

$$ \forall c \in H: d_1 \in c \wedge d_4 \in c \stackrel{(1)}{\rightarrow} d_1 \in c \wedge d_4 \in c \wedge d_2 \in c \stackrel{(3)}{\rightarrow} d_1 \in c \wedge d_2 \in c \wedge d_3 \in c \wedge d_4 \in c, $$

which reduces to (I). To prove the second part, we need again the definition of a cluster hierarchy. So have in mind that items in a hierarchy that are assigned to the same node (at one certain hierarchy level) stay together in all ancestor nodes on more general hierarchy levels as given through the parent-child-relation ≺. Without loss of generality, we assume that d 2 is grouped together with d 1 earlier than with d 3. The first cluster (on the most specific hierarchy level) containing both, d 2 and d 3, therefore also contains d 1. Because of (4), it cannot contain d 4, otherwise the cluster in (4) would not exist. Therefore, this cluster looks as follows:

$$ d_1 \in c \wedge d_2 \in c \wedge d_3 \in c \wedge d_4 \notin c. $$

Property (II) applies for this cluster, which proves that such a cluster exists. Hence, the transitivity is proven. □

And finally, here is the proof for Theorem 3.

Proof

Given are the constraints (d 1,d 2,d 3) and (d 1,d 3,d 4). Therefore, it holds according to Properties 1 and 2 for both constraints that

$$\begin{aligned} & \mbox{(1)}\quad \forall c \in H: d_1 \in c \wedge d_3 \in c \rightarrow d_2 \in c \quad \mbox{and}\quad \mbox{(2)}\quad \exists c \in H: d_1 \in c \wedge d_2 \in c \wedge d_3 \notin c\quad \mbox{and}\\ & \mbox{(3)}\quad \forall c \in H: d_1 \in c \wedge d_4 \in c \rightarrow d_3 \in c \quad \mbox{and}\quad \mbox{(4)}\quad \exists c \in H: d_1 \in c \wedge d_3 \in c \wedge d_4 \notin c. \end{aligned}$$

(d 1,d 2,d 4) is also a valid constraint, if it can be shown that both properties hold, i.e.,

$$ \mbox{(I)}\quad \forall c \in H: d_1 \in c \wedge d_4 \in c \rightarrow d_2 \in c\quad \mbox{and}\quad \mbox{(II)}\quad \exists c \in H: d_1 \in c \wedge d_2 \in c \wedge d_4 \notin c. $$

To prove the first part, we can combine the given facts (3) and (1):

$$ \forall c \in H: d_1 \in c \wedge d_4 \in c \stackrel{(3)}{\rightarrow} d_1 \in c \wedge d_4 \in c \wedge d_3 \in c \stackrel{(1)}{\rightarrow} d_1 \in c \wedge d_2 \in c \wedge d_3 \in c \wedge d_4 \in c, $$

which reduces to (I). To prove the second part, we start with fact (4) and apply fact (1), which yields:

$$ \exists c \in H: d_1 \in c \wedge d_3 \in c \wedge d_4 \notin c \wedge d_2 \in c, $$

which can be reduced to (II). Hence, the transitivity is proven. □

The MLB constraints introduced in this section can be used as basis for constraining hierarchical clustering. In the following approach, it is assumed that Theorems 1–3 are applied to an initially given constraint set to get the most out of it.

4 Constrained hierarchical clustering

In the following, we present an instance-based approach that directly integrates MLB constraints into hierarchical clustering. In general, this approach addresses the task of constrained hierarchical clustering as defined in Definition 3 beneath.

Definition 3

(Constrained Hierarchical Clustering)

Given is a data collection D that shall be structured. Furthermore, a set of must-link-before constraints MLB on some data object triples on D is given. The constrained hierarchical clustering task is to build a cluster hierarchy H=(C,≺,root) and find a mapping for the data objects in D to the clusters in C such that MLB is not violated.

We extended the hierarchical agglomerative clustering (HAC) algorithm (Hastie et al. 2009, Chap. 14.3.12) for this purpose. HAC is a natural choice for the hierarchical clustering problem as it directly extracts the hierarchical relations between data objects through building a dendrogram (see Fig. 6). However, the final hierarchy is very detailed since it splits the data until only individual objects remain in the leaf nodes and several intermediate splits are present. If a more coarse grained structure is desirable, e.g., for user interaction, this needs to be extracted in a postprocessing step as it was proposed by Bade et al. (2007) for unsupervised and semi-supervised clustering.

Fig. 6
figure 6

Example of a dendrogram for the two-dimensional dataset on the right. The similarity of combined clusters is often displayed through the height of the bars that represent a grouping of two clusters

MLB constraints can be naturally integrated into HAC. In contrast to must-link and cannot-link constraints, they do not suffer from the problem that a dendrogram level needs to be specified (cf. Sect. 2.3). This allows for a straight forward integration of MLB constraints into HAC clustering using our instance-based approach iHAC presented in the following.

4.1 Integrating MLB constraints with iHAC

The HAC algorithm is modified in such a way that merges occur only in accordance with the given MLB constraints. HAC initially places each data object in an individual cluster and repeatedly merges the two closest clusters until a single cluster is left. The merging step is modified in iHAC such that not only similarity but also constraint violations are taken into account for determining the next two clusters to merge. In fact, the two most similar clusters are selected from those cluster pairs that do not violate any (or at least the fewest, as discussed later) constraints. This means for merging that for all constraints (d x ,d y ,d z ), the cluster containing d z can only be merged with a cluster that either contains both, d x and d y , or neither. An example is given in Fig. 7. Here, six data objects shall be clustered, given the constraint (d x ,d y ,d z ). As shown in a), this forbids two merges in the first step, which are merging d x with d z and merging d y with d z . Assuming that d w and d x were most similar, the dendrogram after the first merge looks as in b). Again, two merges are forbidden. If the newly formed cluster is merged with d y in the next step (c), the constraint cannot be violated anymore and all merges are allowed for the rest of the clustering. This includes the merge with d z as shown in d). A pseudocode description of the overall iHAC algorithm is shown in Fig. 8.

Fig. 7
figure 7

Example of cluster merges in iHAC under the constraint (d x ,d y ,d z ) (a) after initialization, (b) after first, (c) second, and (d) third merge

Fig. 8
figure 8

The iHAC algorithm

Please note that the proposed method is just one way to integrate constraints into the clustering. The strict enforcement of constraints as done by us is based on the assumption that it is of high importance to follow user given constraints. In a practical scenario, the user will expect to find his provided input reflected in the results. However, there might be other scenarios, where strict enforcement is less important, especially if constraints are generated through an automated method and might contain errors. In this case, it might be useful to define a trade-off function between constraint enforcement and item similarity. However, this is not further analyzed in this paper.

If constraints are not directly conflicting with each other,Footnote 1 there always exists a solution with a complete dendrogram that does not violate any constraints. If the constraints were created from a given class hierarchy, they are always non-conflicting. Previous work on must-link and cannot-link constraints by Davidson and Ravi (2005a, 2005b, 2007a, 2009) indicates, however, that a solution with no constraint violations might not be found by the constrained clustering approach even though it exists. This is referred to as feasibility problem. For HAC, this would mean that the clustering would stop in a dead end before reaching the root node. This means that any merge possible at this moment would violate a constraint. This is referred to as irreducibility (cf. Sect. 2.2).

Considering MLB constraints, irreducibility can also occur. An example is given in Fig. 9. Here, six data objects shall be clustered, given the constraints (d x ,d y ,d z ) and (d u ,d v ,d w ). The top row shows an iteration of the algorithm that reaches a dead end with three clusters. Nonetheless, several solutions with a complete dendrogram exist. One is depicted at the bottom of Fig. 9. The example shows that irreducibility is possible, but how severe is this problem for MLB constraints? So far, no specific experiments were performed to further study this question. However, in all experiments performed with iHAC by us up to this point, the case of a dead end never occurred. It is our hypothesis that the problem of irreducibility is less significant for MLB constraints than for must-link and cannot-link constraints as a higher number of “unfavorable” merges are required to produce a dead end solution.

Fig. 9
figure 9

Example of irreducibility in iHAC under the constraints (d x ,d y ,d z ) and (d u ,d v ,d w ); (a)–(d) iterations of iHAC with a dead end; (e) solution with a complete dendrogram; please note that the item order changes for a better visualization of the dendrogram

Nevertheless, a dead end is possible. As we judge the overall improvement of the clustering to be more important than the strict enforcement of all constraints, we weakened the condition of the merging step to choose the merge with the fewest constraint violations. This ensures that also in the case of irreducibility a complete dendrogram is built.

4.2 An efficient data structure

A critical issue for the overall runtime complexity of iHAC is whether the two clusters to be merged in each iteration can be determined and merged efficiently. Clearly, the straight forward approach of counting in each iteration how many constraints are violated for all data object pairs is very time consuming. Instead, the following data structure was developed, which is inspired by the similarity matrix from standard HAC. However, instead of pure similarity values, the constrained matrix CM includes more complex data. Each matrix entry cm i,j consists of the similarity value sim i,j and a set cv i,j for storing the constraints that are violated by a merge of the two clusters. For this purpose, constraints between data objects are transformed to constraints between clusters. As HAC starts with clusters containing only a single data object, the initial constraints can directly be expressed through replacing data objects by clusters. During the creation of the dendrogram, the interest is only on the currently highest dendrogram level. Therefore, constraints are passed on to clusters on the higher levels as long as they are relevant. This process is described below in more detail.

On that account, the set cv i,j contains all constraints that have the two clusters c i and c j at first and third position (i.e., both, MLB ixj and MLB jxi ). These two kinds of constraints hinder the algorithm to merge the clusters c i and c j as this would violate the constraints. Rather, the merge between the cluster at the first position (either c i or c j ) and the cluster at the second position (a third, different cluster c x ) has to be done first by the algorithm. Please note that we can omit the exact item order in the constraint and create a symmetric matrix. For the algorithm, we only need to know, how many constraints are violated for a certain cluster merge and not the specific constraints. To be able to detect, which constraints are already solved and therefore no longer active, we need to know the involved third cluster c x . This is explained later. Therefore, we store in the set cv i,j the clusters c x of all violated constraints together with the number of constraints associated to these clusters. The latter is necessary as there can be several constraints between the same set of clusters, especially on higher dendrogram levels. As the number of violated constraints is important for the merging criterion, it is stored.

Summarizing, the constrained matrix CM can be described as in (3) below. It is furthermore important to note that it is not necessary to compute the complete matrix. The upper triangular matrix is sufficient because the constrained matrix is symmetric. Additionally, it is beneficial to have an index vector cmi that points to the best entry of each row as shown on the right of (3) (Day and Edelsbrunner 1984; Cathey et al. 2007). For standard HAC that would be the entry of largest similarity. For iHAC, this is the entry with the fewest constraint violations and highest similarity in the case of the same number of violations. As shown by Day and Edelsbrunner (1984), time complexity can be further reduced, if priority queues are used for each row of the matrix, however, with an increased memory complexity. In the implementation for this paper, the index vector cmi was used. The constrained matrix CM looks as follows:

$$ \mathit{CM} = \left( \begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} - & \mathit{cm}_{1,2} & \cdots& \cdots& \mathit{cm}_{1,n}\\ - & - & \ddots& & \vdots\\ \vdots& & \ddots& \ddots& \vdots\\ \vdots& & & - & \mathit{cm}_{n-1,n}\\ - & \cdots& \cdots& - & - \end{array} \right), \qquad\mathit{cmi} = \left( \begin{array}{c} \mathit{cmi}_1\\ \\ \vdots\\ \\ \mathit{cmi}_n \end{array} \right) $$
(3)

with

$$ \mathit{cm}_{i,j} = (\mathit{sim}_{i,j}, \mathit{cv}_{i,j}) $$
(4)

and

$$ \mathit{cv}_{i,j} = \bigl\{(c_x,n_x) | (\exists\mathit{MLB}_{i x j} \vee \exists\mathit{MLB}_{j x i}) \wedge n_x = \operatorname{count}(\mathit {MLB}_{i x j}) + \operatorname{count}(\mathit{MLB}_{j x i}) \bigr\}. $$
(5)

With the constrained matrix CM, the two clusters for the next cluster merge can be efficiently determined with linear complexity (O(n)), because a single iteration over the index cmi is sufficient. Updating the matrix to execute the merge is, however, a little more difficult. It consists of two steps: deleting the columns and rows of the two old clusters c old1 and c old2 and adding a row and column for the new cluster c new .

For deleting a cluster from the matrix, the specific row in the matrix must be deleted. Furthermore, all rows above the one to remove contain one element concerning this cluster. All rows below do not include such an entry, because this belongs in the lower triangular matrix, which is not stored. Removing an entry from the row might require an update of the index vector cmi. In the worst case, an iteration over all elements of that row is needed to find the new best value. Figure 10 summarizes the algorithm for removing a cluster from the matrix. The time complexity of this algorithm is between O(1), if the first row is deleted, and O(n 2), if the last row is deleted and all other rows require a re-computation of the index vector.

Fig. 10
figure 10

An algorithm for removing a cluster from the constrained matrix in iHAC

For adding the merged cluster to the matrix, a new element is added to each row. This new element can be computed based on the old values of the two former clusters. The similarity value is updated as in standard HAC. For the group average linkage method (also called UPGMA), which we used in our implementation, this is done as shown in the algorithm in Fig. 11 (Manning and Schütze 1999, Chap. 14). The new set of constraint violations can basically be created by merging the two old sets. However, constraints that are no longer relevant (because the first two items were already clustered and the constraint can therefore not be violated anymore) are removed from the matrix. This is the case, if the stored constraint clusters are descendant clusters of the new cluster based on the part of the dendrogram that was already built. As the stored clusters in the cv values not necessarily refer to the highest dendrogram level, it is necessary to update them. For this purpose the stored clusters are replaced by the most general ancestor elements on the highest dendrogram level built so far. If the most general ancestor element is the newly formed cluster, we can remove this element from cv. Otherwise, we update cv by replacing the stored cluster with the most general ancestor element.

Fig. 11
figure 11

An algorithm for adding a cluster to the constrained matrix in iHAC

The update of the index vector cmi is fairly easy, because it only changes, if the new cluster has a better value than the best current one (in each row). The entire algorithm is shown in Fig. 11. This algorithms has a time complexity between O(n) and O(n 2), depending on whether constraints are relevant for this merge or not. The worst case occurs, if the considered clusters have constraints to all other remaining clusters in the currently highest dendrogram level.

Based on the constrained matrix and the algorithms presented above, the overall iHAC implementation has a time complexity between O(n 2) and O(n 3). This is identical to the time complexity of the standard HAC implementation (Day and Edelsbrunner 1984). Hence, with the constrained matrix, a time complexity very similar to the baseline given by HAC can be achieved. However, for an increased number of constraints, an increase in runtime can be expected on average. Section 5.4 further compares runtime complexity with experimental measurements.

5 Evaluation

This section presents the experiments that were performed to evaluate the iHAC method. First, we discuss measures for evaluation in hierarchical clustering and describe the selected measures. Then, we describe the experimental setup, introducing the evaluation data used and experiment preparation. The following sections present the experimental results in terms of clustering quality and runtime.

5.1 Hierarchy evaluation measures

Evaluation of clustering methods can be divided in two different approaches: evaluation based on internal measures and evaluation based on external measures. Internal measures aim at evaluating a clustering solely based on the dataset itself and the clustering solution. These measures are usually based on a comparison between intra-cluster and inter-cluster similarity. Items assigned to the same cluster should be as similar as possible while items assigned to different clusters should be as dissimilar as possible. In contrast to this, external measures use a given external structure that describes the underlying cluster structure in the data that shall be recovered by the clustering method. The clustering result is compared to this gold standard to measure how well the method was capable of identifying the known structure. Evaluation with external measures therefore requires benchmark datasets as for classification. Recent discussions about evaluation measures for clustering approaches can be found in Borgelt (2005), Halkidi et al. (2002a, 2002b), Amigó et al. (2009).

Here, we use external evaluation measures as the semi-supervised character of the approach suggests that a certain structuring exists in the data and shall be uncovered. Hints on this structure are provided in the form of constraints or labeled data. Therefore, it is reasonable to measure how well this structure was uncovered. The “hidden” structure is the class structure in the benchmark dataset, on which basis the constraints are also generated. Although there exist several measures for comparing flat partitions, measures for the evaluation of hierarchies are rare. In the following, we present the measures used by us and a review of alternatives.

5.1.1 Hierarchically applied F measure

The F measure (van Rijsbergen 1979) is a widely applied evaluation measure for classification and external clustering evaluation (Sebastiani 2002). This class specific measure is a trade-off between precision and recall. The precision of a class c is the fraction of all items predicted to belong to c that also actually belong to c. In contrast to this, the recall of c is the fraction of all items belonging to this class that are also predicted as such.

To be able to apply the F measure to clustering, a mapping between the created clusters and the classes in the dataset is necessary. There are two ways to gain such mapping: either a class is assigned to every cluster or a cluster is assigned to every class. The first case allows that a class is represented through multiple clusters. However, the goal should be that a class can be represented through a single cluster in the hierarchy. Therefore, we applied the second method, i.e., for each class in the reference hierarchy, a cluster was selected from the dendrogram based on which the F-Score of this class is computed. Specifically, from all clusters in the dendrogram the one with the highest F-Score on the class at hand is chosen. For higher level (i.e., non-leaf) classes, all items contained in subclasses are also counted as belonging to this class. Please note that this simple procedure might select clusters inconsistent with the reference hierarchy or multiple times for different classes in the case of noisy clusters. Determining the optimal and hierarchy consistent selection has a much higher time complexity. However, the results of the simple procedure are usually sufficient for evaluation purposes and little is gained from enforcing hierarchy consistency.

In our evaluation, we computed the F-Score only on the unlabeled data, i.e., the data not used for constraint generation. This is done to estimate the gain on new data and not optimistically biasing the measure. As the F measure is a class specific value, average values are computed for overall hierarchy comparison. Here, we distinguished between the average of all leaf node classes and the average of all non-leaf node classes to evaluate the effects on different levels of the hierarchy.

Applying the F measure as described potentially evaluates only a part of the cluster hierarchy because it just considers the best cluster per class. Some branches of the learned dendrogram might not be part of any cluster selected. As an example consider the two hierarchies in Fig. 12. Two classes exist in the reference hierarchy (which is not depicted). Both hierarchies in the picture are possible solutions of a clustering algorithm. The left branch in both hierarchies is identical. The bold circled clusters were selected for F-Score computation. Therefore, the F-score on both hierarchies is identical. The right branch of both hierarchies is not considered in this computation. Nevertheless, both hierarchies differ largely in this part. While the right branch of the left hierarchy is very noisy, the right branch in the right hierarchy again nicely splits both classes. That is why the right hierarchy should be considered as structuring the data better. From a user’s perspective, it might be sufficient to focus on the best cluster for each class as a user would anyway focus on a single cluster that fits his information need. And the better this single cluster, the better for the user. For comparing the performance of different algorithms, it is nonetheless interesting to evaluate the complete hierarchy to better grasp changes to the structuring by different methods. For this, a different evaluation measure is needed.

Fig. 12
figure 12

Example for F-Score in hierarchical clustering: both hierarchies would gain the same F-score based on the bold, selected clusters for the two classes although the right, non-selected branch differs

5.1.2 H-correlation

Such a measure that can compare two entire hierarchies is H-Correlation (Bade and Benz 2010). Similar to the Rand index (Rand 1971), which compares cluster partitions, the clustering is evaluated based on item assignment without requiring a node mapping between the hierarchies. The Rand index thereby looks at all item pairs. Two cluster partitions cluster two instances identically, if both partitions either assign both instances to the same cluster or to different clusters. The Rand Index is the fraction of all item pairs that are identically clustered by both partitions. Instead of item pairs, H-Correlation considers item triples. More specifically, H-Correlation measures the overlap of the MLB constraint sets that can be determined from both hierarchies. Based on these two sets, both, a symmetric and an asymmetric measure, can be defined. In its simplest form, H-Correlation can measure the overlap of the two sets. This can be extended by weighing individual triples differently. The symmetric H-Correlation H s between the learned hierarchy \(\mathcal{H}_{l}\) and the reference hierarchy \(\mathcal{H}_{r}\) from the dataset is defined as:

$$ H_s(\mathcal{H}_l, \mathcal{H}_r) = \frac{\sum_{\tau\in\mathit{MLB}_l \cap\mathit{MLB}_r}{(w_l(\tau)+w_r(\tau))}}{\sum_{\tau\in \mathit{MLB}_l}{w_l(\tau)} + \sum_{\tau\in\mathit{MLB}_r}{w_r(\tau)}}, $$
(6)

whereby MLB l and MLB r are all must-link-before triples in the learned and reference hierarchy, respectively, and w l and w r assign a weight to a triple depending on the learned and reference hierarchy, respectively.

Furthermore, an asymmetric H-Correlation H a is of interest, if one hierarchy is allowed to be more detailed than the other. As a dendrogram is always a more detailed hierarchy than a given class hierarchy, this measure is especially of interest for our evaluation. In fact, we want to measure, how much of the hierarchical relationship given through a hierarchy is reflected in the dendrogram, while neglecting that the dendrogram contains several more hierarchical relationships as it contains all pairwise merges performed by the clustering. Formally, a hierarchy \(\mathcal{H}_{l} = (C_{l},\prec_{l},\mathit {root}_{l})\) is more detailed than a hierarchy \(\mathcal{H}_{r} = (C_{r},\prec _{r},\mathit{root}_{r})\), if:

$$ C_l \supset C_r \wedge(\forall c_i, c_j \in C_r : c_i \prec_r c_j \rightarrow c_i <_l c_j) \wedge\mathit{root}_l \geq_l \mathit{root}_r, $$
(7)

whereby ≺ refers to a direct parent-child-relation and < refers to the ancestor relation. Then the asymmetric H-Correlation H a measures how many must-link-before triples from the reference hierarchy \(\mathcal{H}_{r}\) can also be found in the learned hierarchy \(\mathcal{H}_{l}\):

$$ H_a(\mathcal{H}_l, \mathcal{H}_r) = \frac{\sum_{\tau\in\mathit{MLB}_l \cap\mathit{MLB}_r}{w_r(\tau)}}{\sum_{\tau\in\mathit{MLB}_r}{w_r(\tau)}}. $$
(8)

The most simple triple weighting function w gives equal weight to all triples. However, there are a lot more triples describing relations in the upper levels of the hierarchy than triples describing specific relations. Therefore, this simple weighting makes the value of H-Correlation strongly dependent on whether the top-level separations were done correctly. The following triple weight gives equal weight to each hierarchy node:

$$ w\bigl((d_1,d_2,d_3)\bigr) = \frac{1}{|\{ (d'_1,d'_2,d'_3) | \mathit {ca}(d_1,d_3)=\mathit{ca}(d'_1,d'_3) \}|} $$
(9)

with ca(a,b) is the most specific common ancestor node of a and b in the hierarchy. Through this weighting, the impact of all triples on the overall value is normalized based on the most general node that is passed by the triple, which is the common ancestor of the first and the third item. Experimental evaluation of the behavior of H-Correlation and the different weightings can be found in the paper by Bade and Benz (2010). Based on these findings, we use the asymmetric H-Correlation using the triple weighting in (9) in this paper.

5.1.3 Alternative measures for hierarchical comparison

Bade and Benz (2010) give a current review on existing measures for comparing hierarchies from the field of clustering as well as ontology engineering. This paper shows that hierarchies can be compared concerning different dimensions, i.e., the hierarchical structure, the assignment of items to this structure, and the assignment of labels (i.e., cluster names) to this structure. Measures that depend on cluster labels are not applicable for the evaluation of clustering methods because label generation is usually not part of the clustering. However, label based measures are predominant in ontology engineering. An exception is the OntoRand index proposed by Brank et al. (2006) and its extended version proposed by Bade and Benz (2010). This measure is also based on the Rand index (Rand 1971). The OntoRand index softens the decision between identically and differently clustered items by using a distance function between clusters for gradual comparison taking into account how far apart two items are placed in the hierarchy. Although this measure is in principle an alternative for the hierarchy evaluation, it is not suited for our evaluation as it only allows for a symmetric hierarchy comparison. We need an asymmetric measure as we compare a more detailed dendrogram with a given class hierarchy.

Very recently Amigó et al. (2009) proposed the extended BCubed measure, which aims at evaluating overlapping partitions in general with hierarchies as a specific case. The original BCubed measure was defined by Bagga and Baldwin (1998) as an algorithm and presented as a measure by Amigó et al. (2009). It also uses item pairs, however, focuses on those that were placed in the same cluster in at least one of the two partitions. It consists of a precision and a recall component. The precision measures the fraction of item pairs assigned to the same cluster that are also in the same cluster in the reference hierarchy. Recall measures the opposite viewpoint by the fraction of item pairs assigned to the same cluster in the reference hierarchy that are also clustered together by the clustering algorithm. The extension for multiple class assignment then replaces individual cluster comparison by a comparison of sets of clusters in which an item pair occurs. Again, this measure compares hierarchies symmetrically and is therefore less suited for our evaluation.

5.2 Experimental setup

Evaluation was done with three datasets shown in Tables 1, 2, and 3. All consist of text documents structured in a class tree. These datasets are subsets from the publicly available banksearch datasetFootnote 2 consisting of websites and the Reuters corpus volume 1Footnote 3 consisting of news articles. Tables 13 show the class structure as well as the number of documents directly assigned to each class. The first dataset uses the complete structure of the banksearch dataset but only the first 100 documents per class. For the Reuters 1 dataset, we selected some classes and subclasses that seemed to be rather distinguishable. We manually picked rather distinct topics from different parts of the Reuters hierarchy. In contrast to this, the Reuters 2 dataset contains classes that are more alike. This was done by picking only topics from one branch of the Reuters hierarchy. We randomly sampled a maximum of 100 documents per class, while a lower number in the final dataset means that there were less than 100 documents in the dataset.

Table 1 Banksearch dataset
Table 2 Reuters 1 dataset
Table 3 Reuters 2 dataset

All documents were preprocessed to obtain a tf×idf document vector representation (Salton and Buckley 1988). We selected features, removing all terms that occurred less than 5 times, were less than 3 characters long, or contained numbers. From the rest, we selected 5000 terms in an unsupervised manner as described by Borgelt and Nürnberger (2004). To determine this number we conducted a preliminary evaluation. It showed that this number still has a small impact on initial clustering performance, while a larger reduction of the feature space leads to decreasing performance.

As the previous work on constrained clustering using pairwise constraints is not applicable on cluster hierarchies, we could not compare these methods with ours. Therefore, we compare in our experiments iHAC with the standard, unsupervised hierarchical agglomerative clustering (HAC) algorithm to show the effectiveness of the constraints. Hence, we show the difference between unsupervised and semi-supervised hierarchical clustering. Both HAC implementations use the group average linkage method (or UPGMA) as cluster similarity measure and the cosine similarity to determine document similarities (Manning and Schütze 1999, Chap. 14).

As mentioned earlier, we follow an evaluation approach with an external measure as usually applied for evaluation of supervised methods. This is reasonable because the goal of our semi-supervised clustering method is to uncover a specific cluster hierarchy. In the motivating example from the beginning, this is the hierarchy intended by the user. In our evaluation, these are the hierarchies given through the dataset. They are presumed to reflect the view of a certain user. To fulfill our given clustering task, the algorithm must derive a dendrogram that reflects the underlying hierarchy as much as possible.

The standard procedure for evaluation in such a setting is to split the data in training and test set. While the training set is used for learning a model (or in our case to define the prior knowledge on the target structure), the test set is used to evaluate the effectiveness of the algorithm (which is in our case the clustering of this data in relation to each other and the training data). This approach was also followed by us. The training data refers to the current user hierarchy as on the left side of Fig. 1. The test data is the collection of new documents yet unclustered as on the right side of Fig. 1. The overall class hierarchy in the complete data set describes the final user hierarchy, which is intended by the user, if he would organize all data on his own. Hence, this is the structure the algorithm shall also derive.

Both methods (HAC and iHAC) were evaluated with five different samples of training data (i.e., set of constraints) for each specific setup. The samples were drawn at random with a uniform distribution. The same samples were used for both algorithms to allow a fair comparison. Mean and variance of the results over these five runs were computed. In the following diagrams, only the mean will be shown for the sake of clarity. The variance was usually very small anyway. We sampled training data and generated all possible MLB constraints from it as described in Sect. 3, which were used to constrain the clustering. Different numbers of labeled data per class were sampled. Specific attention was paid to small numbers of labeled data (ranging from 5 to 30 items) because it is much more likely from a practical point of view that labeled data is rare. Furthermore, different settings were created to reflect different distributions of constraints. Setting (1) assumes evenly distributed constraints. Labeled data was sampled from all classes. Hence, the hierarchy is already completely represented in the training data and new data always belongs to already known topics of the user. However, as discussed in the beginning of this article, it is much more likely that new data also contains new topics. Therefore, two more settings were evaluated for every dataset with parts of the hierarchy being unknown. In these settings, either a single leaf node class (setting (2)) or a whole subtree (setting (3)) of the hierarchy was not used to sample labeled data. Through this, constraints are unevenly distributed. Those scenarios are even more challenging than the first one.

5.3 Results: clustering quality

As discussed in Sect. 5.1, clustering quality was measured with H-Correlation and F-Score. The results are shown in Fig. 13 and 14. At first, we want to discuss setting (1), where constraints were sampled equally from all classes (row 1 and 4 in Fig. 13 and row 1 in Fig. 14). The results show that iHAC can successfully integrate domain knowledge into the hierarchical clustering, improving its performance towards a specific reference structure. Looking at the overall clustering performance as measured with H-Correlation, it can be observed that with an increasing number of constraints cluster quality increases. Depending on the data, the gain reached varies. While constraints strongly improve clustering quality of the banksearch and the reuters 2 dataset, the reuters 1 data rarely benefits from them. This indicates that the nature of the data and especially its feature representation influences the effectiveness of the constraints. In fact, the features of the reuters 1 dataset cannot explain the top level class separation as the documents in the different subclasses have hardly anything in common (from a bag of words point of view). Hence, constraints cannot compensate an inappropriate feature set.

Fig. 13
figure 13

Results for the Banksearch and the Reuters 1 data set

Fig. 14
figure 14

Results for the Reuters 2 data set

Looking at the different hierarchy levels separately with the F measure, different behavior can be observed. iHAC is very successful on the higher levels, because it causes the larger, already merged subclusters to be grouped together correctly for the higher levels (if this would have been done falsely by the original clustering). A few constraints are sufficient to achieve this, because the subclusters are already built. Hence, an increasing number does not bring more benefit. In contrast to this, a constant increase of performance can be measured on the leaf level.

After having evaluated the case of evenly distributed constraints, let us now analyze the impact of an increasing imbalance in the available constraints, e.g., as caused through a partially known hierarchy. These results are shown in row 2 and 5 in Fig. 13 and row 2 in Fig. 14 for setting (2) and in row 3 and 6 in Fig. 13 and row 3 in Fig. 14 for setting (3). In general, the results with unevenly distributed constraints are similar to the results with evenly distributed constraints in terms of differences between the datasets, the number of labeled data, and sensitivity to the hierarchy level. The overall observed performance gain is usually a bit smaller, which is reasonable because unknown classes also mean fewer constraints.

The most notable difference can be found in the F-Score of the higher hierarchy levels. With an increasing number of unknown parts of the hierarchy, the performance can strongly drop for the higher hierarchy level, even below the original baseline given through the unconstrained HAC algorithm. Furthermore this problem gets more severe with more constraints. The reason for this can be found in the erroneous clustering of some instances by the HAC algorithm. Even if a single item (known to belong to a certain class) is merged falsely with a cluster representing an unknown class, the resulting cluster is merged according to this single item and not according to the dominating class which is unknown. Due to this, a whole subcluster can be merged to the wrong subtree, which terribly reduces its F-Score measure.

5.4 Runtime performance

Besides clustering quality, runtime and scalability is usually an important criterion for the applicability of a method. It was already said before that hierarchical agglomerative clustering has a rather high computational complexity. Our implementation has a complexity between O(n 2) for the best case and O(n 3) for the worst case, which, however, is very unlikely (cf. Sect. 4.2). The same bounds hold for iHAC. This section shall show the differences in the average runtime performance through actual time measurements to discuss the impact of the integration of constraints.

Figure 15 shows runtime measurements for the clustering experiments on the Reuters 2 dataset. The measurements were done on a PC with a Intel Core 2 Duo CPU E6750 with 2.66 GHz, 4 GB RAM, a Windows XP 32-bit operating system, and a Java 6 virtual machine. However, please note that the experiments were not done under completely identical conditions. Most importantly, the work load of the PC was varying. Therefore, the displayed numbers might vary to some (rather small) extent because of this. However, the tendency of the runtime behavior is still clearly visible.

Fig. 15
figure 15

Measured runtime of the clustering (in minutes)

It can be noted that the average runtime of iHAC increases with an increasing number of constraints. This is also expected because iHAC requires additional runtime to keep track of the forbidden merges due to constraint violations. The used constrained matrix (cf. Sect. 4.2) can determine the next (allowed and best) cluster merge with the same efficiency as the HAC implementation itself. However, this comes at the price that updating it after each merge might but does not necessarily require some additional effort. The more constraints are involved, the more time is needed for the update as can be seen from the time measurements. Nevertheless, the additional effort is very small in comparison to the number of constraints involved. For this, you should keep in mind that a linear increase in the number of labeled data causes an exponential increase in constraints. As an example, consider the hierarchy of the Reuters 1 dataset. Five labeled items per class generate 37,200 constraints, which increases to 307,800 constraints in the case of ten labeled items per class. Summing up, the developed iHAC algorithm allows for a quite efficient integration of constraints into the clustering process.

6 Conclusion

In this paper, we presented MLB-constraints which are intended to constrain the formation of a cluster hierarchy through some prior knowledge on the target hierarchy. We discussed the limitations of typical pairwise constraints on this problem and formally introduced MLB-constraints and their properties. In the second part of the paper, we focused on an instance-based constrained clustering approach to integrate these constraints into hierarchical agglomerative clustering. While the approach itself is rather simple, we focused on efficiency, which is very important for handling large constraint sets as typical in the hierarchical scenario. We evaluated this approach in terms of both clustering quality and runtime performance. With three different datasets, we showed that MLB constraints can successfully increase the clustering quality with the instance-based approach iHAC, while requiring only a little extra runtime. However, if parts of the target hierarchy are unknown during clustering and not represented by the constraint set, small cluster errors can be intensified through the constrained clustering. To tackle this problem, we suggest the development of metric-based methods based on MLB constraints. Some ideas for such approaches can be found in our previous work on the topic (Bade and Nürnberger 2006, 2008, 2009). These approaches are not the focus of this paper. However, it shall be mentioned that combining both types of approaches showed to be most beneficial. Having an efficient instance-based method as presented in this paper is therefore a prerequisite for further experiments on combined methods.