1 Introduction

Real-world datasets often consist of complex and a priori unknown patterns and structures, requiring to improve the basic representation. Kernel methods are commonly used for this purpose (Hofmann et al. 2008; Shawe-Taylor and Cristianini 2004). However, their applicability is confined by several limitations (von Luxburg 2007; Nadler and Galun 2007; Chehreghani 2017b). (1) Finding the optimal parameter(s) of a kernel function is often nontrivial, in particular in an unsupervised learning task such as clustering where no labeled data is available for cross-validation. (2) The proper values of the parameters usually occur inside a very narrow range that makes cross-validation critical, even in presence of labeled data.

To overcome such challenges, some graph-based distance measures have been developed in the context of algorithmic graph-theory. In this setup, each object corresponds to a node in a graph, and the edge weights are the pairwise (e.g., squared Euclidean) distances between the respective objects (nodes). Then, different methods perform different types of inference on the graph to compute an effective distance measure between the pairs of objects. Link-based methods (Chebotarev 2011; Yen et al. 2008) first sum the edge weights on every path to compute the path-specific distances. The final distance is then obtained by summing up the path-specific distances of all paths between the two nodes. This distance measure can be obtained by inverting the Laplacian of the base distance matrix related to Markov diffusion kernel (Fouss et al. 2012; Yen et al. 2008). It requires an \({\mathcal {O}}(n^3)\) runtime, with n the number of objects.

Minimax distance measure is an alternative option that computes the minimum largest gap of all possible paths between the objects. Several previous works study the superior performance of Minimax distances, compared to metric learning or link-based choices (Farnia and Tse 2016; Fischer et al. 2003; Chehreghani 2016b; Kim and Choi 2007, 2013; Kolar et al. 2011; Li et al. 2017). Minimax distances have been first used with clustering problems in two ways, either as an input in the form of pairwise distance matrix (Chang and Yeung 2008; Pavan and Pelillo 2007), or integrated with some clustering algorithms (Fischer and Buhmann 2003). The straightforward approach to compute the pairwise Minimax distances is to use an adapted variant of the Floyd–Warshall algorithm, whose runtime is \({\mathcal {O}}(n^3)\) (Aho and Hopcroft 1974). However, the method in Fischer and Buhmann (2003) is computationally even more demanding, as its runtime is \({\mathcal {O}}(n^2|E|+n^3\log n)\) (|E| is the number of edges in the graph). Based on equivalence of Minimax distances over a graph and over any minimum spanning tree constructed on that, Chehreghani (2017b, 2020) propose to compute first a minimum spanning tree (e.g., using Prim’s algorithm) and then obtain the Minimax distances over that via an efficient dynamic programming algorithm. Then, the runtime of computing pairwise Minimax distances reduces to \({\mathcal {O}}(n^2)\). Chehreghani (2017d) analyzes computing pairwise Minimax distances in different sparse and dense settings. Zhong et al. (2015) develops an approximate minimum spanning tree algorithm and investigates it for efficient computation of pairwise Minimax distances. Yu et al. (2014) and Liu and Zhang (2019) combine Minimax distances with specific clustering methods in closed-form ways.

Minimax distances have been also used for K-nearest neighbor search (Kim and Choi 2007, 2013; Chehreghani 2016b). The method in Kim and Choi (2007) presents a message passing method related to the sum–product algorithm (Kschischang et al. 2006) to perform K-nearest neighbor classification with Minimax distances. Even though its runtime is \({\mathcal {O}}(n)\), it needs computing a minimum spanning tree (MST) in advance that can require \({\mathcal {O}}(n^2)\) runtime. Thereafter, the algorithm in Kim and Choi (2013) computes the Minimax K nearest neighbors via space partitioning whose runtime is \({\mathcal {O}}(\log n + K\log K)\). However, it is applicable only to sparse graphs built in Euclidean spaces. Finally, Chehreghani (2016b) has proposed an efficient Minimax K-nearest neighbor search method applicable to general graphs and dissimilarities. Its runtime, similar to the standard K nearest neighbor search is linear in general. Moreover, the method provides an outlier detection mechanism alongside performing K-nearest neighbor search, all with a linear runtime. The work in Chehreghani (2017a) investigate Minimax K nearest neighbor search for matrix (of user profiles) completion.

Besides Minimax distances, another related line of research has been developed in the context of tree preserving embedding (Shieh et al. 2011a, b), where the goal is to compute an embedding that preserves the single linkage dendrogram in the embedding.Footnote 1

Both Minimax distances and tree preserving embedding correspond to computing a set of features representing single linkage dendrograms. Therefore, this limitation motivates us to extend the previous works on representation learning and feature extraction based on single linkage criterion and develop a generalized framework to compute different distance measures according to various dendrograms. In our framework the dendrogram, i.e., the way the inter-cluster distances called linkage are defined, can be constructed according to different criteria. The single linkage criterion (Sneath 1957) defines the linkages as the distance between the nearest members of the nodes. In contrast, the complete linkage criterion (Sorensen 1948; Lance and Williams 1967) defines the distance between two nodes as the distance between their farthest members, which corresponds to the maximum within-node distance of the new node. On the other hand, in average criterion (Sokal and Michener 1958) the average of inter-node distances is used as the linkage between two nodes. The Ward method  (Ward 1963) uses the distances between the means of the nodes normalized by a function of the size of the nodes. Moseley and Wang (2017) analyzes in detail several of such criteria.

We study the embedding of the pairwise distances computed from a dendrogram into a new vector space such that the squared Euclidean distances in the new space equal to the dendrogram-based distances. This embedding provides us to employ dendrogram-based distances with a wide range of different machine learning methods, and yields a rich family of alternative dendrogram-based distances with Minimax distance measures and tree preserving embedding in Shieh et al. (2011a, b) being only special instantiations.

Then, we encounter a model selection problem which asks for the choice of the appropriate distance measure (and dendrogram). Therefore, in the context of model averaging and ensemble methods, we first study the aggregation of the distance measures from different dendrograms in the solution space. Assuming, for exaple the different dendrogram-based distance measures are used for an unsupervised clustering task, we build a graph with positive and negative edge weights based on the (dis)agreement of the respective nodes among different clustering solutions. Then, we employ an efficient variant of correlation clustering to obtain the final ensemble solution. Second, several recent studies demonstrate the superior performance of deep representation learning models that extract complex features via aggregating representations sequentially at different levels. Such models are highly over-parameterized and thus require huge amounts of training data to infer the parameters. However, unsupervised representation learning is expected to become far more important in longer term, as human and animal learning is mainly unsupervised (LeCun et al. 2015). Thereby, with the possibility of having access to a wide range of alternative feature extraction models, we investigate design of multi layer deep architectures in an unsupervised manner (in representation space, instead of solution space) which does not require inferring or fixing any critical parameter. Specifically, we study the sequential aggregation of the dendrogram-based features where for example the single linkage features are computed based on the features obtained from average linkage dendrogram, instead of using the original data features.

Our framework provides several options for choosing the dendrogram and the level function, where each option yields separate unsupervised representations and features. However, at the same time, we propose a principled way to aggregate and choose the best options (either in solution space or in representation space). Availability of such alternatives endows a rich family of unsupervised representation learning methods and is different from optimizing the free parameters of a kernel. We will discuss this model selection aspect with more detail in the experiments section.

Finally, we experimentally validate the effectiveness of our framework on UCI and real-world datasets.

2 Feature extraction from dendrograms

In this section, we first introduce the setup for computing distance measures from dendrograms, and then, based on the relation between Minimax distances and single linkage agglomerative clustering, we propose a generalized approach to extract features from dendrograms.

2.1 Pairwise distances over dendrograms

We are given a dataset of n objects with indices \({\mathbf {O}}= \{1,\ldots ,n\}\) and the corresponding measurements. The measurements can be for example the vectors in a feature space or the pairwise distances between the objects. In the former case, the measurements are shown by the \(n\times d\) matrix Y, wherein the ith row (i.e., \({\mathbf {Y}}_i\)) specifies the d dimensional vector of the ith object. In the latter form, an \(n \times n\) matrix X represents the pairwise distances between the objects. Then, we might show the data by graph \({\mathcal {G}}({\mathbf {O}},{\mathbf {X}})\), wherein O is the set of its vertices and X represents the edge weights. Note that the former is a specific form of the latter representation, where the pairwise distances are computed according to (squared) Euclidean distances.

A dendrogram D is defined as a rooted ordered tree such that,

  1. 1.

    each node v in D includes a non-empty subset of the objects, i.e., \(v \subset {\mathbf {O}},|v| >0, \forall v \in D\), and

  2. 2.

    the overlapping nodes are ordered, i.e., \(\forall u,v \in D, \text {if } u \cap v \ne 0, \text { then either } u \subseteq v \text { or } v \subseteq u.\)

The latter condition implies that between every two overlapping nodes an ancestor-descendant relation holds, i.e., \(u \subseteq v\) indicates v is an ancestor of u, and u is a descendant of v.

The nodes at the lowest level (called the final nodes) are the singleton objects, i.e., node v is a final node if and only if \(|v|=1\). A node at a higher level contains the union of the objects of its children (direct descendants). The root of a dendrogram is defined as the node at the highest level (which has the maximum size), i.e., all other nodes are its descendants. linkage(v) returns the distance between the children of v based on the criterion used to compute the dendrogram. For the simplicity of explanation, we assume each node has only two children. In the case that a parent node has multiple (more than two) child nodes, the different linkages among the children will have the same value, which will be assigned to the parent node. To encode a dendrogram, we use the data structure supported by SciPy in Python in particular the same way as the output of the linkage function.Footnote 2 This data structure is a n − 1 by 4 matrix called Z. Each individual object constitutes a separate singleton cluster where the cluster index is the object index. At each iteration i of the agglomerative algorithm, the indices of the two combined clusters are stored respectively in \({\mathbf {Z}}_{i,0}\) and \({\mathbf {Z}}_{i,1}\). The index of the new cluster is then i + n. We store the distance between the two clusters in \({\mathbf {Z}}_{i,2}\) and the size of the new cluster in \({\mathbf {Z}}_{i,3}\).

The level of node v, i.e., level(v) is determined by \(\max (level(c_l), level(c_r))+1\), where \(c_l\) and \(c_r\) indicate the two child nodes of v. For the final nodes, the level() function returns 0. Every connected subtree of D whose final nodes contain only singleton objects from O constitutes a dendrogram on this set. We use \(\mathcal D^D\) to refer to the set of all (sub)dendrograms derived in this form from D.

Thereby, the level of node v, i.e., level(v) is determined by

$$\begin{aligned} level(v) = {\left\{ \begin{array}{ll} \max (level(c_l), level(c_r))+1,&{} \\ \quad \quad \quad \text {if } linkage(v) > \max (linkage(c_l), linkage(c_r)).\\ \max (level(c_l), level(c_r)), &{}\\ \quad \quad \quad \text {if } linkage(v) = \max (linkage(c_l), linkage(c_r)). \end{array}\right. } \end{aligned}$$
(1)

where \(c_l\) and \(c_r\) indicate the two child nodes of v. Note that in an agglomerative method we always have \(linkage(v) \ge \max (linkage(c_l,c_r))\). In particular, we usually expect \(linkage(v) > \max (linkage(c_l,c_r))\), unless there are ties for example in the case of single linkage method, where then the new combination does not yield a higher level node. Rather, the new node has effectively three children instead of two, where two of them are combined to make an intermediate node. Without loss of generality and for the sake of simplicity of presentation, we assume that ties do not occur, i.e., we always have

$$\begin{aligned} level(v) = \max (level(c_l,c_r))+1. \end{aligned}$$
(2)

We consider a generalized variant of the level() function over a dendrogram D. Any function f(v) that satisfies the following conditions is a generalized level function.

  1. 1.

    f(v) = 0 if and only if \(v \subset {\mathbf {O}}, |v| =1\).

  2. 2.

    \(f(v) > f(u)\) if and only if v is an ancestor of u.

It is obvious that the basic function level() satisfies these conditions. We use \(v^*_{ij}\) to denote the node at the lowest level which contains both i and j, i.e.,

$$\begin{aligned} v^*_{ij} = \arg \min _{v\in D} f(v) \quad \text { s.t. } i,j \in v. \end{aligned}$$
(3)

Given dendrogram D, each node \(v \in D\) represents the root of a dendrogram \(D' \in \mathcal D^D\). Thereby, the dendrogram \(D'\) inherits the properties of its root node, i.e., \(f(D') = \max _{v\in D'} f(v)\) and \(linkage(D') = \max _{v\in D'} linkage(v)\), since the root node has the maximum linkage and level among the nodes of \(D'\).

In this paper, we investigate inferring pairwise distances from a dendrogram computed according to an arbitrary criterion, i.e., beyond single linkage criterion. Moreover, our framework allows one to define the level function in a very flexible and diverse way. For this purpose, we consider the following generic distance measure over dendrogram D, where \({\mathbf {D}}^D_{ij}\) indicates the pairwise dendrogram-based distance between the pair of objects (final nodes) \(i, j\in {\mathbf {O}}\).

$$\begin{aligned} {\mathbf {D}}^D_{ij} = \min f(D') \quad \text {s.t.} \quad i,j \in D', \text { and } D'\in \mathcal {D}^D. \end{aligned}$$
(4)

The level function f(v) and the distance matrix \({\mathbf {D}}^D\) provide distinguishing outliers at different levels. The outlier objects do not occur in the nearest neighborhood of many other clusters or objects. Thus, they join the other nodes of the dendrogram only at higher levels. Hence, the probability of object i being an outlier is proportional to the level at which it joins to other objects/clusters. Therefore, such objects will have a large dendrogram-based distance from the other objects.

2.2 Minimax distances and single linkage agglomeration

We first study the relation between Minimax distances and single linkage agglomerative method. In particular, we elaborate that given the pairwise dissimilarity matrix X, the pairwise Minimax distance between objects i and j is equivalent to \({\mathbf {D}}^D_{ij}\) where the dendrogram is produced with single linkage criterion and \({\mathbf {D}}^D_{ij}\) is defined by

$$\begin{aligned} {\mathbf {D}}^D_{ij} = \min linkage(D') \quad \text {s.t.} \quad i,j\in D' \text { and } D'\in \mathcal {D}^D, \end{aligned}$$
(5)

i.e., \(f(D')\) in Eq. 4 is replaced by \(linkage(D')\).

Theorem 1

For each pair of objects \(i,j \in {\mathbf {O}}\), their Minimax distance measure over graph \({\mathcal {G}}({\mathbf {O}},{\mathbf {X}})\)is equivalent to their pairwise distance \({\mathbf {D}}^D_{ij}\)defined in Eq5where the dendrogram D is obtained according to single linkage agglomerative method.

Proof

It can be shown that the pairwise Minimax distances over an arbitrary graph are equivalent to pairwise Minimax distances over ‘any’ minimum spanning tree computed from the graph. The proof is similar to the maximum capacity problem (Hu 1961) problem. Thereby, the Minimax distances are obtained by

$$\begin{aligned} {\mathbf {D}}_{i,j}^{MM}= & {} \min _{r\in \mathcal R_{ij}({\mathcal {G}})}\left\{ \max _{1\le l \le |r|-1}{\mathbf {X}}_{r(l)r(l+1)}\right\} \\= & {} \max _{1\le l \le |r_{ij}|-1}{\mathbf {X}}_{r(l)r(l+1)}, \end{aligned}$$
(6)

where \(r_{ij}\) indicates the (only) route between i and j, i.e., to obtain Minimax distances \({\mathbf {D}}^{MM}_{ij}\), we select the maximal edge weight on the only route between i and j over the minimum spanning tree.

On the other hand, single linkage method and the Kruskal’s minimum spanning tree algorithm are equivalent (Gower and Ross 1969). Thus, dendrogram D represents the pairwise Minimax distances. Now, we only need to show that the Minimax distances in Eq. 6 equal the distances defined in Eq. 3 of the main text, i.e., \({\mathbf {D}}^D_{ij}\) is the largest edge weight on the route between i and j in the hierarchy.

Given i, j, let \(D^* = \arg \min linkage(D') \quad \text {s.t.} \quad i,j\in D' \text { and } D'\in \mathcal {D}^D\). Then, \(D^*\) represents a minimum spanning subtree, which includes a route between i, j (because the root node of \(D^*\) contains both i, j) and it is consistent with a complete minimum spanning on all the objects. On the other hand, we know that for each pair of nodes \(u, v\ \in D^*\) which have direct or indirect parent–child relation, we have, \(linkage(u) \ge linkage(v)\) iff \(f(u) \ge f(v)\). This indicates that the linkage of the node root of \(D^*\) represents the maximal edge weight on the route between i and j induced by the dendrogram D. Thus, \({\mathbf {D}}^D_{ij}\) defined in Eq. 3 of the main text represents \({\mathbf {D}}^{MM}_{ij}\) and the proof is complete.□

Notice that the Minimax distances in Eq. 5 are obtained by replacing \(f(D')\) with \(linkage(D')\) in the generic form of Eq. 4.

2.3 Vector-based representation of dendrogram-based distances

The generic distance measure defined in Eq. 4 yields an \(n \times n\) matrix of pairwise dendrogram-based distances between objects. However, a lot of machine learning algorithms perform on a vector-based representation of the objects, instead of the pairwise distances. For instance, mixture density estimation methods such as Gaussian Mixture Models (GMMs) fall in this category. Vectors constitute the most basic form of data representation, since they provide a bijective map between the objects and the measurements, such that a wide range of numerical machine learning methods can be employed with them. Moreover, feature selection is more straightforward with this representation. Thereby, we compute an embedding of the objects into a new space, such that their pairwise squared Euclidean distances in the new space equal to their pairwise distances obtained from the dendrogram. For this purpose, we first investigate the feasibility of this kind of embedding. Theorem 2 verifies the existence of an \(\mathcal {L}_2^2\) embedding for the general distance measure defined in Eq. 4.Footnote 3

Theorem 2

Given the dendrogram D computed on the input data Y or X, the matrix of pairwise distances \({\mathbf {D}}^D\)obtained via Eq. 4induces an \(\mathcal {L}_2^2\)embedding, such that there exists a new vector space for the set of objects O wherein the pairwise squared Euclidean distances equal to \({\mathbf {D}}^D_{ij}\)s in the original data space.

Proof

First, we show that the matrix \({\mathbf {D}}^{D}\) yields an ultrametric. The conditions to be satisfied are:

  1. 1.

    \(\forall i,j: {\mathbf {D}}^{D}_{ij} = 0\) if and only if i = j. We investigate each of the conditions separately. (1) First, if i = j, then \({\mathbf {D}}^{D}_{ii} = \min f(i) = 0\). (2) If \({\mathbf {D}}^{D}_{ij} = 0\), then \(v^*_{ij} = i = j\), because f(v) = 0 if and only if \(v \in {\mathbf {O}}\). On the other hand, \(\forall i\ne j, {\mathbf {X}}_{ij} >0\), i.e., \(f(v^*_{ij}) > 0\) if \(i\ne j\).

  2. 2.

    \(\forall i,j: {\mathbf {D}}^{D}_{ij} \ge 0\). We have, \(\forall v, f(v) \ge 0\). Thus, \(\forall D' \in \mathcal D^D, \min f(D) \ge 0\), i.e., \({\mathbf {D}}^{D}_{ij} \ge 0\).

  3. 3.

    \(\forall i,j: {\mathbf {D}}^{D}_{ij} = {\mathbf {D}}^{D}_{ji}\). We have, \({\mathbf {D}}^{D}_{ij} = \{\min f(D) \quad \text {s.t.} \quad i,j \in D', \text { and } D'\in \mathcal {D}^D\} = \{\min f(D) \quad \text {s.t.} \quad j,i \in D', \text { and } D'\in \mathcal {D}^D\} = {\mathbf {D}}^{D}_{ji}\).

  4. 4.

    \(\forall i,j,k: {\mathbf {D}}^D_{ij} \le \max ({\mathbf {D}}^D_{ik},{\mathbf {D}}^D_{kj})\). We first investigate \({\mathbf {D}}^D_{ik}\) where we consider the two following cases: (1) If \({\mathbf {D}}^D_{ij} \le {\mathbf {D}}^D_{ik}\) (Fig. 1a), then \({\mathbf {D}}^D_{ik}\) does not yield a contradiction. (2) If \({\mathbf {D}}^D_{ij} > {\mathbf {D}}^D_{ik}\), then i and k join earlier than i and j, i.e., \(f(v^*_{ij}) > f(v^*_{ik})\) (Fig. 1b). In this case, we have \(f(v^*_{ij}) = f(v^*_{v^*_{ik},j})\) and \(f(v^*_{kj}) = f(v^*_{v^*_{ik},j})\). Thus, we will have \(f(v^*_{ij}) = f(v^*_{kj})\), i.e., \({\mathbf {D}}^D_{ij} = {\mathbf {D}}^D_{ik} \le \max ({\mathbf {D}}^D_{ik},{\mathbf {D}}^D_{kj})\). In a similar way, by investigating \({\mathbf {D}}^D_{jk}\) a similar result holds. Thereby, we conclude, a) if \({\mathbf {D}}^D_{ij} > {\mathbf {D}}^D_{ik}\), then \({\mathbf {D}}^D_{ij} = {\mathbf {D}}^D_{kj}\), and b) if \({\mathbf {D}}^D_{ij} > {\mathbf {D}}^D_{kj}\), then \({\mathbf {D}}^D_{ij} = {\mathbf {D}}^D_{ik}\). Thereby, we always have \({\mathbf {D}}^D_{ij} \le \max ({\mathbf {D}}^D_{ik},{\mathbf {D}}^D_{kj})\).

On the other hand, one can show that an ultrametric induces an \(\mathcal {L}_2^2\) embedding (Deza and Laurent 1994). Therefore, \({\mathbf {D}}^{D}\) represents the pairwise squared Euclidean distances in a new vector space.□

Fig. 1
figure 1

The ultrametric property of \({\mathbf {D}}^D\)

After assuring the existence of such an embedding, we can use any method to compute it. In particular, we exploit the method introduced in Young and Householder (1938) and then further analyzed in Torgerson (1958). This method proposes first centering \({\mathbf {D}}^{D}\) to obtain a Mercer kernel and then performing an eigenvalue decomposition:Footnote 4

  1. 1.

    Center \({\mathbf {D}}^{D}\) via

    $$\begin{aligned} \mathbf {W}^{D}\leftarrow -\frac{1}{2}\mathbf {A} {\mathbf {D}}^{D} \mathbf {A}. \end{aligned}$$
    (7)

    A is obtained by \(\mathbf {A} = \mathbf {I}_n - \frac{1}{n}\mathbf {e}_n\mathbf {e}_n^{T}\), where \(\mathbf {e}_n\) is an n-dimensional constant vector of 1’s and \(\mathbf {I}_n\) is an identity matrix of size \(n\times n\).

  2. 2.

    With this transformation, \({\mathbf {W}}^{D}\) becomes a positive semidefinite matrix. Thus, we decompose \({\mathbf {W}}^{D}\) into its eigenbasis, i.e., \({\mathbf {W}}^{D}=\mathbf {V}\varvec{\Lambda }\mathbf {V}^{T},\) where \(\mathbf {V} = (v_1,\ldots ,v_n)\) contains the eigenvectors \(v_i\) and \(\varvec{\Lambda }=\texttt {diag}(\lambda _1,\ldots ,\lambda _n)\) is a diagonal matrix of eigenvalues \(\lambda _1\ge \cdots \ge \lambda _l\ge \lambda _{l+1}= 0 = \cdots = \lambda _n\). Note that the eigenvalues are nonnegative, since \({\mathbf {W}}^{D}\) is positive semidefinite.

  3. 3.

    Calculate the \(n\times l\) matrix \(\mathbf {Y}^{D}_l=\mathbf {V}_l(\varvec{\Lambda }_l)^{1/2},\) with \(\mathbf {V}_l=(v_1,\ldots ,v_l)\) and \(\varvec{\Lambda }_l=\text {diag}(\lambda _1,\ldots ,\lambda _l)\), where l shows the dimensionality of the new vectors.

The new dendrogram-based dimensions are ordered according to the respective eigenvalues and one might choose only the first most representative ones, instead of taking all. Hence, an advantage of computing such an embedding is feature selection.

2.4 On the choice of level function

As mentioned before, Minimax distances as a particular instance of the dendrogram-based representations, are widely used in clustering and classification tasks. However, such distances (and equivalently the single linkage method) do not take into account the diverse densities of the structures or classes. For example, consider the dataset shown in Fig. 2 which consists of two clusters with different densities, marked respectively with black and blue colors. However, the intra-cluster Minimax distances for the members of the blue cluster are considerably large compared to the intra-cluster Minimax distances of the black cluster, or even the inter-cluster Minimax distances. Thereby, a clustering algorithm might split the blue cluster, instead of performing a cut on the boundary of the two clusters. According to Proposition 1, the Minimax distance between objects i and j seeks for a linkage with maximal weight on the path between them in the dendrogram. However, the absolute value of a linkage might be biased in a way that it does not precisely reflect the real coherence of the two nodes compared to the other nodes/objects. Thereby, in order to be more adaptive with respect to the diverse densities of the underlying structures, we will investigate the following choice in our experiments.

$$\begin{aligned} {\mathbf {D}}^D_{ij} = \min _{D'} level(D') \quad \text {s.t.} \quad i,j \in D', \text { and } D'\in \mathcal {D}^D. \end{aligned}$$
(8)

Note that our analysis is generic and can be applied to any definition of dendrogram-based distance measure and to any choice of f defined in Eq. 4. It only needs to satisfy the aforementioned conditions for generalized level functions.

Fig. 2
figure 2

Minimax distance measures might perform imperfectly on the data with diverse densities. An adaptive approach which takes into account the variance of different classes or clusters might be more appropriate

3 Aggregation of multiple representations

3.1 Aggregation in solution space

As discussed earlier, a dendrogram can be constructed in several ways according to different criteria. Moreover, the choice of a level function and a distance function over a dendrogram renders another degree of freedom. Therefore, choosing the right method constitutes a model selection question. Let us assume such distances and features are used later in a clustering task, which is the most common unsupervised learning problem. Then, we address this problem via an ensemble method in the context of model averaging.

We follow a two-step procedure to compute an aggregated clustering that represents a given set of clustering solutions (where, e.g., each solution is the result of a particular dendrogram and then a clustering algorithm). First, we construct a graph whose vertices represent the objects and its edge weights can be any integer number (i.e., positive, negative or zero), depending how often the respective vertices appear at the same cluster among the M different clustering solutions. More specifically, we initialize the edge weights by zero. Then, for each clustering solution \({\mathbf {c}}^m \in \{1,\ldots ,K\}^n, 1\le m\le M\) (each obtained from a different dendrogram-based representation), we compute a co-clustering matrix whose (i, j)th entry is + 1 if \({\mathbf {c}}^m_i = {\mathbf {c}}^m_j\), and it is − 1 otherwise (K indicates the number of clusters). Finally, we sum up the co-clustering matrices to obtain \({\mathbf {S}}^e\). Algorithm 1 describes the procedure in detail.Footnote 5

figure a

Given the graph with positive and negative edge weights, we use correlation clustering (Bansal et al. 2004) to partition it into K clusters. This model computes a partitioning that minimizes the disagreements, i.e., sum of the inter-cluster positive edge weights plus sum of the intra-cluster negative edge weights should be minimal. The cost function for a fixed number of clusters K is written by Bansal et al. (2004) and Chehreghani et al. (2012)

$$\begin{aligned} R({\mathbf {c}},{\mathbf {S}}^e) =&\,\frac{1}{2}\sum _{k=1}^{K} \sum _{i,j\in {\mathbf {O}}_k}(|{\mathbf {S}}^e_{ij}|-{\mathbf {S}}^e_{ij}) \\&+ \frac{1}{2}\sum _{k=1}^{K} \sum _{k'=k+1}^{K} \sum _{i\in {\mathbf {O}}_k} \sum _{j\in {\mathbf {O}}_{k'}} (|{\mathbf {S}}^e_{ij}|+{\mathbf {S}}^e_{ij}), \end{aligned}$$
(9)

where \({\mathbf {O}}_k\) indicates the objects of the kth cluster, i.e., \(\forall i: i \in {\mathbf {O}}_k \text { iff } {\mathbf {c}}_i =k\). This model has been further analyzed in Thiel et al. (2019) in terms of convergence rate.

This ensemble clustering method yields a consistent aggregation of the clustering solutions obtained from different representations, i.e., in the case of M = 1 the optimal solution of Eq. 9 does not change the given clustering solution of this single representation.

Efficient optimization of correlation clustering cost function Finding the optimal solution of the cost function in Eq. 9 is NP-hard (Bansal et al. 2004; Demaine et al. 2006) and even APX-hard (Demaine et al. 2006). Therefore, we develop a local search method which computes a local minimum of the cost function. The good performance of such a greedy strategy is well studied for different clustering models, e.g., K-means (Macqueen 1967), kernel K-means (Schölkopf et al. 1998) and in particular several graph partitioning methods (Dhillon et al. 2004, 2005).Footnote 6 We begin with a random clustering solution and then we iteratively assign each object to the cluster that yields a maximal reduction in the cost function. We repeat this procedure until no further improvement is achieved, i.e., a local optimal solution is found.

At each step of the aforementioned procedure, one needs to investigate the costs of assigning every object to each of the clusters. The cost function is quadratic, thus, a single evaluation might take \({\mathcal {O}}(n^2)\). Thereby, if the local search converges after t steps, the total runtime will be \({\mathcal {O}}(tn^3)\). However, we do not need to recalculate the cost function for each individual evaluation. Let \(R({\mathbf {c}},{\mathbf {S}}^e)\) denote the cost of clustering solution c, wherein the cluster label of object i is k. To obtain a more efficient cost function evaluation, we first consider the contribution of object i in \(R({\mathbf {c}},{\mathbf {S}}^e)\), i.e., \(R_i({\mathbf {c}},{\mathbf {S}}^e)\), which is written by

$$\begin{aligned} R_i({\mathbf {c}},{\mathbf {S}}^e) = \frac{1}{2}\sum _{j\in {\mathbf {O}}_k} (|{\mathbf {S}}^e_{ij}|-{\mathbf {S}}^e_{ij}) + \frac{1}{2}\sum _{q=1, q\ne k}^{K}\sum _{j\in {\mathbf {O}}q} (|{\mathbf {S}}^e_{ij}|+{\mathbf {S}}^e_{ij}). \end{aligned}$$
(10)

Then, the cost of the clustering solution \({\mathbf {c}}'\) being identical to c except for the object i which is assigned to cluster \(k' \ne k\), i.e., \(R({\mathbf {c}}',{\mathbf {S}}^e)\) is computed by

$$\begin{aligned} R({\mathbf {c}}',{\mathbf {S}}^e) = R({\mathbf {c}},{\mathbf {S}}^e) - R_i({\mathbf {c}},{\mathbf {S}}^e) + R_i({\mathbf {c}}',{\mathbf {S}}^e), \end{aligned}$$
(11)

where \(R({\mathbf {c}},{\mathbf {S}}^e)\) is already known and \(R_i({\mathbf {c}},{\mathbf {S}}^e)\) and \(R_i({\mathbf {c}}',{\mathbf {S}}^e)\) both require an \({\mathcal {O}}(n)\) runtime. Thus, we evaluate the cost function (9) only once for the initial random clustering. Then, iteratively and until the convergence, we compute the costs of assigning objects to different clusters via Eq. 11 and assign them to the clusters that yields a minimal cost. The total runtime is then \({\mathcal {O}}(tn^2)\).

3.2 Aggregation in representation space

In this section, instead of an ensemble-based approach in the solution space, we describe the aggregation of different (dendrogram-based) distances in the representation space, independent of what the next task will be. The embedding phase of our general-purpose framework not only enables us to employ any numerical machine learning algorithm, but also provides an amenable way to successively combine different representations. In this approach, the features extracted from a dendrogram (e.g., single linkage) are used to build another dendrogram according to the same or a different criterion (e.g., average linkage), in order to yield more complex features. The degree of freedom (richness of the function class) can increase by the choice of a different level or distance function over dendrograms. Such a framework leads to a nonparametric deep architecture wherein a cascade of multiple layers of nonparametric information processing units are deployed for feature transformation and extraction. The output of each layer is a set of features, which can be fed into another layer as input. Note that in this architecture any other (nonparametric) unit can be employed at the layers, beyond the dendrogram-based feature extraction units. Each layer (dendrogram) extracts a particular type of features in the space of data representation.

4 Experiments

We empirically investigate the performance of dendrogram-based representations on different datasets and demonstrate the usefulness of this approach to extract suitable features. Our methods are unsupervised and do not assume availability of any labeled data. Thus, to fully benefit from this property, we consider an unsupervised representation learning strategy, such that no free parameter is involved in inferring the new features. Thereby, we apply our methods to clustering and density estimation problems, for which parametric feature extraction methods might be inappropriate, due to lack of labeled data for cross validation (to estimate the parameters). In particular, after extracting the new features, we apply the following algorithms to obtain a clustering solution: (1) Gaussian Mixture Model (GMM), (2) K-means, and (3) spectral clustering. In the case of GMM, after computing the assignment probabilities, we assign each object to the cluster (distribution) with a maximal probability. We run each method and as well as correlation clustering (to obtain the ensemble solution) 100 times and pick a solution with the smallest cost or negative log-likelihood.

UCI datasets We perform our experiments on the following datasets selected randomly from the UCI data repository.Footnote 7

  1. 1.

    Forest Type: contains multi-temporal sensing information of 326 samples from a forested area in Japan each described with 27 features. The dataset consists of 5 clusters.

  2. 2.

    Hayes-Roth: contains 160 samples on human subjects study each described with 5 attributes.

  3. 3.

    Lung Cancer: each instance contains 56 attributes and is categorized as cancer or non-cancer.

  4. 4.

    Mammographic Mass: consists of the BI-RADS attributes of the mammographic masses for 961 samples.

  5. 5.

    One-Hundred Plant: contains leaf samples for 100 plant species for each 16 samples with 64 features (1600 samples in total with 100 clusters).

  6. 6.

    Perfume: contains 560 instances (odors) of 20 different perfumes measured by a handheld odor meter.

  7. 7.

    Semeion Handwritten Digit: features of 1593 handwritten digits from around 80 persons where each digit stretched in a rectangular box 16 \(\times\) 16 in a gray scale of 256 values.

  8. 8.

    Statlog (Australian Credit Approval): includes credit card data (described with 14 attributes) of 690 users.

  9. 9.

    Urban Land Cover: contains 168 high resolution aerial images of 9 types each represented by 148 features.

  10. 10.

    Vertebral Column: contains information of 6 biomechanical features of 310 patients categorized according to their status.

In these datasets, the objects and as well as the features extracted from different dendrograms are represented by vectors. Thus, to obtain the pairwise distances, we compute the squared Euclidean distances between the respective vectors. Some clustering algorithms such as spectral clustering require pairwise similarities as input, instead of a vector-based representation. Therefore, as proposed in Chehreghani (2016a), we convert the pairwise distances X (or \({\mathbf {D}}^D\), if obtained from a dendrogram) to a similarity matrix S via \({\mathbf {S}}_{ij} = \max ({\mathbf {X}}) - {\mathbf {X}}_{ij} + \min ({\mathbf {X}})\), where the \(\max (.)\) and \(\min (.)\) operations return the maximal and minimal elements of the given matrix. Note that an alternative transformation is an exponential function in the form of \({\mathbf {S}}_{ij} = \exp (-\frac{{\mathbf {X}}_{ij}}{\sigma ^2})\), which requires fixing the free parameter \(\sigma\) in advance. However, in particular in unsupervised learning, this task is nontrivial and the appropriates values of \(\sigma\) occur in a very narrow range (von Luxburg 2007).

Evaluation The ground truth solutions of these datasets are available. Therefore, we can quantitatively measure the performance of each method by comparing the estimated and the true cluster labels. For each estimated clustering solution, we compute three commonly used quality measures: (1) adjusted Mutual Information (Vinh et al. 2010), that gives the mutual information between the two estimated and true solutions, (2) adjusted Rand score (Hubert and Arabie 1985), that computes the similarity between them, and (3) V-measure (Rosenberg and Hirschberg 2007), that gives the harmonic mean of homogeneity and completeness. We compute the adjusted variants of these criteria, i.e., they yield zero for random solutions.

Results Tables 1 and 2 show the results on different UCI datasets. Each block row represents a separate dataset (in order, Forest Type, Hayes-Roth, Lung Cancer, Mammographic Mass and One-Hundred Plant in Table 1 and Perfume, Semeion Handwritten Digit, Statlog, Urban Land Cover and Vertebral Column in Table 2). For each dataset, we investigate the different feature extraction methods (base, PCA, LSA and those obtained by different dendrograms) with three different clustering algorithms. The goal of studying the three clustering algorithms is to demonstrate that our feature extraction methods can be used with various forms of clustering algorithms and are not limited to a specific algorithm. In this way, we investigate one probabilistic clustreing model (GMM), one which uses vector-based representation (K-means) and another that is applied to pairwise relations (spectral clustering). The three evaluation criteria that we use are the most common criteria for evaluating clustering methods. The results of the ensemble method are shown in italics. For each clustering algorithm and each evaluation measure, the best result is bolded among the different feature extraction methods.

Table 1 Permanence of different representations and clustering methods on different UCI datasets
Table 2 Permanence of different representations and clustering methods on different UCI datasets

The base method indicates performing the GMM, K-means or spectral clustering on the original vectors without inferring any new features. We also investigate Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA) as two other baselines. As discussed in Theorem 2, the matrix of pairwise dendrogram-based distances satisfy the ultrametric conditions. Unltrametric is stronger than metric, i.e., any ultrametrci is a metric too. The only difference is the last condition in the proof of Theorem 2. For an ultrametric, we require \(\forall i,j,k: {\mathbf {D}}^D_{ij} \le \max ({\mathbf {D}}^D_{ik},{\mathbf {D}}^D_{kj})\). It is obvious that this condition satisfies the triangle (metric) condition too, i.e., \(\forall i,j,k: {\mathbf {D}}^D_{ij} \le {\mathbf {D}}^D_{ik} + {\mathbf {D}}^D_{kj}\). Hence, \({\mathbf {D}}^D\) induces a metric. On the other hand, the different embedding methods usually rely on satisfying the metric conditions. Therefore, in principle any embedding and dimension reduction method can be applied to the dendrogram-based pairwise distances, the same way that it can be applied to the base pairwise distances too. Thus, further investigation of the results of different embedding methods is orthogonal to our contribution and we postpone it to future work.

Different dendrogram-based feature extraction methods are specified by the name of the criterion used to build the deprogram. The ensemble method refers to the aggregation of the different solutions and then preforming correlation clustering. According to the equivalence of single linkage method, Minimax distances and the tree preserving embedding method in Shieh et al. (2011a), this method can be seen as another baseline which also constitutes a special instantiation of the dendrogram-based feature extraction methodology. Note that the superior performance of Minimax distances (single linkage features) over methods such as metric learning or link-based methods has been demonstrated in previous works (Kim and Choi 2007, 2013; Chehreghani 2016b, 2017b) (see for example Figure 1 in Kim and Choi (2013)).Footnote 8

We interpret the results of Tables 1 and 2 as follows. For each dataset (block row) and each clustering algorithm, we investigate whether “some” of the dendrogram-based features (i.e., single, complete, average or Ward) perform better (according to the three evaluation criteria) than the baseline methods (base, PCA and LSA). If so, then we conclude our framework provides a rich and diverse family of non-parametric feature extraction methods wherein some instances yield more suitable features for the data at hand. Thus, a user has more freedom and options to choose the correct features. However, the user might not have sufficient information to choose the correct features (dendrograms), thus, we propose to use the ensemble variant, in the context of averaging (aggregating) multiple learners.

According to the results reported in Tables 1 and 2, we observe: (1) extracting features from dendrograms yields better representations that improve the evaluation scores of the final clusters. The dendrogram might be built in different ways which correspond to computing different types of features. In particular, we observe the features extracted via average linkage and Ward linkage often lead to very good results. Single linkage (Minimax) features are more suitable for low-dimensional data wherein connectivity paths still exists. However, in higher dimensions, the other methods might perform better due to robustness and flexibility. (2) The ensemble method works well in particular compared to the baselines and most of the dendrogram-based approaches. Note that the ensemble method is more than just averaging the results. It can be interpreted as obtaining a good (strong) learner from a set of weaker learners. Thereby, in several cases, the ensemble method performs even better than all the other alternatives.

Aggregation of representations As a side study, we investigate the sequential aggregation of different dendrogram-based features in representation space, i.e., we consider the combination of every two such feature extractors. For this purpose, we first compute a dendrogram and extract the respective features. Then, we use these features to compute a second dendrogram from which we obtain a new set of features. Finally, we apply a clustering method (GMM, K-means and spectral clustering) and evaluate the results w.r.t. Mutual Information, Rand score and V-measure.

We observe for most of the datasets, aggregation of different features either improves the results or preserves the accuracy of the results as same as the first representation. However, aggregation of the clustering solutions usually yields more significant changes (improvements) compared to the aggregating the representations. One of the significant changes happens on the Perfume dataset. See the results in Tables 34 and 5, where respectively GMM, K-means and spectral clustering have been applied to the final features to produce the clusters. The first and the second dendrograms are indicated by the rows and the columns, respectively (where S refers to single, C to complete, A to average, and W to Ward, the different ways of obtaining the features). These results should be compared with the block row in Table 2 that corresponds to the Perfume dataset (the first block row). We observe that over this dataset, feature aggregation often improves the results for different clustering methods. However, as mentioned before, such an aggregation is usually less significant (on other datasets).

Table 3 Aggregation of two representations on the Perfume dataset
Table 4 Aggregation of two representations on the Perfume dataset, where K-means is used for the clustering of the final features
Table 5 Aggregation of two representations on the Perfume dataset, where spectral clustering is applied to the final features to cluster them

We observe that on this dataset, the W-S combination (extracting the features first via Ward and then via single linkages) consistently yields the best results, among all different combinations. In Table 6, we compare these results with the best feature extractor for the perfume dataset, which is based on the Ward linkage. Single linkage even though does not yield very good results itself, but improves the Ward features the most. According to Table 6, except spectral clustering, using the single linkage features helps the clustering algorithm to produce better results. However, the best result is obtained with GMM for which combining Ward with any option is helpful.

Table 6 Comparison of Ward(W) and Ward-single(W-S) features on the perfume dataset

Model selection Our framework provides several options for choosing the dendrogram and the level function, and at the same time a principled way to aggregate and choose the best options (either in solution space or in representation space).

Availability of such alternatives endows a rich family of unsupervised models for representation learning and feature extraction. We note that this availability is different than optimizing the free parameters of a kernel.

  1. 1.

    In our framework, the number of the choices is very limited, whereas for a kernel function the free parameter(s) can usually take a wide (continuous) range of different values. Moreover, the optimal values of the kernel parameters usually occur inside very narrow ranges that makes it difficult to find them via search or cross-validation, even using the labeled data (Nadler and Galun 2007; von Luxburg 2007).

  2. 2.

    In our framework, every choice has an explicit interpretation that makes model selection more straightforward. For example, single linkage is more suitable for elongated structures and patterns, whereas average linkage suits better for high-dimensional data. On the other hand, the proposed level function in Eq. 8 is better adapted to the density-diverse structures.

  3. 3.

    Finally, as we demonstrated on all the datasets, our framework also provides a consistent way for computing an ensemble of the different choices and options. According to the experimental results, the ensemble solution performs very well compared to the individual choices. Computing such an ensemble solution is nontrivial for many kernels.

Here, as a side study, we compare the two choices for level function on the ensemble solution, i.e., the option defined in Eq. 5 and the one defined in Eq. 8. As explained before, Eq. 8 suggests a context-sensitive level function that takes into account the data diversity. According to the results in Tables 1 and 2, with the level function in Eq. 8, the ensemble solution of GMM on different UCI datasets yields the following MI scores: 0.4771, 0.1426, 0.2659, 0.0834, 0.6990, 0.9183, 0.5661, 0.0864, 0.1534, 0.2475. However, with the level function in Eq. 5, the ensemble solution of GMM on different UCI datasets gives the following MI scores: 0.4148, 0.1301, 0.2738, 0.0834, 0.6364, 0.9075, 0.5747, 0.0864, 0.1451, 0.2262. We observe that on two datasets (Mammographic Mass and Statlog) the two variants yield the same results. Among the remaining eight datasets, on six of them the level function in Eq. 8 performs better, whereas on only two datasets (Lung Cancer and Semeion Handwritten Digit) the level function in Eq. 5 yields higher scores. It is notable that however the results from both of the choices are acceptable.

Efficiency of correlation clustering optimization In our framework, we employ an efficient optimization of correlation clustering to compute the ensemble solution. We have studied its effectiveness in terms of the quality of the ensemble solution. Here, we investigate the efficiency of its optimization procedure in terms of runtimes. In particular, we compare our local search optimization with the Linear Programming (LP) method (Demaine et al. 2006) and the Semidefinite Programming relaxation (SDP) (Charikar et al. 2003; Mathieu and Schudy 2010). Table 7 shows the different runtime results. We observe that the local search method performs significantly faster compared to the alternatives. It is notable that the SDP method encounters memory issues for the datasets of larger than 200 objects. We stop it when its runtime exceeds 10 h.

Table 7 Comparison of the runtimes of different methods for optimization the correlation clustering objective used to obtain the ensemble solutions

Experiments on scientific datasets At the end, we investigate the proposed methods on two real-world datasets collected within a scientific data analytics project. The goal is to extract clusters of different subjects and topics. The extracted clusters help to analyze (1) how an automated approach can distinguish the scientific outcomes in different subjects and accordingly categorize the respective authors, (2) how separable or related the different subjects and topics are. The first dataset contains 10,000 published scientific articles in 10 different topics of computer science including algorithms, database, machine learning, networks, hardware, software engineering, formal methods, security, logic and information systems. The second dataset contains 10,000 published scientific articles in different topics of electrical engineering. Each ground truth cluster consists of 1000 articles. For each dataset, we obtain the TF-IDF vectors of the articles where we remove the step words. The number of features is 5823 and 5495 respectively for computer science and for electrical engineering datasets. We compute the base pairwise distances based on the squared Euclidean distances between the TF-IDF vectors. We then use them to compute the dendrograms.

Table 8 shows the permanence of different representations and clustering methods on these datasets, where the first block row corresponds to the computer science dataset and the second block row corresponds to the electrical engineering dataset. The results of the ensemble method are shown in italics. For each clustering algorithm and each evaluation measure, the best result is bolded among the different feature extraction methods. We observe consistent results to the UCI datasets. (1) Using different dendrogram-based features often improves the results for different clustering methods w.r.t. the evaluation criteria. (2) The ensemble solution yields either the best results or yields very close results to the best choice, i.e., it can effectively address the model selection problem.

Table 8 Permanence of different representations and clustering methods on two scientific datasets

5 Conclusion

We extended the previous Minimax and tree preserving representation learning methods that correspond to building a single linkage dendrogram, and proposed a generic framework to compute representations from different dendrograms, beyond single linkage. Then, we studied the embedding to extract vector-based features for such distances. This property extends the applicability to a wide range of machine learning algorithms. Then, we considered the aggregation of different dendrogram-based features in solution space and representation space. First, based on the consistency of the cluster labels of different objects, we build a graph with positive and negative edge weights and then apply correlation clustering to obtain the final clusters. In the second approach, in the spirit of deep learning models, we apply different dendrogram-based features sequentially, such that the input of the next layer is the output of the current one, and then we apply the particular (clustering) algorithm to the final features. Our experiments on several datasets revealed the effectiveness of the proposed framework.