Order preserving hierarchical agglomerative clustering

Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a<b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a]<[b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.


Introduction
Clustering is one of the oldest and most frequently used techniques for exploratory data analysis and unsupervised classification.The toolbox contains a large variety of methods and algorithms, spanning from the initial, but still popular ideas of k-means (Macqueen, 1967) and hierarchical clustering (Johnson, 1967), to more recent methods, such as density-and model based clustering (Kriegel et al., 2011;Fraley and Raftery, 2002), and semi-supervised methods (Basu et al., 2008), plus a large list of variants.All these methods have one thing in common: they try to extract hidden structure from the data, and make it visible to the analyst.But they also share another feature: if the analysed data is already endowed with some form of structure, the structure is lost in the clustering process; the clustering does not try to retain the structure.
In this paper, we show how to extend hierarchical clustering to relational data in a way that preserves the relations.In particular, if the input is a set X equipped with a strict partial order <, and if a, b ∈ X, we ensure that if a < b then we will have [a] < ′ [b] after clustering, where [a] and [b] are the respective clusters of a and b, and < ′ is a partial order on the clusters naturally induced by <.
Since directed acyclic graphs (DAGs) correspond to partial orders, our method works equally well for DAGs.If the input is a DAG, then every clustering in the produced hierarchy is a DAG of clusters, and there exists a DAG homomorphism from the original DAG to the cluster DAG.

Motivating real-world use case
The motivation for our method comes from an industry database of machine parts that are arranged in part-of relations: parts are registered as sub-parts of other parts.For historical reasons, there have been incidents of copy-paste of machine designs, and the copies have been given entirely new identifiers with no links to the original design.In hindsight, there is a wish to identify these equivalent machine parts, but telling them apart is hard.Also, the metadata that is available has a tendency of displaying high similarity between a part and its sub-parts, leading to "vertical clustering" in the data.
Since the motivation is to identify equivalent machinery with the aim of replacing one piece of machinery with an equivalent part, and since a part and its sub-parts by no means can be interchanged, it is essential to maintain this parent-child relationship.Moreover, since a part and its sub-part are never equivalent, this is a strict order relation.The set of all machine parts thus makes up a strictly partially ordered set.By preserving these relations in the clustering process, we can eliminate the errors due to close resemblance between the part and the sub-part, resulting in improved over all quality of the clustering.This is but one concrete example of a real world problem where the method we present performs significantly better than standard methods that disregard the structure.It is possible to imagine several other cases for which we have not yet had the opportunity to test our methodology.We will only mention two here; citation network analysis and time series alignment: Citation networks are partial orders, where the order is defined by the citations.If we perform order preserving clustering in the above sense on citation networks, the clusters will contain related research, and the clusters will be ordered according to appearance relative other related research.This differs from clustering with regards to time: when clustering with time as a parameter, you have to choose, implicitly or explicitly, a time interval for each cluster.When the citation graph is used for ordering, the clusters will contain research that occurred in parallel, citing similar sources, and being cited by similar sources, regardless to whether they occurred in some particular time interval.
A time series is a totally ordered set of events, so that a family of time series is a partially ordered set.Assume that you want to do time series alignment, matching events from one time series with events from another, but for some reason the time stamps are corrupted and cannot be used for this purpose.Given a measure of (dis-)similarity between events, we can cluster the events to figure out which events are the more similar.Since an optimal order preserving clustering is one that both preserves all event orders and matches the most similar events across the time series, ideally the result is a series of clusters with each cluster containing the events that correspond to each other across the time series.

Problem overview
Given a set X together with a notion of (dis-)similarity between the elements of X, a hierarchical agglomerative clustering can be obtained as follows (Jain and Dubes, 1988, §3.2): 1. Start by placing each element of X in a separate cluster.
2. Pick the two clusters that are most similar according to the (dis-)similarity measure, and combine them into one cluster by taking their union.
3. If all elements of X are in the same cluster, we are done.Otherwise, go to Step 2 and continue.
The result from this process is a dendrogram; a tree structure showing the sequence of the clustering process (Figure 1).Each path in this diagram, starting at the bottom and advancing upwards, represents a hierarchical clustering.But, since we are required to preserve the strict order relation, we cannot merge any more elements than what we see here.This means that we will never obtain dendrograms like the one in Figure 1, that joins at the top when all elements are placed in a single cluster.Rather, the output of hierarchical agglomerative clustering would take the form of partial dendrograms like those of Figure 3.
To complicate matters, if both (a, d) and (a, c) are pairs of minimal dissimilarity, then they are both candidates for the first merge.From Figure 2 we can see that ad and ac are mutual exclusive merges, and that choosing one over the other leads to very different solutions.We therefore need a method to decide which candidate merge, or which candidate partial dendrogram, is the better.

Outline of our method and contributions
As our first contribution, to solve the problem of picking one candidate merge among a set of tied connections, we present a permutation invariant method for hierarchical agglomerative clustering.The method uses the classical linkage functions of single-, average-and complete linkage, but is optimisation based, as opposed to the algorithmic definition of classical hierarchical clustering.Recalling that every hierarchical clustering corresponds to a unique ultrametric (Jardine and Sibson, 1971), the optimisation criterion is that of minimising the matrix norm of the difference between the original dissimilarity and the ultrametric corresponding to the hierarchical clustering, a method known as ultrametric fitting (De Soete et al., 1987).
We have seen that order preserving hierarchical agglomerative clustering on strictly partially ordered sets leads to partial dendrograms.In order to evaluate the ultrametric fitting of a partial dendrogram, our next contribution is an embedding of partial dendrograms over a set into the family of ultrametrics over the same set.
Our main contribution, order preserving hierarchical agglomerative clustering of strictly partially ordered sets, is the combination of the two.We define an optimal order preserving hierarchical clustering to be the hierarchical clustering with the partial dendrogram that has the best ultrametric fit relative the original dissimilarity measure.
In want of an efficient algorithm, we present a method of approximation that can be computed in polynomial time.We demonstrate the approximation on synthetic data generated as random directed acyclic graphs and random dissimilarity measures, as well as on data from the parts database motivating this research.We evaluate the quality of the obtained clustering by computing the adjusted Rand index relative a planted partition (Hubert and Arabie, 1985).We provide a novel method for comparing two induced order relations using a modified adjusted Rand index, which we believe is a first of its kind.We also provide simple method for computing the level of order preservation of a clustering of an ordered set by counting the number of induced loops.
Beyond our main contribution, we believe that the embedding of partial dendrograms into ultrametrics may be of interest to a larger audience.The embedding provides a means for treating partial dendrograms as complete dendrograms, offering access to the entire rack of tools that already exists in this domain.An obvious example candidate is that of hierarchical clustering with must-link and no-link constraints.The no-link constraints will necessarily lead to partial dendrograms that can be easily evaluated in our framework.

Summary of contributions
Our main contribution is the theory for order preserving hierarchical agglomerative clustering for strict posets.Further contributions we wish to highlight are: • A theory for embedding partial dendrograms over a set into the set of complete dendrograms over the same set.
• An optimisation based, permutation invariant hierarchical clustering methodology for nonordered sets that is very similar to classical hierarchical clustering.
• A polynomial time approximation scheme for order preserving hierarchical agglomerative clustering • A novel method for comparison of induced order relations over a set based on the adjusted Rand index.
• A measure of the level of order preservation of a clustering of an ordered set.

Related work
Hierarchical agglomerative clustering is described in a plethora of books and articles, and we shall not try to give an account of that material.For an introduction to the subject, see (Jain and Dubes, 1988, §3.2).

Clustering of ordered data
There are quite a few articles presenting clustering of ordered data, placing themselves in one of two categories.
The first is clustering of sets where the (dis)similarity measure is replaced by information about whether one pair of elements is more similar than another pair of elements, for example based on user preferences.This is sometimes referred to as comparison based clustering.See the recent article by Ghoshdastidar et al. (2019) for an example and references.In this category, we also find the works of Janowitz (2010), providing a wholly order theoretic description of hierarchical clustering, including the case where the dissimilarity measure is replaced by a partially ordered set.
The second variant is to partition a family of ordered sets so that similarly ordered sets are associated with each other.Examples include the paper by Kamishima and Fujiki (2003), where they develop a variation of k-means, called k-o ′ means, for clustering preference data, each list of preferences being a totally ordered set.Other examples in this category include clustering of times series, identifying which times series are alike ( Luczak, 2016).
Our method differs from all of the above in that we cluster elements inside one ordered set through the use of a (dis)similarity measure, while maintaining the original orders of elements.

Clustering to detect order
Another variant is the detection of order relations in data through clustering: In (Carlsson et al., 2014), it is demonstrated how hierarchical agglomerative quasi-clustering can be used to deduce a partial order of "net flow" from an asymmetric network.
In this category, it is also worth mentioning dynamic time warping.This is a method for aligning time series, and can be considered as clustering across two time series that is indeed order preserving.See ( Luczak, 2016) for further references on this.

Acyclic graph partitioning problems
The problem of order preserving hierarchical agglomerative clustering can be said to belong to the family of acyclic graph partitioning problems (Herrmann et al., 2017).If we consider the strict partial order to be a directed acyclic graph (DAG), the task is to partition the vertices into groups so that the groups together with the arrows still makes up a DAG.
Graph partitioning has received a substantial attention from researchers, especially within computer science, over the last 50 years.Two important fields of application of this theory are VLSI and parallel execution.
In VLSI, short for Very Large Scale Integration, the problem can be formulated as follows: Given a set of micro processors, the wires that connect them, and a set of circuit boards, how do you best place the processors on the circuit boards in order to optimise a given objective function?Typically, a part of the objective function is to minimise the wire length.But other features may also be part of the optimisation, such as the amount or volume of traffic between certain processors etc. (Markov et al., 2015) For parallel processing, the input data is a set of tasks to be executed.The tasks are organised as a DAG, where predecessors must be executed before descendants.Given a finite number of processors, the problem is to group the tasks so that they can be run group-wise on a processor, or running groups in parallel on different processors, in order to execute all tasks as quickly as possible.Typically additional information available is memory requirements, expected execution times for the tasks, etc. (Buluç et al., 2016) It is not difficult to understand why both areas have received attention, being essential in the development of modern computers.The development of theory and methods has been both successful and abundant, and a large array of techniques are available, both academic and commercially.
Although both problems do indeed perform clustering of strict partial orders, their solutions are not directly transferable to exploratory data analysis.Mostly because they have very specific constraints and objectives originating from their respective problem domains.
The method we propose in this paper has as input a strict partial order (equivalently; a DAG) together with an arbitrary dissimilarity measure.We then use the classical linkage functions single-, average-, and complete linkage to suggest clusterings of the vertices from the input dataset, while preserving the original order relation.
Our method therefore places itself firmly in the family of acyclic graph partitioning methodologies, but with different motivation, objective and solution, compared to existing methods.

Hierarchical clustering as an optimisation problem
Several publications aim at solving hierarchical clustering in terms of optimisation.However, due to the procedural nature of classical hierarchical clustering, combined with the linkage functions, pinning down an objective function may be an impossible task.Especially since classical hierarchical clustering is not even well defined for complete linkage in the presence of tied connections.This leads to a general abandonment of linkage functions in optimisation based hierarchical clustering.
Quite commonly, optimisation based hierarchical clustering is done in terms of ultrametric fitting.That is, it aims to find an ultrametric that is as close to the original dissimilarity measure as possible, perhaps adding some additional constraints (Gilpin et al., 2013;Chierchia and Perret, 2019).It is well known that solving single linkage hierarchical clustering is equivalent to finding the so called maximal sub-dominant ultrametric.That is; the ultrametric that is pointwise maximal among all ultrametrics not exceeding the original dissimilarity (Rammal et al., 1986).But for the other linkage functions, there is no equivalent result.
Optimisation based hierarchical clustering therefore generally present alternative definitions of hierarchical clustering.Quite often based on objective functions that originate from some particular domain.Exceptions from this are, for example, Ward's method (Ward, 1963), where the topology of the clusters are the focus of the objective, and also the recent addition by Dasgupta (2016), where the optimisation aims towards topological properties of the generated dendrogram.
Although our method is, eventually, based on ultrametric fitting, we optimise over a very particular set of dendrograms.Namely the dendrograms that can be generated through classical hierarchical clustering with linkage functions.It is therefore reasonable to claim that our method places itself between classical hierarchical clustering and optimised models.

Clustering with constraints
A significant amount of research has been devoted to the topic of clustering with constraints in the form of pairwise must-link or no-link constraints, often in addition to other constraints, such as minimal-and maximal distance constraints, and so on.Some work as also been done on hierarchical agglomerative clustering with constraints, starting with the works of Davidson and Ravi (2005).For a thorough treatment of constrained clustering, see (Basu et al., 2008).
Order preserving clustering (as well as acyclic partitioning) can be seen as a particular version of constrained clustering, where the constraint is a directed, transitive cannot-link constraint.A type of constraint that is not found in the constrained clustering literature.

Clustering in information networks
A large amount of research has been conducted on the problem of clustering nodes in networks, and a more recent field of research is that of clustering data organised in heterogeneous information networks, or HINs for short (Pio et al., 2018).A HIN is an undirected graph where both vertices and edges may have different, or even multiple, types.RDF graphs (Lassila and Swick, 1999) is but one example of HINs.In a sense, we can say that the availability of multiple types allow HINs to model the real world more closely, but with the penalty of increased complexity.It is fair to consider HIN clustering a generalisation of classical network clustering, where, in the classical setting, all vertices and edges are of one common type.
However, the general case in clustering both classical networks and HINs is that although the network structure serves to influence the clustering, the structure is usually lost in the clustering.The most classical example is where connectedness between vertices contribute to vertex similarity, and then the most connected vertices (clique-like subgraphs) are clustered together.Although this can be seen as a type of relation preserving clustering, in order preserving clustering, the opposite is taking place: the more connected two vertices are, the more reason not to place them in the same cluster.Indeed, as we show in Section 4, for the theory we present in this paper, two elements can only be clustered together if there are no paths connecting them.
An example of HIN clustering that is structure preserving is (Li et al., 2017).A HIN comes with a schema, or a schematic graph, describing which types are related to which other types.For Li et al. (2017), the goal is to cluster each set of same-type nodes according to a discovered similarity measure.The result is thus a schematic graph where each node is a clustering of vertices of the same type.This differs from the problem we study in that we do not know which elements are of the same type; to discover this is the goal of the clustering.Hence, the problems are similar but different; we could rephrase our problem as that of deriving a directed schematic graph from unlabeled vertices, where each vertex in the schematic graph is a set of equivalent machine parts, and the directed edges are the part-of relations.

Organisation of the remainder of this paper
Section 2 provides necessary background material.
In Section 3, we develop optimised hierarchical agglomerative clustering for non-ordered sets; our permutation invariant clustering model that is tailored especially to fit into our framework for agglomerative clustering of ordered sets.
In Section 4, we tackle the problem of order preservation during clustering: We define what we mean by order preservation, and classify exactly the clusterings that are order preserving.We also provide concise necessary and sufficient conditions for an hierarchical agglomerative clustering algorithm to be order preserving.
Section 5 defines partial dendrograms and develops the embedding of partial dendrograms over an ordered set into the family of ultrametrics over the same set.
Our main result, order preserving hierarchical agglomerative clustering for strict partial orders, is presented Section 6.
Section 7 provides a polynomial time approximation scheme for our method, and Section 8 demonstrates the efficacy of the approximation on synthetic data.
Section 9 presents the results from applying our approximation method to a subset of the data in the parts database, comparing with existing methods, and finally, Section 10 closes the article with some concluding remarks, and a list of future work topics.

Background
In this section we recall basic background material.We start by recollecting the required order-theoretical tools together with equivalence relations, before recalling classical hierarchical clustering.

Relations
Definition 1.A relation R on a set X is a subset R ⊆ X × X, and we say that x and y are related if (x, y) ∈ R. The short hand notation aRb is equivalent to writing (a, b) ∈ R.

Strict and non-strict partial orders
A strict partial order on a set X is a relation S on X that is irreflexive and transitive.Recall that, an irreflexive and transitive relation is also anti-symmetric.A strictly partially ordered set, or a strict poset, is a pair (X, S), where X is a set and S is a strict partial order on X.We commonly denote a strict partial order by the symbol <.
On the other hand a partial order on X is a relation P on X that is reflexive, asymmetric and transitive, and the pair (X, P ) is called a partially ordered set, or a poset.The usual notation for a partial order is ≤.
We shall just refer to strict and non-strict partial orders as orders, unless there is any need for disambiguation: If R is an order on X, we say that a, b ∈ X are comparable if either (a, b) ∈ R or (b, a) ∈ R. And, if every pair of elements in X are comparable, we call X totally ordered.A totally ordered subset of an ordered set is called a chain, and a subset where no two elements are comparable is called an antichain.We denote non-comparability by a⊥b.That is, for any elements a, b in an antichain, we have a⊥b.
A cycle in a relation E is a sequence in E on the form (a, b 1 ), (b 1 , b 2 ), . . ., (b n , a).The transitive closure of E is the minimal set E for which the following holds: If there is a sequence of pairs (a 1 , a 2 ), (a 2 , a 3 ), . . ., (a n−1 , a n ) in E, then (a 1 , a n ) ∈ E.
Let (X, E) be an ordered set.An element x 0 ∈ X is a minimal element if there is no element y ∈ X − {x 0 } for which (y, x 0 ) ∈ E. Dually, y 0 is a maximal element if there is no x ∈ X − {y 0 } for which (y 0 , x) ∈ E. If (X, E) has a unique minimal element, then this is called the bottom element or the least element, and a unique maximal element is called the top element or the greatest element.
Finally, a map f : (X, , and if f is a set isomorphism (that is, a bijection) for which f −1 is also order preserving, we say that f is an order isomorphism, and that the sets (X, < X ) and (Y, < Y ) are order isomorphic, writing (X, < X ) ≈ (Y, < Y ).

Partitions and equivalence relations
A partition of X is a collection of disjoint subsets of X, the union of which is X.The family of all partitions of X, denoted P(X), has a natural partial order defined by partition-refinement: If A = {A i } i and B = {B j } j are partitions of X, we say that A is a refinement of B, writing A ⋐ B, if, for every A i ∈ A there exists a B j ∈ B such that A i ⊆ B j .The sets of a partition are referred to as blocks.
An equivalence relation is a relation R on X that is reflexive, symmetric and transitive.Let the family of all equivalence relations over a set X be denoted by R(X).If R ∈ R(X) and (x, y) ∈ R, we say that x and y are equivalent, writing x ∼ y.The maximal set of elements equivalent to x ∈ X is called the equivalence class of x, and is denoted [x].R(X) is also partially ordered, but by subset inclusion: that is, for R, S ∈ R(X), we say that R is less than or equal to S if and only if R ⊆ S .
The quotient of X modulo R, denoted X/R, is the set of equivalence classes of X under R. Notice that [x] is an element of X/R, but a subset of X.Since the equivalence classes are subsets of X that together cover X, X/R is a partition of X with equivalence classes being the blocks of the partition.The family of partitions of X is in a one-to-one correspondence with the equivalence relations of X, and the correspondence is order preserving; if A = X/A and B = X/B, we have Both P(X) and R(X) have top-and bottom elements: The least element of P(X) is the singleton partition S(X), where each element is in a block by itself: S(X) = {{x} | x ∈ X}.The singleton partition corresponds to the diagonal equivalence relation, given by ∆(X) = {(x, x) | x ∈ X}, which is the least element of R(X).The greatest element of P(X) is the trivial partition {X}, corresponding to the equivalence relation X × X, where all element are equivalent.That is and {X} = X/(X × X).
If A and B are partitions of X with A being a refinement of B, we say that A is finer than B, and that B is coarser than A. We use the exact same terminology for the corresponding equivalence relations.
For a subset A ⊆ X, let the notation X/A denote the partition of X where all of A is one equivalence class, and the rest of X remains as singletons.Formally, this corresponds to the equivalence relation R A = ∆(X) ∪ (A × A).And finally, the quotient map corresponding to an equivalence relation That is, q R sends each element to its equivalence class.

Classical hierarchical clustering
In this section, we recall classical hierarchical clustering in terms of Jardine and Sibson (1971).Our theory builds directly on the theory for classical hierarchical clustering, so we need to provide a fair bit of detail, especially in view of the fact that there is a general lack of standardised notation for hierarchical clustering theory, and that the level of formality in definitions and notation varies among publications.
We start by recalling the formal definition of a dendrogram, before recalling dissimilarity measures and ultrametrics.Thereafter, we recall linkage functions, and at the end of the section, we tie all the concepts together and provide a definition of classical hierarchical agglomerative clustering.
Definition 2. A clustering of a set X is a partition of X, and a hierarchical clustering is a chain in P(X) containing both the bottom and top elements.A cluster in a clustering is a block in the partition.The elements in bold make up a chain in P(X) that contains both the bottom-and top elements, and therefore constitutes a hierarchical clustering of X.
Alternatively, a clustering of X is an equivalence relation R ∈ R(X), and a hierarchical clustering is a chain in R(X) containing both the bottom-and top elements of R(X).A cluster is, then, an equivalence class in X/R.We will refer to clusters as equivalence classes, clusters or blocks depending on the context, all terms being frequently used in clustering literature.

Dendrograms
For the remainder of the paper, let R + denote the non-negative reals.We generally assume that R + is equipped with the usual total order ≤.Now, for a set X, let P(X) be partially ordered by partition refinement, and let θ : R + → P(X) be an order preserving map.Consider the following list of possible properties of θ: If θ satisfies D1, then θ corresponds to what Carlsson and Mémoli (2013) refers to as a persistent set.If θ satisfies D1 and D2, then θ is what Jardine and Sibson (1971) refers to as a numerically stratified dendrogram, and if θ also satisfies D3, then Jardine and Sibson refer to θ as a definite numerically stratified dendrogram.Furthermore, the concept we have referred to as partial dendrograms corresponds to θ satisfying D1 and D3, so a partial dendrogram is the same as a definite persistent set.
It is the authors' impression that the current use of the term dendrogram in conjunction to classical hierarchical clustering mainly covers what Jardine and Sibson call a definite numerically stratified dendrogram.We thus land on the following definitions: Definition 3. A dendrogram over X is an order preserving map θ : R + → P(X) satisfying axioms D1, D2 and D3.If θ satisfies D1 and D3, we call θ a partial dendrogram over X.
We will use the term dendrogram to denote both the graphical and the functional representation.If im(θ) = {B i } n i=0 , we assume that the enumeration is compatible with the order relation on P(X); in other words, that {B i } n i=0 is a chain in P(X).We denote the family of all dendrograms over X by D(X), and the family of all partial dendrograms over X by PD(X).

Dissimilarity measures and ultrametrics
we call d an ultrametric (Rammal et al., 1986).The pair (X, d) is correspondingly called a dissimilarity space or an ultrametric space.The family of all dissimilarity measures over X is denoted by M(X), and the family of all ultrametrics by U(X).
Example 2 (Ultrametric).Property d3 is referred to as the ultrametric inequality, and is a strengthening of the usual triangle inequality.In an ultrametric space (X, u), every triple of points is arranged in an isosceles triangle: Let a, b, c ∈ X, and let the pair a, b be of minimal distance such that u(a, b) ≤ min{u(a, c), u(b, c)}.The ultrametric inequality gives us Ultrametrics show up in many different contexts, such as p-Adic number theory (Holly, 2001), infinite trees (Hughes, 2004), numerical taxonomy (Sneath and Sokal, 1973) and also within physics (Rammal et al., 1986), just to cite a few.For hierarchical clustering, ultrametrics are relevant because the dendrograms over a set are in a bijective relation to the ultrametrics over the same set (Carlsson and Mémoli, 2010).
We shall also need the following terms, which apply to any dissimilarity space: The diameter of (X, d) is given by the maximal inter-point distance: And the separation of (X, d) is the minimal inter point distance: It is a well known fact that there exists an injective map from dendrograms to ultrametrics (Jardine and Sibson, 1971): In (Carlsson and Mémoli, 2010) the map Ψ X is shown to be a bijection.If θ ∈ D(X), the map is defined as That is, the ultrametric distance is the least real number t for which θ maps to a partition where x and y are in the same block.The minimisation is well defined due to Axiom D1.The ultrametric can be read from the diagrammatic representation of the dendrogram as the minimum height you have to ascend to in order to traverse from one element to the other following the paths in the tree.

Classical hierarchical clustering
We first need to recall linkage functions.Our definition follows the lines of Carlsson and Mémoli (2010): Definition 4. Let P(X) denote the power set of X.A linkage functions on X is a map so that for each partition Q ∈ P(X) and dissimilarity measure d ∈ M(X), the restriction L| Q×Q×{d} is a dissimilarity measure on Q.
The classical linkage functions are defined as Definition 5 (Classical HC).Given a dissimilarity space (X, d) and a linkage function L, if we follow the procedure outlined in Section 1.2, using L as the "notion of dissimilarity", the result is a chain of partitions at which the partitions were formed.The sequence of pairs We define a classical hierarchical clustering of (X, d) using L to be a dendrogram Otherwise, the ρ i will not make up a monotone sequence, and the resulting function θ Q will not be an order preserving map.Although all of SL, AL and CL satisfy (3), it is fully possible to define linkage functions that do not.
At any point during the clustering process, if we encounter a partition Q with two distinct pairs of elements (p 1 , q 1 ), (p 2 , q 2 ) ∈ Q × Q for which we say that the two connections are tied, since they are both eligible candidates for the next merge.It is well known that HC SL is invariant with respect to the order of resolution of ties (Jardine and Sibson, 1971), a property referred to as being permutation invariant, a characteristic shared by neither HC AL nor HC CL .

Optimised hierarchical clustering
In this section we devise a permutation invariant version of hierarchical clustering based on the classical definition.The key to permutation invariance is in dealing with tied connections.If we consider the procedure for hierarchical clustering outlined in Section 1.2, we can resolve tied connections by picking a random minimal dissimilarity pair.The way the procedure is specified, this turns HC L into a non-deterministic algorithm; it may produce different dendrograms for the same input in the presence of ties, depending on which tied pair is selected.But more importantly, it is capable of producing any dendrogram that can be produced by any tie resolution order: Definition 7. Given a dissimilarity space (X, d) and a linkage function L, let D L (X, d) be the set of all possible outputs from HC L (X, d).
A dissimilarity measure d over a finite set X can be described as an |X| × |X| real matrix [d i,j ].Hence, given an ultrametric u ∈ U(X) we can compute the pointwise difference We suggest the following definition, recalling the definition of Ψ X (1): Definition 8. Given a dissimilarity space (X, d) and a linkage function L, the optimised hierarchical agglomerative clustering over (X, d) using L is given by That is; among all dendrograms that can be generated by HC L (X, d), optimised hierarchical agglomerative clustering picks the dendrogram that is closest to the original dissimilarity measure.In the tradition of ultrametric fitting, this is the right choice of candidate.
As D L (X, d) contains all dendrograms generated over all possible permutations of enumerations of X, the below theorem follows directly from Definition 8: Theorem 9. HC L opt is permutation invariant.That is, the order of enumeration of the elements of the set X does not affect the output from HC L opt (X, d).
And since HC SL is permutation invariant, we have Since HC AL and HC CL are not permutation invariant, there is no corresponding result in these cases.For complete linkage, however, we have the following theorem.First, notice that due to the definition of complete linkage (Definition 4), if θ is a solution to HC CL opt (X, d) and u = Ψ X (θ) is the corresponding ultrametric, then Hence, in the case of complete linkage we can reformulate (5) as follows: To see why this is the case, notice that if u, u ′ ∈ M(X) and both d ≤ u and d ≤ u ′ pointwise, then we can produce two non-negative functions δ, δ ′ on X × X so that u = d + δ and u ′ = d + δ ′ .
To reduce clique to HC CL opt , define a dissimilarity measure on V as follows:

And since we have
It follows that a largest possible cluster at proximity level 1 is a maximal clique in G.
We claim that minimising the norm is equivalent to producing a maximal cluster at proximity level 1: Let d be the , then these are exactly the blocks that are subsets of cliques, so each V i contributes with Having more ones reduces the norm of d.Let V j be of maximal cardinality in {V i } s i=1 .Assume first that V j has at least two elements more than the next to largest block, and let Removing one element from V j reduces the number of ones in the dissimilarity matrix by P (P −1)−(P −1)(P −2) = 2(P −1).Let the next to largest block have Q elements.Transferring the element to this block then increases the number of ones by this means that the total number of ones is reduced by moving an element from the largest block to any of the smaller blocks.Hence, achieving the largest possible number of ones implies maximising the size of the largest block.
If now, V j only has one element more than the next to largest block, moving an element as above corresponds to keeping the number of ones.Since each V i for 1 ≤ i ≤ s is a subset of a clique in G, the maximal number of ones is achieved by producing a block V j that contains exactly a maximal clique of G.
Therefore, if I {1} (x) is the indicator function for the set {1}, the size of a maximal clique in G can be computed as counting the maximal number of row-wise ones in [d i,j ] in O(N 2 ) time.We therefore conclude that HC CL opt is NP-hard.
The computational hardness of HC CL opt is directly connected to the presence of tied connections: every encounter of n tied connections leads to n! new candidate solutions.

Since neither HC AL
opt is permutation invariant, the authors strongly believe that this is also NP-hard, although that remains to be proven.
We cannot in general expect the mapping θ → Ψ X (θ) − d p to be injective, meaning that the answer to (5) may not be unique.Recall that P(X) denotes the power set of X.We shall consider HC L opt (X, −) to be the function mapping a dissimilarity measure over X to a set of dendrograms over X.

Other permutation invariant solutions
Carlsson and Mémoli ( 2010) offer an alternative approach to permutation invariant hierarchical agglomerative clustering.In their solution, when they face a set of tied connections, they merge all tied the pairs in one operation, resulting in permutation invariance.
In the case of order preserving clustering, a family of tied connections can contain several mutually exclusive merges due to the order relation.Using the method of Carlsson and Mémoli leads to a problem of figuring which blocks of tied connections to merge together, and in which combinations and order.This leads to a combinatorial explosion of alternatives.The method we have suggested is utterly simple, but it is designed to circumvent this very problem.

Order preserving clustering
In this section, we determine what it means for an equivalence relation to be order preserving with regards to a strict partial order, and establish precise conditions that are necessary and sufficient for a hierarchical agglomerative clustering algorithm to be order preserving.

Order preserving equivalence relations
Recalling the definition of a clustering (Definition 2), let (X, <) be a strict poset.If R is an equivalence relation on X with quotient map q : X → X/R, we have already established, in Section 1.1, that we require ∀x, y∈X : x < y ⇒ q(x) < ′ q(y).
That is, we are looking for a particular class of equivalence relations; namely those for which the quotient map is order preserving.
Given a strict poset (X, E), there is a particular induced relation on the quotient set X/R for any equivalence relation R ∈ R(X) (Blyth, 2005, §3.1): Definition 12.Given a strict poset (X, E) and an equivalence relation R ∈ R(X), first define the relation S 0 on X by The transitive closure of S 0 is called the relation on X/R induced by E. We denote this relation by S.
Example 3.An instructive illustration of what the relation S 0 looks like for a strict poset (X, <) under the equivalence relation R is that of an R-fence (Blyth, 2005), or just fence, for short: Triple lines represent equivalences under R, and the arrows represent the order on (X, <).The fence illustrates visually how one can traverse from a 1 to b n along arrows and through equivalence classes in X/R, and in that case we say that the fence links b 1 to a n .The induced relation S has the property that (a, b) ∈ S if there exists an R-fence in X linking a to b.
Recall that a cycle in a relation R is a sequence of pairs starting and ending with the same element: (a, b 1 ), (b 1 , b 2 ), . . ., (b n , a).The below theorem is an adaptation of (Blyth, 2005, Thm.3.1) to strict partial orders.
Theorem 13.Let (X, E) be a strict poset, R ∈ R(X), and let S be the relation on X/R induced by E. Then the following statements are equivalent: 1. S is a strict partial order on X/R; 2. There are no cycles in S 0 ; 3. q R : (X, E) −→ (X/R, S) is order preserving.
Proof.From the definition of strict posets, they contain no cycles, so 1 ⇒ 2. Since a non-cyclic set is irreflexive, and since S is transitive by construction, 2 ⇒ 1.
Let q R be order preserving.Notice that if S 0 is the set defined in (9), we have S 0 = q R × q R (E).In particular, for all x, y ∈ X for which (x, y) ∈ E, we have ([x], [y]) ∈ S 0 .Assume that S is not a strict order.Then there is a cycle in S 0 ; that is there are x, y ∈ X for which (x, y) ∈ E, but ([y], [x]) ∈ S 0 also.This yields This yields a ∼ a ′ and b ∼ b ′ , so we have But, since we have both q R (a) = q R (a ′ ) and (a, b) ∈ E, this contradicts the fact that q R is order preserving, so our assumption that both ([x], [y]) and ([y], [x]) are elements of S 0 must be wrong.Hence, if q R is order preserving, there are no cycles in S 0 , and S is a strict partial order on X/R.This shows that 3 ⇒ 1.
Finally, let S be a strict partial order, and assume that q R is not order preserving.Then, there exists x, y ∈ X where (x, y) ∈ E and for which at least one of ([x], [y]) ∈ S or ([y], [x]) ∈ S holds.Now, ([x], [y]) ∈ S by Definition 12. Therefore, ([y], [x]) ∈ S implies that S has a cycle, contradicting the fact that S is a strict partial order.Definition 14.Let (X, E) be a strict poset.An equivalence relation R ∈ R(X) is regular if there exists an order on X/R for which the quotient map is order preserving.We denote the set of all regular equivalence relations over an ordered set (X, <) by R(X, <).Likewise, the family of all regular partitions of (X, <) is denoted P(X, <).
In general, we will denote the induced order relation for a strict poset (X, <) and a regular equivalence relation R ∈ R(X, <) by < ′ .

The structure of regular equivalence relations
We now establish a sufficient and necessary condition for an agglomerative clustering algorithm to be order preserving.Recall that, if A ⊆ X, X/A denotes the quotient for which the quotient map q A : X → X/A sends all of A to a point, and is the identity otherwise.That is, for every x, y ∈ X, we have q A (x) = q A (y) ⇔ x, y ∈ A.
Theorem 15.If A ⊆ X for a strict poset (X, <), the quotient map q A : X → X/A is order preserving if and only if A is an antichain in (X, <).
Proof.If A is not an antichain, then X/A places comparable elements in the same equivalence class, so q A is not order preserving.Assume A is an antichain.If q A is not order preserving, then there is a cycle in (X/A, < ′ ), and since we have only one non-singleton equivalence class, the cycle must be on the form b A c.
But this means we have a, a ′ ∈ A for which b < a and a ′ < c, but since c < b, this implies a ′ < a, contradicting the fact that A is an antichain.
Since a composition of order preserving maps is order preserving, this also applies to a composition of quotient maps for a chain of regular equivalence relations Combining this with Theorem 15, we have the following: A clustering of a strict poset will be order preserving if it can be produced as a sequence of pairwise merges of non-comparable elements.
We close the section with an observation about the family of all hierarchical clusterings over a strict poset: Theorem 16.For a strict poset (X, <), the set P(X, <) of regular partitions over (X, <) has S(X) as its least element.Unless < is the empty order, there is no greatest element.
Proof.S(X) is always a regular partition, so S(X) ∈ P(X, <).And since S(X) is a refinement of every partition of X, S(X) is the least element of P(X, <).
If the order relation is not empty, then there are at least two elements that are comparable, and, according to Theorem 15, they cannot be in the same equivalence class.Hence, there is no greatest element.
The situation of Theorem 16 is depicted in Figure 2, and has already been discussed in Section 1.2:In the case of tied connections that represent mutually exclusive merges, choosing to merge one connection over the other may lead to very different results.We therefore need a strategy to select one of these solutions over the others.This will be the main focus of Sections 5 and 6.

Partial dendrograms
In this section, we construct the embedding of partial dendrograms into ultrametrics.Let an ordered dissimilarity space be denoted by (X, <, d).We generally assume that the order relation is non-empty, meaning that there are comparable elements in (X, <).Recall the partial dendrograms of Figure 3, and the mathematical definition of a partial dendrogram in Definition 3. Partial dendrograms are clearly a generalisation of dendrograms.To distinguish between the two, we will occasionally refer to the non-partial dendrograms as complete dendrograms.
For a partial dendrogram θ, we will write θ(∞) to denote the maximal partition in the image of θ.The only difference between a partial dendrogram and a complete dendrogram is that for a partial dendrogram we do not require a greatest element in the image of θ.However, since P(X, <) is finite, a partial dendrogram θ ∈ PD(X, <) is eventually constant ; that is, there exists a positive real number t 0 for which We call the smallest such number the diameter of θ, formally given by Looking at the partial dendrograms of Figure 3, each connected component in a partial dendrogram is a complete dendrogram over its leaf nodes.Since complete dendrograms map to ultrametrics, each connected component gives rise to an ultrametric on the subset of X constituted by the connected component's leaf nodes.That is, if θ(∞) = {B j } k j=1 , and if θ j is the complete dendrogram over B j for 1 ≤ j ≤ k, we can define the ultrametrics u j = Ψ Bj (θ j ) so that {(B j , u j )} k j=1 is a disjoint family of ultrametric spaces, which union covers X.Now consider the following general result.
Lemma 17.Given a family of bounded, disjoint ultrametric spaces {(X j , d j )} n j=1 together with a positive real number K ≥ max j {diam(X j , d j )}, the map given by Proof.To prove that the ultrametric inequality holds, we start by showing that d ∪1,2 is an ultrametric on the restriction to the disjoint union X 1 ∪ X 2 : Let x, y ∈ X 1 and z ∈ X 2 , and choose a positive K ≥ max{diam(X 1 , d 1 ), diam(X 2 , d 2 )}.We now have This means that every triple of points are either already contained in an ultrametric space, or they make up an isosceles triangle.In both cases, the ultrametric inequality holds, according to the observation in Example 2.
By induction, we can now prove that (X 1 ∪ X 2 ) ∪ X 3 ), d ∪1,2,3 is an ultrametric space, and so on, until all the (X j , d j ) are included.
Definition 18.Given an ordered space (X, <) and a non-negative real number ε, the ultrametric completion on ε is the map U ε : PD(X, <) −→ U(X) mapping where u θ is defined as in (10), setting K = diam(θ) + ε.The above discussion serves to show that the construction is well defined.Our next goal is two-fold.First, we wish to provide an (explicit) function from partial dendrograms to dendrograms that realises this map.And second, we wish to establish conditions for this function to be an embedding; that is, an injective map.Injectivity is not strictly required for the theory to work, but it increases the discriminative power of the theory.An example to the contrary is provided towards the end of the section.
We have the map Ψ X : D(X) −→ U(X) from (1), mapping dendrograms to ultrametrics.We now seek a map κ ε : PD(X, <) −→ D(X) making the following diagram commute: Seeing that κ ε must map partial dendrograms to complete dendrograms, a quick glance at Figure 4 suggests the following definition: Proof.Assume first that θ ∈ PD(X, <) is a proper partial dendrogram, and that im(θ) = {B i } n i=0 .Let the coarsest partition in the image of θ be given by B n = {B j } m j=1 .That is, each block B j corresponds to a connected component in the partial dendrogram.Pick a block B ∈ B n and assume x, y ∈ B. If then B k is the finest partition containing all of B in one block.Since B ⊆ X, the partitions constitute a chain in P(B) containing both S(B) and {B}.Hence, we can construct a complete dendrogram over B by defining This is exactly the complete dendrogram corresponding to the connected component of the tree over X having the elements of B as leaf nodes.By Definition 18, Due to (12), we have Hence, by the definition of Ψ X in (1) we conclude that Combining this with (13), we get that whenever x, y ∈ B, we have Ψ X • κ ε = U ε .
On the other side, let x ∈ B i and y ∈ B j with i = j.By definition, we have U ε (θ)(x, y) = diam(θ) + ε.And, since there is no block in θ(∞) containing both x and y, we find that the minimal partition in im(κ ε (θ)) containing x and y in one block is {X}.But this means that Ψ X (κ ε (θ))(x, y) = diam(θ) + ε, so Ψ X • κ ε = U ε holds in this case too.
Theorem 20.Let (X, <) be a strict poset with a non-empty order relation.
Example 5.If ε is not chosen to be strictly positive, the map U ε will not necessarily be injective.Consider the below dendrograms.This illustrates what we mean by reduced discriminative power in the case of a noninjective completion.Since the partial dendrograms exhibit distinctively different information, it is desirable that the methodology can distinguish them.

Hierarchical clustering of ordered sets
We are now ready to embark on the specification of order preserving hierarchical clustering of ordered sets.We do this by extending our notion of optimised hierarchical clustering from Section 3. Consider the following modification of classical hierarchical clustering.The only difference is that for each iteration, we check that there are elements that actually can be merged while preserving the order relation.According to Theorem 15, this means merging a pair of noncomparable elements at each iteration.Recall that S(X) denotes the singleton partition of X.
Let (X, <, d) be given together with a linkage function L.
2. Among the pairs of non-comparable clusters, pick a pair of minimal dissimilarity according to L, and combine them into one cluster by taking their union.
3. Endow the new clustering with the induced order relation.
4. If all elements of X are in the same cluster, or if all clusters are comparable, we are done.Otherwise, go to Step 2 and continue.
The procedure results in a chain of ordered partitions {(Q i , < i )} m i=0 together with the dissimilarities {ρ i } m i=0 at which the partitions where formed.For an ordered set (X, <), recall that non-comparability of a, b ∈ X is denoted a⊥b.Let the non-comparable separation of (X, <, d), be given by The reader may wish to compare the following lemma to Remark 6.
Lemma 21.The sequence of pairs {(Q i , ρ i )} m i=0 produced by the above procedure maps to a partial dendrogram through application of (2) if and only if Since the singleton partition Q 0 maps to a partial dendrogram, the algorithm will produce a partial dendrogram for any ordered dissimilarity space, and since there can be at most |X| − 1 merges, the procedure always terminates.
As for classical hierarchical clustering, the procedure is non-deterministic in the sense that given a set of tied pairs, we may pick a random pair for the next merge.Hence, the procedure is capable of producing partial dendrograms for all possible tie resolution strategies: Definition 22.Given an ordered dissimilarity space (X, <, d) and a linkage function L, we write D L (X, <, d) to denote the set of all possible outputs from the above procedure The set D L (X, <, d) differs from D L (X, d) in two important ways: • D L (X, <, d) contains partial dendrograms, not dendrograms.
• The cardinality of D L (X, <, d) is at least that of D L (X, d), and often higher, due to mutually exclusive merges and the "dead ends" in P(X, <) (see Figure 2).
Even for single linkage we have D SL (X, <, d) > 1 if there are mutually exclusive tied connections.
In the spirit of optimised hierarchical clustering, we suggest the following definition, employing the ultrametric completion U ε from Definition 18: Definition 23.Given an ordered dissimilarity space (X, <, d) together with a linkage function L, let ε > 0. An order preserving hierarchical agglomerative clustering using L and ε is given by The next theorem shows that if we remove the order relation, then optimised clustering and order preserving clustering coincide.Keep in mind that a dissimilarity space is an ordered dissimilarity space with an empty order relation; that is, (X, d) = (X, ∅, d).
Theorem 24.If the order relation is empty, then order preserving optimised hierarchical clustering and optimised hierarchical clustering coincide: where < Q denotes the (trivial) induced order.Hence, we have D L (X, ∅, d) = D L (X, d).Since U ε | D(X) = Ψ X , the result follows.

On the choice of ε
In HC <L opt,ε (X, <, d) we identify the elements from D L (X, <, d) that are closest to the dissimilarity measure d when measured in the p-norm.The injectivity of U ε induces a relation d,ε on PD(X, <) defined by and the optimisation finds the minimal elements under this order.
The choice of ε may affect the ordering of dendrograms under d,ε .We show this by providing an alternative formula for u − d p that better expresses the effect of the choice of ε.Assume θ is a partial dendrogram over (X, <) with θ(∞) = {B i } m i=1 , and let U ε (θ) = u.We split the sum for computing u − d p in two: the intra-block differences and the inter-block differences.The intra-block differences are independent of ε, and are given by On the other hand, the inter-block differences are dependent on ε, and can be computed as If we think of u as an approximation of d, and saying that |X| = N , the mean p-th error of this approximation can be expressed as a function of ε: From the formula for E d (ε|θ, p), we see that when ε becomes large, the inter-block differences dominate the approximation error.For increasing ε, having low error eventually equals having few inter-block pairs.Alternatively: the intra-block differences have insignificant influence on the approximation error for large ε.This means that as ε increases beyond diam(X, d), the partial dendrograms close to d will be those that have a low number of inter-block pairs, regardless of the quality of the intra-block ultrametric fit.From the standpoint of ultrametric fitting, this is intuitively wrong.Also, large ε will lead to clusterings where as many elements as possible are placed in one large cluster, since this is the most effective method for reducing the number of inter-block pairs.
On the other side, a low value of ε will move the weight towards optimising the intra-block ultrametric fit.Since the inter-block distances are all set to diam(θ) + ε, it is in the intra-block ultrametric fit we can make a difference in the optimisation.This will also reduce the bias of cluster size as a function of ε.
In all, it is the authors' opinion that this points towards selecting a low value for ε.In the process of choosing, we have the following result at our aid: Theorem 25.For any finite ordered dissimilarity space (X, <, d) and linkage function L, there exists an ε 0 > 0 for which That is; all ε ∈ (0, ε 0 ) induce the same order on the partial dendrograms.
Proof.Since X is finite, D L (X, <, d) is also finite.And according to E d (ε|θ, p), if the cardinality of D L (X, <, d) is n, there are at most pn positive values of ε that are distinct global minima of partial dendrograms in D L (X, <, d).But this means there is a finite set of ε for which the order on (D L (X, <), ε,p ) changes.And since all these values are strictly positive, they have a strictly positive lower bound.
Since the value of ε 0 depends on D L (X, <, d), it is non-trivial to compute.For practical applications, we recommend to choose a very small positive number for ε, but not so small that it becomes zero due to floating point rounding when added to the diameter of the partial dendrograms.

Idempotency of HC <L opt,ε
A detailed axiomatic analysis along the lines of for example Ackerman and Ben-David ( 2016) is beyond the scope of this paper, and is considered for future work.We still include a proof of idempotency of HC <L opt,ε , since this is an essential property of classical hierarchical clustering.Idempotency of hierarchical clustering necessarily depends on the linkage function.We introduce the following concept, that allows us to prove this property for a range of linkage functions: We say that L is a convex linkage function if we always have Notice that if u is an ultrametric on X, the ultrametric inequality yields This is to say that a convex linkage function preserves the structure of the original ultrametric when minimal dissimilarity elements are merged.As a result, for any u ∈ U(X), the set D L (X, u) contains exactly one element, namely the dendrogram corresponding to the ultrametric, which is why classical hierarchical clustering is idempotent.
For ordered spaces, the case is different.It is easy to construct an ordered ultrametric space (X, <, u) for which u(a, b) = sep(X, u) and a < b, in which case the ultrametric cannot be reproduced.Hence, all of U(X) cannot be fixed points under U ε • HC <L opt,ε (X, <, −), but the mapping is still idempotent: Theorem 26 (Idempotency).For an ordered dissimilarity space (X, <, d) and a convex linkage function L, we have θ Then each B i is an antichain in (X, <), so we have Since ε > 0, we also have And, lastly, since every pair of comparable elements are in pairwise different blocks, we have Now, since L is convex, based on the discussion preceding the theorem, the intra-block structure of every block will be preserved.And, since every inter-block dissimilarity is accompanied by comparability across blocks, the procedure for generation of D L (X, <, U ε (θ)) will exactly reproduce the intra block structure of all blocks and then halt.Hence, D L (X, <, U ε (θ)) = {θ}.

Polynomial time approximation
In the absence of an efficient algorithm for HC <L opt,ε , this section provides a polynomial time approximation scheme.The efficacy as approximation is demonstrated in Section 8, and a demonstration on real world data is given in Section 9.
Recall the set D L (X, <, d) of partial dendrograms over (X, <, d) from Definition 22.The algorithm for producing a random element of D L (X, <, d) is described at the beginning of Section 6; the key is to pick a random pair for merging whenever we encounter a set of tied connections.
The approximation model is deceivingly simple; we generate a set of random partial dendrograms, and choose the one with the best ultrametric fit.
Definition 27.Let (X, <, d) be given, and let N be a positive integer.For any random selection of N partial dendrograms We denote the N -fold approximation scheme by HC <L N,ε .

Running time complexity
Assume that |X| = n.In the worst case, we may have to check n 2 pairs to find one that is not comparable, and the test for a⊥b has complexity O(n 2 ), leading to a complexity of O(n 4 ) of finding a mergeable pair.Since there are up to n − 1 merges, the worst case estimate of the running time complexity for producing one element in D L (X, <, d) is O(n 5 ).
A part of this estimate is the number of comparability tests we have to perform in order to find a mergeable pair.For a sparse order relation, we may have to test significantly less than n 2 pairs before finding a mergeable pair: if K is the expected number of test we have to do, the expected complexity of finding a mergeable pair becomes O(Kn 2 ).This yields a total expected algorithmic complexity of O(Kn 3 ).If the order relation is empty, we have K = 1, and the complexity of producing a dendrogram becomes O(n 3 ), which is the running time complexity of classical hierarchical clustering.Hence, if the order relation is sparse, we can generally expect the algorithm to execute significantly faster than the worst case estimate.
When producing an N -fold approximation, the N random partial dendrograms can be generated in parallel, reducing the computational time of the approximation.For the required number of dendrograms to obtain a good approximation, please see Section 8.

Demonstration of approximation efficacy on randomly generated data
The purpose of the demonstration is to check to which degree the approximation reproduces the order preserving clusterings of HC <L opt,ε .We start by describing the random data model and the quality measures we use in assessing the efficacy of the approximation, before presenting the experimental setup and the results.

Random ordered dissimilarity spaces
To test the correctness and convergence ratio of the approximation scheme, we employ randomly generated ordered dissimilarity spaces.The random model consists of two parts: the random partial order and the random dissimilarity measure.

Random partial order
A partial order is equivalent to a transitively closed directed acyclic graph, so we can use any random model for directed acyclic graphs to generate random partial orders.We choose to use the classical Erdős-Rényi random graph model (Bollobás, 2001).Recall that a directed acyclic graph on n vertices is a binary n × n adjacency matrix that is permutation similar to a strictly upper triangular matrix; that is, there exists a permutation that, when applied to both the rows and the columns of one matrix, transforms it into the other.Let this family of n × n matrices be denoted by A(n).For a number p ∈ [0, 1], the sub-family A(n, p) ⊆ A(n) is defined as follows: for A ∈ A(n), let A ′ be strictly upper triangular and permutation similar to A. Then each entry above the diagonal of A ′ is 1 with probability p.The sought partial order is the transitive closure of this graph; we denote the corresponding set of transitively closed directed acyclic graphs by A(n, p).

Random dissimilarity measure
If |X| = n, a dissimilarity measure over X with no tied connections consists of n 2 distinct values.Hence, any permutation of the sequence {1, . . ., n 2 } is a non-tied random dissimilarity measure over X.
To generate tied connections, let t ≥ 1 be the expected number of ties per level.That is, for each unique value in the dissimilarity measure, that value is expected to have multiplicity t.In the case where t does not divide n 2 , we resolve this by setting the multiplicity of the largest dissimilarity to n 2 mod t .We write D(n, t) to denote the family of random dissimilarity measures over sets of n elements with an expected number of t ties per level.
Definition 28.Given positive integers n and t together with p ∈ [0, 1], the family of random ordered dissimilarity spaces generated by (n, p, t) is given by

Measures of cluster quality
In the demonstration, we start by generating a random ordered dissimilarity space.We then run the optimal clustering method on the space, finding the optimal order preserving hierarchical clustering.Finally, we run the approximation scheme on the space and study to which degree the approximation manages to reproduce the optimal hierarchical clustering.For this, we need a quantitative measure of clustering quality relative a known optimum.
A large body of literature exists on the topic of comparing clusterings (see for instance (Vinh et al., 2010) for a brief review).We have landed on the rather popular adjusted Rand index (Hubert and Arabie, 1985) to measure the ability of the approximation in finding a decent partition, comparing against the optimal result.Less work is done on this type of comparison for partial orders and directed acyclic graphs.We suggest to use a modified version of the adjusted Rand index for this purpose too, based on an adaptation of the Rand index used for network analysis (Hoffman et al., 2015).For an introduction to the Rand index, and also to some of the versions of the adjusted Rand index, see (Rand, 1971;Hubert and Arabie, 1985;Gates and Ahn, 2017).

Adjusted Rand index for partition quality
The Rand index compares two clusterings by computing the percentage of corresponding decisions made in forming the clusterings; that is, counting whether pairs of elements are placed together in both clusterings or apart in both clusterings.An adjusted Rand index reports in the range (−∞, 1], where zero is equivalent to a random draw, and anything above zero is better than chance.We use the adjusted Rand index (ARI) to compute the efficacy of the approximation in finding a partition close to a given planted partition.This corresponds to what Gates and Ahn (2017) refers to as a one sided Rand index, since one of the partitions are given, whereas the other is drawn from some distribution.In the below demonstration, we assume that the approximating partition is drawn from the set of all partitions over X under the uniform distribution.

Adjusted Rand index for induced order relations
When comparing induced orders on partitions over a set, unless the partitions coincide, it is not obvious which blocks in one partition correspond to which blocks in the other.To overcome this problem, we base our measurements on the base space projection: Definition 29.For an ordered set (X, E) and a partition Q of X with induced order E ′ , the base space projection of (Q, E ′ ) onto X is the order relation E Q on X defined as This allows us to compare the induced orders in terms of different orders on X.Notice that if the induced order E ′ is a [strict] partial order on Q, then E Q is a [strict] partial order on X. Hoffman et al. (2015) demonstrate that the adjusted Rand index can be used to detect missing links in networks by computing the similarity of edge sets.The concept relies on the fact that a network link and a link in an equivalence relation are not that different: Both networks and equivalence relations are special classes of relations, and the Rand index simply counts the number of coincidences and mismatches between two relation sets.While Hoffman et al. (2015) uses the ARI to compare elements within a network, we use the same method to compare across networks.
Let A and B be the adjacency matrices of two base space projections, and let A i denote the i-th row of A, and likewise for B i .If a, b is the inner product of a and b, we define Here, a i is the number of common direct descendants of i in both relations, b i is the number of descendants of i found in A but not in B, c i is the number of descendants of i in B but not in A, while d i counts the common non-descendants of i in the two relations.Using this, we can compute the element wise adjusted order Rand index measuring the element wise order correlation between the base space projections in the Hubert-Arabie adjusted Rand index (Hubert and Arabie, 1985;Warrens, 2008) 1 .Notice that we compare the i-th row in A to the i-row in B since these rows correspond to the projections' respective descentand relations for the i-th element in X.In (Hoffman et al., 2015), the above index is computed for each element pair within the network to produce the intra-network similarity coefficient.
Since we are interested in the overall match, we choose to report on the mean value, defining the adjusted order Rand index for A and B as

Normalised ultrametric fit
A natural choice of quality measure is to report the ultrametric fit U ε (θ) − d p of the obtained partial dendrogram θ, especially if we can compare it to the ultrametric fit of the optimal solution.The scale of the ultrametric fit depends heavily on both the size of the space and the order of the norm, so we choose to normalise.Also, we invert the normalised value, so that the optimal fit has a value of 1, and a worst possible fit has value 0. This makes it easy to compare the convergence of the ultrametric fit to the convergence of the ARI and ōARI.
Definition 30.Given a set of partial dendrograms {θ i } over (X, <, d), let their respective ultrametric fits be given by In the presence of a reference solution, we substitute min i {δ i } with the ultrametric fit of the reference.

Ultrametric fit relative the optimal ultrametric
The reference partition can be reached through different sequences of merges, and neither AL nor CL are invariant in this respect.Neither ARI, ōARI nor ultrametric fit captures the match between the optimal hierarchy and the approximated hierarchy.We therefore also include plots of the difference between the optimal ultrametric u opt and the approximated ultrametric u N,ε .Since both ultrametrics are equivalent to their respective hierarchies, the magnitude u opt − u N,ε p can be interpreted as a measure of difference in hierarchies.In the below plots, this is reported as opt.f it.As for the ultrametric fit, we normalise and invert the values for easy comparison.

Demonstration on randomly generated data
The experiments in the demonstration split in two.First, we demonstrate the efficacy of the approximation relative a known optimal solution, to see to which degree HC <L N,ε manages to approximate HC <L opt,ε .Second, we study the convergence rate of the ultrametric fit for spaces with much larger numbers of tied connections; spaces for which the optimal algorithm does not terminate within any reasonable time.
For each parameter combination in Table 1, a set of 30 random ordered dissimilarity spaces are generated.For each space, 100 approximations are generated according to the prescribed procedure.We then bootstrap the approximations to generate N -fold approximations for different N .Table 1: Parameter settings for the demonstrations.The right-most column indicates whether the reference clustering is available or not.left-most column refers to the figure wherein the outcome of the corresponding experiment is presented.The parameters have been chosen to illustrate how the algorithm behaviour changes with changing expected number of ties, changing link probability in the random partial order, and choice of linkage function.
We present the results in terms of convergence plots, showing the efficacy of the approximation as a function of the sample size N .For the results where a reference solution is available, the plots contain four curves: E(ARI) -The expected adjusted Rand index of the approximated partition.E(ōARI) -The expected adjusted Rand index of the approximated induced order.norm.f it -The mean of the normalised fit.opt.f it -The mean of the normalised difference between the approximated ultrametric and the optimal ultrametric.
For the results where no reference solution is available, we present the distribution of the normalised fit.
The results are presented in Figures 5, 6, 7 and 9 on pages 30, 30, 31 and 32, respectively.The parameter settings corresponding to the figures are given in Table 1 for easy reference, and are also repeated in the figure text.
As we can see from the below results, the approximation generally performs very well.We also see that a large expected number of tied connections requires larger sample size for a good approximation, while a more dense order relation (higher value of p) seems to require a smaller sample compared to a more sparse relation.We also see that there is a seemingly strong correlation between the ultrametric fit of the approximation and the similarity between the approximation ultrametric and the optimal ultrametric.
Regarding choice of linkage function, the approximation only requires small samples for both SL and AL, while CL requires larger samples for larger numbers of tied connections.

First conclusions
The first thing that strikes the eye is that the approximations converge very rapidly.Even for moderately sized spaces (∼500 elements), it appears to be sufficient with 20 samples for SL and AL, and for smaller spaces (∼200 elements), even fewer samples are required.We also notice that there is a strong correlation between the ARI, ōARI and normalised fit.
For the part of the demonstration where we have no reference clustering, we cannot know for sure whether the best reported fit is also optimal.However, from the convergent behaviour of the data, and the strong correlation between optimality and normalised fit in Figures 5 and 6, this points in the direction of convergence to the true optimum.
Only CL displays convergence issues, indicating that if one wishes to use CL for large spaces or large numbers of tied connections, it may be wise to do so in conjunction with convergence tests.
On the other hand, since SL is independent of tie resolution order, every sequence of merges ending in the same maximal partition will produce the same partial dendrogram.This explains why the convergence rate of SL is less affected by the expected number of tied connections than, say, CL.
The convergence rate is very high in some of the plots of Figures 8 and 9.The authors believe this is due the high probability of two random elements being comparable (high p in O(n, p, t)), since a dense relation leads to fewer candidate solutions.This in contrast to the larger set of candidates for a more sparse relation, such as in Figure 7.
On the other hand, as we can see in Figures 7 and 8, keeping p fixed and increasing the number of tied connections, and thereby the number of possible branch points, causes a slower convergence rate.
To summarise, we see that the approximation is both good and effective for SL and AL.For CL, although the approximation method seems good, the required sample size must be increased in the presence of large amounts of tied connections.9 Demonstration on data from the parts database While the above demonstration shows that HC <L N,ε performs well with respect to approximating HC <L opt,ε , another question is how order preserving hierarchical clustering deals with the dust of reality.In this section, we present results from applying the approximation algorithm to subsets of the parts database described briefly in Section 1.1.As benchmark, we run classical hierarchical clustering on the same problem instances, comparing the performance of the methods using ARI, ōARI and loop frequency (described below).As hierarchical methods for constrained clustering do not offer a no-link constraint, we also propose a simplified approach simulating no-link behaviour for AL and CL which we call HC + .
To select data for the demonstration we proceeded as follows: We considered the part-of relations as a directed graph, and extracted all the connected components.As it turned out, there was one gigantic component and a large number of singleton elements, but also a handfull of connected components of 11 to 40 elements each.We selected these smaller connected components as our demo dataset without any further consideration.Dissimilarities between the elements were obtained from a dissimilarity measure produced by an ongoing project in the company working on the very task of classifying equivalent equipment.Some key characteristics of the data is provided in The in/out deg.column provides the directed average degree when the data is considered as a DAG.The column p shows the probability for two random elements to be connected in the transitive reduction.
Due to limited labeling of the data, we do not know which elements are copies of other elements, so we have to fake copying to produce planted partitions.For the demonstration, we pick a connected component (X 0 , E 0 ) where X 0 = {x 0 1 , . . ., x 0 n }, and for some positive number m we make m − 1 copies of X 0 and E 0 , leading to m partially ordered sets (X k , E k ) m−1 k=0 .We then form their disjoint union (X, E) where |X| = m X 0 .X now consists of m connected components, each a copy of the others.If x 0 i ∈ X 0 , then the set of elements equivalent to Hence, the clusters we seek are the sets on this form.If we denote the dissimilarity measure that comes with the data by d 0 , we define the extension to all of X as follows: First, if both elements are in the same component X k for 0 ≤ k ≤ m, then we simply use d 0 .And if they are in different components, indicating that they are in a copy-relationship, we increase their dissimilarity by an offset α ≥ 0. Concretely, the extended dissimilarity d α : X × X → R + is given by This means that if x and y are copies of each other, then d α (x, y) = α, and if x and y are in the same component and if z is a copy of x, then d(z, y) = α + d 0 (x, y).Furthermore, for each modified distance, we add a small amount of Gaussian noise to α to induce some variability.As a result, two copies x r i and x s i are offset by approximately α, and by varying the magnitude of α we can study how the offset affects the clustering.

Simulated constrained clustering
The available methods for hierarchical constrained clustering do not easily incorporate the partial order as a constraint.What we would like to compare against, is hierarchical constrained clustering with do-not-cluster constraints.For CL and AL, we can obtain this by setting the dissimilarity between comparable elements to a sufficiently large number, causing all comparable elements to be merged towards the end.Indeed, for CL it is sufficient to set this dissimilarity to any value exceeding max{d α }, and as the below demonstration shows, this value works equally well for AL.We denote hierarchical clustering with this kind of modified dissimilarity by HC +L .Since d 0 < 1 for all pairs of elements, we chose to use 1.0 as our maximum dissimilarity.

A measure of order preservation
While the ōARI measures the correlation between the induced order of the planted partition and the induced order of the obtained clustering, the ōARI does not convey information about whether the induced relation is a partial order or not.Since this is a key question for applications where order preservation is of high importance (such as acyclic partitioning of graphs), we suggest the following simple measure.Let (Q, E ′ ) be a partition of (X, E), and let E Q be the base space projection of (Q, E ′ ) onto X (Definition 29).We say that (Q, E ′ ) induces a loop if there are elements on the form ( There is at most one loop per element of X, and if E Q contains a cycle, then every element of the cycle corresponds to a loop.In the name of normalisation, we measure the amount of loops as the fraction of elements in X that is a part of a cycle:

Picking a clustering in the hierarchy for comparison
Given a problem instance (X, <, d) and a planted partition Q ∈ P(X, <), the planted induced partial order is necessarily the induced relation < ′ .But in comparing a hierarchical clustering with a planted partition, we have to make a choice of clustering in the hierarchy.Given a hierarchical clustering, we choose to find the clustering in the hierarchy that has the highest ARI relative the planted partition.We then report all of ARI, ōARI and loops with regards to this clustering.

Variance of the difference
In the below plots, we present the mean values of ARI, ōARI and loops together with a visual indication of variability.For each instance of a random ordered dissimilarity space (X, <, d), we run all of HC <L N,ε , HC L and HC +L .Thus, we can analyse the performance of the methods by pairwise comparison on a problem instance level.That is, we choose to consider pairwise differences such as ARI(HC <L N,ε (X, <, d)) − ARI(HC +L (X, <, d)) as one random variable, and likewise for ōARI and loops.The variance of this random variable shows the variance in the difference, and we can use this magnitude to analyse whether the sets of results are statistically distinguishable.For the below plots, we mark a region about each line corresponding to one standard deviation of this random variable.This means that the regions encompassing the lines will not overlap unless the difference between the mean values is less than two standard deviations.
To reduce the number of plots, we choose to plot the results of all three methods together.This is obviously impractical with respect to pairwise comparisons, so we employ the following convention: the indicated variance about the mean of HC <L N,ε and HC +L is the standard deviation of the differences between these methods.The indicated variance about the mean of HC L represents the standard deviation of the differences between HC L and HC <L N,ε .

Execution and results
The parameters given in Table 3 define how the ordered dissimilarity spaces are constructed for each of the connected components.For each instance of an ordered dissimilarity space, HC <L N,ε , HC L and HC +L are all run on the same instance with a choice of linkage function L ∈ {SL, AL, CL}.This allows us to compare the performance of the methods against each other on a per-instance basis.For each parameter combination in {α} × {SL, AL, CL}, we repeated this process 50 times.The number m of copies is the least number for which the total number of elements, m X 0 , is at least 200.
We present three families of plots, for ARI, ōARI and loops, respectively We have picked three connected components for the presentation that we believe represent the span of observations.The full set of plots is provided in the appendix.
First, connected component number 7 (cc7) is the sample on which we see the most clear benefit from using HC <L N,ε , significantly outperforming both HC L and HC +L on all quality measures.Although cc7 is not representative for the majority of observations, it is empirical evidence that there exist problem instances for which order preserving clustering cannot be well approximated by hierarchical constrained clustering through do-not-cluster constraints.
Connected component number 1 (cc1) represents the majority of the instances.While HC <L N,ε still is best in class with respect to all quality measures, we see that for AL and CL the method HC +L performs equally well with respect to ARI and sometimes also ōARI.At the other extreme of cc7 there is connected component number 4 (cc4), presented in the bottom row of Figure 10.For this component, all the clustering models perform equally well in all quality measures, indicating that they produce the exact same clusterings.This can only be explained by the fact that the original dissimilarity measure d 0 , when restricted to this component, both is an ultrametric, and incorporates the order relation (Section 6.2).
The results are also summarised in Table 4 after the plots.We summarise the experiment observations in To conclude, we see that if clustering is the sole objective, then HC +L is a good alternative to HC <L whenever L ∈ {AL, CL}.If order preservation, or acyclic partitioning, is of any importance, then HC <L N,ε is the only viable method among those we have tested.Moreover, as demonstrated by the top row of Figure 10, although HC +L may be a good approximation of HC <L N,ε when L ∈ {AL, CL}, there are problem instances on which the latter outperforms the former with significant margin, also for ARI.

Summing up
In this paper we have put forth a theory for order preserving hierarchical agglomerative clustering for strictly partially ordered sets.The clustering uses classical linkage functions such as single-, average-, and complete linkage.The clustering is optimisation based, and therefore also permutation invariant.
The output of the clustering process is partial dendrograms; sub-trees of dendrograms with several connected components.We have shown that the family of partial dendrograms over a set embed into the family of dendrograms over the set.
When applying the theory to non-ordered sets, we see that we have a new theory for hierarchical agglomerative clustering that is very close to the classical theory, but that is optimisation based rather than algorithmic.Differently from classical hierarchical clustering, our theory is permutation invariant.We have shown that for single linkage, the theory coincides with classical hierarchical clustering, while for complete linkage, the clustering problem becomes NP-hard.However, the computational complexity is directly linked to the number of tied connections, and in the absence of tied connections, the theories coincide.
We present a polynomial approximation scheme for the clustering theory, and demonstrate its convergence properties and efficacy on randomly generated data.We also provide a demonstration on real world data comparing against existing methods, showing that our model is best in class in all selected quality measures.

Future work topics
We suggest the following future work topics:

Complexity
While NP-hardness of HC <CL opt,ε follows from Theorem 11, the complexity classes of order preserving hierarchical agglomerative clustering for SL and AL remain to be established.

Order versus dissimilarity
Since the order relation is treated as a binary constraint it has a significant effect on the output from the clustering process, and may in some cases lead to undesirable outcomes.For example, if the dissimilarity measure associates "wrong" elements for clustering, the induced order relation may exclude future merges of elements correctly belonging together by erroneously identifying them as comparable.Also, if elements are wrongly identified as comparable to begin with, they can never be merged.Both due to Theorem 15.
Together, these observations indicate that "loosening up" the stringent nature of the order relation may be beneficial in applications where order preservation is not a strict requirement.
Acknowledgments.I would also like to thank the anonymous reviewers at Machine Learning for constructive feedback and comments greatly improving the exposition.I would also like to thank Henrik Forssell, Department of Informatics (University of Oslo), and Gudmund Hermansen, Department of Mathematics (University of Oslo), for their comments, questions and discussions leading up to this work.

A Plots from the part database demo
This section lists all the plots from the experiments described in Section 9.The plots are grouped per connected component, and present results for all clustering methods, quality measures and linkage models.Please see Table 2 for a list of statistical properties of the different connected components, and Table 3 for

B Reference implementation
The implementation used for the experiments in Sections 8 and 9 is available as open source at https://bitbucket.org/Bakkelund/ophac.

Figure 1 :Figure 2 :
Figure 1: A dendrogram over the set X = {a, b, c, d, e}.The elements of X are the leaf nodes of the dendrogram, and, starting at the bottom, the horizontal bars indicate which elements are joined at which step in the process.The numbers on the y-axis indicate at which dissimilarity level the different clusters were formed.Now, given a partially ordered set X = {a, b, c, d} where a < b and c < d, we can use arrows to denote the order relation, thinking of X as a directed acyclic graph with two connected components.If we want to produce a hierarchical clustering of X, while at the same time maintaining the order relation, our options are depicted in the Hasse digram in Figure 2.

Figure 3 :
Figure 3: Partial dendrograms over the set X = {a, b, c, d} with a < b and c < d.Each partial dendrogram corresponds to a path in Figure 2 starting at the bottom and advancing upwards to the ordered set depicted below the dendrogram.
Both of the partial dendrograms are mapped to the same complete dendrogram (on the right) for ε = 0.

Figure 8 :
Figure 8: Polynomial approximation rate for n = 500, p = 0.05 and t ∈ {50, 100}.The first axis is the size of the drawn sample.

Figure 9 :
Figure 9: Polynomial approximation rate for n = 500, p = 0.10 and t = 100.The first axis is the size of the drawn sample.

Figure 10 :Figure 11 :Figure 12 :
Figure10: Performance of the different clustering methods with respect to ARI on connected components 7, 1 and 4. The shaded regions represent one standard deviation of the pairwise differences, as described in Section 9.4.
the parameter settings used during the experiments.Results for connected component no.0.

Table 2 :
Some key characteristics of the connected components selected for the demonstration.

Table 3 :
The variance of the difference is based on these sets of 50 executions.Parameters for execution of experiments.

Table 4 .
As can be seen from the table, HC <L N,ε is best in class in every category.However, HC +L is also best in class in 81% of the cases when we restrict our attention to ARI and L ∈ {AL, CL}.

Table 4 :
The table presents for how many of the eight selected samples the different methods are best in class with regards to ARI, ōARI and loops.The scores are based on visual inspection of the plots.For ARI and ōARI, we count a one if there is less than one standard deviation to the best plot in at least half the sampled α values and zero otherwise.For loops, we count a one if the expected value is zero throughout.The full list of plots can be found in Appendix A.