Heuristic for Quartet Clustering

. Given a set of n data objects and their pairwise dissimilarities, the goal of quartet clustering is to construct an optimal tree from the total number of possible combinations of quartet topologies on n , where optimality means that the sum of the dissimilarities of the embedded (or consistent) quartet topologies is minimal. This corresponds to an NP-hard combinatorial optimization problem, also referred to as minimum quartet tree cost (MQTC) problem. We provide details and formulation of this challenging problem, and propose a basic greedy heuristic that is characterized by a very high speed and some interesting implementation details. The solution approach, though simple, substantially improves the performance of a Reduced Variable Neighborhood Search for the MQTC problem. The latter is one of the most popular heuristic algorithms for tackling the MQTC problem.


Introduction
Quartet clustering methods are popular in computational biology, where dendrograms (or phylogenies) are ubiquitous. These methods aim at reconstructing a rooted dendrogram from a set of pairwise distant objects (or taxa). Given a set of objects, define Q to be the set of all the possible quartets, and Q t to be the set of consistent quartets being embedded in a dendrogram t. The problem of recombining the quartet topologies of Q to form an estimate of the correct tree diagram can be naturally formulated as an optimization problem. Steel [19] formulated the maximum quartet consistency (MQC) problem, which looks for a dendrogram tree t maximizing the number of consistent quartets Q t belonging to a subset P ⊆ Q of quartet topologies. This problem has been shown to be NP-hard [19], and Jiang et al. [15] proved that the problem admits a polynomial time approximation scheme by using the technique of smooth integer polynomial programming and by exploiting the natural denseness of the set Q. However, this scheme only guarantees a dendrogram that may deviate from Q by εn 4 quartet topologies for any small constant ε > 0, where n is the number of taxa.
Due to these results, most quartet methods are heuristics which attempt to solve the MQC problem, or some variants of the MQC problem with weaker optimization requirements. Strimmer and von Haeseler [20] formulated the quartet tree-puzzling problem, which is a variant of the MQC problem where each quartet is provided with a probability value to be embedded, and for each set of four objects the quartet with the highest probability is selected (at random in case of ties) to form a "maximum-likelihood dendrogram". Felsenstein [11] presented a heuristic which solves the MQC by incrementally growing the tree diagram in random order by stepwise addition of objects in the local optimal way. This procedure is repeated iteratively for different object orders, adding agreement values on the branches of the tree. Both agglomerative approaches are quite fast, but suffer from the usual bottom-up problem: a wrong decision early on cannot be corrected later. Berry et al. [1] reported an interesting result. They presented two "quartet cleaning" algorithms for correcting bounded numbers of quartet errors (i.e. incorrect inferences of simple quartet topologies) for many popular quartet problems.
Cilibrasi and Vitányi [2] proposed for the first time the minimum quartet tree cost (MQTC) problem. Given a set N of n ≥ 4 objects, the MQTC deals with a full unrooted binary tree with n leaves, a special topology dendrogram having all internal nodes connected exactly with three other nodes, the n objects assigned as leaf nodes, and without any distinction between parent and child nodes [12]. A full unrooted binary tree with n ≥ 4 leaves has exactly n − 2 internal nodes, and consequently has a total of 2n − 2 nodes. Full unrooted binary trees are of primary interest in clustering contexts because, of all tree diagrams with a fixed number of nodes, they have the richest internal structure (most differentiated paths between nodes). They are therefore very suitable for representing the structure of a set of objects [12]. A full unrooted binary tree with exactly n = 4 leaves is also referred to as simple quartet topology, or just as quartet [10,12]. Given a set N of n ≥ 4 objects, the number of sets of four objects from the set N is given by: Given four generic objects {a, b, c, d} ∈ N , there exist exactly three different quartets: ab|cd, ac|bd, ad|bc, where the vertical bar divides the two pairs of leaves, with each pair labelled by the corresponding objects and attached to the same internal node. Therefore the total number of possible simple quartet topologies of N is: 3 · n 4 .
A full unrooted binary tree is said to be "consistent" with respect to a simple quartet topology, say ab|cd, if and only if the path from a to b does not cross the path from c to d. This quartet ab|cd is also said to be "embedded" in the given full unrooted binary tree.
Considering the set N , the MQTC problem accepts as input a distance matrix, D, which is a matrix containing the dissimilarities, taken pairwise, among the n objects 1 . To extract a hierarchy of clusters from the distance matrix, the MQTC problem determines a full unrooted binary tree with n leaves that visually represents the symmetric n × n distance matrix as well as possible according to a cost measure. Consider the set Q of all possible 3 · n 4 quartets, and let C : Q → + be a cost function assigning a real valued cost C(ab|cd) to each quartet topology ab|cd ∈ Q. The cost assigned to each simple quartet topology is the sum of the dissimilarities (taken from D) between each pair of neighbouring leaves [4]. For example, the cost of the quartet ab|cd is ) and D (c, d) indicate, respectively, the dissimilarities among (a and b) and (c and d), obtained from the D.
Consider now the set Γ of all full unrooted binary trees with 2n−2 nodes (i.e. n leaves and n − 2 internal nodes), obtained by placing the n objects to cluster as leaf nodes of the trees. For each t ∈ Γ , precisely one of the three possible simple quartet topologies for any set of four leaves is consistent [4]. Thus, there exist precisely n 4 consistent quartet topologies (one for each set of four objects) for each t ∈ Γ .
The cost associated with a full unrooted binary tree t ∈ Γ is the sum of the costs of its n 4 consistent quartet topologies, that is: where Q t is the set of such n 4 quartet topologies embedded in t. In a hierarchical clustering context, we do not even have a priori knowledge that certain simple quartet topologies are objectively true and must be embedded. Thus, the MQTC problem assigns a cost value to each simple quartet topology, in order to express the relative importance of the simple quartet topologies to be embedded in the full unrooted binary tree having the n objects as leaves. The full unrooted binary tree with the minimum cost balances the importance of embedding different quartet topologies against others, leading to a binary tree that visually represents the symmetric distance matrix n × n as well as possible. The solution of this problem allows the hierarchical representation of a set of n objects within a full unrooted binary tree [12]. That is, the resulting binary tree will have the n objects assigned as leaves such that objects with short relative dissimilarities will be placed close to each other in the tree. This hierarchical clustering approach coming from the MQTC problem is also referred in the literature to as quartet method [4]. Such method is more sensitive and objective than other quartet clustering methods, which are usually too slow when they are exact or global, and too inaccurate or uncertain when they are statistical incremental, like the case of quartet tree-puzzling. In [4] the MQTC problem was shown to be NP-hard, and a Randomized Hill Climbing heuristic was also proposed to obtain approximate problem solutions. Other MQTC metaheuristics based on Greedy Randomized Adaptive Search Procedure, Simulated Annealing, and Variable Neighbourhood Search were proposed in [6]. These metaheuristics performed well for the problem, although the best performance was obtained by a Reduced Variable Neighbourhood Search (RVNS) implementation [6].
In this paper we propose some improved metaheuristics for the MQTC problem, to be used to get solutions of higher quality, in terms of reduced costs and computational running times. In particular we first propose a basic greedy heuristic which is characterized by a very high speed and some interesting implementation improvements, which can be used to enhance the MQTC metaheuristics to date in the literature. This greedy algorithm is characterized by its ease of implementation and simplicity, and it takes inspiration from the recently proposed "less is more approach" [9,18], which supports the adoption of non-sophisticated and effective metaheuristics instead of hard-to-reproduce and complex solution approaches. In particular we will show how the performance of the RVNS quartet heuristic is improved, with particular emphasis to computational running time, by adopting our proposed basic greedy heuristic to construct initial solutions for the algorithm.
The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 describes the details of the proposed heuristics, along with their main implementation concepts and pseudo-code formulations. Our computational experience is reported in Sect. 4, and finally the paper ends with conclusions in Sect. 5.

Related Work
The MQTC problem was originally proposed in [2]. There the main focus was on compression-based distances, but the authors visually presented the tree reconstruction results by full unrooted binary trees deriving by their MQTC problem formulation. Hence, they developed the quartet method for hierarchical clustering, a new approach aimed at general hierarchical clustering of data from different domains, not necessarily biological phylogenies. Several practical applications of the quartet method have been explored in the literature. In particular, Cilibrasi et al. [5] proposed a robust automatic music classification procedure consisting of two steps. The first step consisted of extracting the "Normalized Compression Distances" [16] among some considered pieces of music. The Normalized Compression Distance is a similarity metric based on string compression which mimics the ideal performance of Kolmogorov complexity [16]. The second step consisted of creating an efficient visualization of the extracted pairwise distances by means of the quartet method of hierarchical clustering. To substantiate the claims of universality and robustness of this automatic classification method, evidence of other successful applications in areas as diverse as genomics, virology, languages, literature, handwriting, astronomy and combinations of objects from completely different domains, were reported in [2]. In addition, Cilibrasi and Vitányi [3] reported an interesting application of this theory, consisting of the automatic extraction of similarities among words and phrases from the WWW using Google page counts. Granados et al. [13] studied the impact of several kinds of information distortion on compression-based text clustering, showing their results as ternary trees by means of the quartet method of hierarchical clustering. In a recent application, a variant of the quartet method based on the Variable Neighborhood Search metaheuristic was used for biomedical literature extraction and clustering [7,8]. The proposed application was able to retrieve relevant references for systematic reviews and meta-analysis from the Medline/PubMed database, and for visualizing the retrieved bibliography through an intuitive graph layout.
In [4], the authors presented the minimum quartet tree cost problem in a more formal way. They showed the main concepts, components, advantages and disadvantages of the quartet method of hierarchical clustering, particularly underlining the similarities and differences with respect to other methods from biological phylogeny. Cilibrasi and Vitányi [4] also showed that the MQTC problem is NP-hard by reduction from the MQC problem, and provided a Randomized Hill Climbing heuristic to obtain approximate problem solutions. Several other efficient metaheuristics based on Greedy Randomized Adaptive Search Procedure, Simulated Annealing, and Variable Neighbourhood Search were proposed and compared for the MQTC problem in [6]. The best reported performance was obtained by an implementation of a Reduced Variable Neighbourhood Search metaheuristic, which we will use as a reference benchmark in our paper and try to overcome its performance.

Greedy Constructive Heuristic
We first propose a new greedy heuristic for the MQTC problem, used to construct initial solutions of good quality requiring short computational running time [9,18]. In the metaheuristics to date used for solving the MQTC problem [4,6], the initial solution was usually set either completely at random, or by selecting the corresponding flat structure, and then this solution was iteratively improved towards local optimality using the different heuristic guidelines of the specific metaheuristic implementation. The aim of the greedy constructive heuristic that we propose here consists of providing starting solutions having already a good quality, and obtained with an high speed too, which can bring to an improvement of the overall performance of the MQTC heuristic deployed afterwards.
We are given as input n ≥ 4 different objects and the corresponding symmetric distance matrix D containing the n × n pairwise distances among those objects. The algorithm makes use also of another distance matrix D among a set N of n ≥ 4 objects, with n ≤ n, which will be used iteratively from our optimization routine to reduce the dimensionality n of the original set of objects in N . Initially, matrix D is set equal to D, i.e. the sets of objects N and N are equivalent. At this stage, another graph t , which will be used during the algorithm iterations as a support solution, is initialized to null, i.e. t ← ∅. Then the core of our greedy heuristic begins by selecting objects from N to be included in the support solution t . At this purpose we greedily select from N the objects that have the shortest minimum pairwise distance from D . Say the two objects a and b in N have this shortest distance, that is Note that in case of ties for the object pairs having the shortest distance in D , the routine simply selects an object pair at random within this set. Afterwards, these nodes a and b are connected to the support solution graph t . The following three cases are possible: -None of the two objects a and b are already connected in the partial solution t , and therefore they are joined together by means of a terminal node; -One of the two objects is already included as a leaf node in t , and therefore the other object b is linked to the subgraph in t containing a by means of a transition node (i.e. the dotted internal node in the figure). Please note that node b requires to be included in the partial solution t by a link with a new transition node since, being it a leaf, if it would be included by a link with a terminal node instead, we would not be able to add any further nodes afterwards; -Both objects a and b are already included in the partial solution t but they belong to two different subgraphs, and therefore these two subgraphs containing respectively the two objects are connected together by means of a cross node.
Afterwards, we apply a routine, referred to as distance matrix reduction, to merge the added nodes a and b together to form another object, say x, by reducing in this way the dimension of N of one unit, i.e. n = n − 1. The distance matrix D is recomputed accordingly by removing the distances of the two objects a and b with all the other objects in N , and adding the distances of the new node x with the other objects, which are calculated as the averages distances, respectively of a and b, with the other nodes in N . That is D (x, y) = D (a, y)+D (b, y) 2 , for all objects y ∈ N , y = a, b. The rationale behind this procedure is to greedily merge together highly connected objects which may bring higher values of the quartet cost function of the subgraphs inferred in the partial solution t .
This greedy procedure is repeated iteratively until a fully connected unrooted binary tree t is obtained, i.e. t ∈ Γ , which is equivalent also in getting a reduced distance matrix D with a size n = 4 (i.e. it would not be possible to reduce further the corresponding set N since it only contain four objects). Then the support solution t is assigned to the output full unrooted binary tree t, which is produced as final outcome of the algorithm.

Reduced Variable Neighbourhood Search
Variable Neighbourhood Search (VNS) is a popular metaheuristic for solving hard combinatorial optimization problems based on dynamically changing neighbourhood structures during the search process [14]. VNS does not follow a trajectory, but it searches for new solutions in increasingly distant neighbourhoods of the current solution, jumping only if a better solution is found. Reduced Variable Neighbourhood Search (RVNS) is a variant of the classic VNS algorithm, that has been shown to be successful for many combinatorial problems where local optima with respect to one or several neighbourhoods are relatively close to each other [14]. RVNS is a typical example of a pure stochastic heuristic, akin to a classic Monte-Carlo method, but more systematic [17]. It is useful especially for very large problem instances for which the inner local search within the classic VNS approach is costly, as in the case with quartet clustering.
The Reduced Variable Neighbourhood Search for the MQTC problem starts by selecting an initial full unrooted binary tree t ∈ Γ with 2n−2 nodes, obtained by placing the n ≥ 4 objects to cluster as leaves, with total cost C(t). In the original RVNS implementation in [6], the initial full unrooted binary tree t was selected at random.
Then, the shaking phase, which represents the core idea of RVNS, is applied to t. The shaking phase aims to change the neighbourhood structure, N k (·), when the algorithm is trapped at a local optimum. The new incumbent solution, say t , is generated at random in order to avoid cycling, which might occur if a deterministic rule is used. The simplest and most common choice for the neighbourhood structure consists of setting neighbourhoods with increasing cardinality: |N 1 (·)| < |N 2 (·)| < ... < |N kmax (·)|, where k max represents the maximum size of the shaking phase. Let k be the current size of the shaking phase. The algorithm starts by selecting the first neighbourhood (k ← 1) and, at each iteration, it increases the parameter k if a better solution is not obtained (k ← k + 1), until the largest neighbourhood is reached (k ← k max ). The process of changing neighbourhoods when no improvement occurs diversifies the search. In particular, the choice of neighbourhoods of increasing cardinality yields a progressive diversification of the search process.
For the MQTC problem, a shaking phase of size k consists of the random selection of another full unrooted binary tree t within the neighbourhood N k (t) of the current solution t. To obtain t from N k (t), the algorithm performs k consecutive base moves, where a base move is a single basic modification that each internal node of t can perform with its neighbouring internal nodes. The possible base moves that can be performed depend on the types of internal node pairs [6]. In the case of: -two transition nodes: either the attached leaves are exchanged, or they are transformed into one cross node and one terminal node connected to the corresponding leaves; -one terminal node and one transition node: the leaf of the transition node is exchanged with one of the two leaves of the terminal node; -one terminal node and one cross node: they are transformed into two transition nodes with the two leaves of the terminal node attached; -one transition node and one cross node: the transition node is moved in one of the other two branches of the cross node; -two cross nodes: one branch of one cross node is swapped with a branch of the other cross node.
Note that each base move corresponds just to a limited local modification of the structure of the incumbent solution t, which results in most of the coefficients of the corresponding Complete Pseudo-Adjacency matrix C to remain unchanged. In this was there will be no need to recalculate all the coefficients of C, but only recomputing a small subset of it, speeding up consistently this step.
At the beginning of RVNS, the first neighbourhood (k ← 1) is selected and, at each iteration, the parameter k is increased (k ← k + 1) whenever the solution obtained is not an improvement of the current best solution (i.e. C(t ) > C(t)). When k > 1, the first base move is performed to a randomly selected internal node and one of its neighbouring internal nodes with respect to the considered distance of rank one. Then, to perform the successive base move, the algorithm selects one of the two internal nodes considered, and another neighbouring internal node that must be different from the two internal nodes already considered, and so on. The procedure is repeated until k consecutive base moves are performed.
If an improved binary tree t is produced by the shaking phase (C(t ) < C(t)), this becomes the best solution to date (t ← t ) and the algorithm restarts from the first neighbourhood (k ← 1) of t. The process of increasing progressively parameter k whenever no improvements are obtained, occurs until the maximum size of the shaking phase, k max , is reached. When this happens, k is re-initialized to the first neighbourhood (k ← 1). The correct setting of k max is an important user task. For the MQTC problem, a simple reactive schema for the efficient tuning of k max has been implemented [6]. At the starting point, k max is set to a small value (k max = 2) and is increased (k max = k max +1) every i update iterations between two consecutive improvements. For the value of this parameter we use the setting of [6], where i update = (1.25 · 10 5 )/n 2 + 50. Throughout the execution of the algorithm, the best solution to date is stored as the binary tree t, which will be produced as output of the algorithm when the user termination condition (e.g. a maximum allowed CPU time) is reached.

Computational Results
In order to evaluate the algorithms, we performed experiments to compare them in terms of quality of produced solutions and computational running time. For evaluating solution quality, we used both the cost function C(·), already defined previously, and of another metric, referred to as normalized tree benefit score, S(·) ∈ [0, 1] [2,6], which is a more intuitive performance measure of the goodness of quartet clustering. Given the set N of n ≥ 4 documents to cluster, let m be the best (minimal) cost, calculated as the sum of the n 4 minimum costs of each set of four objects in N , and let M be the worst (maximal) cost, calculated as the sum of the n 4 maximum costs of each set of four objects in N . The normalized tree benefit score S(t) of a full unrooted binary tree t ∈ Γ is obtained by rescaling and normalizing in [0, 1] the cost function C(t), i.e. S(t) = M −C(t) M −m ∈ [0, 1]. While a lower cost function C(t) results in a better solution t, conversely a higher normalized tree benefit score means a better clustering quality.
Our experimental algorithms comparison was made upon classic MQTC problem datasets, already used in previous studies in the literature (see e.g. [2,4,6]). They are briefly described in the following, but for more details the reader is referred to [2,4,6].
-Data constructed artificially to have none inconsistency, that is data for which the exact solutions are known in advance and have been built to have normalized tree benefit score equal to one. The construction mechanism is described in detail in [2,6]. These data aim at testing whether the quartet-based tree reconstruction is reliable and accurate on clean consistent data with known solutions. They consist of ten different problem instances ranging from a number of objects n = 10 to 100. They are referred to as: artificial. -Example of natural data concerning a study in genomics with DNA sequences of different placental mammalian species. The distance matrices from the genomic data were computed by using an automated software method described by Cilibrasi and Vitányi [2,4], who downloaded the whole mitochondrial genomes of the placental mammalian species from the GenBank Database on the World Wide Web. They consists of three sets of data with n = 10, n = 24, and n = 34, and are referred to as: nature. Table 1 show the results of our experimental comparison of the algorithms on the considered datasets. The heuristics are identified with the following abbreviations: Greedy, for the greedy constructive heuristic described in Sect. 3.1; RVNS rand , for the original implementation of the Reduced Variable Neighbourhood Search (Sect. 3.2) with initial solution selected at random; RVNS greedy , for the new Reduced Variable Neighbourhood Search implementation where the initial solution is selected by using the greedy constructive heuristic, Greedy. All the algorithms were implemented in C++ under the Microsoft Visual Studio 2015 framework, and were deployed on an Intel Quad-Core i5 64-bit microprocessor at 2.30 GHz with 16 GB RAM.
As stopping condition for the RVNS-based metaheuristics it was considered a maximum allowed CPU time (max-CPU-time). In particular, as also used in [2,6], we set max-CPU-time to one hour. Selection of the maximum allowed CPU time as the stopping criterion was made in order to have a direct comparison among the RVNS metaheuristics with respect to the quality of their solutions. Instead, for the Greedy algorithm it was not necessary to set any stopping criterion since, being a constructive heuristic, it automatically ends when a feasible solution, i.e. a fully connected unrooted binary tree, is obtained.
Looking at Table 1, the first column shows the number of objects, n, characterizing the different datasets (artificial, nature, geographical ) while the remaining columns give the computational results in terms of clustering quality (i.e. cost function values C(·), cost, and normalized tree benefit scores S(·), score), and computational running time in seconds (time) for the different algorithms. The performance of an heuristic can be considered better than another if it obtains a lower cost function value, or more intuitively a larger normalized tree benefit score. In case of ties, an algorithm is consider better than another if it was faster. From the results showed in the table, we can immediately denote that Greedy was much faster of several orders of magnitude than the original RVNS implementation, RVNS rand , although in most of the cases it obtained solutions with worst quality. This is an understandable result since RVNS rand is an explorative metaheuristic that runs for a longer time, max-CPU-time, while Greedy instead stops immediately when a feasible solution is reached. But when Greedy was then embedded inside the RVNS metaheuristic in order to produce initial goodquality solutions in RVNS greedy , a very powerful metaheuristic was obtained. Indeed, as it can be seen in the table, RVNS greedy retained the high-speed feature from Greedy, but also the characteristics of good-quality solutions that is proper of the RVNS approach for the given problem. Indeed, looking at the performance of both the RVNS algorithms, the obtained solutions were comparable with respect to clustering quality, but RVNS greedy was much faster. Note that for the artificial datasets without inconsistencies, both RVNS implementations were able to reach optimality, RVNS greedy being much faster, while this was not achieved by Greedy.
Summarizing, the novel RVNS implementation with the greedy constructive heuristic used for selecting the initial starting solutions resulted to be the best performing method in our computational experiments in terms of both quartet clustering quality and, especially, computational running time.

Conclusions
In this paper we proposed some improved heuristics for quartet clustering, a novel hierarchical clustering approach based on the minimum quartet tree cost (MQTC) problem, which is NP-hard and whose goal is to derive an optimal tree from the total number of possible combinations of quartet topologies on some input objects n, where optimality means that the sum of the dissimilarities of the embedded (or consistent) quartet topologies is minimal.
In particular we provided the details of a new basic greedy heuristic that is characterized by a very high speed. Although the performance of this simple method in terms of quartet clustering quality, evaluated by means of a defined cost function and of a normalized tree benefit score, was not as good as that of the best solution method reported in the literature, i.e. a Reduced Variable Neighbourhood Search (RVNS) metaheuristic, this greedy method was used to considerably improve the performance of the RVNS by using it to construct initial good-quality solutions instead of randomly selected solutions.
This produces a very efficient solution approach to the problem, as demonstrated by our experiments on the comparison of the considered algorithms on a set of well-known MQTC datasets and by the reported computational results, which represents an advancement of the state-of-the-art on the solution methods used for quartet clustering.