# Optimal algorithms for comparing trees with labeled leaves

## Authors

DOI: 10.1007/BF01908061

- Cite this article as:
- Day, W.H.E. Journal of Classification (1985) 2: 7. doi:10.1007/BF01908061

## Abstract

Let*R*
_{
n
} denote the set of rooted trees with*n* leaves in which: the leaves are labeled by the integers in {1, ...,*n*}; and among interior vertices only the root may have degree two. Associated with each interior vertex*v* in such a tree is the subset, or*cluster*, of leaf labels in the subtree rooted at*v.* Cluster {1, ...,*n*} is called*trivial*. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For any*k* trees in*R*
_{
n
}, the*strict consensus tree C*(*T*
_{1}, ...,*T*
_{
k
}) is that tree in*R*
_{
n
} containing exactly those clusters common to every one of the*k* trees. Similarity between trees*T*
_{1} and*T*
_{2} in*R*
_{
n
} is measured by the number*S*(*T*
_{1},*T*
_{2}) of nontrivial clusters in both*T*
_{1} and*T*
_{2}; dissimilarity, by the number*D*(*T*
_{1},*T*
_{2}) of clusters in*T*
_{1} or*T*
_{2} but not in both. Algorithms are known to compute*C*(*T*
_{1}, ...,*T*
_{
k
}) in*O*(*kn*
^{2}) time, and*S*(*T*
_{1},*T*
_{2}) and*D*(*T*
_{1},*T*
_{2}) in*O*(*n*
^{2}) time. I propose a special representation of the clusters of any tree*T R*
_{
n
}, one that permits testing in constant time whether a given cluster exists in*T*. I describe algorithms that exploit this representation to compute*C*(*T*
_{1}, ...,*T*
_{
k
}) in*O*(*kn*) time, and*S*(*T*
_{1},*T*
_{2}) and*D*(*T*
_{1},*T*
_{2}) in*O*(_{n}) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed in*O*(*n*) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.