Abstract
In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the “history subpartition directed acyclic graph” (or “history sDAG” for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the “skeleton” of a more complete uncertainty quantification.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Here we develop a structure that can compactly represent and extend collections of phylogenetic trees with ancestral sequences mapped on the internal nodes. One motivation for this structure comes from uncertainty quantification in statistical phylogenetics, which is typically approached via one of two ways. Bayesian analysis attempts to characterize the posterior distribution of phylogenetic trees given data: the collection of trees that credibly explain the data, and their probabilities of being the generative tree. On the other hand, the phylogenetic bootstrap (Felsenstein 1985) resamples columns of the multiple sequence alignment, infers an optimal tree for each one of the resampled data sets, then aggregates features of the resulting trees.
Neither of these are tenable for very large and densely sampled data sets, such as for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) collections. Traditional Bayesian analysis is often too slow to apply to these large data sets, and introduces many extra unknown model parameters in a signal-weak setting. Bootstrapping may remain fast enough when using recent approximations (Hoang et al. 2018), but has a different problem: it is common for well-established clades (supported by other data) to be supported on the sequence level by a single mutation, so the bootstrap support of the corresponding clade will exactly equal the frequency with which we draw that mutation in the bootstrap sample. Thus, the bootstrap underestimates support in this case (Wertheim et al. 2022).
Phylogenetic placement offers a different type of uncertainty estimate: an assessment of the level of certainty in inserting a new sequence into an existing phylogeny. However, these assessments of uncertainty are relative to a fixed reference tree. For SARS-CoV-2 this can be done in the UShER framework (Turakhia et al. 2021), in which this insertion procedure is used for iterative tree building. No attempt is made to characterize uncertainty of the complete tree in this framework.
The lack of uncertainty quantification may have consequences for interpretation of SARS-CoV-2 evolution. For example, the current practice for the PANGO nomenclature system (Rambaut et al. 2020) for SARS-CoV-2 does not require any sort of support estimation. A typical workflow involves placement and local tree construction. If there is indeed high probability of a single tree, then this is fine. If not, this seems potentially problematic.
We argue as follows that the diversity of maximally parsimonious trees on the data can be used to bound uncertainty from below. First, if there are more maximally-parsimonious explanations of the data, this decreases the probability that any one explanation is correct. For this reason, we expect there to be an inverse relationship between the number of maximally-parsimonious explanations of the data and the certainty of a given node or other feature in the tree. Furthermore, this inverse relationship should express a lower bound on the uncertainty because there are many other potential compelling trees that are not quite maximally parsimonious. In any case, analyzing even just the maximally parsimonious set of trees commonly involves so many trees that storing them individually and learning from them with existing techniques is computationally prohibitive. This is especially the case with parsimony analysis of large data sets, such as those for SARS-CoV-2 (Turakhia et al. 2021; Ye et al. 2022).
As a second motivation for our work, we also suggest that gathering a collection of maximally parsimonious trees could be helpful for Bayesian analysis. Although the parsimony criterion is of course not the same as likelihood, the two objectives are closely linked in the case where sequences are densely sampled relative to the amount of evolution (Thornlow et al. 2021). Previous work has shown how closely related sequences can greatly inflate the posterior distribution (Whidden and Matsen 2015), and a parsimony analysis would have revealed this inflation. Thus, we hope to use the collection of maximally parsimonious trees as an aid for designing proposal distributions, extending previous successful strategies (Zhang et al. 2020), and for quantifying exploration of tree space.
In this paper, we formalize a data structure called the history subpartition directed acyclic graph (a.k.a. history sDAG) to characterize the ensemble of maximally parsimonious trees for large data sets. This is related to the idea of characterizing the trees in a single optimal “terrace” in phylogenetic tree space, with respect to parsimony (Sanderson et al. 2011, 2015). We describe algorithms to build history sDAGs from internally labeled trees, collapse edges with no mutations, and trim history sDAGs to express only trees which are optimal according to general criteria, such as parsimony. Although history sDAG construction is not the same as uncertainty estimation, which would allow for some less-than-maximally-parsimonious trees, it is a first step in that direction. We provide a Python implementation with a flexible interface for the history sDAG as a container type for trees, endowed with abstract methods for convenient dynamic programs on the history sDAG structure, as well as all methods from this paper for manipulating history sDAGs constructed from maximally parsimonious trees. This implementation shows the effectiveness of the approach, efficiently recovering many orders of magnitude more equally parsimonious trees than were used to “seed” the history sDAG when applied to a SARS-CoV-2 data set.
1.1 Intuitive overview
Here we provide an intuitive overview of the definitions and concepts used in this paper. Formal definitions will be given in the sections following the overview.
This paper develops methods for understanding evolutionary relationships between samples from a population of closely related evolving entities, acknowledging uncertainty. We will focus on samples consisting of nucleotide sequences, but keep our language general to emphasize that other data such as sample time and geographic location could also be used.
One way to formalize evolutionary relationships among samples, and inferred ancestral states, is to arrange them in a rooted phylogenetic tree with leaf and internal node labels. Node labels in this tree can include data of the type associated with the given samples. Specifically, leaf nodes are labeled by samples, and interior nodes are labeled by inferred ancestral states. The set of samples which label leaves will be called the leaf labels. Interior node labels may be chosen from some larger label set which includes the leaf labels as a subset. Instead of directly using this notion of a rooted, internally labeled tree, we will define a more convenient object called a history, which holds the same data as such a tree. We will make the definition formal below, but a history may be thought of as a rooted, internally labeled tree (this object has been called other names in the past, including an ancestral scenario (Ishikawa et al. 2019)). For example, a history might be used to represent a phylogenetic tree in which all (internal and tip) nodes are labeled with DNA sequences.
In a history, a node’s clade is the set of labels of its descendant leaf nodes (we emphasize that internal node labels are excluded from the clade definition). A clade of a node’s child is a child clade. The child clades of a node form a partition of the node’s clade. We therefore call this set of child clades a node’s subpartition. Each edge in a history connects two nodes, each with a label and subpartition. As a formality convenient for this paper, each history will contain a universal ancestor (UA) node added as a parent of the root node.
Some histories explain the relationships between their leaf labels more plausibly than others. One common measure of optimality for a history labeled by nucleotide sequences is its parsimony score, which is the total number of nucleotide base changes along all edges in the history. A history is said to be maximally parsimonious if no other history on the same leaf labels has a lower parsimony score.
In general, there are many possible maximally parsimonious histories with leaves labeled by the same set of nucleotide sequences. We will use a structure called the history subpartition directed acyclic graph (history sDAG) to efficiently encode a large collection of histories (Fig. 1). The “history” modifier emphasizes that this structure encodes a collection of possible rooted evolutionary histories, each of which contain not only a tree structure, but also ancestral state labels.
The history sDAG consists of a collection of nodes, each associated with a combination of label and subpartition, and one formal universal ancestor (UA) node, which is denoted \(\rho \). As we will see later, edges exiting \(\rho \) keep track of the root nodes of the histories in the history sDAG.
A directed edge in a history sDAG represents an edge in a corresponding set of histories, from a parent node to a child node which have the same labels and subpartitions as the parent and child nodes of the edge in the DAG. Thus, the history sDAG structure records combinations of labels and subpartitions, and adjacencies between these combinations, in the corresponding collection of histories.
By using a carefully chosen definition of history, introduced in the next section, the history sDAG can easily be constructed as the graph union of a set of histories. These histories need not have identical leaf labels. Specifically, we think of each history as its own history sDAG, with each node annotated by its label and subpartition. The history sDAG constructed from the original set of histories is simply the union of nodes and edges in each history (Fig. 1). The history sDAG then contains as subgraphs at least those histories used to construct it.
Any subgraph of the DAG which is a tree, includes exactly one edge descending from the UA node, and exactly one edge descending from each child clade of each of its nodes, is a history (Fig. 2). Each of the histories contained in a history sDAG represents a combination of substructures from the histories used to construct the history sDAG.
In addition to thinking of the history sDAG as a way of recording structures observed in a collection of histories, we can also think of it as a way of generating histories. In fact, the set of histories in the history sDAG is in general a superset of the set of histories used to construct the DAG (Fig. 3). These new histories result from combining subhistories from histories used to construct the history sDAG. This is similar to tree fusion, in which clades from different trees are combined to improve the parsimony score of the final tree (Goloboff 1999). The connection with tree fusion is explored further in the Discussion section.
As described above, exploring phylogenetic uncertainty by examining many maximally parsimonious histories requires an efficient way to store and compute on those histories. The history sDAG provides a compact structure for storing collections of histories, but in general contains histories beyond those used to build the DAG. We therefore encounter a key question: are these additional histories also maximally parsimonious?
In the following two sections, we will show that maximum parsimony is in fact preserved by the history sDAG. Theorem 1 shows that any history expressed by a history sDAG constructed from maximally parsimonious histories must itself be maximally parsimonious. To achieve this we must first show that swapping certain substructures between histories preserves maximum parsimony. Then we will show that the collection of histories in the history sDAG is closed under these subhistory swaps, and that any history in the history sDAG can be obtained by such a subhistory swap involving histories used to construct the DAG. This means that the history sDAG is not only an effective way to store many maximum parsimony histories, but also may allow us to very quickly discover more such histories.
Preservation of maximum parsimony in the history sDAG has two important consequences:
-
A history sDAG constructed from maximally parsimonious histories will contain only maximally parsimonious histories. If a set T of histories with the same parsimony score is used to construct a history sDAG, and if that history sDAG expresses a history with any other parsimony score, then T must not have contained maximum parsimony histories.
-
It is always possible to trim an arbitrary history sDAG to express all of, and only, its maximally parsimonious histories. In particular, a new history sDAG constructed from the maximally parsimonious histories represented by the original history sDAG will contain only those histories used to construct it.
Throughout this paper we will refer to maximally parsimonious histories using the more general term minimum-weight histories, since maximum parsimony is characterized by minimizing the sum of a weight over all edges in a history. Indeed, this term is more general because we can use weight functions that are more complex than simply the sum of the number of mutations, or which consider label data other than nucleotide sequences.
We provide an implementation of the history sDAG and related algorithms in the open source Python package historydag, installable with pip and available at https://github.com/matsengrp/historydag. This package provides methods for constructing, trimming, collapsing, and extracting histories from the history sDAG as described in the following sections. historydag also implements methods which we will describe in future work, for efficiently calculating weights of histories represented in the history sDAG, and for expressing and sampling from a probability distribution on histories in the history sDAG.
For reference, we provide a summary of notation in Table 1.
2 Histories and the history sDAG
We will now provide a formal definition of histories and the history sDAG.
Let Y refer to a set of labels, such as nucleotide sequences. We can think of observed labels as a set \(X\subset Y \), labeling history leaves. We will not emphasize this set of leaf labels X, since a history sDAG may express collections of histories with varying leaf label sets. In the case of parsimony however, we will be interested in collections of histories which share a leaf label set consisting of observed nucleotide sequences.
We are interested in representing collections of rooted, multifurcating, non-unifurcating trees with nodes (including internal nodes) labeled by elements of Y. As mentioned in the Overview, we will make this easy by carefully defining histories.
Isomorphism classes of such internally labeled trees are in bijection with histories, as defined below. This correspondence is shown formally in Appendix A, but sufficient intuition may be found in Fig. 1.
Let Y be a set of labels, and let \({\mathcal {P}}(\cdot )\) denote the power set.
Definition 1
Let \({\text {Part}}(Y) \) be the set of all \(U\subset {\mathcal {P}}(Y){\setminus } \left\{ \emptyset \right\} \) such that,
-
for \(C_1, C_2\in U \), if \(C_1 \ne C_2 \) then \(C_1\cap C_2 = \emptyset \)
-
\(|U| \ne 1 \).
That is, \({\text {Part}}(Y) \) contains \(\emptyset \) and all sets of two or more nonempty, disjoint subsets (clades) of Y.
Given a set of leaf labels \(X\subset Y \), \({\text {Part}}(X) \) would contain all of the possible subpartitions of leaf labels in an internally labeled tree with leaves labeled by X. Notice that \({\text {Part}}(X) \subset {\text {Part}}(Y) \) for any such \(X\subset Y \). Since a history sDAG may contain histories with varying leaf label sets, elements of \({\text {Part}}(Y) \) are used to construct general history sDAG nodes.
We will see that with the exception of a universal ancestor node, all nodes in the history sDAG structure consist of a label \(\ell \in Y\) and a subpartition \(U\in {\text {Part}}(Y) \).
Definition 2
A node-clade pair is a node \((\ell , U) \) and a choice of child clade \(C\in U \).
Definition 3
A history sDAG with labels Y is a directed graph (V, E) consisting of
-
A node set \(V\subset \left( Y\times {\text {Part}}(Y)\right) \cup \left\{ \rho \right\} \) such that \(\rho \in V \) is the universal ancestor (UA) node. For a node \(v = (\ell , U)\in V \), \(v \ne \rho \), we say that v’s label is \(\ell \), its subpartition is U, its child clades are elements of U, and its clade union \({\text {CU}}(v) \) is \(\left\{ \ell \right\} \) if \(U = \emptyset \), or \(\bigcup \nolimits _{C\in U} C \) otherwise.
-
A directed edge set \(E\subset V\times V \) containing edges \(e = (v_1, v_2) \) from a parent node \(v_1 \) to a target or child node \(v_2 \) such that
-
1.
All nodes are reachable from the UA node \(\rho \), which itself accepts no incoming edges.
-
2.
For any edge whose parent node is not \(\rho \), the clade union of the target node must be in the subpartition of the parent node.
Formally, for any edge \(e = \left( (\ell _1, U_1), (\ell _2, U_2) \right) \in E \), if C is the clade union of \((\ell _2, U_2) \), then \(C\in U_1 \).
We say then that the edge e descends from the node-clade pair \(\left( (\ell _1, U_1), C \right) \).
-
3.
For each node \(v = (\ell , U) \), and for each choice of child clade \(C\in U \), at least one edge descends from the node-clade pair (v, C) .
-
1.
Notice that by requirements (1) and (3) in the definition of the history sDAG, all nodes in the history sDAG must have descendant edges, except for those of the form \((\ell , \emptyset ) \). We will refer to these as leaf nodes.
Observation 1
Since only nodes of the form \((\ell , \emptyset ) \) may have no children, all leaf nodes in a history sDAG must be of this form, and therefore no two leaf nodes may be labeled by the same element of Y.
Observation 2
For any history sDAG edge \((v_1 = (\ell _1, U_1), v_2) \), we know that \({\text {CU}}(v_2) \subset {\text {CU}}(v_1) \), since \({\text {CU}}(v_2) \in U_1 \). More generally, consider a history sDAG (V, E) , in which a node \(v' \) is reachable from another node v via a sequence of edges in E. By transitivity of inclusion, \({\text {CU}}(v') \subset {\text {CU}}(v) \).
Definition 4
A history is a history sDAG in which the UA node \(\rho \) has a unique child node, and each node-clade pair has exactly one descendant edge.
The set of labels of the leaf nodes in a history t will be denoted L(t) .
Notice that not every element of Y must appear as a node label in a history or history sDAG. That is, Y is an ambient label set, such as the set of all nucleotide sequences of a fixed length, from which history sDAG node labels can be chosen.
Also notice that there is no distinction between leaf node and internal node labels. In practice, the set of leaf node labels will be associated with a set of observed evolving entities. When a sampled entity is inferred to be an ancestor of other sampled entities, we can represent this in a history with an internal node carrying the label corresponding to the sampled ancestor.
Informally, a labeled tree can be converted to a history by annotating each node with its subpartition, and adding a UA node as a parent of the root node (Fig. 1). The unique child of the UA node in a history will be called the root node, since it represents the root node of a corresponding internally labeled tree.
The natural substructure of a history is analogous to a subtree of a labeled tree, and will be very useful in later sections.
Definition 5
Given a history sDAG (V, E) , a subgraph \(s = (V_s, E_s) \) with \(V_s \subset V \) and \(E_s \subset E \) is a subhistory of (V, E) if
-
1.
\(\rho \notin V_s \),
-
2.
there exists a root node \(v_r\in V_s \) such that all other nodes in \(V_s \) are reachable from \(v_r \), and
-
3.
each node-clade pair in s has exactly one descendant edge.
The set of labels of leaf nodes in a subhistory s is denoted L(s) .
Later we will establish formally that a history is in fact a tree. Given that fact, naming a subhistory is equivalent to removing an edge from a history, and discarding the component which contains the UA node.
In addition to the UA node, the definition of history contains redundant information in the sense that the subpartition of a node, formally a piece of data associated with each node, can be recovered as the set of sets of labels of leaf nodes reachable from that node’s children. Although this choice may seem an unnecessary complication, it is essential in distinguishing histories contained in a larger history sDAG. This redundancy is shown in the following lemma, which is proven in Appendix A:
Lemma 3
Let (V, E) be a history sDAG or subhistory, and let \(v\in V \). The set of labels of leaf nodes reachable from v is \({\text {CU}}(v) \).
Lemma 3 implies that a history’s set of leaf labels is determined by the subpartition of its root node. This will be relevant later, when we describe what it means for a history to be found in a history sDAG.
We intend for a history to be tree-shaped, but this is not assumed by the definition given. Lemma 3 also allows us to prove this essential fact.
Lemma 4
A history sDAG (V, E) is a history if and only if it is a tree, and contains exactly one edge descending from \(\rho \).
The proof for this proposition is given in Appendix A.
Notice that since elements of \({\text {Part}}(Y) \) may not contain exactly one clade, and since nodes in a history have exactly one child node per child clade, no node (other than the UA node) in a history may have exactly one child. This is required to ensure that the history sDAG may not contain cycles. Although this is not stated in Definition 3, it is an important property of the history sDAG as the name suggests, and is proven in Appendix A:
Lemma 5
A history sDAG (V, E) is acyclic.
Sometimes data sets include a fixed root node label, such as a common ancestor sequence. A search for minimum weight labeled histories explaining such a data set may yield labeled histories with a unifurcation at the root node. We accommodate this by considering the fixed root sequence a leaf node label, and placing the corresponding leaf node as an additional child of the root node.
Since histories are tree-shaped history sDAGs, we can store collections of histories by taking their graph union. However, we should first verify that a graph union of history sDAGs is itself a history sDAG.
Lemma 6
Let (V, E) and \((V', E') \) be history sDAGs on labels Y. Then \((V\cup V', E\cup E') \) is also a history sDAG.
Proof
All the nodes and edges required to satisfy Definition 3 are present in \((V\cup V', E\cup E') \), since they are present in each of the original history sDAGs. All nodes are reachable from the root node, through exactly the same sequence of edges by which they were reachable in at least one of the original histories. \(\square \)
Definition 6
For a set T of histories with labels in Y, the history sDAG constructed from T is the graph union of the histories in T:
We should also formalize the way in which a history sDAG contains histories. To do so, we will need to define a trim, which is a history sDAG which appears as a substructure in a larger history sDAG.
Definition 7
Let (V, E) be a history sDAG on labels Y. Then \((V', E') \) is a trim of (V, E) if \(V' \subset V \), \(E'\subset E \), and \((V', E') \) is a history sDAG on labels Y. We say a history \(t = (V'', E'') \) is in the history sDAG (V, E) if \((V'', E'') \) is a trim of (V, E) .
The collection of histories in the history sDAG constructed from a collection of histories T will be denoted D(T) .
We can now see why we must specify in Definition 4 that \(\rho \) has exactly one child node in a history. Edges descending from the UA node in a history sDAG keep track of which DAG nodes are allowed to be root nodes. It may be possible to choose a tree-shaped trim of a history sDAG in which two nodes \(v_1 \) and \(v_2 \) are children of \(\rho \), and \({\text {CU}}(v_1) \cap {\text {CU}}(v_2) = \emptyset \). Such a structure should be considered a trim containing two histories, but is not itself a history.
Any history sDAG should be uniquely determined by the collection of histories it contains. This intuition motivates the following two lemmas, which are proven in Appendix A:
Lemma 7
Let (V, E) be a history sDAG. For any \(v\in V \), there exists a subhistory s in (V, E) whose root node is v.
Lemma 8
Let (V, E) be a history sDAG, and let T be the collection of histories in (V, E) . Then (V, E) is the history sDAG constructed from T.
Finally, we will need to define the largest possible history sDAG constructed using a given set of labels.
Definition 8
The complete history sDAG on labels Y is the history sDAG which contains all possible edges on all nodes allowed by the choice of Y.
Equivalently, the complete history sDAG could be constructed as the graph union of all possible histories with labels in Y.
2.1 History weights
In this section we will define a general scheme for assigning weights to histories, and describe the relationship between these weights and the structure of the history sDAG.
As shown in Fig. 3, the history sDAG in general contains more histories than were used to construct it. These extra histories arise because the history sDAG allows subhistories between the histories it contains, whenever the subhistories’ parent nodes share the same child clades and node label. We refer to this occurrence as subhistory swapping. Appendix 21 describes these subhistory swaps precisely, shows that all new histories in a history sDAG can be described as in terms of sequences of these subhistory swap operations, and provides the proof for Theorem 1, which involves an argument that these subhistory swaps preserve history weights.
This section will leave the details of subhistory swaps, and the proof of Theorem 1, to the Appendix, and only build the background necessary to state and understand Theorem 1.
We begin by defining another useful type of history substructure.
Definition 9
Let (V, E) be a history sDAG and let \(s = (V_s, E_s) \) be a subhistory (V, E) . Also, let \(v_r \) be the root node of the subhistory s, and let \(v\in V \) be a parent node of \(v_r \), so that \((v, v_r) \in E \). Then the augmented subhistory \(s^{v} \) is the subgraph \(\left( V_s\cup \left\{ v\right\} , E_s \cup \left\{ (v, v_r) \right\} \right) \) of (V, E) consisting of the subhistory s plus the parent node v and the edge connecting v to \(v_r \).
Definition 10
Let (V, E) be a history sDAG, and \(v = (\ell , U) \in V \) a node. We make the following definitions.
-
\({\text {Ch}}(v):= \left\{ v_c \mid (v, v_c) \in E \right\} \) will denote the set of children of v
-
\({\text {Ch}}(v, C):= \left\{ v_c \mid (v, v_c)\in E,\ {\text {CU}}(v_c) = C \right\} \) will denote the set of children of the node-clade pair (v, C) for each clade \(C\in U\)
-
\({\text {B}}(v) \) will denote the set of subhistories in (V, E) rooted at v.
Although we are interested primarily in computing parsimony on histories labeled with nucleotide sequences, we will do so within a much more general framework of history weights.
Definition 11
Let (V, E) be the complete history sDAG on labels Y, and let \(f:E\rightarrow W \) be an edge weight function to a weight set W endowed with addition and containing an additive identity \(0\in W \). The weight of any subgraph \((V', E') \) of (V, E) is then given by the weight function \(g_f\)
In particular, since any history t in (V, E) is a subgraph of (V, E) , the weight of t is given by \(g_f(t) \).
In the case of parsimony, the label set Y will contain sequences, the function f is Hamming distance, and \(g_f \) will compute the parsimony score of a history. A history’s parsimony score is decomposable as a sum of an edge weight function over edges only when complete, unambiguous nucleotide sequences are accessible to that weight function as node label data. If nucleotide sequences of internal nodes are not contained in node label data, the contribution of an edge to a history’s parsimony score may be dependent on the structure of the rest of the history, making the decomposition impossible. In particular, the edge weight function f is required to be a function on all possible history sDAG edges, which correctly reports the contribution of an edge to the weight of any history which contains it.
Although our focus here is parsimony, notice that this framework allows much more general notions of history weight, including situations where the function f is sensitive to edge direction or subpartitions, or takes values in a non-numeric set, such as a set of sequences. These generalizations will be important for future applications. For example, we could compute a branching process likelihood like that used by the gctree project, whose value can be decomposed over tree edges, and which can be summarized by a pair of integers (DeWitt et al. 2018).
To compare weights of histories, the weight set W must admit a total ordering. This ordering will be required to respect addition on W, in a slightly weaker sense than is generally meant:
Definition 12
A weight set W, endowed with addition, is clade-ordered with respect to some edge weight function f and history sDAG (V, E) on labels Y if
-
The ordering on W respects addition and is a total ordering on all of the following subsets of W:
-
\(\circ \) Sets of weights of subhistories below any node: \(\left\{ g_f(s) \mid s\in {\text {B}}(v) \right\} \), for any \(v \in V {\setminus } \left\{ \rho \right\} \),
-
\(\circ \) Sets of weights of augmented subhistories below any node-clade pair: \(\quad \left\{ g_f(s^v) \mid s\in {\text {B}}(v_c),\ v_c\in {\text {Ch}}(v, C)\right\} \), for any \(v = (\ell , U) \in V{\setminus } \left\{ \rho \right\} \), and any \(C\in U \).
-
-
The ordering on W is a total ordering on the set of weights of histories:
$$\begin{aligned} \left\{ g_f(t) \mid t\text { is a history in } (V, E) \right\} \subset W \end{aligned}$$
We say that the ordering on W respects addition on a set \(W'\subset W \) if for all \(a, b\in W' \) and for all \(c\in W \), \(a < b \) if and only if \(a + c < b + c \).
The following observation makes this definition easier to use.
Observation 9
Let W be a weight set which is clade-ordered with respect to a history sDAG (V, E) and edge weight function f. If \((V', E') \) is a trim of (V, E) , and \(f': E' \rightarrow W\) is equal to f restricted to \(E' \), then W is also clade-ordered with respect to \(f' \) and \((V', E') \).
For example, it may often be most convenient to argue that a weight set is clade-ordered with respect to the complete history sDAG on the label set Y, and a weight function defined on all possible edges in that history sDAG.
However, since this is a strictly stronger condition on W, which is why the definition of clade-ordering is with respect to a particular history sDAG.
Through the rest of this section, the label set Y will be fixed, and it will be assumed that f is an edge weight function mapping into W, a weight set which is clade-ordered with respect to f.
Finally, we can describe exactly the sense in which the history sDAG preserves history weights, a property depicted in Fig. 4.
Theorem 1
Let T be a collection of histories, so that \(g_f(t) = K \) for all \(t\in T \). Then there exists a history \(t\in D(T) \) with \(g_f(t) < K \) if and only if there exists a history \(t'\in D(T) \) with \(g_f(t') > K \).
Theorem 1 is the motivation for and main result of this section, guaranteeing that a history sDAG constructed from minimum weight histories will only express minimum weight histories, and is proven in Appendix A.
However, since it may be impractical to verify that a collection of histories are minimum weight relative to all other possible histories on a chosen label set, Theorem 1 will often be more useful when applied in the form of the following corollary, that any history sDAG may be trimmed to express exactly its minimum weight histories, relative only to the other histories in that history sDAG.
Corollary 1.1
Let (V, E) be a history sDAG, and let f be an edge weight function as defined previously. Then there exists a history sDAG \((V', E') \) which is a trim of (V, E) such that the histories in \((V', E') \) are exactly the minimum weight histories in (V, E) with respect to f.
Proof
Let T be the collection of histories expressed by (V, E) , so that \(D(T) = T \). Let K be the minimum weight achieved by \(g_f \) on T, and let \(T' \subset T \) be the set of minimum weight histories:
We know that \(T'\subseteq D(T') \), so we need only show that \(T'\supseteq D(T') \). Since \(T' \subseteq T \), we know that \(D(T') \subseteq D(T) = T\), and that since T is the collection of histories in (V, E) , there exists no history \(t\in D(T') \) with \(g_f(t) < K \). Therefore, by Theorem 1, there exists no \(t\in D(T') \) with \(g_f(t) > K \). Since \(T' \) contains all the histories in T with weight K, we therefore know that \(T' = D(T') \). Let \((V', E') \) be the history sDAG constructed from \(T' \). Since \(T' = D(T') \), the history sDAG \((V', E') \) contains exactly the histories in \(T' \). Also, because \((V', E') \) is a graph union of histories in (V, E) , we know that \(V'\subset V \) and \(E'\subset E \). Therefore \((V', E') \) is the trim of (V, E) that we seek. \(\square \)
We shall take a small excursion now, in which we return to the setting of maximum parsimony which motivates these methods. It makes little sense to minimize parsimony on the set of all histories with labels in an ambient sequence set Y. Rather, one attempts to minimize parsimony subject to the constraint that history leaves are labeled by some fixed set of observed nucleotide sequences.
Definition 13
Let T be a set of histories with labels in Y. We say that histories in T have a fixed set \(X\subset Y \) of leaf labels if \(L(t) = X \) for all \(t\in T \).
Given an edge-weight function f and a set \(X\subset Y \), we say that a history t with \(L(t) = X \) is minimum weight relative to all histories on the fixed set of leaf labels X if \(g_f(t) \le g_f(t') \) for all histories \(t' \) with \(L(t') = X\).
In the general language of this section, a history t with nucleotide sequence labels is maximally parsimonious if it is minimum weight relative to all histories on the fixed leaf label set L(t) , with Hamming distance as the edge-weight function.
The following observation guarantees that Theorem 1 and Corollary 1.1 are useful in this setting.
Observation 10
Let T be a set of histories with a fixed set of leaf labels \(X\subset Y \). Then for any \(t\in D(T) \), \(L(t) = X \).
The truth of this observation can be argued precisely using the lemmas in Appendix A supporting the proof of Theorem 1, but is apparent from Definition 3 and Fig. 1.
This means that given a set T of maximally parsimonious histories on a fixed set of leaf labels X, D(T) must only contain histories with leaves labeled by X. By Theorem 1 then, D(T) must only contain histories which are maximally parsimonious on leaf labels X.
If T contains histories on a fixed label set X which are not necessarily maximally parsimonious, Observation 10 ensures that trimming the history sDAG constructed from T as in Corollary 1.1 will result in a new history sDAG which expresses histories with the same fixed set of leaf labels X.
2.2 Trimming the history sDAG
Here we describe a straightforward method for trimming a history sDAG to represent only its minimum-weight histories. Corollary 1.1 guarantees that merging only the minimum-weight histories in a history sDAG will result in a new history sDAG containing only those histories, but provides no efficient method for producing this trimmed history sDAG. The method described here involves removing all edges which point to suboptimal subhistories, and can be realized in two traversals of the history sDAG.
Definition 14
Let (V, E) be a history sDAG on labels Y, and let f be an edge-weight function \(f: E\rightarrow W \) for W a weight set which is clade-ordered with respect to f and (V, E) .
The minimum weight of an augmented subhistory beneath a node \(v=(\ell , U) \in V \) and a clade \(C\in U \) is given by \(M_f(v, C) \), defined as
Also let \(M_f(v) \) report the minimum weight of any subhistory rooted at the node \(v=(\ell , U) \), and for any leaf node \(v'\in V \), let \(M_f(v') \) be the additive identity of W.
Notice that because W is clade-ordered, \(M_f(v) \) can be computed as
That is, the minimum weight of a subhistory beneath a node is given by the sum over clades of the minimum weight achieved by an augmented subhistory below each clade.
Notice that the clade-ordering on W also allows us to compute \(M_f(v, C) \) more easily, as
With Eq. 1, this defines an efficient dynamic program for calculating the minimum weight of all histories in a history sDAG with respect to f, with
\(M_f \) will be used to define the trimmed history sDAG:
Definition 15
Let (V, E) be a history sDAG and \(f:V\rightarrow W \) be an edge weight function, with W clade-ordered. The minimum weight trim of (V, E) with respect to f is defined to be \(({\underline{V}}, {\underline{E}}) \), where
Notice that \({\underline{E}}'\) consists of edges from E which point to optimal subhistories, \(V' \) contains nodes reachable from \(\rho \) via those edges, and \({\underline{E}} \) removes edges from \({\underline{E}}' \) which connect any nodes not in \({\underline{V}} \).
The following lemma verifies that this structure is what its name suggests.
Lemma 11
Let (V, E) be a history sDAG, and \(f:E\rightarrow W \) be an edge-weight function, with W a weight set which is clade-ordered with respect to f and (V, E) . Let \((V', E') \) be the history sDAG constructed from minimum-weight histories in (V, E) , with respect to f, and let \(({\underline{V}}, {\underline{E}}) \) be the minimum weight trim of (V, E) with respect to f. Then \((V', E') = ({\underline{V}}, {\underline{E}}) \).
The proof for this lemma is given in Appendix A.
2.3 Collapsing histories
The space of possible minimum weight histories on a fixed leaf label set is in general very large. However, some diversity in this set is a result of unnecessary history edges between nodes with the same label. Unless these edges target a leaf node, they are unnecessary, and their existence cannot be supported by the observed data represented in leaf labels.
Just as polytomies can be resolved as many possible bifurcating structures, collapsing history edges which connect nodes with identical labels reduces the number of possible histories on a fixed set of leaves, without restricting the number of informative evolutionary scenarios that can be expressed by those histories (Fig. 5).
Motivated by this observation, we will enforce in practice that adjacent nodes in a history not have the same label, unless one of them is a leaf node. This choice is possible because we allow multifurcations in histories, which leads to the definition of “collapsing” below. On the other hand, sampled ancestors in a history can be witnessed as an internal node with the observed label \(\ell \in Y \), adjacent to the leaf node labeled \(\ell \). Since the edge between these two nodes targets a leaf, such a structure is allowed in a history.
A history containing internal edges whose parent and child nodes carry the same label may be modified to remove such edges. Doing so will add multifurcations to the history, as shown in Fig. 6. The following definition allows us to mark edges as collapsible arbitrarily, not just when their parent and child node labels match. This generality is useful in precisely stating Lemma 13.
Definition 16
Let (V, E) be a history or history sDAG.
Given a binary-valued function \(b: E\rightarrow \left\{ 0,1 \right\} \), an edge \(e = \left( (\ell , U), (\ell ', U') \right) \in E \) is b-collapsible if \(b(e) = 1 \) and \(U' \ne \emptyset \) (so the target node is not a leaf node). An edge is b-collapsed if it is not b-collapsible. (V, E) is b-collapsed if each edge in E is b-collapsed.
For the purpose of this paper we are interested in collapsing edges whose parent and child nodes have the same label. In this situation b should return 1 on edges whose parent and child nodes have the same label, and we will use the terms label-collapsible and label-collapsed instead of b-collapsible and b-collapsed.
A history which is not label-collapsed can be converted to a label-collapsed history by merging adjacent nodes with the same label, but this process requires also modifying subpartitions (Fig. 6).
To formalize this, we first explain what it means to collapse an edge in a history.
Definition 17
Let \(t = (V_t, E_t) \) be a history with labels Y. Let (V, E) be the complete history sDAG on labels Y. Also let \(e = \left( (\ell _p, U_p), (\ell _c, U_c) \right) \in E_t \) be an edge in t, so that \((\ell _c, U_c) \) is not a leaf node. Let \(C = {\text {CU}}(\ell _c, U_c) \) be the clade in \(U_p \) from which the edge e descends.
The history \(t_e = (V_e, E_e) \), formed by collapsing e in t, is defined as follows:
Define \(q:V_t \rightarrow V \) via
Then \(V_e = q(V_t) \), and \(E_e = \left\{ (q(v), q(v')) \,\big \vert \,(v, v') \in E_t{\setminus } \left\{ e \right\} \right\} \).
Notice that after collapsing an edge, the resulting structure remains a valid history, because for any clade \(C\in U_c \), and for any node \(v_c \) which is a child of the node-clade pair \(\left( (\ell _c, U_c), C \right) \), the node \(v_c \) becomes a child of the node-clade pair \(\left( q(U_c, \ell _c), C \right) \). Also notice that \(q(\ell _c, U_c) = q(\ell _p, U_p) \) inherits the unique parent of \((\ell _p, U_p) \) in t.
The new history has one edge fewer than the original.
We can convert a history t to a label-collapsed history by iteratively collapsing each edge in t whose parent and child nodes have the same label.
Lemma 12
A history \(t_0 = (V_0, E_0)\) determines a unique label-collapsed history \(t_c \), which is the result of a finite sequence of edge collapses.
That is, there exists a finite sequence \(t_0, t_1 = (V_1, E_1), \ldots , t_n = (V_n, E_n) \) for which
-
\(t_i \) is the result of collapsing some edge \(e_i = \left( (\ell _i, U_i), (\ell _i', U_i') \right) \) in \(t_{i-1} \) for which \(\ell _i = \ell _i' \) and \(U_i'\ne \emptyset \), and
-
\(t_n \) is label-collapsed
Furthermore, for any such sequence of histories, \(t_n = t_c \).
Proof
Using the correspondence between histories and rooted, internally labeled, multifurcating trees established in Appendix A, we can use the fact that collapsing edges between internal nodes with the same label is a well-defined map on such trees. Since the order of edge collapse has no effect on the final tree, neither does the order of edge collapse on the final history in the sequence named above. \(\square \)
Label-collapsing histories individually is straightforward, but collapsing a large collection of histories could be done more efficiently by label-collapsing their history sDAG.
Label-collapsing histories from within a history sDAG is not as straightforward, because some edges descending from a node-clade pair may need to be collapsed, while others may not. This means that an algorithm to collapse the history sDAG must occasionally add new nodes to the DAG (Fig. 7).
In order to describe the behavior of collapsing in the history sDAG, we require the following definition.
Definition 18
Given a history sDAG (V, E), we say that a collection of histories T is an edge cover of (V, E) if for every edge \(e\in E\), there exists a history \(t\in T\) such that e is contained in t.
Further, a collection of histories T is a b-collapsible edge cover of (V, E) if for every b-collapsible edge \(e\in E\) and every subhistory s containing e, there is a history \(t\in T\) that contains s.
The following lemma describes what it means to collapse a single edge in a history sDAG.
Lemma 13
Let (V, E) be a history sDAG with label set Y.
Also let \((v_p = (\ell _p, U_p), v_c = (\ell _c, U_c))\in E \) be an internal edge. That is, \(U_c\ne \emptyset \) (so \(v_c \) isn’t a leaf node).
Define a binary function \(b: E\rightarrow \left\{ 0, 1 \right\} \) which is constant at 0, except that \(b(v_p, v_c)=1 \), and let T be any b-collapsible edge cover of (V, E) .
Let \(v_p' = (\ell _p, U_p\cup U_c \setminus {\text {CU}}(v_c)) \) be the “new parent node”, and define:
Let \(R = \emptyset \) if there exists an edge \((v_p, v)\in E^+ \) with \({\text {CU}}(v) = {\text {CU}}(v_c) \). Otherwise, let R be the set of parent edges of \(v_p \), of the form \((v', v_p) \in E^+ \).
Then, let \(E^- = E^+\setminus R \). Finally, define
and
Claim: \((V', E') \) is the history sDAG constructed from \(T' \), the set of histories which result by collapsing the edge \((v_p, v_c) \) in each history in T in which it appears.
Notice that if e is the only edge descending from the node-clade pair \((v_p, {\text {CU}}(v_c)) \), then collapsing e requires removing the node \(v_p \), and all edges involving it, from the history sDAG. Also, the definition of \(E^- \) will not leave any parent nodes of \(v_p \) with too few descendant edges, because we added edges from all parent nodes of \(v_p \) to \(v_p' \).
The last step in the construction of \(E' \) ensures that any nodes left without parents in the collapsing process will not appear in the label-collapsed history sDAG.
The proof for Lemma 13 is given in Appendix A.
Finally we arrive at the main result of this section, which provides a guarantee that all histories in a history sDAG can be collapsed by a finite sequence of edge collapses. Although this lemma is stated for label-collapsing, the result can immediately be generalized to b-collapsing, with respect to an arbitrary binary function b.
Lemma 14
Let \((V_0, E_0) \) be a history sDAG, and define a sequence \((V_i, E_i)_{i\in {\mathbb {N}}} \) of history sDAGs, so that \((V_k, E_k) \) is generated by collapsing an edge
in \((V_{k-1}, E_{k-1}) \) with \(\ell _{k-1} = \ell _{k-1}' \) and \(U_{k-1}' \ne \emptyset \) if such an edge exists. If no such edge exists, then \((V_k, E_k) = (V_{k-1}, E_{k-1}) \).
Then there exists \(N\in {\mathbb {N}}\) such that \((V_N, E_N) \) is label-collapsed. Also, if \(T_0 \) is a label-collapsible edge cover of \((V_0, E_0) \), and \(T_0' \) is the set of histories resulting from label-collapsing each history in \(T_0 \), then each history in \(T_0' \) is in \((V_N, E_N) \).
Although this lemma is written for label-collapsing, it extends to collapsing with respect to an arbitrary binary function b, defined on all possible edges in the complete history sDAG with the same leaf nodes and with labels chosen from the same ambient label set as \((V_0, E_0) \).
Note that the collapsing algorithm presented below produces the collapsed history sDAG \((V_N, E_N) \).
The proof for this proposition is given in Appendix A.
Lemma 14 suggests an algorithm for collapsing a history sDAG, whose implementation is given below.
Algorithm A
(Collapsing a history sDAG) Modifies a history sDAG so that no edges connect two non-leaf nodes with the same label, and the histories represented in the resulting history sDAG are the same as the set of histories represented by the original history sDAG, with each label-collapsed.
-
1.
Build queue. \({\mathcal {Q}}:= (v_i, v_i')_{i=1}^{|E|}\) is a queue of edges in \((v_i, v_i') \in E \) so that if \((v_i, v_i') \) and \((v_j, v_j') \) are such that \(v_i' = v_j \), then \(j > i \). That is, edges at the beginning of the queue are closer to the UA node of the history sDAG
-
2.
Collapse loop head. If \({\mathcal {Q}} \) is empty, END. Otherwise, remove the first element \(\left( v_p = (\ell _p, U_p), v_c = (\ell _c, U_c) \right) \) from \({\mathcal {Q}} \).
-
(a)
Check collapsed. If \(\ell _p = \ell _c \) and \(v_c \) is not a leaf node, and \((v_p, v_c)\in {\mathcal {Q}}\), go to new parent. Otherwise, return to collapse loop head.
-
(b)
New parent. Set \(v_p':= \left( \ell _p, U_p \cup U_2 {\setminus } \left\{ \bigcup \limits _{C\in U_c} C \right\} \right) \). Add \(v_p'\) to V.
-
(c)
Add grandparents to newparent. For any \((v, v_p)\in E \), add \((v, v_p') \) to E and to beginning of \({\mathcal {Q}} \).
-
(d)
Add children to newparent. For any \((v_p, v)\in E_t\), if clade union of v is not the same as the clade union of \(v_c \), add \((v_p', v) \) to E and to beginning of \({\mathcal {Q}} \).
-
(e)
Add grandchildren to new parent. For any \((v_c, v) \in E \), add \((v_p', v) \) to E and to the beginning of \({\mathcal {Q}} \).
-
(f)
Remove collapsed edge. Remove \((v_p, v_c) \) from E.
-
(g)
Remove lonely parent. If no edge \((v_p, v)\in E \) exists with clade unions of v and \(v_c \) equal, then do routine removenode \(v_p \) from (V, E).
-
(h)
Remove orphaned child. If no edge \((v, v_c)\in E \) exists, do routine removenode \(v_c \) from (V, E) . Return to Collapse loop head.
-
(a)
The routine removenode v from (V, E) is the following:
-
1.
Remove node. Remove v from V.
-
2.
Remove children loop head. For each child node \(v_c \) of v:
-
(a)
Remove edge. Remove the edge \((v, v_c) \) from E.
-
(b)
Clean child node. If no edge \((v_p, v) \) exists in E, then do routine removenode \(v_c\) from (V, E) .
-
(a)
-
3.
Remove parent loop head. For each parent node \(v_p \) of v:
-
(a)
Remove edge. Remove the edge \((v_p, v) \) from E.
-
(a)
Notice that each iteration of the collapse loop corresponds with an element in the sequence of history sDAGs named in Lemma 14. Since the order of edges in the sequence \((e_k) \) in Lemma 14 has no effect on the resulting history sDAG, the order of edges in the queue should have no effect on the history sDAG produced by this algorithm.
2.4 History sDAG completion
We now introduce “completion,” which essentially means that we add every edge that respects clade union sets. More precisely, Definition 3 specifies that each edge of a history sDAG must target a node whose clade union is in the subpartition of its parent. Given a collection of history sDAG nodes V, we can create an edge set \(E' \) containing all edges allowed by this requirement. The resulting DAG \((V, E')\) then contains all histories that can be constructed using nodes from V. If V is the node set for some valid history sDAG, then the resulting DAG \((V, E') \) must also be a history sDAG.
By completing a history sDAG, additional histories are represented. Although there is no guarantee about the weight of these new trees, it is possible that additional minimum weight trees may be found by the completed history sDAG, which makes this operation useful.
This idea is expressed in the following definition.
Definition 19
Let T be a collection of histories with labels in Y. Let (V, E) be the history sDAG constructed from T. The completed history sDAG constructed from T is the history sDAG \((V, E') \), where
We will also refer to \((V, E') \) as the completion of (V, E) .
The completed history sDAG constructed from T is a history sDAG because it includes at least those edges present in the history sDAG constructed from T, and all the additional edges are allowed by the definition of the history sDAG. We emphasize that history sDAG completion adds no new nodes, and that a completed history sDAG is in general a much smaller object than the complete history sDAG on a taxon set described in Definition 8.
Earlier sections show that \(T \subset D(T) \) for a set of histories T because a history sDAG constructed from T allows subhistory swaps involving conforming subhistories. In contrast, the completed history sDAG constructed from T allows any subhistories on the same leaf labels to swap, regardless of their parent nodes.
Swaps between subhistories with the same leaf label sets will not preserve history weights in the same sense as conforming subhistory swaps. Therefore, the completed history sDAG constructed from a set of histories T is not guaranteed to preserve weights in any sense. However, the lemmas from the previous sections guarantee that any history sDAG can be trimmed to express only its minimum weight histories. This means that the completed history sDAG can be used as a way to find even more minimum-weight histories than the original history sDAG construction, given a set of minimum-weight histories T. For example, completing a history sDAG constructed from maximally parsimonious, or nearly maximally parsimonious histories, could in some cases find additional maximally parsimonious histories which wouldn’t have been present before completion.
The completed history sDAG constructed from a set T of histories represents all possible histories which can be constructed using the nodes of histories in T. A choice of input histories can be therefore be framed as a choice of plausible pairs of labels and subpartitions, which then determines a collection of plausible histories.
3 Exploring parsimony diversity of SARS-CoV-2 clades
The original motivation for the history sDAG was to store a collection of minimum-weight histories. The theorems in the preceding sections show that the history sDAG is an ideal object for this task, and can discover new minimum weight histories in addition to those which we seek to store. Because SARS-CoV-2 is densely sampled relative to the rate of mutation and undergoes minimal recombination, parsimony methods are well-suited to studying its evolution (Thornlow et al. 2021). However, we will now demonstrate that there exists considerable uncertainty in a parsimonious reconstruction of SARS-CoV-2 evolution.
Searching for maximally parsimonious trees is computationally intensive, and scales poorly as the number of leaves increases. Traditionally, tools like PHYLIP ’s dnapars were used to produce an assortment of maximally parsimonious trees on a given set of sequences (Felsenstein 2009). Recently, the UShER project made it possible to quickly reconstruct a single approximate parsimony tree on millions of sampled sequences (Thornlow et al. 2021). Neither method guarantees that the reconstructed trees are maximally parsimonious relative to all possible trees on the given leaf sequences.
Users of both methods often accept the first tree produced, ignoring the uncertainty inherent to the parsimony assumption. However, there are in general many possible maximally parsimonious trees on a given set of leaf sequences.
Indeed, dnapars by default outputs a non-exhaustive collection of maximally parsimonious trees. However, for very large sets of sequences, a collection of nearly maximally parsimonious trees may be produced much more quickly using UShER. As a demonstration, we use UShER to reconstruct trees on an assortment of SARS-CoV-2 clades, extracted from the global phylogeny of public SARS-CoV-2 sequences provided by the UShER project (accessed 3-3-2022) (Lanfear 2020; Turakhia et al. 2021). We allowed UShER to reconstruct trees on the set of unique sequences from each clade, as well as the ancestral sequence in the original tree, outputting a maximum of 200 trees resulting from alternative parsimonious placements of samples. Including the ancestral sequence guarantees that the resulting reconstruction is comparable to the subtree of the global phylogeny corresponding to the same clade. We then use the UShER utility matOptimize (Ye et al. 2022) to attempt to optimize each tree, allowing the optimizer to make up to four moves for each sample which do not improve the parsimony score. Allowing a few such moves is intended to increase the diversity in output trees, without requiring excessive computation time. We saved four intermediate trees during optimization of each tree output by UShER. Optimizations of different trees output by UShER are not guaranteed to achieve the same parsimony score. However, even optimized trees which are not globally maximally parsimonious are likely to contain parsimony-optimal substructures.
For each clade, the collection of 800 intermediate trees resulting from these tree optimizations are used to create a history sDAG, after outgrouping the ancestral sequence in each. These 800 trees are not guaranteed to be unique, and in fact there are often many duplicates. The resulting history sDAG is then completed, trimmed to only express maximally parsimonious histories, and label-collapsed.
Whereas it is computationally expensive to construct a maximally parsimonious tree, the operations of trimming, collapsing, and completing are highly optimized, and in practice take only a few seconds for the history sDAGs used to produce Fig. 8. The number of operations required for the proposed trimming algorithm is bounded by \({\mathcal {O}}(E\cdot (MCS + MNC))\), and similarly the algorithm for completing the history sDAG is bounded by \({\mathcal {O}}(N^2\cdot MNC)\) where N is the number of nodes, E the number of edges, MCS the maximum size of any set of edge descending from a node-clade pair, and MNC the maximum number of child clades for any node in the history sDAG.
The resulting history sDAG sometimes contains histories which are slightly more parsimonious than any trees found by UShER, and in most cases, the number of maximally parsimonious histories contained in the resulting history sDAG is many orders of magnitude greater than the number of histories used as input (Fig. 8). However, this increase is far from uniform across clades. For the clade AY.46.6, the history sDAG expresses an impressive 25 orders of magnitude more tree diversity than the input trees found by UShER, and all of those trees have a slightly better parsimony score than any tree found by UShER. On the other hand, clade AY.111 also stands out in contrast, with only two unique trees found by UShER, and only those same two unique trees contained in the resulting history sDAG.
For some clades, such as 20F, the number of unique trees found by UShER is greater than the final number of trees expressed in the history sDAG. Although surprising, this is not contradictory, since many of the unique trees found by UShER may have a higher parsimony score than the trees contained in the final history sDAG.
It is unlikely that Fig. 8 reflects the true diversity of maximally parsimonious trees for each clade. In fact, the true minimum parsimony scores for tree reconstructions of each clade may be lower than the parsimony score of trees found here. The variation in tree diversity between clades is instead likely determined by features in the particular trees found by UShER. Further investigation of the true diversity of maximum parsimony trees will be left for future work.
Regardless, the large diversity of trees for most clades suggests that considerable uncertainty remains about tree structure when performing a maximum-parsimony search, even after collapsing edges without mutations into multifurcations. This uncertainty represents an opportunity to fine-tune the accepted tree in settings where parsimony is an appropriate assumption. For example, the histories found by this method could be used as a starting point for further optimization according to criteria other than parsimony. Such criteria, and their efficient calculation in the history sDAG, will be the subject of future work.
4 Discussion
This paper establishes that the history sDAG is an efficient structure for storage of similar internally labeled trees, and provides a foundation for future work to understand phylogenetic uncertainty using massive collections of parsimonious trees.
We described efficient methods for basic manipulation of the history sDAG object, and used these methods to demonstrate that for densely sampled SARS-CoV-2 data, it is possible to build a history sDAG containing many alternative parsimonious evolutionary histories. We implemented this process on clades containing up to seven thousand leaves, although it would have been feasible to use clades containing perhaps ten times as many. Software which is currently in development will allow parsimony optimization via matOptimize (Ye et al. 2022) directly on the history sDAG, avoiding the time-consuming step of generating many input trees with UShER, and hopefully allowing these methods to scale to even larger datasets.
Thanks to the convenient structure of the history sDAG, it will be possible to efficiently summarize clade-level uncertainty in these histories, although such methods will be described and benchmarked in a future paper. This approach can only be expected to work well when the tree posterior is overwhelmingly concentrated on maximally parsimonious trees, and even then clade supports estimated with the history sDAG may not be directly comparable to supports observed in a sample from the tree posterior. However, for phylogenetic inference resulting in a single maximally parsimonious tree (which is typically arbitrarily chosen from the collection of MP trees), our method could provide a valuable understanding of the uncertainty resulting from this choice. Clade support estimation via the history sDAG may have advantages over standard approaches to phylogenetic uncertainty estimation. Unlike a bootstrap approach, all alternative histories in the history sDAG are built on the same data, and therefore clade support derived from the history sDAG could be more accurate for clades defined by only a few mutations (Wertheim et al. 2022). Unlike a Bayesian approach, our method makes no attempt to fully resolve a tree when there is insufficient signal to do so, and we expect it to scale well to large data.
The history sDAG is related to various earlier works, as we now describe.
4.1 The Subsplit DAG
The history sDAG generalizes a similar construction useful for likelihood computations and variational inference on trees, integrating out ancestral sequence uncertainty (Zhang and Matsen 2018, 2019). Although this form of the DAG structure is not expressed in the original variational inference papers, it is described in a more recent paper (Jun et al. 2023). In this subsplit DAG, internal nodes do not contain label data, and each internal node is required to have exactly two child clades (a subsplit is a subpartition with two parts). That is, the subsplit DAG is a history sDAG in which internal nodes all share the same fixed label, and each node has two child clades. The additional node label information in the history sDAG is essential for efficient storage and retrieval of maximally parsimonious trees, with the inferred ancestral sequences dictated by the parsimony assumption.
4.1.1 The Buneman Graph
A construction known as the Buneman graph is related to the history sDAG. In this construction, a collection of observations, each consisting of a collection of binary traits, can be arranged in a graph. This Buneman graph contains as subgraphs all possible maximally parsimonious trees relating the observations (Semple and Steel 2003). This construction has been generalized to sequences of non-binary characters (Bandelt and Röhl 2009; Misra et al. 2011), and one such generalization was applied to the problem of finding provably maximally parsimonious trees on nucleotide sequence data (Misra et al. 2011).
However, although the Buneman graph contains all maximally parsimonious trees on a set of observations, it may also contain trees which are not maximally parsimonious. The Buneman graph is therefore not a natural data structure for storing collections of maximally parsimonious trees, since considerable additional computation may be needed to find the maximally parsimonious trees in the graph. In contrast, the history sDAG may be trimmed to express only maximally parsimonious trees, and sampling or iterating through the trees it contains is trivial. In addition, the history sDAG can be immediately generalized to arbitrary observed data (abstracted as node labels), and allows efficient computation and trimming with respect to weight functions other than parsimony.
4.1.2 Tree Fusion
The swapping of subhistories that takes place in the history sDAG bears some resemblance to the procedure known as tree fusion, used in some parsimony software like TNT, in which clades are swapped between trees to improve parsimony scores (Goloboff 1999; Goloboff and Pol 2007).
Generally, the history sDAG can be thought of as a structure which efficiently represents, and allows computation on, the set of trees resulting from all possible combinations of these clade swaps. Thus, the history sDAG can only swap subhistories that have identical parent node labels and subpartitions. In contrast, tree fusion can consider trees resulting from swapping any subtrees, as long as they contain the same set of samples.
Tree fusion is better approximated in the completed history sDAG, which does allow swaps of any subhistories containing the same samples. That is, for a history sDAG (V, E) constructed from a set of histories T, the set of histories in the completion of (V, E) consists of all histories resulting from combinations of swaps involving subhistories of histories in T, regardless of their parent nodes. However, subhistory swaps are still fundamentally different from the swaps of subtopologies realized during tree fusion, since subhistory swaps maintain the same ancestral node labels that were present in the original histories involved in each swap. In order to ensure that ancestral labels are optimal in the new histories contained in the completed history sDAG, we would need an algorithm to reconstruct these ancestral states from scratch. Such an algorithm for computing optimal ancestral states in the history sDAG would be analogous to the Sankoff algorithm for reconstructing ancestral states on trees.
Despite these limitations, Fig. 8 shows that the subtree swaps which are realized in the history sDAG can be effective in reducing parsimony scores. Although the history sDAG does not fully implement tree fusion, it concurrently applies subhistory swaps in many different histories, and allows the resulting trees to be filtered efficiently according to arbitrary criteria. This may represent an advantage over methods which keep track of and optimize far fewer trees.
4.1.3 Tree Sequences
The history sDAG also bears some similarities to the tree sequence (Kelleher et al. 2019; Speidel et al. 2019). The tree sequence encodes a single evolutionary history for segments of a multiple sequence alignment, with changes of evolutionary history at specific points along the alignment due to recombination. The history sDAG, on the other hand, is meant to encode an unordered collection of equally parsimonious histories.
4.1.4 Future Work
We are in the process of building software that will allow us to do larger-scale inference using the history sDAG. In addition to the uncertainty quantification goals described above, this software will also allow us to do broader exploration of the set of maximally parsimonious trees than previously possible. We also hope to use the history sDAG as a means of improving MCMC sampling.
Maximally parsimonious trees may be a good starting point for inference via other methods, such as the branching process used by the tree inference package gctree (DeWitt et al. 2018). To support this, we will develop efficient algorithms to make calculations on histories contained in the history sDAG. We will also explore ways to search for new optimal histories, such as maximally parsimonious histories, directly within the structure of the history sDAG.
Data Availibility
The SARS-CoV-2 data used to produce clade reconstructions in the Exploring Parsimony Diversity section was read from the public SARS-CoV-2 tree distributed by the UShER team at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/. This data originates from GenBank (Hatcher et al. 2016) at https://www.ncbi.nlm.nih.gov, COG-UK (Nicholls et al. 2020) at https://www.cogconsortium.uk/tools-analysis/public-data-analysis-2/, and the China National Center for Bioinformation (Song et al. 2020; Zhao et al. 2020; Gong et al. 2020; Yu et al. 2022) at https://bigd.big.ac.cn/ncov/release_genome.
Code Availability
The history sDAG data structure described in this paper, as well as various algorithms described in this paper and in future work, are implemented in the open source Python package historydag, which is available at https://github.com/matsengrp/historydag.
All code necessary to reproduce the SARS-CoV-2 clade reconstruction example is available at https://github.com/matsengrp/usher-clade-reconstructions/tree/7953eda7eb5c15556753fc23b4807b748f6a2464.
References
Bandelt HJ, Röhl A (2009) Quasi-median hulls in hamming space are Steiner hulls. Discrete Appl Math 157(2):227–233. https://doi.org/10.1016/j.dam.2006.09.015
DeWitt WSIII, Mesin L, Victora GD et al (2018) Using genotype abundance to improve phylogenetic inference. Mol Biol Evol 35(5):1253–1265. https://doi.org/10.1093/molbev/msy020
Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39(4):783–791. https://doi.org/10.1111/j.1558-5646.1985.tb00420.x
Felsenstein J (2009) PHYLIP. https://evolution.genetics.washington.edu/phylip/doc/main.html
Goloboff PA (1999) Analyzing large data sets in reasonable times: solutions for composite optima. Cladistics 15(4):415–428. https://doi.org/10.1111/j.1096-0031.1999.tb00278.x
Goloboff PA, Pol D (2007) On divide-and-conquer strategies for parsimony analysis of large data sets: Rec-i-dcm3 versus tnt. Syst Biol 56(3):485–495. https://doi.org/10.1080/10635150701431905
Gong Z, Zhu JW, Li CP et al (2020) An online coronavirus analysis platform from the national genomics data center. Zool Res 41(6):705. https://doi.org/10.24272/j.issn.2095-8137.2020.065
Hatcher EL, Zhdanov SA, Bao Y et al (2016) Virus variation resource—improved response to emergent viral outbreaks. Nucleic Acids Res 45(D1):D482–D490. https://doi.org/10.1093/nar/gkw1065
Hoang DT, Chernomor O, von Haeseler A et al (2018) UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol 35(2):518–522. https://doi.org/10.1093/molbev/msx281
Ishikawa SA, Zhukova A, Iwasaki W et al (2019) A fast likelihood method to reconstruct and visualize ancestral scenarios. Mol Biol Evol 36(9):2069–2085. https://doi.org/10.1093/molbev/msz131
Jun SH, Nasif H, Jennings-Shaffer C, et al (2023) A topology-marginal composite likelihood via a generalized phylogenetic pruning algorithm. Algorithms Mol Biol (to appear)
Kelleher J, Wong Y, Wohns AW et al (2019) Inferring whole-genome histories in large population datasets. Nat Genet 51(9):1330–1338. https://doi.org/10.1038/s41588-019-0483-y
Lanfear R (2020) A global phylogeny of SARS-CoV-2 sequences from GISAID. https://doi.org/10.5281/zenodo.3958883
Misra N, Blelloch G, Ravi R et al (2011) Generalized Buneman pruning for inferring the most parsimonious multi-state phylogeny. J Comput Biol 18(3):445–457. https://doi.org/10.1089/cmb.2010.0254
Nicholls SM, Poplawski R, Bull MJ et al (2020) Majora: continuous integration supporting decentralised sequencing for sars-cov-2 genomic surveillance. bioRxiv. https://doi.org/10.1101/2020.10.06.328328
Rambaut A, Holmes EC, O’Toole Á et al (2020) A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol 5(11):1403–1407. https://doi.org/10.1038/s41564-020-0770-5
Sanderson MJ, McMahon MM, Steel M (2011) Terraces in phylogenetic tree space. Science 333(6041):448–450. https://doi.org/10.1126/science.1206357
Sanderson MJ, McMahon MM, Stamatakis A et al (2015) Impacts of terraces on phylogenetic inference. Syst Biol 64(5):709–726. https://doi.org/10.1093/sysbio/syv024
Semple C, Steel M (2003) Phylogenetics, vol 24. Oxford University Press on Demand, London
Song S, Ma L, Zou D et al (2020) The global landscape of sars-cov-2 genomes, variants, and haplotypes in 2019ncovr. Genom Proteom Bioinf 18(6):749–759. https://doi.org/10.1016/j.gpb.2020.09.001
Speidel L, Forest M, Shi S et al (2019) A method for genome-wide genealogy estimation for thousands of samples. Nat Genet 51(9):1321–1329. https://doi.org/10.1038/s41588-019-0484-x
Thornlow B, Ye C, De Maio N et al (2021) Online phylogenetics using parsimony produces slightly better trees and is dramatically more efficient for large sars-cov-2 phylogenies than de novo and maximum-likelihood approaches. bioRxiv. https://doi.org/10.1101/2021.12.02.471004
Turakhia Y, Thornlow B, Hinrichs AS et al (2021) Ultrafast sample placement on existing trees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat Genet 53(6):809–816. https://doi.org/10.1038/s41588-021-00862-7
Wertheim JO, Steel M, Sanderson MJ (2022) Accuracy in near-perfect virus phylogenies. Syst Biol 71(2):426–438. https://doi.org/10.1093/sysbio/syab069
Whidden C, Matsen FAIV (2015) Quantifying MCMC exploration of phylogenetic tree space. Syst Biol 64(3):472–491. https://doi.org/10.1093/sysbio/syv006
Ye C, Thornlow B, Hinrichs A et al (2022) matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2. Bioinformatics 38(15):3734–3740. https://doi.org/10.1093/bioinformatics/btac401
Yu D, Yang X, Tang B et al (2022) Coronavirus GenBrowser for monitoring the transmission and evolution of SARS-CoV-2. Brief Bioinf 23(2):bbab583. https://doi.org/10.1093/bib/bbab583
Zhang C, Matsen FAIV (2018) Generalizing tree probability estimation via Bayesian networks. In: Bengio S, Wallach H, Larochelle H et al (eds) Advances in neural information processing systems 31. Curran Associates Inc, Red Hook, pp 1449–1458
Zhang C, Matsen FAIV (2019) Variational bayesian phylogenetic inference. In: International conference on learning representations (ICLR), https://openreview.net/pdf?id=SJVmjjR9FX
Zhang C, Huelsenbeck JP, Ronquist F (2020) Using Parsimony-Guided tree proposals to accelerate convergence in Bayesian phylogenetic inference. Syst Biol 69(5):1016–1032. https://doi.org/10.1093/sysbio/syaa002
Zhao WM, Song SH, Chen ML et al (2020) The 2019 novel coronavirus resource. Yi chuan= Hereditas 42(2):212–221. https://doi.org/10.16288/j.yczz.20-030
Acknowledgements
We thank JT McCrone and Gytis Dudas for discussions that informed this work, Mike Steel for pointing us to relevant literature, as well as Marc Suchard for suggestions on exposition. Thanks also to Ye Cheng, Russ Corbett-Detig, Yatish Turakhia, and the rest of the UShER team for helpful discussions and their help applying UShER to the SARS-CoV-2 example. We also thank Matthew Macaulay, Hassan Nasif, Anna Kooperberg, Michael Karcher, Tanvi Ganapathy, Shosuke Kiami, Seong-Hwan Jun, Cheng Zhang, and Mathieu Fourment for their work on the “subsplit DAG,” a closely related idea. The SARS-CoV-2 data which made the exploration of diversity of parsimonious reconstructions of SARS-CoV-2 clades possible is from the public databases GenBank (Hatcher et al. 2016), COG-UK (Nicholls et al. 2020), and the China National Center for Bioinformation (Song et al. 2020; Zhao et al. 2020; Gong et al. 2020; Yu et al. 2022). We thank the laboratories submitting sequence data to these public databases, as well as the researchers and laboratories contributing viral samples on which these sequences are based.
Funding
FAM supported by R01 AI162611; FAM is an Investigator of the Howard Hughes Medical Institute. WSD was supported by National Institute of Allergy and Infectious Diseases Grant F31AI150163, and by a Fellowship in Understanding Dynamic and Multi-Scale Systems from the James S. McDonnell Foundation. Scientific Computing Infrastructure at Fred Hutch funded by ORIP grant S10OD028685.
Author information
Authors and Affiliations
Contributions
Will Dumm and Frederick Matsen wrote the first draft of the manuscript, with edits and contributions to proofs from Mary Barker and edits from William DeWitt. Will Dumm and William Howard-Snyder prepared the SARS-CoV-2 clade reconstruction example. All authors commented on previous versions, and read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethics approval
No ethics approval process was required for this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Proofs omitted from the text
Lemma 3
Let (V, E) be a history sDAG or subhistory, and let \(v\in V \). The set of labels of leaf nodes reachable from v is \({\text {CU}}(v) \).
Proof
Let (V, E) be a history sDAG or subhistory on labels Y, and let \(v\in V \) be a non-UA node.
Let X be the set of labels of leaf nodes reachable from v. \(X\subset {\text {CU}}(v) \) by Observation 2. To show inclusion in the other direction, we will show by induction on \(|{\text {CU}}(v)| \) that for any \(\ell \in {\text {CU}}(v) \), the leaf node \((\ell , \emptyset ) \) is reachable from v in (V, E) .
As a base case, if \(|{\text {CU}}(v)| = 1 \), then v must be the leaf node with label \(\ell \), so the statement is immediately true.
Now suppose that for any node \(v'\in V \) with \(|{\text {CU}}(v')| < n \), and for any \(\ell '\in {\text {CU}}(v') \), then \((\ell ', \emptyset ) \) is reachable from \(v' \). Suppose that \(v = (\ell _v, U) \) is such that \(|{\text {CU}}(v)| = n \), and let \(\ell \in {\text {CU}}(v) \). Then \(\ell \in C \) for some child clade \(C\in U \). Since U contains at least two disjoint, nonempty subsets of Y, it must be true that \(C\subsetneq {\text {CU}}(v) \). Any node-clade pair in a history sDAG or subhistory must have at least one descendant edge, so there exists an edge \(\left( v, v_c \right) \in E \) such that \(C = {\text {CU}}(v_c) \), and \(|{\text {CU}}(v_c)| < n \). Since \(\ell \in C \), we know that \(\ell \in {\text {CU}}(v_c) \), and by the inductive hypothesis, \((\ell , \emptyset ) \) is reachable from \(v_c \), and therefore also from v. \(\square \)
Lemma 4
A history sDAG (V, E) is a history if and only if it is a tree, and contains exactly one edge descending from \(\rho \).
Proof
We will prove the equivalent statement that, for a history sDAG (V, E) with exactly one edge descending from \(\rho \), (V, E) is a tree if and only if (V, E) contains exactly one edge descending from each node-clade pair. We will prove the contrapositive of both directions.
Assume first that there exists a node-clade pair in (V, E) with at least two descendant edges. That is, there exist edges \((v, v_1), (v, v_2)\in E \) such that \(v_1\ne v_2 \), but \({\text {CU}}(v_1) = {\text {CU}}(v_2) \). Let \(\ell \in {\text {CU}}(v_1) \). By Lemma 3, \((\ell , \emptyset ) \) is reachable from both \(v_1 \) and \(v_2 \), so (V, E) is not a tree.
Now, suppose that (V, E) is not a tree, meaning that there exist two edges \((v_1, v), (v_2, v) \in E \) with the same child node. Since all nodes in a history sDAG must be reachable from \(\rho \), there exist paths in E connecting \(\rho \) to both \(v_1 \) and \(v_2 \). (V, E) has only one edge exiting \(\rho \), so these two paths must diverge at some non-UA node \(v_r = (\ell _r, U_r) \in V \). That is, there are edges \((v_r, v_1'), (v_r, v_2') \in E \) such that \(v_1 \) is reachable from \(v_1' \) and \(v_2 \) is reachable from \(v_2' \). Therefore, v is reachable from both \(v_1' \) and \(v_2' \), so \({\text {CU}}(v) \subset {\text {CU}}(v_1') \) and \({\text {CU}}(v) \subset {\text {CU}}(v_2') \), and \({\text {CU}}(v_1') \cap {\text {CU}}(v_2') \ne \emptyset \). However, since \(v_1' \) and \(v_2' \) are children of the node \(v_r \), both \({\text {CU}}(v_1') \) and \({\text {CU}}(v_2') \) must be elements of \(U_r \). Elements of \(U_r \) are disjoint, nonempty subsets of Y, so \({\text {CU}}(v_1') = {\text {CU}}(v_2') \). We have demonstrated that the two edges \((v_r, v_1') \) and \((v_r, v_2') \) descend from the same node-clade pair \((v_r, {\text {CU}}(v_1')) \) in (V, E) . \(\square \)
Lemma 5
A history sDAG (V, E) is acyclic.
Proof
We will show that no edge may take part in a cycle.
Recall first that the UA node only admits outgoing edges, so no edge exiting \(\rho \) can be part of a cycle.
Consider an edge \(e = \left( v_p = (l_p, U_p), v = (l, U) \right) \) whose parent is not \(\rho \). If \(v_p \) is not the UA node, then \(|U_p| \ge 2 \), and either \(U = \emptyset \) or \(\bigcup \limits _{C\in U} \in U_p \). In the first case, v is a leaf node, which can only accept incoming edges, so the edge e cannot be part of any cycles. In the second case, since \(|U_p| \ge 2 \) and elements of \(U_p \) are nonempty, disjoint subsets of X,
The same inequality is true of any edge reachable from v which does not terminate at a leaf node, so no edge reachable from v can have \(v_p \) as a target. \(\square \)
Lemma 7
Let (V, E) be a history sDAG. For any \(v\in V \), there exists a subhistory s in (V, E) whose root node is v.
Proof
We will prove this by induction on \(|{\text {CU}}(v)| \).
As the base case, suppose \(|{\text {CU}}(v)| = 1 \). Then v must be a leaf node, and the subhistory we seek is the one consisting of only the node v.
Now suppose it’s true that for any node v with \(|{\text {CU}}(v)| < n \), there’s a subhistory s rooted at v in (V, E) .
Let \(v = (\ell , U) \in V \), and suppose that \(|{\text {CU}}(v)| = n \). Then \(U = \left\{ C_1, \ldots , C_m \right\} \) where \(C_1, \ldots , C_m \) are \(m \ge 2 \) disjoint subsets of Y. For each \(C_i \in U \), by the definition of the history sDAG, there exists at least one edge \((v, v_i) \in E \) with \({\text {CU}}(v_i) = C_i \). Also, since \(|C_i| < n \), each \(v_i \) is guaranteed to have a subhistory \(s_i = (V_{s_i}, E_{s_i}) \) in (V, E) , with \(v_i \) as its root, by the inductive hypothesis.
Notice that the node sets \(V_{s_i} \) for \(1\le i\le m \) are pairwise disjoint: Let \(v' \in V_{s_k} \) and \(v''\in V_{s_j} \), for \(k\ne j \), \(1\le k, j \le m \). A necessary condition for node equality is that \({\text {CU}}(v') = {\text {CU}}(v'') \). But notice that \({\text {CU}}(v') \subset {\text {CU}}(v_k) = C_k \), and \({\text {CU}}(v'') \subset {\text {CU}}(v_j) = C_j \). Since \(C_j\cap C_k = \emptyset \), \({\text {CU}}(v') \ne {\text {CU}}(v'') \), so \(v' \ne v'' \), and \(V_{s_j} \cap V_{s_k} = \emptyset \).
Therefore, we can build a subhistory rooted at v consisting of v, the subtrees \(s_i \) for \(1\le i\le m \), and the edges connecting v to the root node of \(v_i \) for each subtree \(s_i \). That is,
is the subhistory we seek, rooted at v.
This subhistory has exactly one edge descending from each node-clade pair because \(s_i \) are subhistories, and because for each child clade \(C_i \) of v, the edge \((v, v_i) \) descends from \((v, C_i) \). \(\square \)
Lemma 8
Let (V, E) be a history sDAG, and let T be the collection of histories in (V, E) . Then (V, E) is the history sDAG constructed from T.
Proof
We will argue that for any edge \(e = (v, v_c)\in E \), there exists a history \((V', E') \) in (V, E) with \(e\in E' \).
Let \(\left( e_i=(v_i, v_{i+1}) \right) _{i=0}^{n-1} \) be a sequence of edges in E which is a path from \(\rho \) to \(v_c \), so that \(v_0 = \rho \) and \(e_{n-1} = (v_{n-1}, v_n) = (v, v_c) = e \). For \(1\le i\le n \), let \(s_i \) be a subhistory rooted at \(v_i \), which exists by Lemma 7.
Now recursively construct \(s_i' \) for \(1\le i< n \) by replacing the edge descending from the node-clade pair \(\left( v_i, {\text {CU}}(v_{i+1}) \right) \) in \(s_i \), and the subhistory consisting of all nodes and edges reachable from the child node of that edge, with the edge \(e_i \) and the subhistory \(s_{i+1} \), rooted at \(v_{i+1} \). Finally, let \(s_n' = s_n \).
\(s_i' \) remains rooted at \(v_i \), and now contains the edge \(e_i \) and all the edges \(e_j \), for \(i< j < n \), including the edge e.
\(s_1' \) is then a subhistory rooted at \(v_1 \) which contains e. Lemma 4 implies that by adding the edge \((v_0=\rho , v_1) \) to \(s_1' \), we’ve constructed a history in (V, E) containing e. Note that in the language of following sections, this history can also be expressed as \(t\triangleleft s_2 \triangleleft \cdots \triangleleft s_n \), where t is the history consisting of \(s_1 \) and the edge \((\rho , v_1) \).
To finish the proof, let \((V_T, E_T) \) be the history sDAG constructed from T. Since a history sDAG must be connected, to show that (V, E) and \((V_T, E_T) \) are equal is to show that \(E_T = E \). Any edge in \(E_T \) must be present in some history in T, and since T is the set of histories in (V, E) , any such edge must be in E. Therefore, \(E_T \subset E \). Also, we just showed that any edge \(e\in E \) must take part in some history in T, and therefore e must also be in \(E_T \). Therefore, \(E_T = E \). \(\square \)
1.1 History weights
The lemmas in this appendix subsection are necessary for the proof of Theorem 1. The proof of that theorem is given at the end of this subsection.
Definition 20
Let s and \(s' \) be subhistories of histories t and \(t' \) in some ambient history sDAG. We say that s and \(s' \) are conforming subhistories if \(s' \) has the same set of leaf labels as s, and the parent node in \(t' \) of \(s' \) is the same as the parent node of s in t.
More formally, if \(t = (V, E) \) and \(t' = (V', E') \), and \(s = (V_s, E_s), s' = (V_s', E_s') \), with \(V_s \subset V \) and \(V_s' \subset V' \), then s and \(s' \) are conforming if:
-
1.
\(v_p = v_p'\) for \(v_p, v_p' \) the parent nodes of s and \(s' \) in t and \(t' \), respectively
-
2.
\(L(s) = L(s')\).
Notice that since no internal node in a history may have exactly one child, a history may not contain two distinct subhistories with the same leaf nodes. Therefore, given a history t and a subhistory \(s' \), a choice of subhistory s of t conforming with \(s' \) is guaranteed to be unique, if it exists.
Notice also that the definition of conforming subhistory does not allow a history to be conforming with any subhistory of itself. To evaluate conformity of a subhistory, there must be some ambient parent node (Fig. 9).
We can now define the exchange of substructures that takes place between histories in the history sDAG.
Definition 21
Let \(t = (V, E)\) be a history, and let \(s = (V_s, E_s) \) be a subhistory of t. Also, let \((V_d, E_d) \) be a history sDAG, and let \(s' = (V', E') \) be any subhistory of \((V_d, E_d) \) conforming with s.
A subhistory swap of t and \(s' \) is a history with the structure and labeling of t, except that the subhistory s of t is replaced with the subhistory \(s' \).
More formally, the subhistory swap replacing s with \(s'\) is the history with nodes \((V\setminus V_s)\cup V' \), and edges
where \(v_p \) is the parent node of s in t, v is the root node of s, and \(v' \) is the root node of \(s'\).
Definition 22
Let the swap operator \(\triangleleft \) be a left-associative operator on history, subhistory pairs, defined so that \(t\triangleleft s' \) is the subhistory swap of t and \(s' \), if a subhistory of t conforming with \(s' \) exists. \(t\triangleleft s' \) is undefined if no such subhistory exists.
Notice again that the subhistory (right argument of \(\triangleleft \)) in a subhistory swap must exist in the context of some ambient history sDAG, so that it can be evaluated whether swapped subhistories are conforming.
The definition of conformity is slightly more restrictive than it needs to be to guarantee that subhistory swaps of conforming subhistories preserve parsimony. In particular, there is no need to require that the parent nodes of the swapped subhistories have the same subpartitions. However, this assumption is natural in the context of the history sDAG structure, and is necessary for the argument to extend to edge weight functions that depend on nodes’ subpartitions.
Lemma 15
The operator \(\triangleleft \) is well-defined on subhistories. Also, given subhistories \(t, s' \) both with labels in Y, and with the leaves of t labeled by \(X\subset Y \), then \(t\triangleleft s' \) is a history with labels in Y and leaves labeled by X. That is, \(\triangleleft \) preserves the leaf labels of its left argument.
Proof
To show that \(\triangleleft \) is well-defined, we need to show that given histories \(t, s' \), the subhistory swap \(t\triangleleft s' \) is a history, and is uniquely determined by the choice of t and \(s' \). \(t\triangleleft s' \) is a history directly from the definition, and by the observation that since neither t nor \(s' \) may have unifurcations, their subhistory swap may not either. \(t\triangleleft s' \) replaces a subhistory s of t with \(s' \), where s must have exactly the same leaf label set as \(s' \). If such a choice of s exists, it must be unique by the assumption that nodes in a history may not have exactly one child. This guarantees that no two nodes in a history are above the same set of leaves.
Now assume that t and \(s' \) are subhistories on labels Y, and t has leaves labeled by \(X\subset Y \). To see that \(t\triangleleft s' \) is a history with labels in Y and leaves labeled by X, notice first that \(s' \) must have nodes labeled bijectively by a set \(C\subset X \), the same set of leaf labels as the subhistory in t that \(s' \) replaces. Therefore the labeling on \(t\triangleleft s' \), restricted to leaf nodes, is bijective as a union of two bijective functions with disjoint domains, and images partitioning X. The labeling on \(t\triangleleft s' \) maps into Y as a union of functions which both map into Y. \(\square \)
We now describe the sense in which subhistory swaps preserve history weight.
Lemma 16
Let \(t_1 = (V_1, E_1) \) and \(t_2 = (V_2, E_2) \) be histories on labels Y. For \(i\in \left\{ 1,2 \right\} \), let \(s_i \) be a subhistory of \(t_i \), so that \(s_1 \) and \(s_2 \) are conforming.
Let \(t_1' = t_1 \triangleleft s_2 \) be the history constructed by replacing \(s_1 \) with \(s_2 \) in \(t_1 \), and similarly define \(t_2' = t_2 \triangleleft s_1\) to be the history constructed by replacing \(s_2 \) with \(s_1 \) in \(t_2 \). Finally, suppose that f is an edge-weight function taking values in a weight set W, clade-ordered with respect to Y. Then \(g_f(t_1') < g_f(t_1) \) if and only if \(g_f(t_2') > g_f(t_2) \).
Proof
Let \(v_i \) be the parent node of \(s_i \) in \(t_i \), and let \(K_i = g_f(s_i^{v_i})\) for \(i\in \left\{ 1,2\right\} \). That is, \(K_i \) is the weight of the augmented subhistory \(s_i\) and its parent edge. Then for some weights \(w_1, w_2\in W \), \(g_f(t_i) = w_i + K_i \), and also \(g_f(t_1') = w_1 + K_2 \) and \(g_f(t_2') = w_2 + K_1 \). The following are equivalent, since \(K_i \) are weights of subhistories below the same clade, and W is clade-ordered:
so \(g_f(t_1') < g_f(t_2) \) if and only if \(g_f(t_2') > g_f(t_2) \). \(\square \)
To extend the conclusion of this lemma to all the histories in the history sDAG, we need a few more lemmas:
Lemma 17
Suppose \(t_1, \ldots , t_n \) are histories in the history sDAG (V, E) , and \(s_i \) is a subhistory of \(t_i \) for \(2\le i\le n \). Then \(t_1 \triangleleft s_2 \triangleleft \ldots \triangleleft s_n \) is a history in (V, E) .
Proof
We need only show this is true for \(n=2 \), since subhistory swaps are left-associative. Let \(t_1, t_2 \) be histories in the history sDAG, and let \(s_2 \) be a subhistory of \(t_2 \), conforming with some subhistory \(s_1 \) of \(t_1 \). Also let \(v_1, v_2 \) be the root nodes of \(s_1 \) and \(s_2 \) respectively. Conformity means that the parent node \(v_p \) of \(s_1 \) in \(t_1 \) is the same as the parent node of \(s_2 \) in \(v_2 \). Notice that all the edges in \(t_1 \) are in E, as well as all the edges of \(s_2 \), since we assumed that \(t_1, t_2 \) are histories in the DAG. Also notice that the edge \((v_p, v_2) \) is in E, because it is an edge in \(s_2 \). Therefore, all the edges in \(t_1\triangleleft s_2 \) are in E, and \(t_1\triangleleft s_2 \) is a history in the history sDAG. \(\square \)
The following lemma describes how any history in a history sDAG built from a collection of histories T can be built from a collection of swaps operating on subhistories from T.
Lemma 18
Let \(t\in D(T) \) be a history in the history sDAG (V, E) constructed from a collection of histories T. Then for some sequence of histories \((t_i)_{i=1}^n \) in T, and choices of subhistories \(s_i \) of \(t_i \) for all i, \(t = t_1 \triangleleft s_1 \triangleleft \ldots \triangleleft s_n \).
Proof
Let \(t = (V_t, E_t) \) be a history in (V, E) . Every edge in \(E_t \) must appear in some \(t'\in T \). Since t is a tree, there exists a preordering \((v_i)_{i=0}^n \) of vertices in \(V_t \) so that \(v_j \) is reachable from \(v_i \) only if \(i \le j \). That is, if \(i > j \), then \(v_j \) must not be reachable from \(v_i \). Let \((e_i)_{i=1}^n \) be an ordering of edges in \(E_t \) such that \(v_i\) is the target node of \(e_i \) for all \(1\le i\le n \). We will use the notation \(v_i' \) to denote the parent node of the edge \(e_i \). Notice then that for edges \(e_i = (v_i', v_i) \) and \(e_j = (v_j', v_j) \), if \(v_i = v_j' \), then \(i < j \).
Finally, also define a sequence of parent node-clade pairs \((p_i)_{i=1}^n \) so that \(p_i = (v_i', {\text {CU}}(v_i)) \). We will say that an edge \(e_j=(v_j', v_j) \) is reachable from a node-clade pair \(p_i \) if there exists a path of edges ending with \(e_j \) such that the first edge in the path descends from the node-clade pair \(p_i \). Notice that since the sequence \((e_i) \) is a preordering of the history t, and since only one edge may descend from each node-clade pair in a history, if \(e_j \) is reachable from the node-clade pair \(p_i \), then \(i \le j \).
Now choose a sequence \((t_i)_{i=1}^n \) of histories in T so that \(e_i \) is an edge in \(t_i \) for all i, and let \(s_i \) be the subhistory of \(t_i \) rooted at \(v_i \), the child node of the edge \(e_i \). The edge \(e_i \) is not reachable from the parent node-clade pair of any \(s_k\) with \(k > i \), because the indices are chosen to preorder nodes and edges. Notice that a subhistory swap can only change edges reachable from the shared parent node-clade pair of the subhistories being swapped. Assume temporarily that \(e_i \) is in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_i \). Then the edge \(e_i \) must be in the history \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_k \) for \(k > i \), since \(e_i \) must not be reachable from \(p_k \).
Because of this, to show that all edges in \(E_t \) are in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_n \), we need only show that the edge \(e_i \) is in \(t_1\triangleleft s_1 \triangleleft \cdots \triangleleft s_i \) for all \(1\le i\le n \), which we now establish. Inducting on i, notice first that \(e_1 \) is in \(t_1\triangleleft s_1 = t_1\), by our choice of \(t_1 \). Supposing that \(e_j \) is in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_j \) for all \(j < i \), notice that \(v_i' = v_j\) for some \(j < i \), so that \(v_i' \) is in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_{i-1} \), and \(s_i \) is conforming with some subhistory of \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_{i-1} \). The subhistory swap with \(s_i \) therefore replaces the unique child node v in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_{i-1} \) which descends from the node-clade pair \((v_i', {\text {CU}}(v_i)) \), and all of its descendants, with \(s_i \), which is rooted at \(v_i \) and attached below \(v_i' \) with the edge \((v_i', v_i)= e_i \) in \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_i \).
Therefore, \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_n \) contains at least all those edges in \(E_t \). Furthermore, \(t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_n\) is a history with the same leaves as t, so it can’t contain any more edges than those in \(E_t \) and remain a tree. That is, \(t = t_1\triangleleft s_1\triangleleft \cdots \triangleleft s_n \). \(\square \)
With the preceding lemmas, it is finally possible to prove the main result of this section:
Theorem 1
Let T be a collection of histories, so that \(g_f(t) = K \) for all \(t\in T \). Then there exists a history \(t\in D(T) \) with \(g_f(t) < K \) if and only if there exists a history \(t'\in D(T) \) with \(g_f(t') > K \).
Proof
By Lemma 18, any history \(t\in D(T) \) can be expressed as a finite sequence of subhistory swaps involving histories in T. We will induct on n, the number of subhistory swaps involving histories in T required to express t. First, suppose the history t can be expressed as \(t = t_1\triangleleft s_2 \), a subhistory swap involving histories \(t_1, t_2\in T \), and the subhistory \(s_2 \) of \(t_2 \), conforming with a subhistory \(s_1 \) of \(t_1 \). Then by Lemma 16, \(g_f(t) < K \) if and only if \(g_f(t_2\triangleleft s_1) > K \). \(t_2\triangleleft s_1 \in D(T) \) by Lemma 17, so we’ve shown that \(g_f(t) < K \) implies there exists a history \(t'\in D(T) \) with \(g_f(t') > K \). By the same argument, if \(g_f(t) > K \), then there exists a history \(t'\in D(T) \) with \(g_f(t') < K \).
Now suppose that for \(i < n \), and for any \(t\in D(T) \) which can be expressed as \(t= t_1\triangleleft s_2\triangleleft \cdots \triangleleft s_i \) for \(s_i \) subhistories of histories in T,
-
if \(g_f(t) < K \) then there exists a history \(t' \in D(T)\) with \(g_f(t') > K \), and
-
if \(g_f(t) > K \) then there exists a history \(t' \in D(T)\) with \(g_f(t') < K \).
Let \(t\in D(T) \) be expressible as \(t = t_1\triangleleft s_2\triangleleft \cdots \triangleleft s_n \), where \(s_2, \ldots ,s_n \) are subhistories of histories \(t_2, \ldots , t_n \), and \(t_1, \ldots , t_n \in T \). Suppose \(g_f(t) < K \), and let \(t_* = t_1\triangleleft s_2\triangleleft \cdots \triangleleft s_{n-1} \). Notice that \(t_* \in D(T) \) by Lemma 17 since \(t_1, \ldots , t_n \in D(T) \). We seek to show there exists \(t'\in D(T) \) with \(g_f(t') > K \).
If \(g_f(t_*) > K \), then \(t_* \) is the history we seek.
If \(g_f(t_*) < K \), then the history we seek exists by the inductive hypothesis, since \(t_* \) is expressible as a subhistory swap involving \(n-1 \) histories in T.
If \(g_f(t_*) = K \), let s be the subhistory of \(t_* \) conforming with \(s_n \). Notice by Lemma 17, \(t_n \triangleleft s \in D(T) \). By Lemma 16, \(g(t_n \triangleleft s) > K\).
A similar argument shows that if \(g_f(t) > K \), there exists \(t'\in D(T) \) with \(g_f(t') < K \). \(\square \)
1.1.1 Trimming the history sDAG
Lemma 11
Let (V, E) be a history sDAG, and \(f:E\rightarrow W \) be an edge-weight function, with W a weight set which is clade-ordered with respect to f and (V, E) . Let \((V', E') \) be the history sDAG constructed from minimum-weight histories in (V, E) , with respect to f, and let \(({\underline{V}}, {\underline{E}}) \) be the minimum weight trim of (V, E) with respect to f. Then \((V', E') = ({\underline{V}}, {\underline{E}}) \).
Proof
First, \(({\underline{V}}, {\underline{E}}) \) is a history sDAG: \({\underline{V}}\subset V \), \({\underline{E}} \subset E \), and all nodes in \({\underline{V}} \) are reachable from \(\rho \) by construction. Also, for each node \(v = (\ell , U) \in {\underline{V}} \), and each \(C\in U \), there is at least one edge descending from the node-clade pair (v, C) , since at least one edge must achieve the minimum augmented subhistory weight in each clade.
Since a history sDAG is uniquely determined by the histories it contains by Lemma 8, it’s enough to show that these two history sDAGs contain the same set of histories. First, let t be a history in \((V', E') \), and let \(e = (v, v_c) \) be any edge in t. Since t achieves the minimum weight of any history in (V, E) , \(M_f(v_c) + f(v, v_c) \) must be equal to \(M_f(v, {\text {CU}}(v_c)) \). If this were not true, there would necessarily exist some subhistory s in (V, E) for which \(g_f(t\triangleleft s) < g_f(t) \), and \(t\triangleleft s \) would be a history in (V, E) , contradicting the assumption that t is a minimum-weight history in (V, E) . Therefore, all edges in t are in \({\underline{E}}' \). All nodes in t are reachable from \(\rho \) via paths in \({\underline{E}}' \), in particular the paths which follow the history t, so all nodes in t are in \({\underline{V}} \), and all edges in t are in \({\underline{E}} \). That is, \(({\underline{V}}, {\underline{E}}) \) contains at least all minimum-weight histories in (V, E) .
To show that \(({\underline{V}}, {\underline{E}}) \) contains only minimum-weight histories, let K be the minimum weight of any history in (V, E) , and let t be a history in \(({\underline{V}}, {\underline{E}}) \) with \(g_f(t) > K \). Consider the edge \(e = (v, v_c) \) in t closest to \(\rho \) which is not in \((V', E') \). Since \(e\notin E' \), e must not be an edge in any minimum-weight history in (V, E) . However, since e is the closest edge to \(\rho \) in t which is not in \((V', E') \), it must be true that \(v\in V' \). That is, there must be no subhistory \(s\in {\text {Ch}}(v_c) \) in (V, E) such that \(g_f(s^v) = M_f(v, {\text {CU}}(v_c)) \), and the edge e must not be in \({\underline{E}} \). Therefore, t is not in \(({\underline{V}}, {\underline{E}}) \), and \(({\underline{V}}, {\underline{E}}) \) contains exactly the minimum-weight histories in (V, E) , so by Lemma 8, \(({\underline{V}}, {\underline{E}}) = (V', E') \). \(\square \)
1.1.2 Collapsing histories
Lemma 13
Let (V, E) be a history sDAG with label set Y.
Also let \((v_p = (\ell _p, U_p), v_c = (\ell _c, U_c))\in E \) be an internal edge. That is, \(U_c\ne \emptyset \) (so \(v_c \) isn’t a leaf node).
Define a binary function \(b: E\rightarrow \left\{ 0, 1 \right\} \) which is constant at 0, except that \(b(v_p, v_c)=1 \), and let T be any b-collapsible edge cover of (V, E) .
Let \(v_p' = (\ell _p, U_p\cup U_c \setminus {\text {CU}}(v_c)) \) be the “new parent node”, and define:
Let \(R = \emptyset \) if there exists an edge \((v_p, v)\in E^+ \) with \({\text {CU}}(v) = {\text {CU}}(v_c) \). Otherwise, let R be the set of parent edges of \(v_p \), of the form \((v', v_p) \in E^+ \).
Then, let \(E^- = E^+\setminus R \). Finally, define
and
Claim: \((V', E') \) is the history sDAG constructed from \(T' \), the set of histories which result by collapsing the edge \((v_p, v_c) \) in each history in T in which it appears.
Proof
Let \((V_!, E_!) \) be the DAG constructed from \(T' \). We must show that \(E' = E_! \), so that by construction, \(V' = V_! \).
First, to show that \(E'\subset E_! \), let \((v_1, v_2)\in E' \).
-
If \(\left\{ v_1, v_2 \right\} \cap \left\{ v_p, v_p', v_c \right\} = \emptyset \), then \((v_1, v_2) \in E_! \) because collapsing in histories only modifies edges incident to the edge being collapsed.
-
If \(v_2 = v_p \), then \(v_p \) was not removed from V, meaning that some edge \((v_p, v) \in E \) must exist, with \({\text {CU}}(v) = {\text {CU}}(v_c) \). T contains all the histories in (V, E) , so there is a history \((V_t, E_t) \) in T so that \((v_p, v) \in E_t \). Since each history has exactly one edge descending from each node-clade pair, \((v_p, v_c)\notin E_t \), and \((V_t, E_t) \in T' \). Therefore, \((v_p, v) \in E_! \).
-
If \(v_2 = v_p' \), then \((v_1, v_2) \notin E \), but \((v_1, v_p) \in E \), and \((v_p, v_c) \in E \). Since \((v_1, v_p) \) and \((v_p, v_c) \) are adjacent in (V, E) , there’s a subhistory which contains both edges, and since \((v_p, v_c)\) is b-collapsible and T is a b-collapsible edge cover, there is a history \((V_t, E_t)\in T \) which contains both the subhistory, and consequently, both edges. The corresponding label-collapsed history in \(T' \) contains \((v_1, v_2) \). Therefore, \((v_1, v_2) \in E_! \).
-
If \(v_2 = v_c \), then \(v_1\ne v_p \), and \((v_1, v_2) \in E \). Some history in T must contain the edge \((v_1, v_2) \), and may not contain the edge \((v_p, v_c) = (v_p, v_2) \), in order to be a tree. Therefore, this history is unchanged in \(T' \), and \((v_1, v_2) \in E_! \).
-
If \(v_1 = v_p \), then \(v_2\ne v_c \) since \((v_p, v_c) \notin E' \). Therefore, \((v_1, v_2) \in E \), and by the same reasoning as above, \((v_1, v_2) \in E_! \).
-
If \(v_1 = v_c \), then again \((v_1, v_2) \in E \), so \((v_1, v_2) \in E \).
-
If \(v_1 = v_p' \), then either \((v_c, v_2) \in E \) or \((v_p, v_2) \in E\). There exists a history \((V_t, E_t) \in T \) with \((v_p, v_c) \in E_t \), and either \((v_c, v_2) \) or \((v_p, v_2) \) in \(E_t \), in which collapsing \((v_p, v_c) \) yields the edge \((v_p', v_2) \). The resulting history is in \(T' \), so \((v_1, v_2) \in E_! \).
We’ve addressed all the situations where one of \(v_1, v_2 \in \left\{ v_p, v_p', v_c \right\} \). Both nodes can’t be in that set, because no pair of nodes in \(\left\{ v_p, v_p', v_c \right\} \) can have an edge between them in \(E' \), by construction.
Now, to show that \(E_!\subset E' \), let \((v_1, v_2)\in E_! \). First, notice that \(E_!\subset E^+ \), because \(E^+ \) contains all the edges that are added to histories in T when collapsing \((v_p, v_c) \), and \(E_! \) does not contain \((v_p, v_c) \).
By definition, \((v_1, v_2)\in E_t \), for some \((V_t, E_t)\in T' \). If \(E^- = E^+ \), and since \((v_1, v_2) \in E_t , v_1 \) is reachable from the UA node, and \((v_1, v_2) \in E'\). If \(E^-\ne E^+ \), then \(v_p \) must have had no edges descending from its child clade \({\text {CU}}(v_c) \), in \(E^+ \). That means \(v_p\notin V_t \), since a history must have exactly one descendant edge for each node-clade pair. Therefore, no removed parent edges are in \(E_t \), and \(E_t\subset E^- \). This means that \(v_1 \) is reachable from the UA node in \(E^-\), and \((v_1, v_2) \in E' \).
Therefore, \(E' = E_! \) and \(V' = V_! \). \(\square \)
Lemma 14
Let \((V_0, E_0) \) be a history sDAG, and define a sequence \((V_i, E_i)_{i\in {\mathbb {N}}} \) of history sDAGs, so that \((V_k, E_k) \) is generated by collapsing an edge
in \((V_{k-1}, E_{k-1}) \) with \(\ell _{k-1} = \ell _{k-1}' \) and \(U_{k-1}' \ne \emptyset \) if such an edge exists. If no such edge exists, then \((V_k, E_k) = (V_{k-1}, E_{k-1}) \).
Then there exists \(N\in {\mathbb {N}}\) such that \((V_N, E_N) \) is label-collapsed. Also, if \(T_0 \) is a label-collapsible edge cover of \((V_0, E_0) \), and \(T_0' \) is the set of histories resulting from label-collapsing each history in \(T_0 \), then each history in \(T_0' \) is in \((V_N, E_N) \).
Proof
Let \(T_0 \) be a label-collapsible edge cover of \((V_0, E_0) \), and define a sequence of sets of histories \((T_i)_{i\in {\mathbb {N}}} \), where \(T_k = T_{k-1} \) if \((V_k, E_k) = (V_{k-1}, E_{k-1}) \), and otherwise let \(T_k\) be obtained by collapsing all histories in \(T_{k-1}\) at the edge \(e_{k-1}\). Notice that if such an N exists, then \(T_0' \subseteq T_N \) by Lemma 13. Therefore we need only show that such an N exists.
Let T denote a label-collapsible edge cover of (V, E), and denote the multiset of collapsible edges in all \(t\in T\) as \(E_{col}\). For each collapsible edge \(e\in E\), the label-collapsed DAG \((V', E')\) is equivalent to the DAG obtained from the label-collapsed histories \(T'\) by Lemma 13.
Note that the number of trees in \(T'\) is equal to the number of trees in T. However, the total number of unique trees in \(T'\) can be smaller than in T since collapsing an edge in two different trees can produce the same resulting tree. Also, since collapsing an edge in a history does not introduce any new edges in that history, collapsing e strictly reduces the number of edges in \(E_{col}\). So the multiset of collapsible edges in \(T'\) is a strict subset of \(E_{col}\). We will demonstrate that \(T'\) is a label-collapsible edge cover of \((V', E')\). Since the set \(E_{col}\) is finite, these results imply that any such sequence of history sDAGs results in a label-collapsed DAG in a finite number of steps.
To show that \(T'\) is a label-collapsible edge cover of \((V', E')\), we show that for a collapsible edge \(e_c\in E'\), every subhistory in \((V', E')\) which contains \(e_c\) is contained in \(T'\).
Let \(e_c\) be given, and suppose \(s'\) is any subhistory in \((V', E')\) containing \(e_c\).
If every edge in \(s'\) is disjoint from the vertices \(\{v_p', v_p\}\), then there is an identical subhistory s in (V, E), and, by the label-collapsible edge covering property of T, there exists \(t\in T\) containing s. Since every edge in s is disjoint from the set of edges altered by collapsing at e, collapsing t at e yields a history in \(T'\) that contains \(s=s'\) as a subhistory.
If there is an edge in \(s'\) of the form \((v, v_p')\) then consider the corresponding un-collapsed subhistory s in (V, E) consisting of edges
Where the edges adjacent to \(v_p'\) are replaced with the corresponding structures in E. By construction, s is a subhistory in (V, E) containing a collapsible edge e and such that collapsing at e yields \(s'\). Since T is a label-collapsible edge cover for (V, E), there exists \(t\in T\) containing s, and, since collapsing s at e yields \(s'\), the label-collapsed history \(t'\in T'\) contains \(s'\). The analogous argument holds if \(s'\) is a subhistory containing an edge of the form \((v_p', v)\).
If there is an edge in \(s'\) of the form \((v_p, v)\), then by the observation following Lemma 13, this implies that there is another edge descending from the node-clade pair \((v_p, {\text {CU}}(e))\) distinct from e, and that \(s'\) can be viewed as a subhistory of a subhistory containing that alternative edge. So the subhistory \(s'\) corresponds to a subhistory s in (V, E) which belongs to a history t that cannot contain e. Since s is contained in a history that does not contain e, collapsing at e does not affect s, and so collapsing s yields \(s'=s\). Thus s is a subhistory in (V, E) that contains the collapsible edge \(e_c\), and hence there exists a history \(t\in T\) containing s. Since t does not contain e, collapsing at e yields \(t'\in T'\) which, trivially, contains \(s'=s\).
And so \(T'\) is a label-collapsible edge cover for \((V', E')\). \(\square \)
1.2 Histories are labeled trees
In this subsection, we show that history substructures in the history sDAG are in bijection with isomorphism classes of rooted, internally labeled, multifurcating trees. There will be a number of notational differences from the rest of the paper. Rather than a history, t will denote a labeled tree, and s a subtree of a labeled tree. \(\tau \) will denote a tree’s graph structure, in which nodes are abstract objects rather than the label, subpartition pairs that the history sDAG consists of. The function L will denote the set of leaf nodes below an internal node in a labeled tree. Also, \(\varphi \) will denote a labeling function of a labeled tree, rather than a disambiguation of a history.
Y will continue to mean a set of labels, as in the rest of the paper.
Definition 23
An (internally) labeled tree \(t = (\tau , \varphi ) \) is
-
a rooted, multifurcating tree \(\tau =(V, E) \), and
-
a labeling function on vertices, \(\varphi :V\rightarrow Y \), where Y is a label set.
We will let L(t) refer to the set of leaf nodes of the tree \(\tau \), and require that
-
no node in \(\tau \) has exactly one child, and
-
the labels on leaf vertices must be unique (that is, \(\left. \varphi \right| _{L(t)} \) must be injective), but labels on internal vertices need not be (that is, \(\varphi \) need not be injective or surjective).
However, we will primarily use a different definition in this text, which is equivalent up to isomorphism on internally labeled trees:
Definition 24
Let \(t = (V, E, \varphi ) \) and \(t' = (V', E', \varphi ') \) be two labeled trees. Then t and \(t' \) are isomorphic if there exists a bijection \(h:V\rightarrow V' \) which preserves labels and respects tree structure. That is,
-
\(\varphi (v) = \varphi '(h(v)) \) for all \(v\in V \)
-
\(E' = \left\{ \left( h(v), h(v') \right) \,\big \vert \,(v, v') \in E \right\} \)
Lemma 19
Let (V, E) be the complete history sDAG on labels Y. Given a history \((V_t, E_t) \) in (V, E) , let
and
with v the only child node of the UA node in \((V_t, E_t) \). Define the function \(\varphi : V'\rightarrow Y \) as \(\varphi ((\ell , U)) = \ell \in Y \).
The correspondence \((V_t, E_t) \mapsto t_{(V_t, E_t)}=(V', E', \varphi ) \) from histories in (V, E) to labeled trees on labels Y is well-defined.
Proof
Given the history \((V_t, E_t) \), the label function restricted to leaf nodes, \(\left. \varphi \right| _{L(t_{(V_t, E_t)})}: L(t_{(V_t, E_t)}) \rightarrow Y\) is an injection, since DAG leaf nodes are uniquely labeled by elements of Y. Also, any node w in the labeled tree \(t_{(V_t, E_t)} \) is determined by a unique node \(v \in V_t \). v must have either no child clades, or at least two child clades, and v must have a child node for each child clade. Each child node of v in \((V_t, E_t) \) corresponds to a child node of w in \(t_{(V_t, E_t)} \), so w may not have exactly one child node.
That is, the map named in the lemma is well-defined. \(\square \)
Lemma 20
Let (V, E) be the complete history sDAG on labels Y. Let \(t = (\tau , \varphi ) \) be a labeled tree with labels in Y, and with root node \(w_0\).
For each node w of \(\tau \), let \(C_w\subset Y\) be the set of leaf labels below the node w, and let
Define
and
The correspondence \(t\mapsto (V_t, E_t) \) from labeled trees on labels Y to histories in (V, E) is well-defined.
Proof
The assignment \(w\mapsto v_w = (\ell , U) \) of nodes in the labeled tree to nodes in the DAG is well-defined:
-
U consists of disjoint subsets of Y because \(\varphi \) is injective on leaves of t, and sets of leaves between child nodes of w must be disjoint.
-
U is either empty, or contains more than one subset of Y, since w can be a leaf node with no children, or an interior node of t with two or more children.
-
\(U=\emptyset \) if and only if w is a leaf node, because w has no children if and only if w is a leaf node.
This assignment \(w\mapsto v_w \) is also injective: In particular, no two nodes in a labeled tree may have the same subpartition. To see this, let \(w_1, w_2\) be two nodes in a labeled tree t, with subpartitions \(U_1 \) and \(U_2 \). We will show that \(U_1\ne U_2 \).
If one of the nodes \(w_1, w_2 \) is not reachable from the other, then \(U_1\cap U_2 = \emptyset \), and \(U_1\ne U_2 \).
Otherwise, suppose that \(w_2 \) is reachable from \(w_1 \). \(w_1 \) must have more than one child node, so \(w_1 \) has at least one child node \(w_1' \) from which \(w_2 \) is not reachable. Therefore, the clade below \(w_1' \) must be disjoint from all the clades in \(U_2 \), since t is a tree. However, the clade below \(w_1' \) is an element of \(U_1 \), so \(U_1\ne U_2 \).
The assignment \((w_1, w_2)\mapsto (v_{w_1}, v_{w_2}) \) from edges in the labeled tree to edges in the history sDAG is well-defined and injective:
Either
-
The union of the child clades of \(w_2 \) are a child clade of \(w_1 \) (in particular, the child clade under the child \(w_2 \)), or
-
\(w_2 \) is a leaf node and therefore one of the child clades of \(w_1 \) is \(\left\{ \ell _{w_2} \right\} \).
The assignment on edges is also injective, since the assignment on nodes is injective.
\((V_t, E_t)\) is a history sDAG:
Since the assignments of nodes and edges in the labeled tree to nodes and edges in the complete DAG are well-defined, \(V_t\subset V \) and \(E_t\subset E \). To finish showing that \((V_t, E_t) \) is a history sDAG, notice that each node-clade pair (v, C) has a descendant edge, namely the one to \(v_w \), where w is the parent node of the clade C in t. Also, each node is reachable from the UA node, since each node in t is reachable from \(w_0 \).
Notice \((V_t, E_t) \) has the same tree structure and labels as t, by construction. Therefore, such a choice of history \((V_t, E_t) \) is uniquely determined by a labeled tree t. \(\square \)
Lemma 21
(Correspondence) The map from histories to labeled trees named in Lemma 19, and the map from labeled trees to histories named in Lemma 20, are inverses, up to label-preserving bijection on nodes. In particular, both maps name a bijective correspondence between histories in a history sDAG (V, E) on labels Y, and labeled trees with labels in Y.
Proof
Given a labeled tree t, the labeled tree recovered from the history \((V_t, E_t) \) is exactly the tree we started with (up to the isomorphism in Definition 24).
In the other direction, the labeled tree \(t_{(V_t, E_t)} \) derived from a history \((V_t, E_t) \), induces exactly the history \((V_t, E_t) \). This demonstrates the desired bijection. \(\square \)
1.3 Supplementary tables
See Table 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dumm, W., Barker, M., Howard-Snyder, W. et al. Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph. J. Math. Biol. 87, 75 (2023). https://doi.org/10.1007/s00285-023-02006-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00285-023-02006-3