Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph

In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the “history subpartition directed acyclic graph” (or “history sDAG” for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the “skeleton” of a more complete uncertainty quantification.


Introduction
Here we develop a structure that can compactly represent and extend collections of phylogenetic trees with ancestral sequences mapped on the internal nodes. One motivation for this structure comes from uncertainty quantification in statistical phylogenetics, which is typically approached via one of two ways. Bayesian analysis attempts to characterize the posterior distribution of phylogenetic trees given data: the collection of trees that credibly explain the data, and their probabilities of being the generative tree. On the other hand, the phylogenetic bootstrap (Felsenstein, 1985) resamples columns of the multiple sequence alignment, infers an optimal tree for each one of the resampled data sets, then aggregates features of the resulting trees.
Neither of these are tenable for very large and densely sampled data sets, such as for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) collections. Traditional Bayesian analysis is often too slow to apply to these large data sets, and introduces many extra unknown model parameters in a signalweak setting. Bootstrapping may remain fast enough when using recent approximations (Hoang et al, 2018), but has a different problem: it is common for well-established clades (supported by other data) to be supported on the sequence level by a single mutation, so the bootstrap support of the corresponding clade will exactly equal the frequency with which we draw that mutation in the bootstrap sample. Thus, the bootstrap underestimates support in this case (Wertheim et al, 2022).
Phylogenetic placement offers a different type of uncertainty estimate: an assessment of the level of certainty in inserting a new sequence into an existing phylogeny. However, these assessments of uncertainty are relative to a fixed reference tree. For SARS-CoV-2 this can be done in the UShER framework (Turakhia et al, 2021), in which this insertion procedure is used for iterative tree building. No attempt is made to characterize uncertainty of the complete tree in this framework.
The lack of uncertainty quantification may have consequences for interpretation of SARS-CoV-2 evolution. For example, the current practice for the PANGO nomenclature system (Rambaut et al, 2020) for SARS-CoV-2 does not require any sort of support estimation. A typical workflow involves placement and local tree construction. If there is indeed high probability of a single tree, then this is fine. If not, this seems potentially problematic.
We argue as follows that the diversity of maximally parsimonious trees on the data can be used to bound uncertainty from below. First, if there are more maximally-parsimonious explanations of the data, this decreases the probability that any one explanation is correct. For this reason, we expect there to be an inverse relationship between the number of maximally-parsimonious explanations of the data and the certainty of a given node or other feature in the tree. Furthermore, this inverse relationship should express a lower bound on the uncertainty because there are many other potential compelling trees that are not quite maximally parsimonious. In any case, analyzing even just the maximally parsimonious set of trees commonly involves so many trees that storing them individually and learning from them with existing techniques is computationally prohibitive. This is especially the case with parsimony analysis of large data sets, such as those for SARS-CoV-2 (Turakhia et al, 2021;Ye et al, 2022).
As a second motivation for our work, we also suggest that gathering a collection of maximally parsimonious trees could be helpful for Bayesian analysis. Although the parsimony criterion is of course not the same as likelihood, the two objectives are closely linked in the case where sequences are densely sampled relative to the amount of evolution . Previous work has shown how closely related sequences can greatly inflate the posterior distribution (Whidden and Matsen, 2015), and a parsimony analysis would have revealed this inflation. Thus, we hope to use the collection of maximally parsimonious trees as an aid for designing proposal distributions, extending previous successful strategies (Zhang et al, 2020), and for quantifying exploration of tree space.
In this paper, we formalize a data structure called the history subpartition directed acyclic graph (a.k.a. history sDAG) to characterize the ensemble of maximally parsimonious trees for large data sets. This is related to the idea of characterizing the trees in a single optimal "terrace" in phylogenetic tree space, with respect to parsimony (Sanderson et al, 2015(Sanderson et al, , 2011. We describe algorithms to build history sDAGs from internally labeled trees, collapse edges with no mutations, and trim history sDAGs to express only trees which are optimal according to general criteria, such as parsimony. Although history sDAG construction is not the same as uncertainty estimation, which would allow for some less-than-maximally-parsimonious trees, it is a first step in that direction. We provide a Python implementation with a flexible interface for the history sDAG as a container type for trees, endowed with abstract methods for convenient dynamic programs on the history sDAG structure, as well as all methods from this paper for manipulating history sDAGs constructed from maximally parsimonious trees. This implementation shows the effectiveness of the approach, efficiently recovering many orders of magnitude more equally parsimonious trees than were used to "seed" the history sDAG when applied to a SARS-CoV-2 data set.

Intuitive Overview
Here we provide an intuitive overview of the definitions and concepts used in this paper. Formal definitions will be given in the sections following the overview. This paper develops methods for understanding evolutionary relationships between samples from a population of closely related evolving entities, acknowledging uncertainty. We will focus on samples consisting of nucleotide sequences, but keep our language general to emphasize that other data such as sample time and geographic location could also be used.
One way to formalize evolutionary relationships among samples, and inferred ancestral states, is to arrange them in a rooted phylogenetic tree with leaf and internal node labels. Node labels in this tree can include data of the type associated with the given samples. Specifically, leaf nodes are labeled by samples, and interior nodes are labeled by inferred ancestral states. The set of samples which label leaves will be called the leaf labels. Interior node labels may be chosen from some larger label set which includes the leaf labels as a subset. Instead of directly using this notion of a rooted, internally labeled tree, we will define a more convenient object called a history, which holds the same data as such a tree. We will make the definition formal below, but a history may be thought of as a rooted, internally labeled tree (this object has been called other names in the past, including an ancestral scenario (Ishikawa et al, 2019)). For example, a history might be used to represent a phylogenetic tree in which all (internal and tip) nodes are labeled with DNA sequences.
In a history, a node's clade is the set of labels of its descendant leaf nodes (we emphasize that internal node labels are excluded from the clade definition). A clade of a node's child is a child clade. The child clades of a node form a partition of the node's clade. We therefore call this set of child clades a node's subpartition. Each edge in a history connects two nodes, each with a label and subpartition. As a formality convenient for this paper, each history will contain a universal ancestor (UA) node added as a parent of the root node.
Some histories explain the relationships between their leaf labels more plausibly than others. One common measure of optimality for a history labeled by nucleotide sequences is its parsimony score, which is the total number of nucleotide base changes along all edges in the history. A history is said to be maximally parsimonious if no other history on the same leaf labels has a lower parsimony score.
In general, there are many possible maximally parsimonious histories with leaves labeled by the same set of nucleotide sequences. We will use a structure called the history subpartition directed acyclic graph (history sDAG) to efficiently encode a large collection of histories ( Figure 1). The "history" modifier emphasizes that this structure encodes a collection of possible rooted evolutionary histories, each of which contain not only a tree structure, but also ancestral state labels.
The history sDAG consists of a collection of nodes, each associated with a combination of label and subpartition, and one formal universal ancestor (UA) node, which is denoted ρ. As we will see later, edges exiting ρ keep track of the root nodes of the histories in the history sDAG.
A directed edge in a history sDAG represents an edge in a corresponding set of histories, from a parent node to a child node which have the same labels and subpartitions as the parent and child nodes of the edge in the DAG. Thus, the history sDAG structure records combinations of labels and subpartitions, and adjacencies between these combinations, in the corresponding collection of histories.
By using a carefully chosen definition of history, introduced in the next section, the history sDAG can easily be constructed as the graph union of a set of histories. These histories need not have identical leaf labels. Specifically, we think of each history as its own history sDAG, with each node annotated by its label and subpartition. The history sDAG constructed from the original set of histories is simply the union of nodes and edges in each history (Figure 1).
The history sDAG then contains as subgraphs at least those histories used to construct it. Fig. 1 A history sDAG constructed from three internally labeled trees on label set of sequences {AA, AC, AT, AG}. Each tree is converted to the equivalent history structure, and the union of these histories is the history sDAG. Each node in a history or the history sDAG consists of a label (in this case a sequence of two bases) shown in the top half of the node, and a subpartition, with each set in the subpartition separated by a vertical bar in the bottom half of the node. Leaf nodes have no children, so appear with only their label. Although in this example labels are length-two nucleotide sequences, the label set is arbitrary, and could include sequences, geographic location, or other information Any subgraph of the DAG which is a tree, includes exactly one edge descending from the UA node, and exactly one edge descending from each child clade of each of its nodes, is a history (Figure 2). Each of the histories contained in a history sDAG represents a combination of substructures from the histories used to construct the history sDAG.
In addition to thinking of the history sDAG as a way of recording structures observed in a collection of histories, we can also think of it as a way of generating histories. In fact, the set of histories in the history sDAG is in general a superset of the set of histories used to construct the DAG (Figure 3). These new histories result from combining subhistories from histories used to construct the history sDAG. This is similar to tree fusion, in which clades from different trees are combined to improve the parsimony score of the final tree (Goloboff, 1999). The connection with tree fusion is explored further in the Discussion section. a history structure highlighted in red (left) and a labeled tree corresponding to that history (right) As described above, exploring phylogenetic uncertainty by examining many maximally parsimonious histories requires an efficient way to store and compute on those histories. The history sDAG provides a compact structure for storing collections of histories, but in general contains histories beyond those used to build the DAG. We therefore encounter a key question: are these additional histories also maximally parsimonious?
In the following two sections, we will show that maximum parsimony is in fact preserved by the history sDAG. Theorem 1 shows that any history expressed by a history sDAG constructed from maximally parsimonious histories must itself be maximally parsimonious. To achieve this we must first show that swapping certain substructures between histories preserves maximum parsimony. Then we will show that the collection of histories in the history sDAG is closed under these subhistory swaps, and that any history in the history sDAG can be obtained by such a subhistory swap involving histories used to construct the DAG. This means that the history sDAG is not only an effective way to The history sDAG can express more histories than were used to construct it. The history sDAG in (b) is constructed from the two internally labeled trees in (a), and represents the four internally labeled trees in panels (a) and (c). Notice that the trees in (c) are not among the trees used to construct the sDAG, but result from swapping the substructures highlighted in green and orange in (a) store many maximum parsimony histories, but also may allow us to very quickly discover more such histories. Preservation of maximum parsimony in the history sDAG has two important consequences: • A history sDAG constructed from maximally parsimonious histories will contain only maximally parsimonious histories. If a set T of histories with the same parsimony score is used to construct a history sDAG, and if that history sDAG expresses a history with any other parsimony score, then T must not have contained maximum parsimony histories.
• It is always possible to trim an arbitrary history sDAG to express all of, and only, its maximally parsimonious histories. In particular, a new history sDAG constructed from the maximally parsimonious histories represented by the original history sDAG will contain only those histories used to construct it.
Throughout this paper we will refer to maximally parsimonious histories using the more general term minimum-weight histories, since maximum parsimony is characterized by minimizing the sum of a weight over all edges in a history. Indeed, this term is more general because we can use weight functions that are more complex than simply the sum of the number of mutations, or which consider label data other than nucleotide sequences.
We provide an implementation of the history sDAG and related algorithms in the open source Python package historydag, installable with pip and available at https://github.com/matsengrp/historydag. This package provides methods for constructing, trimming, collapsing, and extracting histories from the history sDAG as described in the following sections. historydag also implements methods which we will describe in future work, for efficiently calculating weights of histories represented in the history sDAG, and for expressing and sampling from a probability distribution on histories in the history sDAG.
For reference, we provide a summary of notation in Table 1. labels ℓ a label, such as a nucleotide sequence Y a set of labels X the set of leaf labels, a subset of the label set Y C a clade, i.e. a subset of leaf labels U a set of disjoint clades histories t a history s a subhistory of a history v a node in a history or history sDAG s v a subhistory and its parent node ρ the UA node e an edge in a history or history sDAG T a set of histories f an edge weight function on pairs of history or history sDAG nodes g f a weight function on histories, summing f over all edges L(s) the set of leaf nodes of a subhistory s CU(v) the clade union of a node v history sDAGs V a set of history sDAG or history nodes E a set of history sDAG or history edges D(T ) the set of histories expressed by a history sDAG constructed from histories T Ch(v) children of a node v in a history or history sDAG B(v) subhistories below a node v in a history sDAG

Histories and the History sDAG
We will now provide a formal definition of histories and the history sDAG. Let Y refer to a set of labels, such as nucleotide sequences. We can think of observed labels as a set X ⊂ Y , labeling history leaves. We will not emphasize this set of leaf labels X, since a history sDAG may express collections of histories with varying leaf label sets. In the case of parsimony however, we will be interested in collections of histories which share a leaf label set consisting of observed nucleotide sequences.
We are interested in representing collections of rooted, multifurcating, nonunifurcating trees with nodes (including internal nodes) labeled by elements of Y . As mentioned in the Overview, we will make this easy by carefully defining histories.
Isomorphism classes of such internally labeled trees are in bijection with histories, as defined below. This correspondence is shown formally in Appendix A, but sufficient intuition may be found in Figure 1.
Let Y be a set of labels, and let P(·) denote the power set.
That is, Part(Y ) contains ∅ and all sets of two or more nonempty, disjoint subsets ( clades) of Y .
Given a set of leaf labels X ⊂ Y , Part(X) would contain all of the possible subpartitions of leaf labels in an internally labeled tree with leaves labeled by X. Notice that Part(X) ⊂ Part(Y ) for any such X ⊂ Y . Since a history sDAG may contain histories with varying leaf label sets, elements of Part(Y ) are used to construct general history sDAG nodes.
We will see that with the exception of a universal ancestor node, all nodes in the history sDAG structure consist of a label ℓ ∈ Y and a subpartition U ∈ Part(Y ).
Definition 2. A node-clade pair is a node (ℓ, U ) and a choice of child clade C ∈ U .
Definition 3. A history sDAG with labels Y is a directed graph (V, E) consisting of we say that v's label is ℓ, its subpartition is U , its child clades are elements of U , and its clade union • A directed edge set E ⊂ V ×V containing edges e = (v 1 , v 2 ) from a parent node v 1 to a target or child node v 2 such that 1. All nodes are reachable from the UA node ρ, which itself accepts no incoming edges.
2. For any edge whose parent node is not ρ, the clade union of the target node must be in the subpartition of the parent node. Formally, for any edge e = (ℓ 1 , U 1 ), (ℓ 2 , U 2 ) ∈ E, if C is the clade union of (ℓ 2 , U 2 ), then C ∈ U 1 . We say then that the edge e descends from the node-clade pair (ℓ 1 , U 1 ), C .
3. For each node v = (ℓ, U ), and for each choice of child clade C ∈ U , at least one edge descends from the node-clade pair (v, C).
Notice that by requirements (1) and (3) in the definition of the history sDAG, all nodes in the history sDAG must have descendant edges, except for those of the form (ℓ, ∅). We will refer to these as leaf nodes.
Observation 1. Since only nodes of the form (ℓ, ∅) may have no children, all leaf nodes in a history sDAG must be of this form, and therefore no two leaf nodes may be labeled by the same element of Y .

Definition 4.
A history is a history sDAG in which the UA node ρ has a unique child node, and each node-clade pair has exactly one descendant edge.
The set of labels of the leaf nodes in a history t will be denoted L(t).
Notice that not every element of Y must appear as a node label in a history or history sDAG. That is, Y is an ambient label set, such as the set of all nucleotide sequences of a fixed length, from which history sDAG node labels can be chosen.
Also notice that there is no distinction between leaf node and internal node labels. In practice, the set of leaf node labels will be associated with a set of observed evolving entities. When a sampled entity is inferred to be an ancestor of other sampled entities, we can represent this in a history with an internal node carrying the label corresponding to the sampled ancestor.
Informally, a labeled tree can be converted to a history by annotating each node with its subpartition, and adding a UA node as a parent of the root node ( Figure 1). The unique child of the UA node in a history will be called the root node, since it represents the root node of a corresponding internally labeled tree.
The natural substructure of a history is analogous to a subtree of a labeled tree, and will be very useful in later sections.
Definition 5. Given a history sDAG there exists a root node v r ∈ V s such that all other nodes in V s are reachable from v r , and 3. each node-clade pair in s has exactly one descendant edge.
The set of labels of leaf nodes in a subhistory s is denoted L(s).
Later we will establish formally that a history is in fact a tree. Given that fact, naming a subhistory is equivalent to removing an edge from a history, and discarding the component which contains the UA node.
In addition to the UA node, the definition of history contains redundant information in the sense that the subpartition of a node, formally a piece of data associated with each node, can be recovered as the set of sets of labels of leaf nodes reachable from that node's children. Although this choice may seem an unnecessary complication, it is essential in distinguishing histories contained in a larger history sDAG. This redundancy is shown in the following lemma, which is proven in Appendix A: Lemma 3. Let (V, E) be a history sDAG or subhistory, and let v ∈ V . The set of labels of leaf nodes reachable from v is CU(v).
Lemma 3 implies that a history's set of leaf labels is determined by the subpartition of its root node. This will be relevant later, when we describe what it means for a history to be found in a history sDAG.
We intend for a history to be tree-shaped, but this is not assumed by the definition given. Lemma 3 also allows us to prove this essential fact.
Lemma 4. A history sDAG (V, E) is a history if and only if it is a tree, and contains exactly one edge descending from ρ.
The proof for this proposition is given in Appendix A. Notice that since elements of Part(Y ) may not contain exactly one clade, and since nodes in a history have exactly one child node per child clade, no node (other than the UA node) in a history may have exactly one child. This is required to ensure that the history sDAG may not contain cycles. Although this is not stated in Definition 3, it is an important property of the history sDAG as the name suggests, and is proven in Appendix A: Lemma 5. A history sDAG (V, E) is acyclic.
Sometimes data sets include a fixed root node label, such as a common ancestor sequence. A search for minimum weight labeled histories explaining such a data set may yield labeled histories with a unifurcation at the root node. We accommodate this by considering the fixed root sequence a leaf node label, and placing the corresponding leaf node as an additional child of the root node.
Since histories are tree-shaped history sDAGs, we can store collections of histories by taking their graph union. However, we should first verify that a graph union of history sDAGs is itself a history sDAG.
is also a history sDAG.
Proof. All the nodes and edges required to satisfy Definition 3 are present in (V ∪V ′ , E ∪E ′ ), since they are present in each of the original history sDAGs. All nodes are reachable from the root node, through exactly the same sequence of edges by which they were reachable in at least one of the original histories.
Definition 6. For a set T of histories with labels in Y , the history sDAG constructed from T is the graph union of the histories in T : We should also formalize the way in which a history sDAG contains histories. To do so, we will need to define a trim, which is a history sDAG which appears as a substructure in a larger history sDAG.
The collection of histories in the history sDAG constructed from a collection of histories T will be denoted D(T ).
We can now see why we must specify in Definition 4 that ρ has exactly one child node in a history. Edges descending from the UA node in a history sDAG keep track of which DAG nodes are allowed to be root nodes. It may be possible to choose a tree-shaped trim of a history sDAG in which two nodes v 1 and v 2 are children of ρ, and CU(v 1 ) ∩ CU(v 2 ) = ∅. Such a structure should be considered a trim containing two histories, but is not itself a history.
Any history sDAG should be uniquely determined by the collection of histories it contains. This intuition motivates the following two lemmas, which are proven in Appendix A: Lemma 7. Let (V, E) be a history sDAG. For any v ∈ V , there exists a subhistory s in (V, E) whose root node is v.
Lemma 8. Let (V, E) be a history sDAG, and let T be the collection of histories in (V, E). Then (V, E) is the history sDAG constructed from T .
Finally, we will need to define the largest possible history sDAG constructed using a given set of labels.
Definition 8. The complete history sDAG on labels Y is the history sDAG which contains all possible edges on all nodes allowed by the choice of Y .
Equivalently, the complete history sDAG could be constructed as the graph union of all possible histories with labels in Y .

History Weights
In this section we will define a general scheme for assigning weights to histories, and describe the relationship between these weights and the structure of the history sDAG. Figure 3, the history sDAG in general contains more histories than were used to construct it. These extra histories arise because the history sDAG allows subhistories between the histories it contains, whenever the subhistories' parent nodes share the same child clades and node label. We refer to this occurrence as subhistory swapping. Appendix 21 describes these subhistory swaps precisely, shows that all new histories in a history sDAG can be described as in terms of sequences of these subhistory swap operations, and provides the proof for Theorem 1, which involves an argument that these subhistory swaps preserve history weights.

As shown in
This section will leave the details of subhistory swaps, and the proof of Theorem 1, to the Appendix, and only build the background necessary to state and understand Theorem 1.
We begin by defining another useful type of history substructure.
Definition 9. Let (V, E) be a history sDAG and let s = (V s , E s ) be a subhistory (V, E). Also, let v r be the root node of the subhistory s, and let consisting of the subhistory s plus the parent node v and the edge connecting v to v r .
Definition 10. Let (V, E) be a history sDAG, and v = (ℓ, U ) ∈ V a node. We make the following definitions.
Although we are interested primarily in computing parsimony on histories labeled with nucleotide sequences, we will do so within a much more general framework of history weights.
Definition 11. Let (V, E) be the complete history sDAG on labels Y , and let f : E → W be an edge weight function to a weight set W endowed with addition and containing an additive identity 0 ∈ W . The weight of any subgraph In particular, since any history t in (V, E) is a subgraph of (V, E), the weight of t is given by g f (t).
In the case of parsimony, the label set Y will contain sequences, the function f is Hamming distance, and g f will compute the parsimony score of a history. A history's parsimony score is decomposable as a sum of an edge weight function over edges only when complete, unambiguous nucleotide sequences are accessible to that weight function as node label data. If nucleotide sequences of internal nodes are not contained in node label data, the contribution of an edge to a history's parsimony score may be dependent on the structure of the rest of the history, making the decomposition impossible. In particular, the edge weight function f is required to be a function on all possible history sDAG edges, which correctly reports the contribution of an edge to the weight of any history which contains it.
Although our focus here is parsimony, notice that this framework allows much more general notions of history weight, including situations where the function f is sensitive to edge direction or subpartitions, or takes values in a nonnumeric set, such as a set of sequences. These generalizations will be important for future applications. For example, we could compute a branching process likelihood like that used by the gctree project, whose value can be decomposed over tree edges, and which can be summarized by a pair of integers (DeWitt et al, 2018).
To compare weights of histories, the weight set W must admit a total ordering. This ordering will be required to respect addition on W , in a slightly weaker sense than is generally meant: Definition 12. A weight set W , endowed with addition, is clade-ordered with respect to some edge weight function f and history sDAG (V, E) on labels Y if • The ordering on W respects addition and is a total ordering on all of the following subsets of W : • Sets of weights of subhistories below any node: • Sets of weights of augmented subhistories below any node-clade pair: • The ordering on W is a total ordering on the set of weights of histories: We say that the ordering on W respects addition on a set W ′ ⊂ W if for all a, b ∈ W ′ and for all c ∈ W , a < b if and only if a + c < b + c.
The following observation makes this definition easier to use.
Observation 9. Let W be a weight set which is clade-ordered with respect to a history sDAG (V, E) and edge weight function For example, it may often be most convenient to argue that a weight set is clade-ordered with respect to the complete history sDAG on the label set Y , and a weight function defined on all possible edges in that history sDAG. However, since this is a strictly stronger condition on W , which is why the definition of clade-ordering is with respect to a particular history sDAG.
Through the rest of this section, the label set Y will be fixed, and it will be assumed that f is an edge weight function mapping into W , a weight set which is clade-ordered with respect to f . Finally, we can describe exactly the sense in which the history sDAG preserves history weights, a property depicted in Figure 4.  Theorem 1 states that if a history sDAG is built from a collection of histories which all have weight K, then either the resulting sDAG must contain only histories of weight K, or there must be histories with weights greater and less than K. In either case the resulting history sDAG may contain more histories than were used to build it. Corollary 1.1 observes that since no parsimony score less than the maximum parsimony score can be achieved by a history on a given leaf set, a history sDAG built from maximally parsimonious histories must contain only maximally parsimonious histories.
Theorem 1. Let T be a collection of histories, so that g f (t) = K for all t ∈ T .
Then there exists a history t ∈ D(T ) with g f (t) < K if and only if there exists Theorem 1 is the motivation for and main result of this section, guaranteeing that a history sDAG constructed from minimum weight histories will only express minimum weight histories, and is proven in Appendix A.
However, since it may be impractical to verify that a collection of histories are minimum weight relative to all other possible histories on a chosen label set, Theorem 1 will often be more useful when applied in the form of the following corollary, that any history sDAG may be trimmed to express exactly its minimum weight histories, relative only to the other histories in that history sDAG.
Corollary 1.1. Let (V, E) be a history sDAG, and let f be an edge weight function as defined previously. Then there exists a history sDAG Proof. Let T be the collection of histories expressed by (V, E), so that D(T ) = T . Let K be the minimum weight achieved by g f on T , and let T ′ ⊂ T be the set of minimum weight histories: We shall take a small excursion now, in which we return to the setting of maximum parsimony which motivates these methods. It makes little sense to minimize parsimony on the set of all histories with labels in an ambient sequence set Y . Rather, one attempts to minimize parsimony subject to the constraint that history leaves are labeled by some fixed set of observed nucleotide sequences.
Definition 13. Let T be a set of histories with labels in Y . We say that histories Given an edge-weight function f and a set X ⊂ Y , we say that a history t with L(t) = X is minimum weight relative to all histories on the fixed set of leaf labels In the general language of this section, a history t with nucleotide sequence labels is maximally parsimonious if it is minimum weight relative to all histories on the fixed leaf label set L(t), with Hamming distance as the edge-weight function.
The following observation guarantees that Theorem 1 and Corollary 1.1 are useful in this setting.
Observation 10. Let T be a set of histories with a fixed set of leaf labels X ⊂ Y . Then for any t ∈ D(T ), L(t) = X.
The truth of this observation can be argued precisely using the lemmas in Appendix A supporting the proof of Theorem 1, but is apparent from Definition 3 and Figure 1.
This means that given a set T of maximally parsimonious histories on a fixed set of leaf labels X, D(T ) must only contain histories with leaves labeled by X. By Theorem 1 then, D(T ) must only contain histories which are maximally parsimonious on leaf labels X.
If T contains histories on a fixed label set X which are not necessarily maximally parsimonious, Observation 10 ensures that trimming the history sDAG constructed from T as in Corollary 1.1 will result in a new history sDAG which expresses histories with the same fixed set of leaf labels X.

Trimming the history sDAG
Here we describe a straightforward method for trimming a history sDAG to represent only its minimum-weight histories. Corollary 1.1 guarantees that merging only the minimum-weight histories in a history sDAG will result in a new history sDAG containing only those histories, but provides no efficient method for producing this trimmed history sDAG. The method described here involves removing all edges which point to suboptimal subhistories, and can be realized in two traversals of the history sDAG.
Definition 14. Let (V, E) be a history sDAG on labels Y , and let f be an edge-weight function f : E → W for W a weight set which is clade-ordered with respect to f and (V, E).
The minimum weight of an augmented subhistory beneath a node v = (ℓ, U ) ∈ V and a clade C ∈ U is given by Also let M f (v) report the minimum weight of any subhistory rooted at the node v = (ℓ, U ), and for any leaf (1) That is, the minimum weight of a subhistory beneath a node is given by the sum over clades of the minimum weight achieved by an augmented subhistory below each clade.
Notice that the clade-ordering on W also allows us to compute M f (v, C) more easily, as With Equation 1, this defines an efficient dynamic program for calculating the minimum weight of all histories in a history sDAG with respect to f , with M f will be used to define the trimmed history sDAG: Notice that E ′ consists of edges from E which point to optimal subhistories, V ′ contains nodes reachable from ρ via those edges, and E removes edges from E ′ which connect any nodes not in V .
The following lemma verifies that this structure is what its name suggests.
Lemma 11. Let (V, E) be a history sDAG, and f : E → W be an edge-weight function, with W a weight set which is clade-ordered with respect to f and (V, E). Let (V ′ , E ′ ) be the history sDAG constructed from minimum-weight histories in (V, E), with respect to f , and let (V , E) be the minimum weight trim of (V, E) with respect to f . Then The proof for this lemma is given in Appendix A.

Collapsing histories
The space of possible minimum weight histories on a fixed leaf label set is in general very large. However, some diversity in this set is a result of unnecessary history edges between nodes with the same label. Unless these edges target a leaf node, they are unnecessary, and their existence cannot be supported by the observed data represented in leaf labels. Just as polytomies can be resolved as many possible bifurcating structures, collapsing history edges which connect nodes with identical labels reduces the number of possible histories on a fixed set of leaves, without restricting the number of informative evolutionary scenarios that can be expressed by those histories ( Figure 5).
Motivated by this observation, we will enforce in practice that adjacent nodes in a history not have the same label, unless one of them is a leaf node. This By collapsing red edges between nodes with identical labels, all five internally labeled tree structures shown here are equivalent choice is possible because we allow multifurcations in histories, which leads to the definition of "collapsing" below. On the other hand, sampled ancestors in a history can be witnessed as an internal node with the observed label ℓ ∈ Y , adjacent to the leaf node labeled ℓ. Since the edge between these two nodes targets a leaf, such a structure is allowed in a history. A history containing internal edges whose parent and child nodes carry the same label may be modified to remove such edges. Doing so will add multifurcations to the history, as shown in Figure 6. The following definition allows us to mark edges as collapsible arbitrarily, not just when their parent and child node labels match. This generality is useful in precisely stating Lemma 13.
Definition 16. Let (V, E) be a history or history sDAG.
Given a binary-valued function b : For the purpose of this paper we are interested in collapsing edges whose parent and child nodes have the same label. In this situation b should return 1 on edges whose parent and child nodes have the same label, and we will use the terms label-collapsible and label-collapsed instead of b-collapsible and b-collapsed.
A history which is not label-collapsed can be converted to a label-collapsed history by merging adjacent nodes with the same label, but this process requires also modifying subpartitions ( Figure 6). To formalize this, we first explain what it means to collapse an edge in a history.
Definition 17. Let t = (V t , E t ) be a history with labels Y . Let (V, E) be the complete history sDAG on labels Y . Also let e = (ℓ p , U p ), (ℓ c , U c ) ∈ E t be an edge in t, so that (ℓ c , U c ) is not a leaf node. Let C = CU(ℓ c , U c ) be the clade in U p from which the edge e descends.
The history t e = (V e , E e ), formed by collapsing e in t, is defined as follows: Notice that after collapsing an edge, the resulting structure remains a valid history, because for any clade C ∈ U c , and for any node v c which is a child of the node-clade pair (ℓ c , U c ), C , the node v c becomes a child of the node-clade pair q(U c , ℓ c ), C . Also notice that q(ℓ c , U c ) = q(ℓ p , U p ) inherits the unique parent of (ℓ p , U p ) in t.
The new history has one edge fewer than the original. We can convert a history t to a label-collapsed history by iteratively collapsing each edge in t whose parent and child nodes have the same label.
Lemma 12. A history t 0 = (V 0 , E 0 ) determines a unique label-collapsed history t c , which is the result of a finite sequence of edge collapses.
That is, there exists a finite sequence t 0 , t • t n is label-collapsed Furthermore, for any such sequence of histories, t n = t c .
Proof. Using the correspondence between histories and rooted, internally labeled, multifurcating trees established in Appendix A, we can use the fact that collapsing edges between internal nodes with the same label is a well-defined map on such trees. Since the order of edge collapse has no effect on the final tree, neither does the order of edge collapse on the final history in the sequence named above.
Label-collapsing histories individually is straightforward, but collapsing a large collection of histories could be done more efficiently by label-collapsing their history sDAG.
Label-collapsing histories from within a history sDAG is not as straightforward, because some edges descending from a node-clade pair may need to be collapsed, while others may not. This means that an algorithm to collapse the history sDAG must occasionally add new nodes to the DAG (Figure 7).
Fig. 7 Analogous to Figure 6, but within a history sDAG, (a) shows part of a history sDAG, with an edge e to be collapsed. Collapsing e requires adding the node v ′ p (b). In this example, both v p and v c remain in the history sDAG, because even without e they each have a parent edge, as well as one child edge descending from each child clade. Edges in (a) are colored to match with the corresponding new edges in (b), and with the annotations in Equation 2 In order to describe the behavior of collapsing in the history sDAG, we require the following definition.
Definition 18. Given a history sDAG (V, E), we say that a collection of histories T is an edge cover of (V, E) if for every edge e ∈ E, there exists a history t ∈ T such that e is contained in t.
Further, a collection of histories T is a b-collapsible edge cover of (V, E) if for every b-collapsible edge e ∈ E and every subhistory s containing e, there is a history t ∈ T that contains s.
The following lemma describes what it means to collapse a single edge in a history sDAG.
Define a binary function b : E → {0, 1} which is constant at 0, except that b(v p , v c ) = 1, and let T be any b-collapsible edge cover of (V, E). Let ) be the "new parent node", and define: Then, let E − = E + \ R. Finally, define is the history sDAG constructed from T ′ , the set of histories which result by collapsing the edge (v p , v c ) in each history in T in which it appears.
Notice that if e is the only edge descending from the node-clade pair (v p , CU(v c )), then collapsing e requires removing the node v p , and all edges involving it, from the history sDAG. Also, the definition of E − will not leave any parent nodes of v p with too few descendant edges, because we added edges from all parent nodes of v p to v ′ p .
The last step in the construction of E ′ ensures that any nodes left without parents in the collapsing process will not appear in the label-collapsed history sDAG.
The proof for Lemma 13 is given in Appendix A.
Finally we arrive at the main result of this section, which provides a guarantee that all histories in a history sDAG can be collapsed by a finite sequence of edge collapses. Although this lemma is stated for label-collapsing, the result can immediately be generalized to b-collapsing, with respect to an arbitrary binary function b.
Lemma 14. Let (V 0 , E 0 ) be a history sDAG, and define a sequence (V i , E i ) i∈N of history sDAGs, so that (V k , E k ) is generated by collapsing an edge Then there exists N ∈ N such that (V N , E N ) is label-collapsed. Also, if T 0 is a label-collapsible edge cover of (V 0 , E 0 ), and T ′ 0 is the set of histories resulting from label-collapsing each history in T 0 , then each history in T ′ 0 is in (V N , E N ). Although this lemma is written for label-collapsing, it extends to collapsing with respect to an arbitrary binary function b, defined on all possible edges in the complete history sDAG with the same leaf nodes and with labels chosen from the same ambient label set as (V 0 , E 0 ).
Note that the collapsing algorithm presented below produces the collapsed history sDAG (V N , E N ).
The proof for this proposition is given in Appendix A. Lemma 14 suggests an algorithm for collapsing a history sDAG, whose implementation is given below.
Algorithm A. (Collapsing a history sDAG). Modifies a history sDAG so that no edges connect two non-leaf nodes with the same label, and the histories represented in the resulting history sDAG are the same as the set of histories represented by the original history sDAG, with each label-collapsed.
That is, edges at the beginning of the queue are closer to the UA node of the history sDAG 2. Collapse loop head. If Q is empty, END. Otherwise, remove the first (a) Check collapsed. If ℓ p = ℓ c and v c is not a leaf node, and (v p , v c ) ∈ Q , go to new parent. Otherwise, return to collapse loop head.
(c) Add grandparents to newparent. For any (v, v p ) ∈ E, add (v, v ′ p ) to E and to beginning of Q.  Notice that each iteration of the collapse loop corresponds with an element in the sequence of history sDAGs named in Lemma 14. Since the order of edges in the sequence (e k ) in Lemma 14 has no effect on the resulting history sDAG, the order of edges in the queue should have no effect on the history sDAG produced by this algorithm.

History sDAG Completion
We now introduce "completion," which essentially means that we add every edge that respects clade union sets. More precisely, Definition 3 specifies that each edge of a history sDAG must target a node whose clade union is in the subpartition of its parent. Given a collection of history sDAG nodes V , we can create an edge set E ′ containing all edges allowed by this requirement. The resulting DAG (V, E ′ ) then contains all histories that can be constructed using nodes from V . If V is the node set for some valid history sDAG, then the resulting DAG (V, E ′ ) must also be a history sDAG.
By completing a history sDAG, additional histories are represented. Although there is no guarantee about the weight of these new trees, it is possible that additional minimum weight trees may be found by the completed history sDAG, which makes this operation useful.
This idea is expressed in the following definition.
Definition 19. Let T be a collection of histories with labels in Y . Let (V, E) be the history sDAG constructed from T . The completed history sDAG constructed from T is the history sDAG (V, E ′ ), where We will also refer to (V, E ′ ) as the completion of (V, E).
The completed history sDAG constructed from T is a history sDAG because it includes at least those edges present in the history sDAG constructed from T , and all the additional edges are allowed by the definition of the history sDAG. We emphasize that history sDAG completion adds no new nodes, and that a completed history sDAG is in general a much smaller object than the complete history sDAG on a taxon set described in Definition 8.
Earlier sections show that T ⊂ D(T ) for a set of histories T because a history sDAG constructed from T allows subhistory swaps involving conforming subhistories. In contrast, the completed history sDAG constructed from T allows any subhistories on the same leaf labels to swap, regardless of their parent nodes.
Swaps between subhistories with the same leaf label sets will not preserve history weights in the same sense as conforming subhistory swaps. Therefore, the completed history sDAG constructed from a set of histories T is not guaranteed to preserve weights in any sense. However, the lemmas from the previous sections guarantee that any history sDAG can be trimmed to express only its minimum weight histories. This means that the completed history sDAG can be used as a way to find even more minimum-weight histories than the original history sDAG construction, given a set of minimum-weight histories T . For example, completing a history sDAG constructed from maximally parsimonious, or nearly maximally parsimonious histories, could in some cases find additional maximally parsimonious histories which wouldn't have been present before completion.
The completed history sDAG constructed from a set T of histories represents all possible histories which can be constructed using the nodes of histories in T . A choice of input histories can be therefore be framed as a choice of plausible pairs of labels and subpartitions, which then determines a collection of plausible histories.

Exploring parsimony diversity of SARS-CoV-2 clades
The original motivation for the history sDAG was to store a collection of minimumweight histories. The theorems in the preceding sections show that the history sDAG is an ideal object for this task, and can discover new minimum weight histories in addition to those which we seek to store. Because SARS-CoV-2 is densely sampled relative to the rate of mutation and undergoes minimal recombination, parsimony methods are well-suited to studying its evolution . However, we will now demonstrate that there exists considerable uncertainty in a parsimonious reconstruction of SARS-CoV-2 evolution. Searching for maximally parsimonious trees is computationally intensive, and scales poorly as the number of leaves increases. Traditionally, tools like PHYLIP's dnapars were used to produce an assortment of maximally parsimonious trees on a given set of sequences (Felsenstein, 2009). Recently, the UShER project made it possible to quickly reconstruct a single approximate parsimony tree on millions of sampled sequences . Neither method guarantees that the reconstructed trees are maximally parsimonious relative to all possible trees on the given leaf sequences.
Users of both methods often accept the first tree produced, ignoring the uncertainty inherent to the parsimony assumption. However, there are in general many possible maximally parsimonious trees on a given set of leaf sequences.
Indeed, dnapars by default outputs a non-exhaustive collection of maximally parsimonious trees. However, for very large sets of sequences, a collection of nearly maximally parsimonious trees may be produced much more quickly using UShER. As a demonstration, we use UShER to reconstruct trees on an assortment of SARS-CoV-2 clades, extracted from the global phylogeny of public SARS-CoV-2 sequences provided by the UShER project (accessed 3-3-2022) (Lanfear, 2020;Turakhia et al, 2021). We allowed UShER to reconstruct trees on the set of unique sequences from each clade, as well as the ancestral sequence in the original tree, outputting a maximum of 200 trees resulting from alternative parsimonious placements of samples. Including the ancestral sequence guarantees that the resulting reconstruction is comparable to the subtree of the global phylogeny corresponding to the same clade. We then use the UShER utility matOptimize (Ye et al, 2022) to attempt to optimize each tree, allowing the optimizer to make up to four moves for each sample which do not improve the parsimony score. Allowing a few such moves is intended to increase the diversity in output trees, without requiring excessive computation time. We saved four intermediate trees during optimization of each tree output by UShER. Optimizations of different trees output by UShER are not guaranteed to achieve the same parsimony score. However, even optimized trees which are not globally maximally parsimonious are likely to contain parsimony-optimal substructures.
For each clade, the collection of 800 intermediate trees resulting from these tree optimizations are used to create a history sDAG, after outgrouping the ancestral sequence in each. These 800 trees are not guaranteed to be unique, and in fact there are often many duplicates. The resulting history sDAG is then completed, trimmed to only express maximally parsimonious histories, and label-collapsed.
Whereas it is computationally expensive to construct a maximally parsimonious tree, the operations of trimming, collapsing, and completing are highly optimized, and in practice take only a few seconds for the history sDAGs used to produce Figure 8. The number of operations required for the proposed trimming algorithm is bounded by O(E · (M CS + M N C)), and similarly the algorithm for completing the history sDAG is bounded by O(N 2 · M N C) where N is the number of nodes, E the number of edges, M CS the maximum size of any set of edge descending from a node-clade pair, and M N C the maximum number of child clades for any node in the history sDAG. No M.P. Improvement M.P. Improvement Identity Fig. 8 Unique trees found by UShER, and unique trees in the resulting history sDAG, for each selected SARS-CoV-2 clade. Point colors indicate if the parsimony score of trees in the history sDAG is lower than the best parsimony score achieved by UShER. Parsimony improvement compared to UShER trees does not exceed 0.04% for any clade. These data are summarized in Supplementary  Table 2 The resulting history sDAG sometimes contains histories which are slightly more parsimonious than any trees found by UShER, and in most cases, the number of maximally parsimonious histories contained in the resulting history sDAG is many orders of magnitude greater than the number of histories used as input (Figure 8). However, this increase is far from uniform across clades. For the clade AY.46.6, the history sDAG expresses an impressive 25 orders of magnitude more tree diversity than the input trees found by UShER, and all of those trees have a slightly better parsimony score than any tree found by UShER. On the other hand, clade AY.111 also stands out in contrast, with only two unique trees found by UShER, and only those same two unique trees contained in the resulting history sDAG.
For some clades, such as 20F, the number of unique trees found by UShER is greater than the final number of trees expressed in the history sDAG. Although surprising, this is not contradictory, since many of the unique trees found by UShER may have a higher parsimony score than the trees contained in the final history sDAG.
It is unlikely that Figure 8 reflects the true diversity of maximally parsimonious trees for each clade. In fact, the true minimum parsimony scores for tree reconstructions of each clade may be lower than the parsimony score of trees found here. The variation in tree diversity between clades is instead likely determined by features in the particular trees found by UShER. Further investigation of the true diversity of maximum parsimony trees will be left for future work.
Regardless, the large diversity of trees for most clades suggests that considerable uncertainty remains about tree structure when performing a maximumparsimony search, even after collapsing edges without mutations into multifurcations. This uncertainty represents an opportunity to fine-tune the accepted tree in settings where parsimony is an appropriate assumption. For example, the histories found by this method could be used as a starting point for further optimization according to criteria other than parsimony. Such criteria, and their efficient calculation in the history sDAG, will be the subject of future work.

Discussion
This paper establishes that the history sDAG is an efficient structure for storage of similar internally labeled trees, and provides a foundation for future work to understand phylogenetic uncertainty using massive collections of parsimonious trees.
We described efficient methods for basic manipulation of the history sDAG object, and used these methods to demonstrate that for densely sampled SARS-CoV-2 data, it is possible to build a history sDAG containing many alternative parsimonious evolutionary histories. We implemented this process on clades containing up to seven thousand leaves, although it would have been feasible to use clades containing perhaps ten times as many. Software which is currently in development will allow parsimony optimization via matOptimize (Ye et al, 2022) directly on the history sDAG, avoiding the time-consuming step of generating many input trees with UShER, and hopefully allowing these methods to scale to even larger datasets.
Thanks to the convenient structure of the history sDAG, it will be possible to efficiently summarize clade-level uncertainty in these histories, although such methods will be described and benchmarked in a future paper. This approach can only be expected to work well when the tree posterior is overwhelmingly concentrated on maximally parsimonious trees, and even then clade supports estimated with the history sDAG may not be directly comparable to supports observed in a sample from the tree posterior. However, for phylogenetic inference resulting in a single maximally parsimonious tree (which is typically arbitrarily chosen from the collection of MP trees), our method could provide a valuable understanding of the uncertainty resulting from this choice. Clade support estimation via the history sDAG may have advantages over standard approaches to phylogenetic uncertainty estimation. Unlike a bootstrap approach, all alternative histories in the history sDAG are built on the same data, and therefore clade support derived from the history sDAG could be more accurate for clades defined by only a few mutations (Wertheim et al, 2022). Unlike a Bayesian approach, our method makes no attempt to fully resolve a tree when there is insufficient signal to do so, and we expect it to scale well to large data.
The history sDAG is related to various earlier works, as we now describe.

The Subsplit DAG
The history sDAG generalizes a similar construction useful for likelihood computations and variational inference on trees, integrating out ancestral sequence uncertainty (Zhang andMatsen, 2018, 2019). Although this form of the DAG structure is not expressed in the original variational inference papers, it is described in a more recent paper (Jun et al, 2023). In this subsplit DAG, internal nodes do not contain label data, and each internal node is required to have exactly two child clades (a subsplit is a subpartition with two parts). That is, the subsplit DAG is a history sDAG in which internal nodes all share the same fixed label, and each node has two child clades. The additional node label information in the history sDAG is essential for efficient storage and retrieval of maximally parsimonious trees, with the inferred ancestral sequences dictated by the parsimony assumption.

The Buneman Graph
A construction known as the Buneman graph is related to the history sDAG. In this construction, a collection of observations, each consisting of a collection of binary traits, can be arranged in a graph. This Buneman graph contains as subgraphs all possible maximally parsimonious trees relating the observations (Semple and Steel, 2003). This construction has been generalized to sequences of non-binary characters (Bandelt and Röhl, 2009;Misra et al, 2011), and one such generalization was applied to the problem of finding provably maximally parsimonious trees on nucleotide sequence data (Misra et al, 2011). However, although the Buneman graph contains all maximally parsimonious trees on a set of observations, it may also contain trees which are not maximally parsimonious. The Buneman graph is therefore not a natural data structure for storing collections of maximally parsimonious trees, since considerable additional computation may be needed to find the maximally parsimonious trees in the graph. In contrast, the history sDAG may be trimmed to express only maximally parsimonious trees, and sampling or iterating through the trees it contains is trivial. In addition, the history sDAG can be immediately generalized to arbitrary observed data (abstracted as node labels), and allows efficient computation and trimming with respect to weight functions other than parsimony.

Tree Fusion
The swapping of subhistories that takes place in the history sDAG bears some resemblance to the procedure known as tree fusion, used in some parsimony software like TNT, in which clades are swapped between trees to improve parsimony scores (Goloboff, 1999;Goloboff and Pol, 2007).
Generally, the history sDAG can be thought of as a structure which efficiently represents, and allows computation on, the set of trees resulting from all possible combinations of these clade swaps. Thus, the history sDAG can only swap subhistories that have identical parent node labels and subpartitions. In contrast, tree fusion can consider trees resulting from swapping any subtrees, as long as they contain the same set of samples.
Tree fusion is better approximated in the completed history sDAG, which does allow swaps of any subhistories containing the same samples. That is, for a history sDAG (V, E) constructed from a set of histories T , the set of histories in the completion of (V, E) consists of all histories resulting from combinations of swaps involving subhistories of histories in T , regardless of their parent nodes. However, subhistory swaps are still fundamentally different from the swaps of subtopologies realized during tree fusion, since subhistory swaps maintain the same ancestral node labels that were present in the original histories involved in each swap. In order to ensure that ancestral labels are optimal in the new histories contained in the completed history sDAG, we would need an algorithm to reconstruct these ancestral states from scratch. Such an algorithm for computing optimal ancestral states in the history sDAG would be analogous to the Sankoff algorithm for reconstructing ancestral states on trees.
Despite these limitations, Figure 8 shows that the subtree swaps which are realized in the history sDAG can be effective in reducing parsimony scores. Although the history sDAG does not fully implement tree fusion, it concurrently applies subhistory swaps in many different histories, and allows the resulting trees to be filtered efficiently according to arbitrary criteria. This may represent an advantage over methods which keep track of and optimize far fewer trees.

Tree Sequences
The history sDAG also bears some similarities to the tree sequence (Kelleher et al, 2019;Speidel et al, 2019). The tree sequence encodes a single evolutionary history for segments of a multiple sequence alignment, with changes of evolutionary history at specific points along the alignment due to recombination. The history sDAG, on the other hand, is meant to encode an unordered collection of equally parsimonious histories.

Future Work
We are in the process of building software that will allow us to do larger-scale inference using the history sDAG. In addition to the uncertainty quantification goals described above, this software will also allow us to do broader exploration of the set of maximally parsimonious trees than previously possible. We also hope to use the history sDAG as a means of improving MCMC sampling.
Maximally parsimonious trees may be a good starting point for inference via other methods, such as the branching process used by the tree inference package gctree (DeWitt et al, 2018). To support this, we will develop efficient algorithms to make calculations on histories contained in the history sDAG. We will also explore ways to search for new optimal histories, such as maximally parsimonious histories, directly within the structure of the history sDAG.

Acknowledgements
We thank JT McCrone and Gytis Dudas for discussions that informed this work, Mike Steel for pointing us to relevant literature, as well as Marc Suchard for suggestions on exposition. Thanks also to Ye Cheng, Russ Corbett-Detig, Yatish Turakhia, and the rest of the UShER team for helpful discussions and their help applying UShER to the SARS-CoV-2 example. We also thank Matthew Macaulay, Hassan Nasif, Anna Kooperberg, Michael Karcher, Tanvi Ganapathy, Shosuke Kiami, Seong-Hwan Jun, Cheng Zhang, and Mathieu Fourment for their work on the "subsplit DAG," a closely related idea.
The SARS-CoV-2 data which made the exploration of diversity of parsimonious reconstructions of SARS-CoV-2 clades possible is from the public databases GenBank (Hatcher et al, 2016), COG-UK (Nicholls et al, 2020), and the China National Center for Bioinformation (Song et al, 2020;Zhao et al, 2020;Gong et al, 2020;Yu et al, 2022). We thank the laboratories submitting sequence data to these public databases, as well as the researchers and laboratories contributing viral samples on which these sequences are based.

Competing interests
The authors declare no competing interests.

Ethics approval
No ethics approval process was required for this work.

Code availability
The history sDAG data structure described in this paper, as well as various algorithms described in this paper and in future work, are implemented in the open source Python package historydag, which is available at https: //github.com/matsengrp/historydag.

Authors' contributions
Will Dumm and Frederick Matsen wrote the first draft of the manuscript, with edits and contributions to proofs from Mary Barker and edits from William DeWitt. Will Dumm and William Howard-Snyder prepared the SARS-CoV-2 clade reconstruction example. All authors commented on previous versions, and read and approved the final manuscript.

Open access
This article is subject to HHMI's Open Access to Publications policy. HHMI lab heads have previously granted a nonexclusive CC BY 4.0 license to the public and a sublicensable license to HHMI in their research articles. Pursuant to those licenses, the author-accepted manuscript of this article can be made freely available under a CC BY 4.0 license immediately upon publication.
Ye C, Thornlow B, Hinrichs A, et al (2022)  Appendices A Proofs omitted from the text Lemma 3. Let (V, E) be a history sDAG or subhistory, and let v ∈ V . The set of labels of leaf nodes reachable from v is CU(v).
Proof. Let (V, E) be a history sDAG or subhistory on labels Y , and let v ∈ V be a non-UA node. Let X be the set of labels of leaf nodes reachable from v. X ⊂ CU(v) by Observation 2. To show inclusion in the other direction, we will show by induction on | CU(v)| that for any ℓ ∈ CU(v), the leaf node (ℓ, ∅) is reachable from v in (V, E).
As a base case, if | CU(v)| = 1, then v must be the leaf node with label ℓ, so the statement is immediately true. Now suppose that for any node v ′ ∈ V with | CU(v ′ )| < n, and for any is such that | CU(v)| = n, and let ℓ ∈ CU(v). Then ℓ ∈ C for some child clade C ∈ U . Since U contains at least two disjoint, nonempty subsets of Y , it must be true that C ⊊ CU(v). Any node-clade pair in a history sDAG or subhistory must have at least one descendant edge, so there exists an edge (v, v c ) ∈ E such that C = CU(v c ), and | CU(v c )| < n. Since ℓ ∈ C, we know that ℓ ∈ CU(v c ), and by the inductive hypothesis, (ℓ, ∅) is reachable from v c , and therefore also from v.
Lemma 4. A history sDAG (V, E) is a history if and only if it is a tree, and contains exactly one edge descending from ρ.
Proof. We will prove the equivalent statement that, for a history sDAG (V, E) with exactly one edge descending from ρ, (V, E) is a tree if and only if (V, E) contains exactly one edge descending from each node-clade pair. We will prove the contrapositive of both directions.
Assume first that there exists a node-clade pair in (V, E) with at least two descendant edges. That is, there exist edges is not a tree, meaning that there exist two edges (v 1 , v), (v 2 , v) ∈ E with the same child node. Since all nodes in a history sDAG must be reachable from ρ, there exist paths in E connecting ρ to both v 1 and v 2 . (V, E) has only one edge exiting ρ, so these two paths must diverge at some non-UA node v r = (ℓ r , U r ) ∈ V . That is, there are edges However, since v ′ 1 and v ′ 2 are children of the node v r , both CU(v ′ 1 ) and CU(v ′ 2 ) must be elements of U r . Elements of U r are disjoint, nonempty subsets of Y , so CU(v ′ 1 ) = CU(v ′ 2 ). We have demonstrated that the two edges (v r , v ′ 1 ) and (v r , v ′ 2 ) descend from the same node-clade pair (v r , CU(v ′ 1 )) in (V, E).
Proof. We will show that no edge may take part in a cycle. Recall first that the UA node only admits outgoing edges, so no edge exiting ρ can be part of a cycle.
Consider an edge e = v p = (l p , U p ), v = (l, U ) whose parent is not ρ. If v p is not the UA node, then |U p | ≥ 2, and either U = ∅ or C∈U ∈ U p . In the first case, v is a leaf node, which can only accept incoming edges, so the edge e cannot be part of any cycles. In the second case, since |U p | ≥ 2 and elements of U p are nonempty, disjoint subsets of X, The same inequality is true of any edge reachable from v which does not terminate at a leaf node, so no edge reachable from v can have v p as a target.
Lemma 7. Let (V, E) be a history sDAG. For any v ∈ V , there exists a subhistory s in (V, E) whose root node is v.
Proof. We will prove this by induction on | CU(v)|.
As the base case, suppose | CU(v)| = 1. Then v must be a leaf node, and the subhistory we seek is the one consisting of only the node v.
Now suppose it's true that for any node v with | CU(v)| < n, there's a subhistory s rooted at v in (V, E).
Let v = (ℓ, U ) ∈ V , and suppose that | CU(v)| = n. Then U = {C 1 , . . . , C m } where C 1 , . . . , C m are m ≥ 2 disjoint subsets of Y . For each C i ∈ U , by the definition of the history sDAG, there exists at least one edge (v, v with v i as its root, by the inductive hypothesis.
Notice that the node sets V si for 1 ≤ i ≤ m are pairwise disjoint: Let v ′ ∈ V s k and v ′′ ∈ V sj , for k ̸ = j, 1 ≤ k, j ≤ m. A necessary condition for node equality is that CU( Therefore, we can build a subhistory rooted at v consisting of v, the subtrees s i for 1 ≤ i ≤ m, and the edges connecting v to the root node of v i for each subtree s i . That is, is the subhistory we seek, rooted at v. This subhistory has exactly one edge descending from each node-clade pair because s i are subhistories, and because for each child clade C i of v, the edge (v, v i ) descends from (v, C i ).
Lemma 8. Let (V, E) be a history sDAG, and let T be the collection of histories in (V, E). Then (V, E) is the history sDAG constructed from T .
Proof. We will argue that for any edge e = (v, v c ) ∈ E, there exists a history i=0 be a sequence of edges in E which is a path from ρ to v c , so that v 0 = ρ and e n−1 = (v n−1 , v n ) = (v, v c ) = e. For 1 ≤ i ≤ n, let s i be a subhistory rooted at v i , which exists by Lemma 7. Now recursively construct s ′ i for 1 ≤ i < n by replacing the edge descending from the node-clade pair v i , CU(v i+1 ) in s i , and the subhistory consisting of all nodes and edges reachable from the child node of that edge, with the edge e i and the subhistory s i+1 , rooted at v i+1 . Finally, let s ′ n = s n . s ′ i remains rooted at v i , and now contains the edge e i and all the edges e j , for i < j < n, including the edge e. s ′ 1 is then a subhistory rooted at v 1 which contains e. Lemma 4 implies that by adding the edge (v 0 = ρ, v 1 ) to s ′ 1 , we've constructed a history in (V, E) containing e. Note that in the language of following sections, this history can also be expressed as t ◁ s 2 ◁ · · · ◁ s n , where t is the history consisting of s 1 and the edge (ρ, v 1 ).
To finish the proof, let (V T , E T ) be the history sDAG constructed from T . Since a history sDAG must be connected, to show that (V, E) and (V T , E T ) are equal is to show that E T = E. Any edge in E T must be present in some history in T , and since T is the set of histories in (V, E), any such edge must be in E. Therefore, E T ⊂ E. Also, we just showed that any edge e ∈ E must take part in some history in T , and therefore e must also be in E T . Therefore, E T = E.

A.1 History Weights
The lemmas in this appendix subsection are necessary for the proof of Theorem 1. The proof of that theorem is given at the end of this subsection.
Definition 20. Let s and s ′ be subhistories of histories t and t ′ in some ambient history sDAG. We say that s and s ′ are conforming subhistories if s ′ has the same set of leaf labels as s, and the parent node in t ′ of s ′ is the same as the parent node of s in t.
More formally, if t = (V, E) and t ′ = (V ′ , E ′ ), and s = (V s , E s ), Notice that since no internal node in a history may have exactly one child, a history may not contain two distinct subhistories with the same leaf nodes. Therefore, given a history t and a subhistory s ′ , a choice of subhistory s of t conforming with s ′ is guaranteed to be unique, if it exists.
Notice also that the definition of conforming subhistory does not allow a history to be conforming with any subhistory of itself. To evaluate conformity of a subhistory, there must be some ambient parent node (Figure 9). In both (a) and (b), the two subhistories highlighted in red are conforming, since they share the same set of leaves, and in each respective history their parent node has the same label and descendant clades. We include this example to emphasize that conforming subhistories need not have the same internal node labels We can now define the exchange of substructures that takes place between histories in the history sDAG.
Definition 21. Let t = (V, E) be a history, and let s = (V s , E s ) be a subhistory of t. Also, let (V d , E d ) be a history sDAG, and let s ′ = (V ′ , E ′ ) be any subhistory of (V d , E d ) conforming with s.
A subhistory swap of t and s ′ is a history with the structure and labeling of t, except that the subhistory s of t is replaced with the subhistory s ′ .
More formally, the subhistory swap replacing s with s ′ is the history with nodes (V \ V s ) ∪ V ′ , and edges where v p is the parent node of s in t, v is the root node of s, and v ′ is the root node of s ′ .
Definition 22. Let the swap operator ◁ be a left-associative operator on history, subhistory pairs, defined so that t◁s ′ is the subhistory swap of t and s ′ , if a subhistory of t conforming with s ′ exists. t ◁ s ′ is undefined if no such subhistory exists.
Notice again that the subhistory (right argument of ◁) in a subhistory swap must exist in the context of some ambient history sDAG, so that it can be evaluated whether swapped subhistories are conforming.
The definition of conformity is slightly more restrictive than it needs to be to guarantee that subhistory swaps of conforming subhistories preserve parsimony. In particular, there is no need to require that the parent nodes of the swapped subhistories have the same subpartitions. However, this assumption is natural in the context of the history sDAG structure, and is necessary for the argument to extend to edge weight functions that depend on nodes' subpartitions.
Lemma 15. The operator ◁ is well-defined on subhistories. Also, given subhistories t, s ′ both with labels in Y , and with the leaves of t labeled by X ⊂ Y , then t ◁ s ′ is a history with labels in Y and leaves labeled by X. That is, ◁ preserves the leaf labels of its left argument.
Proof. To show that ◁ is well-defined, we need to show that given histories t, s ′ , the subhistory swap t ◁ s ′ is a history, and is uniquely determined by the choice of t and s ′ . t ◁ s ′ is a history directly from the definition, and by the observation that since neither t nor s ′ may have unifurcations, their subhistory swap may not either. t ◁ s ′ replaces a subhistory s of t with s ′ , where s must have exactly the same leaf label set as s ′ . If such a choice of s exists, it must be unique by the assumption that nodes in a history may not have exactly one child. This guarantees that no two nodes in a history are above the same set of leaves. Now assume that t and s ′ are subhistories on labels Y , and t has leaves labeled by X ⊂ Y . To see that t ◁ s ′ is a history with labels in Y and leaves labeled by X, notice first that s ′ must have nodes labeled bijectively by a set C ⊂ X, the same set of leaf labels as the subhistory in t that s ′ replaces. Therefore the labeling on t ◁ s ′ , restricted to leaf nodes, is bijective as a union of two bijective functions with disjoint domains, and images partitioning X. The labeling on t◁s ′ maps into Y as a union of functions which both map into Y .
We now describe the sense in which subhistory swaps preserve history weight.
Lemma 16. Let t 1 = (V 1 , E 1 ) and t 2 = (V 2 , E 2 ) be histories on labels Y . For i ∈ {1, 2}, let s i be a subhistory of t i , so that s 1 and s 2 are conforming.
Let t ′ 1 = t 1 ◁ s 2 be the history constructed by replacing s 1 with s 2 in t 1 , and similarly define t ′ 2 = t 2 ◁ s 1 to be the history constructed by replacing s 2 with s 1 in t 2 . Finally, suppose that f is an edge-weight function taking values in a weight set W , clade-ordered with respect to Y . Then Proof. Let v i be the parent node of s i in t i , and let K i = g f (s vi i ) for i ∈ {1, 2}. That is, K i is the weight of the augmented subhistory s i and its parent edge. Then for some weights w 1 , w 2 ∈ W , g f (t i ) = w i + K i , and also g f (t ′ 1 ) = w 1 + K 2 and g f (t ′ 2 ) = w 2 + K 1 . The following are equivalent, since K i are weights of subhistories below the same clade, and W is clade-ordered: To extend the conclusion of this lemma to all the histories in the history sDAG, we need a few more lemmas: Lemma 17. Suppose t 1 , . . . , t n are histories in the history sDAG (V, E), and s i is a subhistory of t i for 2 ≤ i ≤ n. Then t 1 ◁ s 2 ◁ . . . ◁ s n is a history in (V, E).
Proof. We need only show this is true for n = 2, since subhistory swaps are left-associative. Let t 1 , t 2 be histories in the history sDAG, and let s 2 be a subhistory of t 2 , conforming with some subhistory s 1 of t 1 . Also let v 1 , v 2 be the root nodes of s 1 and s 2 respectively. Conformity means that the parent node v p of s 1 in t 1 is the same as the parent node of s 2 in v 2 . Notice that all the edges in t 1 are in E, as well as all the edges of s 2 , since we assumed that t 1 , t 2 are histories in the DAG. Also notice that the edge (v p , v 2 ) is in E, because it is an edge in s 2 . Therefore, all the edges in t 1 ◁ s 2 are in E, and t 1 ◁ s 2 is a history in the history sDAG.
The following lemma describes how any history in a history sDAG built from a collection of histories T can be built from a collection of swaps operating on subhistories from T .
Lemma 18. Let t ∈ D(T ) be a history in the history sDAG (V, E) constructed from a collection of histories T . Then for some sequence of histories (t i ) n i=1 in T , and choices of subhistories s i of t i for all i, t = t 1 ◁ s 1 ◁ . . . ◁ s n .
Proof. Let t = (V t , E t ) be a history in (V, E). Every edge in E t must appear in some t ′ ∈ T . Since t is a tree, there exists a preordering (v i ) n i=0 of vertices in V t so that v j is reachable from v i only if i ≤ j. That is, if i > j, then v j must not be reachable from v i . Let (e i ) n i=1 be an ordering of edges in E t such that v i is the target node of e i for all 1 ≤ i ≤ n. We will use the notation v ′ i to denote the parent node of the edge e i . Notice then that for edges Finally, also define a sequence of parent node-clade pairs (p i ) n i=1 so that ). We will say that an edge e j = (v ′ j , v j ) is reachable from a node-clade pair p i if there exists a path of edges ending with e j such that the first edge in the path descends from the node-clade pair p i . Notice that since the sequence (e i ) is a preordering of the history t, and since only one edge may descend from each node-clade pair in a history, if e j is reachable from the node-clade pair p i , then i ≤ j.
Now choose a sequence (t i ) n i=1 of histories in T so that e i is an edge in t i for all i, and let s i be the subhistory of t i rooted at v i , the child node of the edge e i . The edge e i is not reachable from the parent node-clade pair of any s k with k > i, because the indices are chosen to preorder nodes and edges. Notice that a subhistory swap can only change edges reachable from the shared parent node-clade pair of the subhistories being swapped. Assume temporarily that e i is in t 1 ◁ s 1 ◁ · · · ◁ s i . Then the edge e i must be in the history t 1 ◁ s 1 ◁ · · · ◁ s k for k > i, since e i must not be reachable from p k .
Because of this, to show that all edges in E t are in t 1 ◁ s 1 ◁ · · · ◁ s n , we need only show that the edge e i is in t 1 ◁ s 1 ◁ · · · ◁ s i for all 1 ≤ i ≤ n, which we now establish. Inducting on i, notice first that e 1 is in t 1 ◁ s 1 = t 1 , by our choice of t 1 . Supposing that e j is in t 1 ◁ s 1 ◁ · · · ◁ s j for all j < i, notice that v ′ i = v j for some j < i, so that v ′ i is in t 1 ◁ s 1 ◁ · · · ◁ s i−1 , and s i is conforming with some subhistory of t 1 ◁ s 1 ◁ · · · ◁ s i−1 . The subhistory swap with s i therefore replaces the unique child node v in t 1 ◁ s 1 ◁ · · · ◁ s i−1 which descends from the node-clade pair (v ′ i , CU(v i )), and all of its descendants, with s i , which is rooted at v i and attached below v ′ i with the edge (v ′ i , v i ) = e i in t 1 ◁ s 1 ◁ · · · ◁ s i . Therefore, t 1 ◁s 1 ◁· · ·◁s n contains at least all those edges in E t . Furthermore, t 1 ◁ s 1 ◁ · · · ◁ s n is a history with the same leaves as t, so it can't contain any more edges than those in E t and remain a tree. That is, t = t 1 ◁ s 1 ◁ · · · ◁ s n .
With the preceding lemmas, it is finally possible to prove the main result of this section: Theorem 1. Let T be a collection of histories, so that g f (t) = K for all t ∈ T . Then there exists a history t ∈ D(T ) with g f (t) < K if and only if there exists a history t ′ ∈ D(T ) with g f (t ′ ) > K.
Proof. By Lemma 18, any history t ∈ D(T ) can be expressed as a finite sequence of subhistory swaps involving histories in T . We will induct on n, the number of subhistory swaps involving histories in T required to express t. First, suppose the history t can be expressed as t = t 1 ◁s 2 , a subhistory swap involving histories t 1 , t 2 ∈ T , and the subhistory s 2 of t 2 , conforming with a subhistory s 1 of t 1 . Then by Lemma 16, g f (t) < K if and only if g f (t 2 ◁ s 1 ) > K. t 2 ◁ s 1 ∈ D(T ) by Lemma 17, so we've shown that g f (t) < K implies there exists a history g f (t) > K. Consider the edge e = (v, v c ) in t closest to ρ which is not in (V ′ , E ′ ). Since e / ∈ E ′ , e must not be an edge in any minimum-weight history in (V, E). However, since e is the closest edge to ρ in t which is not in (V ′ , E ′ ), it must be true that v ∈ V ′ . That is, there must be no subhistory s ∈ Ch(v c ) in (V, E) such that g f (s v ) = M f (v, CU(v c )), and the edge e must not be in E. Therefore, t is not in (V , E), and (V , E) contains exactly the minimum-weight histories in (V, E), so by Lemma 8, (V , E) = (V ′ , E ′ ).

A.1.2 Collapsing histories
Lemma 13. Let (V, E) be a history sDAG with label set Y .
Also let (v p = (ℓ p , U p ), v c = (ℓ c , U c )) ∈ E be an internal edge. That is, Define a binary function b : E → {0, 1} which is constant at 0, except that b(v p , v c ) = 1, and let T be any b-collapsible edge cover of (V, E).
Let v ′ p = (ℓ p , U p ∪ U c \ CU(v c )) be the "new parent node", and define: Then, let E − = E + \ R. Finally, define Claim: (V ′ , E ′ ) is the history sDAG constructed from T ′ , the set of histories which result by collapsing the edge (v p , v c ) in each history in T in which it appears.
Proof. Let (V ! , E ! ) be the DAG constructed from T ′ . We must show that E ′ = E ! , so that by construction, collapsing in histories only modifies edges incident to the edge being collapsed.
• If v 2 = v p , then v p was not removed from V , meaning that some edge (v p , v) ∈ E must exist, with CU(v) = CU(v c ). T contains all the histories in (V, E), so there is a history (V t , E t ) in T so that (v p , v) ∈ E t . Since each history has exactly one edge descending from each node-clade pair, , there's a subhistory which contains both edges, and since (v p , v c ) is b-collapsible and T is a b-collapsible edge cover, there is a history (V t , E t ) ∈ T which contains both the subhistory, and consequently, both edges. The corresponding label-collapsed history in T ′ contains (v 1 , v 2 ). Therefore, (v 1 , v 2 ) ∈ E ! .
• If v 2 = v c , then v 1 ̸ = v p , and (v 1 , v 2 ) ∈ E. Some history in T must contain the edge (v 1 , v 2 ), and may not contain the edge (v p , v c ) = (v p , v 2 ), in order to be a tree. Therefore, this history is unchanged in Therefore, (v 1 , v 2 ) ∈ E, and by the same reasoning as above, We've addressed all the situations where one of v 1 , v 2 ∈ v p , v ′ p , v c . Both nodes can't be in that set, because no pair of nodes in v p , v ′ p , v c can have an edge between them in E ′ , by construction. Now, to show that E ! ⊂ E ′ , let (v 1 , v 2 ) ∈ E ! . First, notice that E ! ⊂ E + , because E + contains all the edges that are added to histories in T when collapsing (v p , v c ), and E ! does not contain (v p , v c ).
By definition, (v 1 , v 2 ) ∈ E t , for some (V t , E t ) ∈ T ′ . If E − = E + , and since (v 1 , v 2 ) ∈ E t , v 1 is reachable from the UA node, and (v 1 , v 2 ) ∈ E ′ . If E − ̸ = E + , then v p must have had no edges descending from its child clade CU(v c ), in E + . That means v p / ∈ V t , since a history must have exactly one descendant edge for each node-clade pair. Therefore, no removed parent edges are in E t , and E t ⊂ E − . This means that v 1 is reachable from the UA node in E − , and Lemma 14. Let (V 0 , E 0 ) be a history sDAG, and define a sequence (V i , E i ) i∈N of history sDAGs, so that (V k , E k ) is generated by collapsing an edge Then there exists N ∈ N such that (V N , E N ) is label-collapsed. Also, if T 0 is a label-collapsible edge cover of (V 0 , E 0 ), and T ′ 0 is the set of histories resulting from label-collapsing each history in T 0 , then each history in Proof. Let T 0 be a label-collapsible edge cover of (V 0 , E 0 ), and define a sequence of sets of histories (T i , and otherwise let T k be obtained by collapsing all histories in T k−1 at the edge e k−1 .
Notice that if such an N exists, then T ′ 0 ⊆ T N by Lemma 13. Therefore we need only show that such an N exists.
Let T denote a label-collapsible edge cover of (V, E), and denote the multiset of collapsible edges in all t ∈ T as E col . For each collapsible edge e ∈ E, the label-collapsed DAG (V ′ , E ′ ) is equivalent to the DAG obtained from the labelcollapsed histories T ′ by Lemma 13.
Note that the number of trees in T ′ is equal to the number of trees in T . However, the total number of unique trees in T ′ can be smaller than in T since collapsing an edge in two different trees can produce the same resulting tree. Also, since collapsing an edge in a history does not introduce any new edges in that history, collapsing e strictly reduces the number of edges in E col . So the multiset of collapsible edges in T ′ is a strict subset of E col . We will demonstrate that T ′ is a label-collapsible edge cover of (V ′ , E ′ ). Since the set E col is finite, these results imply that any such sequence of history sDAGs results in a labelcollapsed DAG in a finite number of steps.
To show that T ′ is a label-collapsible edge cover of (V ′ , E ′ ), we show that for a collapsible edge e c ∈ E ′ , every subhistory in (V ′ , E ′ ) which contains e c is contained in T ′ .
Let e c be given, and suppose s ′ is any subhistory in (V ′ , E ′ ) containing e c . If every edge in s ′ is disjoint from the vertices {v ′ p , v p }, then there is an identical subhistory s in (V, E), and, by the label-collapsible edge covering property of T , there exists t ∈ T containing s. Since every edge in s is disjoint from the set of edges altered by collapsing at e, collapsing t at e yields a history in T ′ that contains s = s ′ as a subhistory.
If there is an edge in s ′ of the form (v, v ′ p ) then consider the corresponding un-collapsed subhistory s in (V, E) consisting of edges Where the edges adjacent to v ′ p are replaced with the corresponding structures in E. By construction, s is a subhistory in (V, E) containing a collapsible edge e and such that collapsing at e yields s ′ . Since T is a label-collapsible edge cover for (V, E), there exists t ∈ T containing s, and, since collapsing s at e yields s ′ , the label-collapsed history t ′ ∈ T ′ contains s ′ . The analogous argument holds if s ′ is a subhistory containing an edge of the form (v ′ p , v). If there is an edge in s ′ of the form (v p , v), then by the observation following Lemma 13, this implies that there is another edge descending from the node-clade pair (v p , CU(e)) distinct from e, and that s ′ can be viewed as a subhistory of a subhistory containing that alternative edge. So the subhistory s ′ corresponds to a subhistory s in (V, E) which belongs to a history t that cannot contain e. Since s is contained in a history that does not contain e, collapsing at e does not affect s, and so collapsing s yields s ′ = s. Thus s is a subhistory in (V, E) that contains the collapsible edge e c , and hence there exists a history t ∈ T containing s. Since t does not contain e, collapsing at e yields t ′ ∈ T ′ which, trivially, contains s ′ = s.
And so T ′ is a label-collapsible edge cover for (V ′ , E ′ ).

A.2 Histories are labeled trees:
In this subsection, we show that history substructures in the history sDAG are in bijection with isomorphism classes of rooted, internally labeled, multifurcating trees. There will be a number of notational differences from the rest of the paper. Rather than a history, t will denote a labeled tree, and s a subtree of a labeled tree. τ will denote a tree's graph structure, in which nodes are abstract objects rather than the label, subpartition pairs that the history sDAG consists of. The function L will denote the set of leaf nodes below an internal node in a labeled tree. Also, φ will denote a labeling function of a labeled tree, rather than a disambiguation of a history. Y will continue to mean a set of labels, as in the rest of the paper.
We will let L(t) refer to the set of leaf nodes of the tree τ , and require that • no node in τ has exactly one child, and • the labels on leaf vertices must be unique (that is, φ| L(t) must be injective), but labels on internal vertices need not be (that is, φ need not be injective or surjective).
However, we will primarily use a different definition in this text, which is equivalent up to isomorphism on internally labeled trees: Definition 24. Let t = (V, E, φ) and t ′ = (V ′ , E ′ , φ ′ ) be two labeled trees. Then t and t ′ are isomorphic if there exists a bijection h : V → V ′ which preserves labels and respects tree structure. That is, Lemma 19. Let (V, E) be the complete history sDAG on labels Y . Given a history (V t , E t ) in (V, E), let with v the only child node of the UA node in (V t , E t ). Define the function φ : V ′ → Y as φ((ℓ, U )) = ℓ ∈ Y . The correspondence (V t , E t ) → t (Vt,Et) = (V ′ , E ′ , φ) from histories in (V, E) to labeled trees on labels Y is well-defined.
Proof. Given the history (V t , E t ), the label function restricted to leaf nodes, φ| L(t (V t ,E t ) ) : L(t (Vt,Et) ) → Y is an injection , since DAG leaf nodes are uniquely labeled by elements of Y . Also, any node w in the labeled tree t (Vt,Et) is determined by a unique node v ∈ V t . v must have either no child clades, or at least two child clades, and v must have a child node for each child clade. Each child node of v in (V t , E t ) corresponds to a child node of w in t (Vt,Et) , so w may not have exactly one child node.
That is, the map named in the lemma is well-defined.
Lemma 20. Let (V, E) be the complete history sDAG on labels Y . Let t = (τ, φ) be a labeled tree with labels in Y , and with root node w 0 . For each node w of τ , let C w ⊂ Y be the set of leaf labels below the node w, and let v w = φ(w), C w ′ w ′ a child of w .
The correspondence t → (V t , E t ) from labeled trees on labels Y to histories in (V, E) is well-defined. Proof.
The assignment w → v w = (ℓ, U ) of nodes in the labeled tree to nodes in the DAG is well-defined: • U consists of disjoint subsets of Y because φ is injective on leaves of t, and sets of leaves between child nodes of w must be disjoint.
• U is either empty, or contains more than one subset of Y , since w can be a leaf node with no children, or an interior node of t with two or more children.
• U = ∅ if and only if w is a leaf node, because w has no children if and only if w is a leaf node.
This assignment w → v w is also injective: In particular, no two nodes in a labeled tree may have the same subpartition. To see this, let w 1 , w 2 be two nodes in a labeled tree t, with subpartitions U 1 and U 2 . We will show that U 1 ̸ = U 2 . If one of the nodes w 1 , w 2 is not reachable from the other, then U 1 ∩ U 2 = ∅, and U 1 ̸ = U 2 .
Otherwise, suppose that w 2 is reachable from w 1 . w 1 must have more than one child node, so w 1 has at least one child node w ′ 1 from which w 2 is not reachable. Therefore, the clade below w ′ 1 must be disjoint from all the clades in U 2 , since t is a tree. However, the clade below w ′ 1 is an element of U 1 , so U 1 ̸ = U 2 .
The assignment (w 1 , w 2 ) → (v w1 , v w2 ) from edges in the labeled tree to edges in the history sDAG is well-defined and injective: Either • The union of the child clades of w 2 are a child clade of w 1 (in particular, the child clade under the child w 2 ), or • w 2 is a leaf node and therefore one of the child clades of w 1 is ℓ w2 .
The assignment on edges is also injective, since the assignment on nodes is injective.
(V t , E t ) is a history sDAG: Since the assignments of nodes and edges in the labeled tree to nodes and edges in the complete DAG are well-defined, V t ⊂ V and E t ⊂ E. To finish showing that (V t , E t ) is a history sDAG, notice that each node-clade pair (v, C) has a descendant edge, namely the one to v w , where w is the parent node of the clade C in t. Also, each node is reachable from the UA node, since each node in t is reachable from w 0 .
Notice (V t , E t ) has the same tree structure and labels as t, by construction. Therefore, such a choice of history (V t , E t ) is uniquely determined by a labeled tree t.
Lemma 21. (Correspondence) The map from histories to labeled trees named in Lemma 19, and the map from labeled trees to histories named in Lemma 20, are inverses, up to label-preserving bijection on nodes. In particular, both maps name a bijective correspondence between histories in a history sDAG (V, E) on labels Y , and labeled trees with labels in Y .