1 Introduction

Here we develop a structure that can compactly represent and extend collections of phylogenetic trees with ancestral sequences mapped on the internal nodes. One motivation for this structure comes from uncertainty quantification in statistical phylogenetics, which is typically approached via one of two ways. Bayesian analysis attempts to characterize the posterior distribution of phylogenetic trees given data: the collection of trees that credibly explain the data, and their probabilities of being the generative tree. On the other hand, the phylogenetic bootstrap (Felsenstein 1985) resamples columns of the multiple sequence alignment, infers an optimal tree for each one of the resampled data sets, then aggregates features of the resulting trees.

Neither of these are tenable for very large and densely sampled data sets, such as for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) collections. Traditional Bayesian analysis is often too slow to apply to these large data sets, and introduces many extra unknown model parameters in a signal-weak setting. Bootstrapping may remain fast enough when using recent approximations (Hoang et al. 2018), but has a different problem: it is common for well-established clades (supported by other data) to be supported on the sequence level by a single mutation, so the bootstrap support of the corresponding clade will exactly equal the frequency with which we draw that mutation in the bootstrap sample. Thus, the bootstrap underestimates support in this case (Wertheim et al. 2022).

Phylogenetic placement offers a different type of uncertainty estimate: an assessment of the level of certainty in inserting a new sequence into an existing phylogeny. However, these assessments of uncertainty are relative to a fixed reference tree. For SARS-CoV-2 this can be done in the UShER framework (Turakhia et al. 2021), in which this insertion procedure is used for iterative tree building. No attempt is made to characterize uncertainty of the complete tree in this framework.

The lack of uncertainty quantification may have consequences for interpretation of SARS-CoV-2 evolution. For example, the current practice for the PANGO nomenclature system (Rambaut et al. 2020) for SARS-CoV-2 does not require any sort of support estimation. A typical workflow involves placement and local tree construction. If there is indeed high probability of a single tree, then this is fine. If not, this seems potentially problematic.

We argue as follows that the diversity of maximally parsimonious trees on the data can be used to bound uncertainty from below. First, if there are more maximally-parsimonious explanations of the data, this decreases the probability that any one explanation is correct. For this reason, we expect there to be an inverse relationship between the number of maximally-parsimonious explanations of the data and the certainty of a given node or other feature in the tree. Furthermore, this inverse relationship should express a lower bound on the uncertainty because there are many other potential compelling trees that are not quite maximally parsimonious. In any case, analyzing even just the maximally parsimonious set of trees commonly involves so many trees that storing them individually and learning from them with existing techniques is computationally prohibitive. This is especially the case with parsimony analysis of large data sets, such as those for SARS-CoV-2 (Turakhia et al. 2021; Ye et al. 2022).

As a second motivation for our work, we also suggest that gathering a collection of maximally parsimonious trees could be helpful for Bayesian analysis. Although the parsimony criterion is of course not the same as likelihood, the two objectives are closely linked in the case where sequences are densely sampled relative to the amount of evolution (Thornlow et al. 2021). Previous work has shown how closely related sequences can greatly inflate the posterior distribution (Whidden and Matsen 2015), and a parsimony analysis would have revealed this inflation. Thus, we hope to use the collection of maximally parsimonious trees as an aid for designing proposal distributions, extending previous successful strategies (Zhang et al. 2020), and for quantifying exploration of tree space.

In this paper, we formalize a data structure called the history subpartition directed acyclic graph (a.k.a. history sDAG) to characterize the ensemble of maximally parsimonious trees for large data sets. This is related to the idea of characterizing the trees in a single optimal “terrace” in phylogenetic tree space, with respect to parsimony (Sanderson et al. 2011, 2015). We describe algorithms to build history sDAGs from internally labeled trees, collapse edges with no mutations, and trim history sDAGs to express only trees which are optimal according to general criteria, such as parsimony. Although history sDAG construction is not the same as uncertainty estimation, which would allow for some less-than-maximally-parsimonious trees, it is a first step in that direction. We provide a Python implementation with a flexible interface for the history sDAG as a container type for trees, endowed with abstract methods for convenient dynamic programs on the history sDAG structure, as well as all methods from this paper for manipulating history sDAGs constructed from maximally parsimonious trees. This implementation shows the effectiveness of the approach, efficiently recovering many orders of magnitude more equally parsimonious trees than were used to “seed” the history sDAG when applied to a SARS-CoV-2 data set.

1.1 Intuitive overview

Here we provide an intuitive overview of the definitions and concepts used in this paper. Formal definitions will be given in the sections following the overview.

This paper develops methods for understanding evolutionary relationships between samples from a population of closely related evolving entities, acknowledging uncertainty. We will focus on samples consisting of nucleotide sequences, but keep our language general to emphasize that other data such as sample time and geographic location could also be used.

One way to formalize evolutionary relationships among samples, and inferred ancestral states, is to arrange them in a rooted phylogenetic tree with leaf and internal node labels. Node labels in this tree can include data of the type associated with the given samples. Specifically, leaf nodes are labeled by samples, and interior nodes are labeled by inferred ancestral states. The set of samples which label leaves will be called the leaf labels. Interior node labels may be chosen from some larger label set which includes the leaf labels as a subset. Instead of directly using this notion of a rooted, internally labeled tree, we will define a more convenient object called a history, which holds the same data as such a tree. We will make the definition formal below, but a history may be thought of as a rooted, internally labeled tree (this object has been called other names in the past, including an ancestral scenario (Ishikawa et al. 2019)). For example, a history might be used to represent a phylogenetic tree in which all (internal and tip) nodes are labeled with DNA sequences.

In a history, a node’s clade is the set of labels of its descendant leaf nodes (we emphasize that internal node labels are excluded from the clade definition). A clade of a node’s child is a child clade. The child clades of a node form a partition of the node’s clade. We therefore call this set of child clades a node’s subpartition. Each edge in a history connects two nodes, each with a label and subpartition. As a formality convenient for this paper, each history will contain a universal ancestor (UA) node added as a parent of the root node.

Some histories explain the relationships between their leaf labels more plausibly than others. One common measure of optimality for a history labeled by nucleotide sequences is its parsimony score, which is the total number of nucleotide base changes along all edges in the history. A history is said to be maximally parsimonious if no other history on the same leaf labels has a lower parsimony score.

In general, there are many possible maximally parsimonious histories with leaves labeled by the same set of nucleotide sequences. We will use a structure called the history subpartition directed acyclic graph (history sDAG) to efficiently encode a large collection of histories (Fig. 1). The “history” modifier emphasizes that this structure encodes a collection of possible rooted evolutionary histories, each of which contain not only a tree structure, but also ancestral state labels.

The history sDAG consists of a collection of nodes, each associated with a combination of label and subpartition, and one formal universal ancestor (UA) node, which is denoted \(\rho \). As we will see later, edges exiting \(\rho \) keep track of the root nodes of the histories in the history sDAG.

A directed edge in a history sDAG represents an edge in a corresponding set of histories, from a parent node to a child node which have the same labels and subpartitions as the parent and child nodes of the edge in the DAG. Thus, the history sDAG structure records combinations of labels and subpartitions, and adjacencies between these combinations, in the corresponding collection of histories.

By using a carefully chosen definition of history, introduced in the next section, the history sDAG can easily be constructed as the graph union of a set of histories. These histories need not have identical leaf labels. Specifically, we think of each history as its own history sDAG, with each node annotated by its label and subpartition. The history sDAG constructed from the original set of histories is simply the union of nodes and edges in each history (Fig. 1). The history sDAG then contains as subgraphs at least those histories used to construct it.

Fig. 1
figure 1

A history sDAG constructed from three internally labeled trees on label set of sequences \(\left\{ AA, AC, AT, AG \right\} \). Each tree is converted to the equivalent history structure, and the union of these histories is the history sDAG. Each node in a history or the history sDAG consists of a label (in this case a sequence of two bases) shown in the top half of the node, and a subpartition, with each set in the subpartition separated by a vertical bar in the bottom half of the node. Leaf nodes have no children, so appear with only their label. Although in this example labels are length-two nucleotide sequences, the label set is arbitrary, and could include sequences, geographic location, or other information

Any subgraph of the DAG which is a tree, includes exactly one edge descending from the UA node, and exactly one edge descending from each child clade of each of its nodes, is a history (Fig. 2). Each of the histories contained in a history sDAG represents a combination of substructures from the histories used to construct the history sDAG.

In addition to thinking of the history sDAG as a way of recording structures observed in a collection of histories, we can also think of it as a way of generating histories. In fact, the set of histories in the history sDAG is in general a superset of the set of histories used to construct the DAG (Fig. 3). These new histories result from combining subhistories from histories used to construct the history sDAG. This is similar to tree fusion, in which clades from different trees are combined to improve the parsimony score of the final tree (Goloboff 1999). The connection with tree fusion is explored further in the Discussion section.

Fig. 2
figure 2

A history sDAG on label set \(\left\{ TT, CC, GA, AA, CA, GG, AG, CG\right\} \), with a history structure highlighted in red (left) and a labeled tree corresponding to that history (right) (color figure online)

Fig. 3
figure 3

The history sDAG can express more histories than were used to construct it. The history sDAG in b is constructed from the two internally labeled trees in a, and represents the four internally labeled trees in a and c. Notice that the trees in c are not among the trees used to construct the sDAG, but result from swapping the substructures highlighted in green and orange in a (color figure online)

As described above, exploring phylogenetic uncertainty by examining many maximally parsimonious histories requires an efficient way to store and compute on those histories. The history sDAG provides a compact structure for storing collections of histories, but in general contains histories beyond those used to build the DAG. We therefore encounter a key question: are these additional histories also maximally parsimonious?

In the following two sections, we will show that maximum parsimony is in fact preserved by the history sDAG. Theorem 1 shows that any history expressed by a history sDAG constructed from maximally parsimonious histories must itself be maximally parsimonious. To achieve this we must first show that swapping certain substructures between histories preserves maximum parsimony. Then we will show that the collection of histories in the history sDAG is closed under these subhistory swaps, and that any history in the history sDAG can be obtained by such a subhistory swap involving histories used to construct the DAG. This means that the history sDAG is not only an effective way to store many maximum parsimony histories, but also may allow us to very quickly discover more such histories.

Preservation of maximum parsimony in the history sDAG has two important consequences:

  • A history sDAG constructed from maximally parsimonious histories will contain only maximally parsimonious histories. If a set T of histories with the same parsimony score is used to construct a history sDAG, and if that history sDAG expresses a history with any other parsimony score, then T must not have contained maximum parsimony histories.

  • It is always possible to trim an arbitrary history sDAG to express all of, and only, its maximally parsimonious histories. In particular, a new history sDAG constructed from the maximally parsimonious histories represented by the original history sDAG will contain only those histories used to construct it.

Throughout this paper we will refer to maximally parsimonious histories using the more general term minimum-weight histories, since maximum parsimony is characterized by minimizing the sum of a weight over all edges in a history. Indeed, this term is more general because we can use weight functions that are more complex than simply the sum of the number of mutations, or which consider label data other than nucleotide sequences.

We provide an implementation of the history sDAG and related algorithms in the open source Python package historydag, installable with pip and available at https://github.com/matsengrp/historydag. This package provides methods for constructing, trimming, collapsing, and extracting histories from the history sDAG as described in the following sections. historydag also implements methods which we will describe in future work, for efficiently calculating weights of histories represented in the history sDAG, and for expressing and sampling from a probability distribution on histories in the history sDAG.

For reference, we provide a summary of notation in Table 1.

Table 1 Notation used in the text

2 Histories and the history sDAG

We will now provide a formal definition of histories and the history sDAG.

Let Y refer to a set of labels, such as nucleotide sequences. We can think of observed labels as a set \(X\subset Y \), labeling history leaves. We will not emphasize this set of leaf labels X, since a history sDAG may express collections of histories with varying leaf label sets. In the case of parsimony however, we will be interested in collections of histories which share a leaf label set consisting of observed nucleotide sequences.

We are interested in representing collections of rooted, multifurcating, non-unifurcating trees with nodes (including internal nodes) labeled by elements of Y. As mentioned in the Overview, we will make this easy by carefully defining histories.

Isomorphism classes of such internally labeled trees are in bijection with histories, as defined below. This correspondence is shown formally in Appendix A, but sufficient intuition may be found in Fig. 1.

Let Y be a set of labels, and let \({\mathcal {P}}(\cdot )\) denote the power set.

Definition 1

Let \({\text {Part}}(Y) \) be the set of all \(U\subset {\mathcal {P}}(Y){\setminus } \left\{ \emptyset \right\} \) such that,

  • for \(C_1, C_2\in U \), if \(C_1 \ne C_2 \) then \(C_1\cap C_2 = \emptyset \)

  • \(|U| \ne 1 \).

That is, \({\text {Part}}(Y) \) contains \(\emptyset \) and all sets of two or more nonempty, disjoint subsets (clades) of Y.

Given a set of leaf labels \(X\subset Y \), \({\text {Part}}(X) \) would contain all of the possible subpartitions of leaf labels in an internally labeled tree with leaves labeled by X. Notice that \({\text {Part}}(X) \subset {\text {Part}}(Y) \) for any such \(X\subset Y \). Since a history sDAG may contain histories with varying leaf label sets, elements of \({\text {Part}}(Y) \) are used to construct general history sDAG nodes.

We will see that with the exception of a universal ancestor node, all nodes in the history sDAG structure consist of a label \(\ell \in Y\) and a subpartition \(U\in {\text {Part}}(Y) \).

Definition 2

A node-clade pair is a node \((\ell , U) \) and a choice of child clade \(C\in U \).

Definition 3

A history sDAG with labels Y is a directed graph (VE) consisting of

  • A node set \(V\subset \left( Y\times {\text {Part}}(Y)\right) \cup \left\{ \rho \right\} \) such that \(\rho \in V \) is the universal ancestor (UA) node. For a node \(v = (\ell , U)\in V \), \(v \ne \rho \), we say that v’s label is \(\ell \), its subpartition is U, its child clades are elements of U, and its clade union \({\text {CU}}(v) \) is \(\left\{ \ell \right\} \) if \(U = \emptyset \), or \(\bigcup \nolimits _{C\in U} C \) otherwise.

  • A directed edge set \(E\subset V\times V \) containing edges \(e = (v_1, v_2) \) from a parent node \(v_1 \) to a target or child node \(v_2 \) such that

    1. 1.

      All nodes are reachable from the UA node \(\rho \), which itself accepts no incoming edges.

    2. 2.

      For any edge whose parent node is not \(\rho \), the clade union of the target node must be in the subpartition of the parent node.

      Formally, for any edge \(e = \left( (\ell _1, U_1), (\ell _2, U_2) \right) \in E \), if C is the clade union of \((\ell _2, U_2) \), then \(C\in U_1 \).

      We say then that the edge e descends from the node-clade pair \(\left( (\ell _1, U_1), C \right) \).

    3. 3.

      For each node \(v = (\ell , U) \), and for each choice of child clade \(C\in U \), at least one edge descends from the node-clade pair (vC) .

Notice that by requirements (1) and (3) in the definition of the history sDAG, all nodes in the history sDAG must have descendant edges, except for those of the form \((\ell , \emptyset ) \). We will refer to these as leaf nodes.

Observation 1

Since only nodes of the form \((\ell , \emptyset ) \) may have no children, all leaf nodes in a history sDAG must be of this form, and therefore no two leaf nodes may be labeled by the same element of Y.

Observation 2

For any history sDAG edge \((v_1 = (\ell _1, U_1), v_2) \), we know that \({\text {CU}}(v_2) \subset {\text {CU}}(v_1) \), since \({\text {CU}}(v_2) \in U_1 \). More generally, consider a history sDAG (VE) , in which a node \(v' \) is reachable from another node v via a sequence of edges in E. By transitivity of inclusion, \({\text {CU}}(v') \subset {\text {CU}}(v) \).

Definition 4

A history is a history sDAG in which the UA node \(\rho \) has a unique child node, and each node-clade pair has exactly one descendant edge.

The set of labels of the leaf nodes in a history t will be denoted L(t) .

Notice that not every element of Y must appear as a node label in a history or history sDAG. That is, Y is an ambient label set, such as the set of all nucleotide sequences of a fixed length, from which history sDAG node labels can be chosen.

Also notice that there is no distinction between leaf node and internal node labels. In practice, the set of leaf node labels will be associated with a set of observed evolving entities. When a sampled entity is inferred to be an ancestor of other sampled entities, we can represent this in a history with an internal node carrying the label corresponding to the sampled ancestor.

Informally, a labeled tree can be converted to a history by annotating each node with its subpartition, and adding a UA node as a parent of the root node (Fig. 1). The unique child of the UA node in a history will be called the root node, since it represents the root node of a corresponding internally labeled tree.

The natural substructure of a history is analogous to a subtree of a labeled tree, and will be very useful in later sections.

Definition 5

Given a history sDAG (VE) , a subgraph \(s = (V_s, E_s) \) with \(V_s \subset V \) and \(E_s \subset E \) is a subhistory of (VE) if

  1. 1.

    \(\rho \notin V_s \),

  2. 2.

    there exists a root node \(v_r\in V_s \) such that all other nodes in \(V_s \) are reachable from \(v_r \), and

  3. 3.

    each node-clade pair in s has exactly one descendant edge.

The set of labels of leaf nodes in a subhistory s is denoted L(s) .

Later we will establish formally that a history is in fact a tree. Given that fact, naming a subhistory is equivalent to removing an edge from a history, and discarding the component which contains the UA node.

In addition to the UA node, the definition of history contains redundant information in the sense that the subpartition of a node, formally a piece of data associated with each node, can be recovered as the set of sets of labels of leaf nodes reachable from that node’s children. Although this choice may seem an unnecessary complication, it is essential in distinguishing histories contained in a larger history sDAG. This redundancy is shown in the following lemma, which is proven in Appendix A:

Lemma 3

Let (VE) be a history sDAG or subhistory, and let \(v\in V \). The set of labels of leaf nodes reachable from v is \({\text {CU}}(v) \).

Lemma 3 implies that a history’s set of leaf labels is determined by the subpartition of its root node. This will be relevant later, when we describe what it means for a history to be found in a history sDAG.

We intend for a history to be tree-shaped, but this is not assumed by the definition given. Lemma 3 also allows us to prove this essential fact.

Lemma 4

A history sDAG (VE) is a history if and only if it is a tree, and contains exactly one edge descending from \(\rho \).

The proof for this proposition is given in Appendix A.

Notice that since elements of \({\text {Part}}(Y) \) may not contain exactly one clade, and since nodes in a history have exactly one child node per child clade, no node (other than the UA node) in a history may have exactly one child. This is required to ensure that the history sDAG may not contain cycles. Although this is not stated in Definition 3, it is an important property of the history sDAG as the name suggests, and is proven in Appendix A:

Lemma 5

A history sDAG (VE) is acyclic.

Sometimes data sets include a fixed root node label, such as a common ancestor sequence. A search for minimum weight labeled histories explaining such a data set may yield labeled histories with a unifurcation at the root node. We accommodate this by considering the fixed root sequence a leaf node label, and placing the corresponding leaf node as an additional child of the root node.

Since histories are tree-shaped history sDAGs, we can store collections of histories by taking their graph union. However, we should first verify that a graph union of history sDAGs is itself a history sDAG.

Lemma 6

Let (VE) and \((V', E') \) be history sDAGs on labels Y. Then \((V\cup V', E\cup E') \) is also a history sDAG.

Proof

All the nodes and edges required to satisfy Definition 3 are present in \((V\cup V', E\cup E') \), since they are present in each of the original history sDAGs. All nodes are reachable from the root node, through exactly the same sequence of edges by which they were reachable in at least one of the original histories. \(\square \)

Definition 6

For a set T of histories with labels in Y, the history sDAG constructed from T is the graph union of the histories in T:

$$\begin{aligned} \left( \bigcup _{(V, E) \in T}V, \bigcup _{(V, E) \in T} E \right) \end{aligned}$$

We should also formalize the way in which a history sDAG contains histories. To do so, we will need to define a trim, which is a history sDAG which appears as a substructure in a larger history sDAG.

Definition 7

Let (VE) be a history sDAG on labels Y. Then \((V', E') \) is a trim of (VE) if \(V' \subset V \), \(E'\subset E \), and \((V', E') \) is a history sDAG on labels Y. We say a history \(t = (V'', E'') \) is in the history sDAG (VE) if \((V'', E'') \) is a trim of (VE) .

The collection of histories in the history sDAG constructed from a collection of histories T will be denoted D(T) .

We can now see why we must specify in Definition 4 that \(\rho \) has exactly one child node in a history. Edges descending from the UA node in a history sDAG keep track of which DAG nodes are allowed to be root nodes. It may be possible to choose a tree-shaped trim of a history sDAG in which two nodes \(v_1 \) and \(v_2 \) are children of \(\rho \), and \({\text {CU}}(v_1) \cap {\text {CU}}(v_2) = \emptyset \). Such a structure should be considered a trim containing two histories, but is not itself a history.

Any history sDAG should be uniquely determined by the collection of histories it contains. This intuition motivates the following two lemmas, which are proven in Appendix A:

Lemma 7

Let (VE) be a history sDAG. For any \(v\in V \), there exists a subhistory s in (VE) whose root node is v.

Lemma 8

Let (VE) be a history sDAG, and let T be the collection of histories in (VE) . Then (VE) is the history sDAG constructed from T.

Finally, we will need to define the largest possible history sDAG constructed using a given set of labels.

Definition 8

The complete history sDAG on labels Y is the history sDAG which contains all possible edges on all nodes allowed by the choice of Y.

Equivalently, the complete history sDAG could be constructed as the graph union of all possible histories with labels in Y.

2.1 History weights

In this section we will define a general scheme for assigning weights to histories, and describe the relationship between these weights and the structure of the history sDAG.

As shown in Fig. 3, the history sDAG in general contains more histories than were used to construct it. These extra histories arise because the history sDAG allows subhistories between the histories it contains, whenever the subhistories’ parent nodes share the same child clades and node label. We refer to this occurrence as subhistory swapping. Appendix 21 describes these subhistory swaps precisely, shows that all new histories in a history sDAG can be described as in terms of sequences of these subhistory swap operations, and provides the proof for Theorem 1, which involves an argument that these subhistory swaps preserve history weights.

This section will leave the details of subhistory swaps, and the proof of Theorem 1, to the Appendix, and only build the background necessary to state and understand Theorem 1.

We begin by defining another useful type of history substructure.

Definition 9

Let (VE) be a history sDAG and let \(s = (V_s, E_s) \) be a subhistory (VE) . Also, let \(v_r \) be the root node of the subhistory s, and let \(v\in V \) be a parent node of \(v_r \), so that \((v, v_r) \in E \). Then the augmented subhistory \(s^{v} \) is the subgraph \(\left( V_s\cup \left\{ v\right\} , E_s \cup \left\{ (v, v_r) \right\} \right) \) of (VE) consisting of the subhistory s plus the parent node v and the edge connecting v to \(v_r \).

Definition 10

Let (VE) be a history sDAG, and \(v = (\ell , U) \in V \) a node. We make the following definitions.

  • \({\text {Ch}}(v):= \left\{ v_c \mid (v, v_c) \in E \right\} \) will denote the set of children of v

  • \({\text {Ch}}(v, C):= \left\{ v_c \mid (v, v_c)\in E,\ {\text {CU}}(v_c) = C \right\} \) will denote the set of children of the node-clade pair (vC) for each clade \(C\in U\)

  • \({\text {B}}(v) \) will denote the set of subhistories in (VE) rooted at v.

Although we are interested primarily in computing parsimony on histories labeled with nucleotide sequences, we will do so within a much more general framework of history weights.

Definition 11

Let (VE) be the complete history sDAG on labels Y, and let \(f:E\rightarrow W \) be an edge weight function to a weight set W endowed with addition and containing an additive identity \(0\in W \). The weight of any subgraph \((V', E') \) of (VE) is then given by the weight function \(g_f\)

$$\begin{aligned} g_f\left( (V', E') \right) = \sum \limits _{e \in E'} f(e). \end{aligned}$$

In particular, since any history t in (VE) is a subgraph of (VE) , the weight of t is given by \(g_f(t) \).

In the case of parsimony, the label set Y will contain sequences, the function f is Hamming distance, and \(g_f \) will compute the parsimony score of a history. A history’s parsimony score is decomposable as a sum of an edge weight function over edges only when complete, unambiguous nucleotide sequences are accessible to that weight function as node label data. If nucleotide sequences of internal nodes are not contained in node label data, the contribution of an edge to a history’s parsimony score may be dependent on the structure of the rest of the history, making the decomposition impossible. In particular, the edge weight function f is required to be a function on all possible history sDAG edges, which correctly reports the contribution of an edge to the weight of any history which contains it.

Although our focus here is parsimony, notice that this framework allows much more general notions of history weight, including situations where the function f is sensitive to edge direction or subpartitions, or takes values in a non-numeric set, such as a set of sequences. These generalizations will be important for future applications. For example, we could compute a branching process likelihood like that used by the gctree project, whose value can be decomposed over tree edges, and which can be summarized by a pair of integers (DeWitt et al. 2018).

To compare weights of histories, the weight set W must admit a total ordering. This ordering will be required to respect addition on W, in a slightly weaker sense than is generally meant:

Definition 12

A weight set W, endowed with addition, is clade-ordered with respect to some edge weight function f and history sDAG (VE) on labels Y if

  • The ordering on W respects addition and is a total ordering on all of the following subsets of W:

    • \(\circ \) Sets of weights of subhistories below any node: \(\left\{ g_f(s) \mid s\in {\text {B}}(v) \right\} \), for any \(v \in V {\setminus } \left\{ \rho \right\} \),

    • \(\circ \) Sets of weights of augmented subhistories below any node-clade pair: \(\quad \left\{ g_f(s^v) \mid s\in {\text {B}}(v_c),\ v_c\in {\text {Ch}}(v, C)\right\} \), for any \(v = (\ell , U) \in V{\setminus } \left\{ \rho \right\} \), and    any \(C\in U \).

  • The ordering on W is a total ordering on the set of weights of histories:

    $$\begin{aligned} \left\{ g_f(t) \mid t\text { is a history in } (V, E) \right\} \subset W \end{aligned}$$

We say that the ordering on W respects addition on a set \(W'\subset W \) if for all \(a, b\in W' \) and for all \(c\in W \), \(a < b \) if and only if \(a + c < b + c \).

The following observation makes this definition easier to use.

Observation 9

Let W be a weight set which is clade-ordered with respect to a history sDAG (VE) and edge weight function f. If \((V', E') \) is a trim of (VE) , and \(f': E' \rightarrow W\) is equal to f restricted to \(E' \), then W is also clade-ordered with respect to \(f' \) and \((V', E') \).

For example, it may often be most convenient to argue that a weight set is clade-ordered with respect to the complete history sDAG on the label set Y, and a weight function defined on all possible edges in that history sDAG.

However, since this is a strictly stronger condition on W, which is why the definition of clade-ordering is with respect to a particular history sDAG.

Through the rest of this section, the label set Y will be fixed, and it will be assumed that f is an edge weight function mapping into W, a weight set which is clade-ordered with respect to f.

Finally, we can describe exactly the sense in which the history sDAG preserves history weights, a property depicted in Fig. 4.

Fig. 4
figure 4

Theorem 1 states that if a history sDAG is built from a collection of histories which all have weight K, then either the resulting sDAG must contain only histories of weight K, or there must be histories with weights greater and less than K. In either case the resulting history sDAG may contain more histories than were used to build it. Corollary 1.1 observes that since no parsimony score less than the maximum parsimony score can be achieved by a history on a given leaf set, a history sDAG built from maximally parsimonious histories must contain only maximally parsimonious histories

Theorem 1

Let T be a collection of histories, so that \(g_f(t) = K \) for all \(t\in T \). Then there exists a history \(t\in D(T) \) with \(g_f(t) < K \) if and only if there exists a history \(t'\in D(T) \) with \(g_f(t') > K \).

Theorem 1 is the motivation for and main result of this section, guaranteeing that a history sDAG constructed from minimum weight histories will only express minimum weight histories, and is proven in Appendix A.

However, since it may be impractical to verify that a collection of histories are minimum weight relative to all other possible histories on a chosen label set, Theorem 1 will often be more useful when applied in the form of the following corollary, that any history sDAG may be trimmed to express exactly its minimum weight histories, relative only to the other histories in that history sDAG.

Corollary 1.1

Let (VE) be a history sDAG, and let f be an edge weight function as defined previously. Then there exists a history sDAG \((V', E') \) which is a trim of (VE) such that the histories in \((V', E') \) are exactly the minimum weight histories in (VE) with respect to f.

Proof

Let T be the collection of histories expressed by (VE) , so that \(D(T) = T \). Let K be the minimum weight achieved by \(g_f \) on T, and let \(T' \subset T \) be the set of minimum weight histories:

$$\begin{aligned} T' = \left\{ t\in T \,\big \vert \,g_f(t) = K \right\} \end{aligned}$$

We know that \(T'\subseteq D(T') \), so we need only show that \(T'\supseteq D(T') \). Since \(T' \subseteq T \), we know that \(D(T') \subseteq D(T) = T\), and that since T is the collection of histories in (VE) , there exists no history \(t\in D(T') \) with \(g_f(t) < K \). Therefore, by Theorem 1, there exists no \(t\in D(T') \) with \(g_f(t) > K \). Since \(T' \) contains all the histories in T with weight K, we therefore know that \(T' = D(T') \). Let \((V', E') \) be the history sDAG constructed from \(T' \). Since \(T' = D(T') \), the history sDAG \((V', E') \) contains exactly the histories in \(T' \). Also, because \((V', E') \) is a graph union of histories in (VE) , we know that \(V'\subset V \) and \(E'\subset E \). Therefore \((V', E') \) is the trim of (VE) that we seek. \(\square \)

We shall take a small excursion now, in which we return to the setting of maximum parsimony which motivates these methods. It makes little sense to minimize parsimony on the set of all histories with labels in an ambient sequence set Y. Rather, one attempts to minimize parsimony subject to the constraint that history leaves are labeled by some fixed set of observed nucleotide sequences.

Definition 13

Let T be a set of histories with labels in Y. We say that histories in T have a fixed set \(X\subset Y \) of leaf labels if \(L(t) = X \) for all \(t\in T \).

Given an edge-weight function f and a set \(X\subset Y \), we say that a history t with \(L(t) = X \) is minimum weight relative to all histories on the fixed set of leaf labels X if \(g_f(t) \le g_f(t') \) for all histories \(t' \) with \(L(t') = X\).

In the general language of this section, a history t with nucleotide sequence labels is maximally parsimonious if it is minimum weight relative to all histories on the fixed leaf label set L(t) , with Hamming distance as the edge-weight function.

The following observation guarantees that Theorem 1 and Corollary 1.1 are useful in this setting.

Observation 10

Let T be a set of histories with a fixed set of leaf labels \(X\subset Y \). Then for any \(t\in D(T) \), \(L(t) = X \).

The truth of this observation can be argued precisely using the lemmas in Appendix A supporting the proof of Theorem 1, but is apparent from Definition 3 and Fig. 1.

This means that given a set T of maximally parsimonious histories on a fixed set of leaf labels X, D(T) must only contain histories with leaves labeled by X. By Theorem 1 then, D(T) must only contain histories which are maximally parsimonious on leaf labels X.

If T contains histories on a fixed label set X which are not necessarily maximally parsimonious, Observation 10 ensures that trimming the history sDAG constructed from T as in Corollary 1.1 will result in a new history sDAG which expresses histories with the same fixed set of leaf labels X.

2.2 Trimming the history sDAG

Here we describe a straightforward method for trimming a history sDAG to represent only its minimum-weight histories. Corollary 1.1 guarantees that merging only the minimum-weight histories in a history sDAG will result in a new history sDAG containing only those histories, but provides no efficient method for producing this trimmed history sDAG. The method described here involves removing all edges which point to suboptimal subhistories, and can be realized in two traversals of the history sDAG.

Definition 14

Let (VE) be a history sDAG on labels Y, and let f be an edge-weight function \(f: E\rightarrow W \) for W a weight set which is clade-ordered with respect to f and (VE) .

The minimum weight of an augmented subhistory beneath a node \(v=(\ell , U) \in V \) and a clade \(C\in U \) is given by \(M_f(v, C) \), defined as

$$\begin{aligned} M_f(v, C) = \min \left\{ g_f(s^v) \,\big \vert \,v_c \in {\text {Ch}}(v, C), s \in {\text {B}}(v_c) \right\} . \end{aligned}$$

Also let \(M_f(v) \) report the minimum weight of any subhistory rooted at the node \(v=(\ell , U) \), and for any leaf node \(v'\in V \), let \(M_f(v') \) be the additive identity of W.

Notice that because W is clade-ordered, \(M_f(v) \) can be computed as

$$\begin{aligned} M_f(v) = \sum _{c\in U} M_f(v, C). \end{aligned}$$
(1)

That is, the minimum weight of a subhistory beneath a node is given by the sum over clades of the minimum weight achieved by an augmented subhistory below each clade.

Notice that the clade-ordering on W also allows us to compute \(M_f(v, C) \) more easily, as

$$\begin{aligned} M_f(v, C) = \min \left\{ M_f(v_c) + f(v, v_c) \,\big \vert \,v_c \in {\text {Ch}}(v, C)\right\} . \end{aligned}$$

With Eq. 1, this defines an efficient dynamic program for calculating the minimum weight of all histories in a history sDAG with respect to f, with

$$\begin{aligned} M_f(\rho ) = \min \left\{ M_f(v_c) + f(\rho , v_c) \,\big \vert \,v_c \in {\text {Ch}}(\rho ) \right\} . \end{aligned}$$

\(M_f \) will be used to define the trimmed history sDAG:

Definition 15

Let (VE) be a history sDAG and \(f:V\rightarrow W \) be an edge weight function, with W clade-ordered. The minimum weight trim of (VE) with respect to f is defined to be \(({\underline{V}}, {\underline{E}}) \), where

$$\begin{aligned} {\underline{E}}'&= \left\{ (v, v_c)\in E \,\big \vert \,M_f(v_c) + f(v, v_c) = M_f(v, {\text {CU}}(v_c)) \right\} ,\\ {\underline{V}}&= \left\{ v\in V\,\big \vert \,v \text { reachable from } \rho \text { via a path in } {\underline{E}}' \right\} , \text { and}\\ {\underline{E}}&= \left\{ (v, v_c)\in {\underline{E}}' \,\big \vert \,v, v_c \in {\underline{V}} \right\} . \end{aligned}$$

Notice that \({\underline{E}}'\) consists of edges from E which point to optimal subhistories, \(V' \) contains nodes reachable from \(\rho \) via those edges, and \({\underline{E}} \) removes edges from \({\underline{E}}' \) which connect any nodes not in \({\underline{V}} \).

The following lemma verifies that this structure is what its name suggests.

Lemma 11

Let (VE) be a history sDAG, and \(f:E\rightarrow W \) be an edge-weight function, with W a weight set which is clade-ordered with respect to f and (VE) . Let \((V', E') \) be the history sDAG constructed from minimum-weight histories in (VE) , with respect to f, and let \(({\underline{V}}, {\underline{E}}) \) be the minimum weight trim of (VE) with respect to f. Then \((V', E') = ({\underline{V}}, {\underline{E}}) \).

The proof for this lemma is given in Appendix A.

2.3 Collapsing histories

The space of possible minimum weight histories on a fixed leaf label set is in general very large. However, some diversity in this set is a result of unnecessary history edges between nodes with the same label. Unless these edges target a leaf node, they are unnecessary, and their existence cannot be supported by the observed data represented in leaf labels.

Just as polytomies can be resolved as many possible bifurcating structures, collapsing history edges which connect nodes with identical labels reduces the number of possible histories on a fixed set of leaves, without restricting the number of informative evolutionary scenarios that can be expressed by those histories (Fig. 5).

Fig. 5
figure 5

By collapsing red edges between nodes with identical labels, all five internally labeled tree structures shown here are equivalent (color figure online)

Motivated by this observation, we will enforce in practice that adjacent nodes in a history not have the same label, unless one of them is a leaf node. This choice is possible because we allow multifurcations in histories, which leads to the definition of “collapsing” below. On the other hand, sampled ancestors in a history can be witnessed as an internal node with the observed label \(\ell \in Y \), adjacent to the leaf node labeled \(\ell \). Since the edge between these two nodes targets a leaf, such a structure is allowed in a history.

A history containing internal edges whose parent and child nodes carry the same label may be modified to remove such edges. Doing so will add multifurcations to the history, as shown in Fig. 6. The following definition allows us to mark edges as collapsible arbitrarily, not just when their parent and child node labels match. This generality is useful in precisely stating Lemma 13.

Definition 16

Let (VE) be a history or history sDAG.

Given a binary-valued function \(b: E\rightarrow \left\{ 0,1 \right\} \), an edge \(e = \left( (\ell , U), (\ell ', U') \right) \in E \) is b-collapsible if \(b(e) = 1 \) and \(U' \ne \emptyset \) (so the target node is not a leaf node). An edge is b-collapsed if it is not b-collapsible. (VE) is b-collapsed if each edge in E is b-collapsed.

For the purpose of this paper we are interested in collapsing edges whose parent and child nodes have the same label. In this situation b should return 1 on edges whose parent and child nodes have the same label, and we will use the terms label-collapsible and label-collapsed instead of b-collapsible and b-collapsed.

A history which is not label-collapsed can be converted to a label-collapsed history by merging adjacent nodes with the same label, but this process requires also modifying subpartitions (Fig. 6).

Fig. 6
figure 6

a Shows part of a history, with an edge e to be collapsed. Collapsing e requires replacing the child clade \(C_2 \) of \(v_p \) with the child clades of \(v_c \), to create the new combined node \(v_p' \) (b)

To formalize this, we first explain what it means to collapse an edge in a history.

Definition 17

Let \(t = (V_t, E_t) \) be a history with labels Y. Let (VE) be the complete history sDAG on labels Y. Also let \(e = \left( (\ell _p, U_p), (\ell _c, U_c) \right) \in E_t \) be an edge in t, so that \((\ell _c, U_c) \) is not a leaf node. Let \(C = {\text {CU}}(\ell _c, U_c) \) be the clade in \(U_p \) from which the edge e descends.

The history \(t_e = (V_e, E_e) \), formed by collapsing e in t, is defined as follows:

Define \(q:V_t \rightarrow V \) via

$$\begin{aligned} q(v) = {\left\{ \begin{array}{ll} \left( \ell _p, U_p \cup U_c \setminus \left\{ C \right\} \right) &{} v = (\ell _p, U_p) \text { or } v=(\ell _c, U_c)\\ v &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Then \(V_e = q(V_t) \), and \(E_e = \left\{ (q(v), q(v')) \,\big \vert \,(v, v') \in E_t{\setminus } \left\{ e \right\} \right\} \).

Notice that after collapsing an edge, the resulting structure remains a valid history, because for any clade \(C\in U_c \), and for any node \(v_c \) which is a child of the node-clade pair \(\left( (\ell _c, U_c), C \right) \), the node \(v_c \) becomes a child of the node-clade pair \(\left( q(U_c, \ell _c), C \right) \). Also notice that \(q(\ell _c, U_c) = q(\ell _p, U_p) \) inherits the unique parent of \((\ell _p, U_p) \) in t.

The new history has one edge fewer than the original.

We can convert a history t to a label-collapsed history by iteratively collapsing each edge in t whose parent and child nodes have the same label.

Lemma 12

A history \(t_0 = (V_0, E_0)\) determines a unique label-collapsed history \(t_c \), which is the result of a finite sequence of edge collapses.

That is, there exists a finite sequence \(t_0, t_1 = (V_1, E_1), \ldots , t_n = (V_n, E_n) \) for which

  • \(t_i \) is the result of collapsing some edge \(e_i = \left( (\ell _i, U_i), (\ell _i', U_i') \right) \) in \(t_{i-1} \) for which \(\ell _i = \ell _i' \) and \(U_i'\ne \emptyset \), and

  • \(t_n \) is label-collapsed

Furthermore, for any such sequence of histories, \(t_n = t_c \).

Proof

Using the correspondence between histories and rooted, internally labeled, multifurcating trees established in Appendix A, we can use the fact that collapsing edges between internal nodes with the same label is a well-defined map on such trees. Since the order of edge collapse has no effect on the final tree, neither does the order of edge collapse on the final history in the sequence named above. \(\square \)

Label-collapsing histories individually is straightforward, but collapsing a large collection of histories could be done more efficiently by label-collapsing their history sDAG.

Label-collapsing histories from within a history sDAG is not as straightforward, because some edges descending from a node-clade pair may need to be collapsed, while others may not. This means that an algorithm to collapse the history sDAG must occasionally add new nodes to the DAG (Fig. 7).

Fig. 7
figure 7

Analogous to Fig. 6, but within a history sDAG, (a) shows part of a history sDAG, with an edge e to be collapsed. Collapsing e requires adding the node \(v_p' \) (b). In this example, both \(v_p \) and \(v_c \) remain in the history sDAG, because even without e they each have a parent edge, as well as one child edge descending from each child clade. Edges in (a) are colored to match with the corresponding new edges in (b), and with the annotations in Eq. 2 (color figure online)

In order to describe the behavior of collapsing in the history sDAG, we require the following definition.

Definition 18

Given a history sDAG (VE), we say that a collection of histories T is an edge cover of (VE) if for every edge \(e\in E\), there exists a history \(t\in T\) such that e is contained in t.

Further, a collection of histories T is a b-collapsible edge cover of (VE) if for every b-collapsible edge \(e\in E\) and every subhistory s containing e, there is a history \(t\in T\) that contains s.

The following lemma describes what it means to collapse a single edge in a history sDAG.

Lemma 13

Let (VE) be a history sDAG with label set Y.

Also let \((v_p = (\ell _p, U_p), v_c = (\ell _c, U_c))\in E \) be an internal edge. That is, \(U_c\ne \emptyset \) (so \(v_c \) isn’t a leaf node).

Define a binary function \(b: E\rightarrow \left\{ 0, 1 \right\} \) which is constant at 0, except that \(b(v_p, v_c)=1 \), and let T be any b-collapsible edge cover of (VE) .

Let \(v_p' = (\ell _p, U_p\cup U_c \setminus {\text {CU}}(v_c)) \) be the “new parent node”, and define:

(2)

Let \(R = \emptyset \) if there exists an edge \((v_p, v)\in E^+ \) with \({\text {CU}}(v) = {\text {CU}}(v_c) \). Otherwise, let R be the set of parent edges of \(v_p \), of the form \((v', v_p) \in E^+ \).

Then, let \(E^- = E^+\setminus R \). Finally, define

$$\begin{aligned} E' = \left\{ (v_1, v_2) \,\big \vert \,(v_1, v_2) \in E^-,\ v_1 \text { reachable from } \rho \text { via edges in } E^- \right\} \end{aligned}$$

and

$$\begin{aligned} V' = \left\{ v_1, v_2 \,\big \vert \,(v_1, v_2) \in E' \right\} . \end{aligned}$$

Claim: \((V', E') \) is the history sDAG constructed from \(T' \), the set of histories which result by collapsing the edge \((v_p, v_c) \) in each history in T in which it appears.

Notice that if e is the only edge descending from the node-clade pair \((v_p, {\text {CU}}(v_c)) \), then collapsing e requires removing the node \(v_p \), and all edges involving it, from the history sDAG. Also, the definition of \(E^- \) will not leave any parent nodes of \(v_p \) with too few descendant edges, because we added edges from all parent nodes of \(v_p \) to \(v_p' \).

The last step in the construction of \(E' \) ensures that any nodes left without parents in the collapsing process will not appear in the label-collapsed history sDAG.

The proof for Lemma 13 is given in Appendix A.

Finally we arrive at the main result of this section, which provides a guarantee that all histories in a history sDAG can be collapsed by a finite sequence of edge collapses. Although this lemma is stated for label-collapsing, the result can immediately be generalized to b-collapsing, with respect to an arbitrary binary function b.

Lemma 14

Let \((V_0, E_0) \) be a history sDAG, and define a sequence \((V_i, E_i)_{i\in {\mathbb {N}}} \) of history sDAGs, so that \((V_k, E_k) \) is generated by collapsing an edge

$$\begin{aligned} e_{k-1} = \left( (\ell _{k-1}, U_{k-1}), (\ell _{k-1}', U_{k-1}')\right) \end{aligned}$$

in \((V_{k-1}, E_{k-1}) \) with \(\ell _{k-1} = \ell _{k-1}' \) and \(U_{k-1}' \ne \emptyset \) if such an edge exists. If no such edge exists, then \((V_k, E_k) = (V_{k-1}, E_{k-1}) \).

Then there exists \(N\in {\mathbb {N}}\) such that \((V_N, E_N) \) is label-collapsed. Also, if \(T_0 \) is a label-collapsible edge cover of \((V_0, E_0) \), and \(T_0' \) is the set of histories resulting from label-collapsing each history in \(T_0 \), then each history in \(T_0' \) is in \((V_N, E_N) \).

Although this lemma is written for label-collapsing, it extends to collapsing with respect to an arbitrary binary function b, defined on all possible edges in the complete history sDAG with the same leaf nodes and with labels chosen from the same ambient label set as \((V_0, E_0) \).

Note that the collapsing algorithm presented below produces the collapsed history sDAG \((V_N, E_N) \).

The proof for this proposition is given in Appendix A.

Lemma 14 suggests an algorithm for collapsing a history sDAG, whose implementation is given below.

Algorithm A

(Collapsing a history sDAG) Modifies a history sDAG so that no edges connect two non-leaf nodes with the same label, and the histories represented in the resulting history sDAG are the same as the set of histories represented by the original history sDAG, with each label-collapsed.

  1. 1.

    Build queue. \({\mathcal {Q}}:= (v_i, v_i')_{i=1}^{|E|}\) is a queue of edges in \((v_i, v_i') \in E \) so that if \((v_i, v_i') \) and \((v_j, v_j') \) are such that \(v_i' = v_j \), then \(j > i \). That is, edges at the beginning of the queue are closer to the UA node of the history sDAG

  2. 2.

    Collapse loop head. If \({\mathcal {Q}} \) is empty, END. Otherwise, remove the first element \(\left( v_p = (\ell _p, U_p), v_c = (\ell _c, U_c) \right) \) from \({\mathcal {Q}} \).

    1. (a)

      Check collapsed. If \(\ell _p = \ell _c \) and \(v_c \) is not a leaf node, and \((v_p, v_c)\in {\mathcal {Q}}\), go to new parent. Otherwise, return to collapse loop head.

    2. (b)

      New parent. Set \(v_p':= \left( \ell _p, U_p \cup U_2 {\setminus } \left\{ \bigcup \limits _{C\in U_c} C \right\} \right) \). Add \(v_p'\) to V.

    3. (c)

      Add grandparents to newparent. For any \((v, v_p)\in E \), add \((v, v_p') \) to E and to beginning of \({\mathcal {Q}} \).

    4. (d)

      Add children to newparent. For any \((v_p, v)\in E_t\), if clade union of v is not the same as the clade union of \(v_c \), add \((v_p', v) \) to E and to beginning of \({\mathcal {Q}} \).

    5. (e)

      Add grandchildren to new parent. For any \((v_c, v) \in E \), add \((v_p', v) \) to E and to the beginning of \({\mathcal {Q}} \).

    6. (f)

      Remove collapsed edge. Remove \((v_p, v_c) \) from E.

    7. (g)

      Remove lonely parent. If no edge \((v_p, v)\in E \) exists with clade unions of v and \(v_c \) equal, then do routine removenode \(v_p \) from (VE).

    8. (h)

      Remove orphaned child. If no edge \((v, v_c)\in E \) exists, do routine removenode \(v_c \) from (VE) . Return to Collapse loop head.

The routine removenode v from (VE) is the following:

  1. 1.

    Remove node. Remove v from V.

  2. 2.

    Remove children loop head. For each child node \(v_c \) of v:

    1. (a)

      Remove edge. Remove the edge \((v, v_c) \) from E.

    2. (b)

      Clean child node. If no edge \((v_p, v) \) exists in E, then do routine removenode \(v_c\) from (VE) .

  3. 3.

    Remove parent loop head. For each parent node \(v_p \) of v:

    1. (a)

      Remove edge. Remove the edge \((v_p, v) \) from E.

Notice that each iteration of the collapse loop corresponds with an element in the sequence of history sDAGs named in Lemma 14. Since the order of edges in the sequence \((e_k) \) in Lemma 14 has no effect on the resulting history sDAG, the order of edges in the queue should have no effect on the history sDAG produced by this algorithm.

2.4 History sDAG completion

We now introduce “completion,” which essentially means that we add every edge that respects clade union sets. More precisely, Definition 3 specifies that each edge of a history sDAG must target a node whose clade union is in the subpartition of its parent. Given a collection of history sDAG nodes V, we can create an edge set \(E' \) containing all edges allowed by this requirement. The resulting DAG \((V, E')\) then contains all histories that can be constructed using nodes from V. If V is the node set for some valid history sDAG, then the resulting DAG \((V, E') \) must also be a history sDAG.

By completing a history sDAG, additional histories are represented. Although there is no guarantee about the weight of these new trees, it is possible that additional minimum weight trees may be found by the completed history sDAG, which makes this operation useful.

This idea is expressed in the following definition.

Definition 19

Let T be a collection of histories with labels in Y. Let (VE) be the history sDAG constructed from T. The completed history sDAG constructed from T is the history sDAG \((V, E') \), where

$$\begin{aligned} E' = \left\{ (v, v') \,\big \vert \,v, v'\in V \text { and the clade union of } v' \text { is a child clade of } v \right\} \end{aligned}$$

We will also refer to \((V, E') \) as the completion of (VE) .

The completed history sDAG constructed from T is a history sDAG because it includes at least those edges present in the history sDAG constructed from T, and all the additional edges are allowed by the definition of the history sDAG. We emphasize that history sDAG completion adds no new nodes, and that a completed history sDAG is in general a much smaller object than the complete history sDAG on a taxon set described in Definition 8.

Earlier sections show that \(T \subset D(T) \) for a set of histories T because a history sDAG constructed from T allows subhistory swaps involving conforming subhistories. In contrast, the completed history sDAG constructed from T allows any subhistories on the same leaf labels to swap, regardless of their parent nodes.

Swaps between subhistories with the same leaf label sets will not preserve history weights in the same sense as conforming subhistory swaps. Therefore, the completed history sDAG constructed from a set of histories T is not guaranteed to preserve weights in any sense. However, the lemmas from the previous sections guarantee that any history sDAG can be trimmed to express only its minimum weight histories. This means that the completed history sDAG can be used as a way to find even more minimum-weight histories than the original history sDAG construction, given a set of minimum-weight histories T. For example, completing a history sDAG constructed from maximally parsimonious, or nearly maximally parsimonious histories, could in some cases find additional maximally parsimonious histories which wouldn’t have been present before completion.

The completed history sDAG constructed from a set T of histories represents all possible histories which can be constructed using the nodes of histories in T. A choice of input histories can be therefore be framed as a choice of plausible pairs of labels and subpartitions, which then determines a collection of plausible histories.

3 Exploring parsimony diversity of SARS-CoV-2 clades

The original motivation for the history sDAG was to store a collection of minimum-weight histories. The theorems in the preceding sections show that the history sDAG is an ideal object for this task, and can discover new minimum weight histories in addition to those which we seek to store. Because SARS-CoV-2 is densely sampled relative to the rate of mutation and undergoes minimal recombination, parsimony methods are well-suited to studying its evolution (Thornlow et al. 2021). However, we will now demonstrate that there exists considerable uncertainty in a parsimonious reconstruction of SARS-CoV-2 evolution.

Searching for maximally parsimonious trees is computationally intensive, and scales poorly as the number of leaves increases. Traditionally, tools like PHYLIP ’s dnapars were used to produce an assortment of maximally parsimonious trees on a given set of sequences (Felsenstein 2009). Recently, the UShER project made it possible to quickly reconstruct a single approximate parsimony tree on millions of sampled sequences (Thornlow et al. 2021). Neither method guarantees that the reconstructed trees are maximally parsimonious relative to all possible trees on the given leaf sequences.

Users of both methods often accept the first tree produced, ignoring the uncertainty inherent to the parsimony assumption. However, there are in general many possible maximally parsimonious trees on a given set of leaf sequences.

Indeed, dnapars by default outputs a non-exhaustive collection of maximally parsimonious trees. However, for very large sets of sequences, a collection of nearly maximally parsimonious trees may be produced much more quickly using UShER. As a demonstration, we use UShER to reconstruct trees on an assortment of SARS-CoV-2 clades, extracted from the global phylogeny of public SARS-CoV-2 sequences provided by the UShER project (accessed 3-3-2022) (Lanfear 2020; Turakhia et al. 2021). We allowed UShER to reconstruct trees on the set of unique sequences from each clade, as well as the ancestral sequence in the original tree, outputting a maximum of 200 trees resulting from alternative parsimonious placements of samples. Including the ancestral sequence guarantees that the resulting reconstruction is comparable to the subtree of the global phylogeny corresponding to the same clade. We then use the UShER utility matOptimize (Ye et al. 2022) to attempt to optimize each tree, allowing the optimizer to make up to four moves for each sample which do not improve the parsimony score. Allowing a few such moves is intended to increase the diversity in output trees, without requiring excessive computation time. We saved four intermediate trees during optimization of each tree output by UShER. Optimizations of different trees output by UShER are not guaranteed to achieve the same parsimony score. However, even optimized trees which are not globally maximally parsimonious are likely to contain parsimony-optimal substructures.

For each clade, the collection of 800 intermediate trees resulting from these tree optimizations are used to create a history sDAG, after outgrouping the ancestral sequence in each. These 800 trees are not guaranteed to be unique, and in fact there are often many duplicates. The resulting history sDAG is then completed, trimmed to only express maximally parsimonious histories, and label-collapsed.

Whereas it is computationally expensive to construct a maximally parsimonious tree, the operations of trimming, collapsing, and completing are highly optimized, and in practice take only a few seconds for the history sDAGs used to produce Fig. 8. The number of operations required for the proposed trimming algorithm is bounded by \({\mathcal {O}}(E\cdot (MCS + MNC))\), and similarly the algorithm for completing the history sDAG is bounded by \({\mathcal {O}}(N^2\cdot MNC)\) where N is the number of nodes, E the number of edges, MCS the maximum size of any set of edge descending from a node-clade pair, and MNC the maximum number of child clades for any node in the history sDAG.

Fig. 8
figure 8

Unique trees found by UShER, and unique trees in the resulting history sDAG, for each selected SARS-CoV-2 clade. Point colors indicate if the parsimony score of trees in the history sDAG is lower than the best parsimony score achieved by UShER. Parsimony improvement compared to UShER trees does not exceed 0.04% for any clade. These data are summarized in Supplementary Table 2 (color figure online)

The resulting history sDAG sometimes contains histories which are slightly more parsimonious than any trees found by UShER, and in most cases, the number of maximally parsimonious histories contained in the resulting history sDAG is many orders of magnitude greater than the number of histories used as input (Fig. 8). However, this increase is far from uniform across clades. For the clade AY.46.6, the history sDAG expresses an impressive 25 orders of magnitude more tree diversity than the input trees found by UShER, and all of those trees have a slightly better parsimony score than any tree found by UShER. On the other hand, clade AY.111 also stands out in contrast, with only two unique trees found by UShER, and only those same two unique trees contained in the resulting history sDAG.

For some clades, such as 20F, the number of unique trees found by UShER is greater than the final number of trees expressed in the history sDAG. Although surprising, this is not contradictory, since many of the unique trees found by UShER may have a higher parsimony score than the trees contained in the final history sDAG.

It is unlikely that Fig. 8 reflects the true diversity of maximally parsimonious trees for each clade. In fact, the true minimum parsimony scores for tree reconstructions of each clade may be lower than the parsimony score of trees found here. The variation in tree diversity between clades is instead likely determined by features in the particular trees found by UShER. Further investigation of the true diversity of maximum parsimony trees will be left for future work.

Regardless, the large diversity of trees for most clades suggests that considerable uncertainty remains about tree structure when performing a maximum-parsimony search, even after collapsing edges without mutations into multifurcations. This uncertainty represents an opportunity to fine-tune the accepted tree in settings where parsimony is an appropriate assumption. For example, the histories found by this method could be used as a starting point for further optimization according to criteria other than parsimony. Such criteria, and their efficient calculation in the history sDAG, will be the subject of future work.

4 Discussion

This paper establishes that the history sDAG is an efficient structure for storage of similar internally labeled trees, and provides a foundation for future work to understand phylogenetic uncertainty using massive collections of parsimonious trees.

We described efficient methods for basic manipulation of the history sDAG object, and used these methods to demonstrate that for densely sampled SARS-CoV-2 data, it is possible to build a history sDAG containing many alternative parsimonious evolutionary histories. We implemented this process on clades containing up to seven thousand leaves, although it would have been feasible to use clades containing perhaps ten times as many. Software which is currently in development will allow parsimony optimization via matOptimize  (Ye et al. 2022) directly on the history sDAG, avoiding the time-consuming step of generating many input trees with UShER, and hopefully allowing these methods to scale to even larger datasets.

Thanks to the convenient structure of the history sDAG, it will be possible to efficiently summarize clade-level uncertainty in these histories, although such methods will be described and benchmarked in a future paper. This approach can only be expected to work well when the tree posterior is overwhelmingly concentrated on maximally parsimonious trees, and even then clade supports estimated with the history sDAG may not be directly comparable to supports observed in a sample from the tree posterior. However, for phylogenetic inference resulting in a single maximally parsimonious tree (which is typically arbitrarily chosen from the collection of MP trees), our method could provide a valuable understanding of the uncertainty resulting from this choice. Clade support estimation via the history sDAG may have advantages over standard approaches to phylogenetic uncertainty estimation. Unlike a bootstrap approach, all alternative histories in the history sDAG are built on the same data, and therefore clade support derived from the history sDAG could be more accurate for clades defined by only a few mutations (Wertheim et al. 2022). Unlike a Bayesian approach, our method makes no attempt to fully resolve a tree when there is insufficient signal to do so, and we expect it to scale well to large data.

The history sDAG is related to various earlier works, as we now describe.

4.1 The Subsplit DAG

The history sDAG generalizes a similar construction useful for likelihood computations and variational inference on trees, integrating out ancestral sequence uncertainty (Zhang and Matsen 2018, 2019). Although this form of the DAG structure is not expressed in the original variational inference papers, it is described in a more recent paper (Jun et al. 2023). In this subsplit DAG, internal nodes do not contain label data, and each internal node is required to have exactly two child clades (a subsplit is a subpartition with two parts). That is, the subsplit DAG is a history sDAG in which internal nodes all share the same fixed label, and each node has two child clades. The additional node label information in the history sDAG is essential for efficient storage and retrieval of maximally parsimonious trees, with the inferred ancestral sequences dictated by the parsimony assumption.

4.1.1 The Buneman Graph

A construction known as the Buneman graph is related to the history sDAG. In this construction, a collection of observations, each consisting of a collection of binary traits, can be arranged in a graph. This Buneman graph contains as subgraphs all possible maximally parsimonious trees relating the observations (Semple and Steel 2003). This construction has been generalized to sequences of non-binary characters (Bandelt and Röhl 2009; Misra et al. 2011), and one such generalization was applied to the problem of finding provably maximally parsimonious trees on nucleotide sequence data (Misra et al. 2011).

However, although the Buneman graph contains all maximally parsimonious trees on a set of observations, it may also contain trees which are not maximally parsimonious. The Buneman graph is therefore not a natural data structure for storing collections of maximally parsimonious trees, since considerable additional computation may be needed to find the maximally parsimonious trees in the graph. In contrast, the history sDAG may be trimmed to express only maximally parsimonious trees, and sampling or iterating through the trees it contains is trivial. In addition, the history sDAG can be immediately generalized to arbitrary observed data (abstracted as node labels), and allows efficient computation and trimming with respect to weight functions other than parsimony.

4.1.2 Tree Fusion

The swapping of subhistories that takes place in the history sDAG bears some resemblance to the procedure known as tree fusion, used in some parsimony software like TNT, in which clades are swapped between trees to improve parsimony scores (Goloboff 1999; Goloboff and Pol 2007).

Generally, the history sDAG can be thought of as a structure which efficiently represents, and allows computation on, the set of trees resulting from all possible combinations of these clade swaps. Thus, the history sDAG can only swap subhistories that have identical parent node labels and subpartitions. In contrast, tree fusion can consider trees resulting from swapping any subtrees, as long as they contain the same set of samples.

Tree fusion is better approximated in the completed history sDAG, which does allow swaps of any subhistories containing the same samples. That is, for a history sDAG (VE) constructed from a set of histories T, the set of histories in the completion of (VE) consists of all histories resulting from combinations of swaps involving subhistories of histories in T, regardless of their parent nodes. However, subhistory swaps are still fundamentally different from the swaps of subtopologies realized during tree fusion, since subhistory swaps maintain the same ancestral node labels that were present in the original histories involved in each swap. In order to ensure that ancestral labels are optimal in the new histories contained in the completed history sDAG, we would need an algorithm to reconstruct these ancestral states from scratch. Such an algorithm for computing optimal ancestral states in the history sDAG would be analogous to the Sankoff algorithm for reconstructing ancestral states on trees.

Despite these limitations, Fig. 8 shows that the subtree swaps which are realized in the history sDAG can be effective in reducing parsimony scores. Although the history sDAG does not fully implement tree fusion, it concurrently applies subhistory swaps in many different histories, and allows the resulting trees to be filtered efficiently according to arbitrary criteria. This may represent an advantage over methods which keep track of and optimize far fewer trees.

4.1.3 Tree Sequences

The history sDAG also bears some similarities to the tree sequence (Kelleher et al. 2019; Speidel et al. 2019). The tree sequence encodes a single evolutionary history for segments of a multiple sequence alignment, with changes of evolutionary history at specific points along the alignment due to recombination. The history sDAG, on the other hand, is meant to encode an unordered collection of equally parsimonious histories.

4.1.4 Future Work

We are in the process of building software that will allow us to do larger-scale inference using the history sDAG. In addition to the uncertainty quantification goals described above, this software will also allow us to do broader exploration of the set of maximally parsimonious trees than previously possible. We also hope to use the history sDAG as a means of improving MCMC sampling.

Maximally parsimonious trees may be a good starting point for inference via other methods, such as the branching process used by the tree inference package gctree  (DeWitt et al. 2018). To support this, we will develop efficient algorithms to make calculations on histories contained in the history sDAG. We will also explore ways to search for new optimal histories, such as maximally parsimonious histories, directly within the structure of the history sDAG.