Keywords

1 Introduction

In formal phylogenetics, Dollo models postulate that a character can only appear once in the course of evolution, although it may disappear from several descendent lineages. This idea has a number of combinatorial consequences, and suggests an algorithmic inference of phylogeny that differs significantly from standard approaches. In this paper, we present a proof of principle for such an algorithm in the context of unrooted binary trees.

Our approach is basically hierarchical, agglomerative, combining pairs of input taxa or already constructed subtrees but, crucially, does make use, via Dollo, of a limited amount of information from outside these pairs.

The input to the generic version of our algorithm consists of n taxa, sometimes referred to as observed taxonomic units - “OTUs”, corresponding to the terminal vertices of the phylogeny to be constructed, each of which represented by a set of distinct characters, where the sets may overlap with each other to varying degrees. The strategy is to construct \(n-2\) output sets, each containing some of the same characters, corresponding to the non-terminal vertices, or HTUs, of a binary tree constructed at the same time, such that the Dollo condition, or at least some of its important consequences, is maintained.

In contrast to the generic version of the algorithm, for any specific formulation of a Dollo model, the algorithm must be modified so that the output sets conform to the requirements of the problem at hand. These modifications generally mean that Dollo is no longer a sufficient condition, but remains a necessary condition for a character to be included in an output set.

In the main part of this paper, the elements of an input set are all the proximities between oriented genes in one genome. The elements of each output set are chosen from a very large number of proximities, namely those satisfying the Dollo condition, but narrowed down to include only those consistent with an optimal linear ordering, i.e. fragments of chromosome. We have previously invoked these concepts in studying the “small phylogeny problem”, where the tree topology is given (e.g., [1,2,3,4,5]), but here we concentrate on the “large phylogeny problem”, where we actually construct this phylogeny.

We apply our method to three plant orders to compare the results with known phylogenies, with almost total agreement.

Finally we discuss possible improvements in efficiency and extensions beyond unrooted binary branching phylogenies.

2 Dollo’s Law in the Context of Unrooted Binary Trees

The idea that a character is gained only one time and can never be regained if it is lost is realized in an unrooted tree by the property that the set of vertices containing the character are connected. This is a necessary and sufficient condition, valid both for terminal vertices (or degree 1) and internal (ancestral) ones (degree 3 in an unrooted binary tree).

Fig. 1.
figure 1

Necessary condition for characters to appear at internal vertex of binary branching tree. Light shaded character (small square) appears in all three trees (triangles) subtended by the internal vertex (circle). Dark shaded character appears in only two of the trees. Unshaded character appears in only one subtree so does not affect internal vertex. The shaded characters are “phylogenetically validated” with respect to the internal vertex. The unshaded one is not validated.

Formally, the connectedness condition can be satisfied by a set of non-terminal nodes of a tree. For phylogenetic inference, however, we require that a character be present in the input set of at least two terminal vertices, i.e., visible to the observer. Moreover, if it were only in one terminal set, it could not be necessary presence for any of the output sets at the non-terminal vertices.

In an unrooted binary branching tree, each non-terminal vertex subtends three subtrees, as in Fig. 1. For a data set to be used in constructing a phylogenetic tree and the output sets, clearly each character must be present in at least two of the three subtrees, as illustrated in Fig. 1. More precisely, each character must be present at least in one terminal vertex set in at least two of the three subtrees. We call these characters “phylogenetically validated”.

3 Generic Algorithm

  1. 1.

    input n sets of characters

  2. 2.

    \(i=1\)

  3. 3.

    all n sets are “eligible” vertices

  4. 4.

    while \(i\le n-3\)

    1. (a)

      *for each pair (GH) of the \(m=n-i+1\) eligible vertices calculate a potential ancestral vertex A containing all phylogenetically validated characters (in both G and H, or in one of G or H plus any other eligible vertex).

    2. (b)

      pick the pair \((G',H')\) with maximum total number of the characters in their potential ancestor A.

    3. (c)

      ancestor vertex A becomes eligible, and the other two, \(G'\) and \(H'\), become ineligible

    4. (d)

      two edges of the tree are defined: \(AG'\) and \(AH'\)

    5. (e)

      \(i=i+1\)

  5. 5.

    Now \(i = n-3\), so that there are three eligible vertices \(G', H', K'\). Define three edges of the tree: \(AG', AH'\) and \(AK'\)

  6. 6.

    calculate ancestral vertex A containing all phylogenetically validated characters (in any two or all three of \(G',H'\) and \(K'\))

  7. 7.

    output all \(2n-3\) tree edges and all \(n-2\) selected output (ancestral, or HTU) sets.

Fig. 2.
figure 2

The enclosed area contains input vertices. Outside the enclosed area are ancestors. Five of the input vertices have already been joined to an ancestor. These are shaded, as is one ancestor that itself been joined to an earlier ancestor. Eligible vertices, either input or ancestral, are not shaded. The dotted lines show an attempt to join an input vertex to an ancestor, with one dotted line potentially linking a Dollo character.

The asterisked step 4a may be interpreted in at least two different ways. In requiring that a character be present in a subtree, we may mean that character must be present in some input set associated with a terminal vertex of that subtree, or we may require something stronger, that the character be present in the eligible (ancestral) vertex of that subtree.

In constructing a potential ancestor A of two input sets G and H, we define a Dollo character to be one in either G or H, but not both, as well as in some eligible vertex other than G or H.

Our sketch of the generic algorithm is meant to illustrate how the Dollo principle allows a kind of look-ahead in the hierarchical construction of the phylogeny (Fig. 2). Without some modification, however, the algorithm can result in some counter-intuitive results.

Consider the input sets (1, 2), (3, 4), (1, 3), (2, 4), where all four characters are also in other sets. The largest potential ancestor (1, 2, 3, 4) is constructed by pairing (1, 2) with (3, 4), or pairing (1, 3) with (2, 4). Note that this ancestor is based entirely on Dollo characters.

The grouping of (1, 2) with (3, 4) in this construction is not intuitively satisfying from the phylogenetic viewpoint since these two input sets have nothing in common. This suggests down-weighting the Dollo characters. For example, if we assign weight 1/2 to Dollo characters compared to weight 2 to a character in both G and H, the potential ancestor of (1, 2) and (3, 4) only has weight 2, while the ancestor of (1, 2) and (1, 3) has weight 3. It is the latter that will be chosen by the algorithm if we replace “maximum number” by “maximum weight”.

A principled way of assigning weights might be \(2+\alpha q/m\) for characters in both G and H plus any other eligible vertex, and \(1+\beta q/m\) for a character in one of G or H plus any other eligible vertex, where q is the number of other eligible vertices out of m containing the character, and \(2\beta<\alpha <1\).

4 Genomics Case: Generalized Adjacencies of Genes

4.1 Algorithm: Ancestor via Maximum Weight Matching (MWM)

In this version of the algorithm, the content of the sets corresponding to the given extant genomes are all the generalized gene adjacencies in the genome. This includes pairs of adjacent genes, taking their orientations or DNA strandedness into account, but also includes pairs of gene that are not immediate neighbours, but that are separated by at most 7 other genes in the gene order on a chromosome.

The ancestor genomes also contain adjacencies, not the simple union of the contents of the two daughter genomes, but only the best set of consistent adjacencies, namely the output of a MWM from among all the adjacencies in the two daughters plus certain adjacencies from other “eligible” genomes. The matching criterion ensures that all the adjacencies in a set associated with an ancestor are compatible with a linear ordering, as in a chromosome, although we are not concerned with actually building chromosomes here.

The definition of orthologous genes in the various input genomes, the construction of the sets of adjacency, and the use of MWM are taken from the first steps of the Raccroche method [1,2,3,4,5], which is concerned with actual chromosome reconstruction of the ancestors genomes in a small phylogeny context. This is not our concern here, which is the branching structure of a phylogeny, namely the large phylogeny problem.

The MWM analysis was carried out using Joris van Rantwijk’s implementation [6] of Galil’s algorithm for maximum matching in general (not just bipartite) graphs [7], based on Edmonds “blossom” method including the search for augmenting paths and his“primal-dual” method for finding a matching of maximum weight [8].

  1. 1.

    input n extant genomes

  2. 2.

    \(i=1\)

  3. 3.

    all n genomes are “eligible”

  4. 4.

    while \(i\le n-3\)

    1. (a)

      for all pairs (GH) of the \(m-i+1\) eligible vertices find potential ancestral genome A as the Maximum Weight Matching of phylogenetically validated generalized adjacencies (in both G and H - weight 2, or in one of G or H plus any other eligible vertex - weight 2, or in both G and H plus any other eligible vertex - weight 3).

    2. (b)

      pick the pair \((G',H')\) with the highest Maximum Weight Matching score.

    3. (c)

      ancestor genome A becomes eligible, and the other two, \(G'\) and \(H'\) become ineligible.

    4. (d)

      two edges of the tree are defined: \(AG'\) and \(AH'\)

    5. (e)

      \(i=i+1\)

  5. 5.

    Now \(i = n-3\), so that there are three eligible vertices \(G', H', K'\). Define three edges of the tree : \(AG', AH'\) and \(AK'\)

  6. 6.

    calculate ancestral vertex A containing all phylogenetically validated adjacencies (in any two or all three of \(G',H'\) and \(K'\))

A down-weighting scheme for Dollo characters may also be adopted here, although we do not consider that further here.

The use of MWM is what distinguishes our method from Dollo parsimony and other adjacency methods. MWM ensures that the adjacencies that appear in a vertex set are consistent, that they can all be part of a chromosome, whereas all other phylogenetic methods based on adjacencies do not care that some sets of adjacencies are not consistent with a chromosome-based genome. These methods may base their results on a gene that is adjacent to three or four genes, while MWM will only retain adjacencies that allow a gene to be adjacent to only two other genes.

5 Application to Phylogenies of Three Plant Orders

Detailed references to all the genomes mentioned here are given in reference [5], including access codes for the CDS files of the genomes we use in the CoGe platform [9, 10].

The phylogenies that serve as validation of our constructs are by and large uncontroversial, based on up-to-date sources, mainly [11,12,13].

5.1 Asterales

From the family Asteraceae, we used the published genomes of safflower (Carthamus tinctorius) and artichoke (Cynara cardunculu) from the subfamily Carduoideae, lettuce (Lactuca sativa) and dandelion (Taraxacum mongolicum) from the subfamily Cichorioideae, and Mikania micrantha and Stevia rebaudian from the subfamily Asteroideae.

Figure 3 and Table 1 show that our method partitioned the six genomes correctly into three groups.

Fig. 3.
figure 3

Partial phylogeny of the order Asterales (family Asteraceae) as correctly reconstructed by our algorithm, with haploid numbers of chromosomes. Labels on interior nodes indicate the algorithm steps at which they were created (cf Table 1).

Table 1. Steps in searching for Asterales ancestors

5.2 Fagales

From the order Fagales, we used the published genomes of oak (Quercus robur) and beech (Fagus sylvatica) from the family Fagaceae, birch (Betula platyphylla) and hazelnut (Corylus mandshurica) from the family Betulaceae, and walnut (Juglans regia), pecan (Carya illinoinensis) and hickory (Carya cathayensis) from the family Juglandaceae.

Fig. 4.
figure 4

Partial phylogeny of the order Fagales, as correctly reconstructed by our algorithm, with haploid numbers of chromosomes. Labels on interior nodes indicate the algorithm steps at which they were created (cf Table 2).

Table 2. Steps in searching for Fagales ancestors

Figure 4 and Table 2 show that our method partitioned the seven genomes correctly into three families.

5.3 Sapindales

While the results on the Asterales and Fagales are heartening, we cannot expect the method to always perform as well. This is illustrated by an analysis of the order Sapindales. From this order, we used the published genomes of cashew (Anacardium occidentale), mango (Mangifera indica) and pistachio (Pistacia vera) from the family Anacardiaceae and maple (Acer catalpifolium), longan (Dimocarpus longan), lychee (Litchi chinensis) and yellowhorn (Xanthoceras sorbifoli) from the family Sapindaceae.

As seen in Fig. 5, except for the incorrect placement of yellowthorn, the method separates the two families as expected. Were this genome to be removed, the reconstructed tree would be identical to the known tree, aside from a permutation of the species within the Sapindaceae. Indeed, Table 3 shows that at the fourth step in the algorithm, yellowthorn was almost assigned to join the other Sapindaceae.

Fig. 5.
figure 5

Partial phylogeny of the order Sapindaless with haploid numbers of chromosomes. Left is the known phylogeny and right is the reconstructed one. Labels on interior nodes at right indicate the algorithm steps at which they were created.

Table 3. Steps in searching for Sapindales ancestors

6 Discussion and Conclusions

The algorithmic reconstruction of ancient gene orders, and associated phylogenies, has a history of over three decades (e.g., [14,15,16,17]). The present work differs from all these in that it is situated in a paradigm [1,2,3,4,5] where only the strictly monoploid ancestors in a phylogeny are reconstructed, or strictly linear chromosomal fragments, whether or not such genomes actually existed or are just inherent in the basic chromosomal organization within the possibly re-occurring polyploid history of the group.

Although our approach builds a hierarchy by combining pairs of OTUs or already constructed subtrees, it is unusual that it does make use, via Dollo, of a limited amount of information from outside these pairs. This is in effect a very partial look-ahead.

The goal of this paper was to present evidence that our method can construct accurate or plausible phylogenies, based entirely on comparative gene order. We were not preoccupied with questions of computing time. Indeed, since our reconstruction is based on maximum weight matching (MWM) software on very large graphs, it is bound to be computationally expensive. Moreover, since the current experimental version uses MWM to exhaustively evaluate all possibilities separately at each step in building a hierarchy, only a moderate number of OTUs can be input. There are, however, many possibilities to improving the efficiency, by constraining the search space, by branch and bound techniques, by saving a certain number of partial solutions in parallel, and other techniques.