Gene Order Phylogeny via Ancestral Genome Reconstruction Under Dollo

Xu, Qiaoji; Sankoff, David

doi:10.1007/978-3-031-36911-7_7

Qiaoji Xu⁹ &
David Sankoff⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13883))

Included in the following conference series:

RECOMB International Workshop on Comparative Genomics

442 Accesses

Abstract

We present a proof of principle for a new kind of stepwise algorithm for unrooted binary gene-order phylogenies. This method incorporates a simple look-ahead inspired by Dollo’s law, while simultaneously reconstructing each ancestor (sometimes referred to as hypothetical taxonomic units “HTU”). We first present a generic version of the algorithm illustrating a necessary consequence of Dollo characters. In a concrete application we use generalized oriented gene adjacencies and maximum weight matching (MWM) to reconstruct fragments of monoploid ancestral genomes as HTUs. This is applied to three flowering plant orders while estimating phylogenies for these orders in the process. We discuss how to improve on the extensive computing times that would be necessary for this method to handle larger trees.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

In formal phylogenetics, Dollo models postulate that a character can only appear once in the course of evolution, although it may disappear from several descendent lineages. This idea has a number of combinatorial consequences, and suggests an algorithmic inference of phylogeny that differs significantly from standard approaches. In this paper, we present a proof of principle for such an algorithm in the context of unrooted binary trees.

Our approach is basically hierarchical, agglomerative, combining pairs of input taxa or already constructed subtrees but, crucially, does make use, via Dollo, of a limited amount of information from outside these pairs.

The input to the generic version of our algorithm consists of n taxa, sometimes referred to as observed taxonomic units - “OTUs”, corresponding to the terminal vertices of the phylogeny to be constructed, each of which represented by a set of distinct characters, where the sets may overlap with each other to varying degrees. The strategy is to construct \(n-2\) output sets, each containing some of the same characters, corresponding to the non-terminal vertices, or HTUs, of a binary tree constructed at the same time, such that the Dollo condition, or at least some of its important consequences, is maintained.

In contrast to the generic version of the algorithm, for any specific formulation of a Dollo model, the algorithm must be modified so that the output sets conform to the requirements of the problem at hand. These modifications generally mean that Dollo is no longer a sufficient condition, but remains a necessary condition for a character to be included in an output set.

In the main part of this paper, the elements of an input set are all the proximities between oriented genes in one genome. The elements of each output set are chosen from a very large number of proximities, namely those satisfying the Dollo condition, but narrowed down to include only those consistent with an optimal linear ordering, i.e. fragments of chromosome. We have previously invoked these concepts in studying the “small phylogeny problem”, where the tree topology is given (e.g., [1,2,3,4,5]), but here we concentrate on the “large phylogeny problem”, where we actually construct this phylogeny.

We apply our method to three plant orders to compare the results with known phylogenies, with almost total agreement.

Finally we discuss possible improvements in efficiency and extensions beyond unrooted binary branching phylogenies.

2 Dollo’s Law in the Context of Unrooted Binary Trees

The idea that a character is gained only one time and can never be regained if it is lost is realized in an unrooted tree by the property that the set of vertices containing the character are connected. This is a necessary and sufficient condition, valid both for terminal vertices (or degree 1) and internal (ancestral) ones (degree 3 in an unrooted binary tree).

Formally, the connectedness condition can be satisfied by a set of non-terminal nodes of a tree. For phylogenetic inference, however, we require that a character be present in the input set of at least two terminal vertices, i.e., visible to the observer. Moreover, if it were only in one terminal set, it could not be necessary presence for any of the output sets at the non-terminal vertices.

In an unrooted binary branching tree, each non-terminal vertex subtends three subtrees, as in Fig. 1. For a data set to be used in constructing a phylogenetic tree and the output sets, clearly each character must be present in at least two of the three subtrees, as illustrated in Fig. 1. More precisely, each character must be present at least in one terminal vertex set in at least two of the three subtrees. We call these characters “phylogenetically validated”.

3 Generic Algorithm

1.
input n sets of characters
2.
\(i=1\)
3.
all n sets are “eligible” vertices
4.
while \(i\le n-3\)
1. (a)
  *for each pair (G, H) of the \(m=n-i+1\) eligible vertices calculate a potential ancestral vertex A containing all phylogenetically validated characters (in both G and H, or in one of G or H plus any other eligible vertex).
2. (b)
  pick the pair \((G',H')\) with maximum total number of the characters in their potential ancestor A.
3. (c)
  ancestor vertex A becomes eligible, and the other two, \(G'\) and \(H'\), become ineligible
4. (d)
  two edges of the tree are defined: \(AG'\) and \(AH'\)
5. (e)
  \(i=i+1\)
5.
Now \(i = n-3\), so that there are three eligible vertices \(G', H', K'\). Define three edges of the tree: \(AG', AH'\) and \(AK'\)
6.
calculate ancestral vertex A containing all phylogenetically validated characters (in any two or all three of \(G',H'\) and \(K'\))
7.
output all \(2n-3\) tree edges and all \(n-2\) selected output (ancestral, or HTU) sets.

The asterisked step 4a may be interpreted in at least two different ways. In requiring that a character be present in a subtree, we may mean that character must be present in some input set associated with a terminal vertex of that subtree, or we may require something stronger, that the character be present in the eligible (ancestral) vertex of that subtree.

In constructing a potential ancestor A of two input sets G and H, we define a Dollo character to be one in either G or H, but not both, as well as in some eligible vertex other than G or H.

Our sketch of the generic algorithm is meant to illustrate how the Dollo principle allows a kind of look-ahead in the hierarchical construction of the phylogeny (Fig. 2). Without some modification, however, the algorithm can result in some counter-intuitive results.

Consider the input sets (1, 2), (3, 4), (1, 3), (2, 4), where all four characters are also in other sets. The largest potential ancestor (1, 2, 3, 4) is constructed by pairing (1, 2) with (3, 4), or pairing (1, 3) with (2, 4). Note that this ancestor is based entirely on Dollo characters.

The grouping of (1, 2) with (3, 4) in this construction is not intuitively satisfying from the phylogenetic viewpoint since these two input sets have nothing in common. This suggests down-weighting the Dollo characters. For example, if we assign weight 1/2 to Dollo characters compared to weight 2 to a character in both G and H, the potential ancestor of (1, 2) and (3, 4) only has weight 2, while the ancestor of (1, 2) and (1, 3) has weight 3. It is the latter that will be chosen by the algorithm if we replace “maximum number” by “maximum weight”.

A principled way of assigning weights might be \(2+\alpha q/m\) for characters in both G and H plus any other eligible vertex, and \(1+\beta q/m\) for a character in one of G or H plus any other eligible vertex, where q is the number of other eligible vertices out of m containing the character, and \(2\beta<\alpha <1\).

4 Genomics Case: Generalized Adjacencies of Genes

4.1 Algorithm: Ancestor via Maximum Weight Matching (MWM)

In this version of the algorithm, the content of the sets corresponding to the given extant genomes are all the generalized gene adjacencies in the genome. This includes pairs of adjacent genes, taking their orientations or DNA strandedness into account, but also includes pairs of gene that are not immediate neighbours, but that are separated by at most 7 other genes in the gene order on a chromosome.

The ancestor genomes also contain adjacencies, not the simple union of the contents of the two daughter genomes, but only the best set of consistent adjacencies, namely the output of a MWM from among all the adjacencies in the two daughters plus certain adjacencies from other “eligible” genomes. The matching criterion ensures that all the adjacencies in a set associated with an ancestor are compatible with a linear ordering, as in a chromosome, although we are not concerned with actually building chromosomes here.

The definition of orthologous genes in the various input genomes, the construction of the sets of adjacency, and the use of MWM are taken from the first steps of the Raccroche method [1,2,3,4,5], which is concerned with actual chromosome reconstruction of the ancestors genomes in a small phylogeny context. This is not our concern here, which is the branching structure of a phylogeny, namely the large phylogeny problem.

The MWM analysis was carried out using Joris van Rantwijk’s implementation [6] of Galil’s algorithm for maximum matching in general (not just bipartite) graphs [7], based on Edmonds “blossom” method including the search for augmenting paths and his“primal-dual” method for finding a matching of maximum weight [8].

1.
input n extant genomes
2.
\(i=1\)
3.
all n genomes are “eligible”
4.
while \(i\le n-3\)
1. (a)
  for all pairs (G, H) of the \(m-i+1\) eligible vertices find potential ancestral genome A as the Maximum Weight Matching of phylogenetically validated generalized adjacencies (in both G and H - weight 2, or in one of G or H plus any other eligible vertex - weight 2, or in both G and H plus any other eligible vertex - weight 3).
2. (b)
  pick the pair \((G',H')\) with the highest Maximum Weight Matching score.
3. (c)
  ancestor genome A becomes eligible, and the other two, \(G'\) and \(H'\) become ineligible.
4. (d)
  two edges of the tree are defined: \(AG'\) and \(AH'\)
5. (e)
  \(i=i+1\)
5.
Now \(i = n-3\), so that there are three eligible vertices \(G', H', K'\). Define three edges of the tree : \(AG', AH'\) and \(AK'\)
6.
calculate ancestral vertex A containing all phylogenetically validated adjacencies (in any two or all three of \(G',H'\) and \(K'\))

A down-weighting scheme for Dollo characters may also be adopted here, although we do not consider that further here.

The use of MWM is what distinguishes our method from Dollo parsimony and other adjacency methods. MWM ensures that the adjacencies that appear in a vertex set are consistent, that they can all be part of a chromosome, whereas all other phylogenetic methods based on adjacencies do not care that some sets of adjacencies are not consistent with a chromosome-based genome. These methods may base their results on a gene that is adjacent to three or four genes, while MWM will only retain adjacencies that allow a gene to be adjacent to only two other genes.

5 Application to Phylogenies of Three Plant Orders

Detailed references to all the genomes mentioned here are given in reference [5], including access codes for the CDS files of the genomes we use in the CoGe platform [9, 10].

The phylogenies that serve as validation of our constructs are by and large uncontroversial, based on up-to-date sources, mainly [11,12,13].

5.1 Asterales

From the family Asteraceae, we used the published genomes of safflower (Carthamus tinctorius) and artichoke (Cynara cardunculu) from the subfamily Carduoideae, lettuce (Lactuca sativa) and dandelion (Taraxacum mongolicum) from the subfamily Cichorioideae, and Mikania micrantha and Stevia rebaudian from the subfamily Asteroideae.

Figure 3 and Table 1 show that our method partitioned the six genomes correctly into three groups.

Table 1. Steps in searching for Asterales ancestors

Full size table

5.2 Fagales

From the order Fagales, we used the published genomes of oak (Quercus robur) and beech (Fagus sylvatica) from the family Fagaceae, birch (Betula platyphylla) and hazelnut (Corylus mandshurica) from the family Betulaceae, and walnut (Juglans regia), pecan (Carya illinoinensis) and hickory (Carya cathayensis) from the family Juglandaceae.

Table 2. Steps in searching for Fagales ancestors

Full size table

Figure 4 and Table 2 show that our method partitioned the seven genomes correctly into three families.

5.3 Sapindales

While the results on the Asterales and Fagales are heartening, we cannot expect the method to always perform as well. This is illustrated by an analysis of the order Sapindales. From this order, we used the published genomes of cashew (Anacardium occidentale), mango (Mangifera indica) and pistachio (Pistacia vera) from the family Anacardiaceae and maple (Acer catalpifolium), longan (Dimocarpus longan), lychee (Litchi chinensis) and yellowhorn (Xanthoceras sorbifoli) from the family Sapindaceae.

As seen in Fig. 5, except for the incorrect placement of yellowthorn, the method separates the two families as expected. Were this genome to be removed, the reconstructed tree would be identical to the known tree, aside from a permutation of the species within the Sapindaceae. Indeed, Table 3 shows that at the fourth step in the algorithm, yellowthorn was almost assigned to join the other Sapindaceae.

Table 3. Steps in searching for Sapindales ancestors

Full size table

6 Discussion and Conclusions

The algorithmic reconstruction of ancient gene orders, and associated phylogenies, has a history of over three decades (e.g., [14,15,16,17]). The present work differs from all these in that it is situated in a paradigm [1,2,3,4,5] where only the strictly monoploid ancestors in a phylogeny are reconstructed, or strictly linear chromosomal fragments, whether or not such genomes actually existed or are just inherent in the basic chromosomal organization within the possibly re-occurring polyploid history of the group.

Although our approach builds a hierarchy by combining pairs of OTUs or already constructed subtrees, it is unusual that it does make use, via Dollo, of a limited amount of information from outside these pairs. This is in effect a very partial look-ahead.

The goal of this paper was to present evidence that our method can construct accurate or plausible phylogenies, based entirely on comparative gene order. We were not preoccupied with questions of computing time. Indeed, since our reconstruction is based on maximum weight matching (MWM) software on very large graphs, it is bound to be computationally expensive. Moreover, since the current experimental version uses MWM to exhaustively evaluate all possibilities separately at each step in building a hierarchy, only a moderate number of OTUs can be input. There are, however, many possibilities to improving the efficiency, by constraining the search space, by branch and bound techniques, by saving a certain number of partial solutions in parallel, and other techniques.

References

Xu, Q., Jin, L., Zheng, C., Leebens Mack, J.H., Sankoff, D.: RACCROCHE: ancestral flowering plant chromosomes and gene orders based on generalized adjacencies and chromosomal gene co-occurrences. In: Jha, S.K., Măndoiu, I., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds.) ICCABS 2020. LNCS, vol. 12686, pp. 97–115. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79290-9_9
Chapter Google Scholar
Xu, Q., Jin, L., Leebens-Mack, J.H., Sankoff, D.: Validation of automated chromosome recovery in the reconstruction of ancestral gene order. Algorithms 14(6), 160 (2021)
Article Google Scholar
Chanderbali, A.S., Jin, L., Xu, Q., et al.: Buxus and Tetracentron genomes help resolve eudicot genome history. Nat. Communun. 13, 643 (2022). https://doi.org/10.1038/s41467-022-28312-w
Article Google Scholar
Xu, Q., et al.: Ancestral flowering plant chromosomes and gene orders based on generalized adjacencies and chromosomal gene co-occurrences. J. Comput. Biol. 28(11), 1156–79 (2021)
Article Google Scholar
Xu, Q., Jin, L., Zheng, C., Zhang, X., Leebens-Mack, J., Sankoff, D.: From comparative gene content and gene order to ancestral contigs, chromosomes and karyotypes. Sci. Rep. 13, 6095 (2023)
Google Scholar
van Rantwijk, J.: Maximum Weighted Matching (2008). http://jorisvr.nl/article/maximum-matching
Galil, Z.: Efficient algorithms for finding maximum matching in graphs. ACM Comput. Surv. 18, 23–38 (1986)
Article MathSciNet MATH Google Scholar
Edmonds, J.: Paths, trees, and flowers. Can. J. Math. 17, 449–67 (1965)
Article MathSciNet MATH Google Scholar
Lyons, E., Freeling, M.: How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 53, 661–673 (2008)
Article Google Scholar
Lyons, E., et al.: Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar and grape: CoGe with rosids. Plant Physiol. 148, 1772–1781 (2008)
Article Google Scholar
Published Plant Genomes. Usadel lab, Forschungszentrum Jülich, Heinrich Heine University., Düsseldorf (2022). https://www.plabipd.de/
Stevens, P.F.: Angiosperm Phylogeny Website. Version 14 (2017). http://www.mobot.org/MOBOT/research/APweb/
Chase, M.W., et al.: An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Bot. J. Linn. Soc. 181, 1–20 (2016)
Article Google Scholar
Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., Cedergren, R.: Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. 15, 6575–9 (1992)
Article Google Scholar
Moret, B.M., Warnow, T.: Advances in phylogeny reconstruction from gene order and content data. Methods Enzymol. 395, 673–700 (2005)
Article Google Scholar
Hu, F., Lin, Y., Tang, J.: MLGO: phylogeny reconstruction and ancestral inference from gene-order data. BMC Bioinf. 15, 1–6 (2014)
Article Google Scholar
Perrin, A., Varré, J.S., Blanquart, S., Ouangraoua, A.: ProCARs: progressive reconstruction of ancestral gene orders BMC genomics 16 S5:S6 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Ottawa, 150 Louis Pasteur Pvt., Ottawa, ON, K1N 6N5, Canada
Qiaoji Xu & David Sankoff

Authors

Qiaoji Xu
View author publications
You can also search for this author in PubMed Google Scholar
David Sankoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sankoff .

Editor information

Editors and Affiliations

Freie Universität Berlin, Berlin, Germany
Katharina Jahn
Comenius University, Bratislava, Slovakia
Tomáš Vinař

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Q., Sankoff, D. (2023). Gene Order Phylogeny via Ancestral Genome Reconstruction Under Dollo. In: Jahn, K., Vinař, T. (eds) Comparative Genomics. RECOMB-CG 2023. Lecture Notes in Computer Science(), vol 13883. Springer, Cham. https://doi.org/10.1007/978-3-031-36911-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-36911-7_7
Published: 13 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36910-0
Online ISBN: 978-3-031-36911-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Gene Order Phylogeny via Ancestral Genome Reconstruction Under Dollo

Abstract

Keywords

1 Introduction

2 Dollo’s Law in the Context of Unrooted Binary Trees

3 Generic Algorithm

4 Genomics Case: Generalized Adjacencies of Genes

4.1 Algorithm: Ancestor via Maximum Weight Matching (MWM)

5 Application to Phylogenies of Three Plant Orders

5.1 Asterales

5.2 Fagales

5.3 Sapindales

6 Discussion and Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation