Phylogenetics Algorithms and Applications
- 242 Downloads
Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution.
KeywordsPhylogenetics Cancer evolution Sequence analysis
Phylogenetics can be considered as one of the best tools for understanding the spread of contagious disease, for example, transmission of the human immunodeficiency virus (HIV) and the origin and subsequent evolution of the severe acute respiratory syndrome (SARS) associated coronavirus (SCoV) . Earlier, morphological traits were used for assessing similarities between species and building phylogenetic trees. Presently, phylogenetics relies on information extracted from genetic material such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences . Methods used for phylogenetic inference have changed drastically during the past two decades: from alignment-based to alignment-free methods . This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. A brief review of phylogenetic tree applications is also given in cancer studies.
2 Literature Review
A phylogenetic tree can be unrooted or rooted, implying directions corresponding to evolutionary time, i.e. the species at the leaves of a tree relate to the current day species. The species can be expressed as DNA strings which are formed by combining four nucleotides A, T, C and G (A—adenine, T—thymine, C—cytosine and G—guanine). In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. A high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in the phylogenetic tree. To get more precise information about the extent of similarity to some other sequence stored in a database, we must be able to compare sequences quickly with a set of sequences. For this, we need to perform the multiple sequence comparison. Dynamic programming concepts facilitate this comparison using alignment methods, but it involves more computation. Moreover, the iterative computational steps limit its utility for long length sequences . Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences.
3 Methods of Phylogenetic Tree Construction
Phylogenetic tree generation consists of sequence alignment where the resulting tree reveals how alignment can influence the tree formation. Alignment-based methodologies are probably the most widely used tools in sequence analysis problems . They consist of arranging two sequences: one on the top of another to highlight their common symbols and substrings. An alignment method is based on alignment parameters including insertion, deletions and gaps which play a pivotal role in the construction of the phylogenetic tree. A phylogenetic tree is formed as an outcome of sequence analysis performed on the DNA or RNA strings . Sequence comparison reveals the patterns of shared history between species, helping in the prediction of ancestral states. The comparison of sequences also helps in understanding the biology of living organisms which is required to find similarity and relationship among species. For sequence comparison, we can follow alignment-based or alignment-free methods [3, 6, 7].
3.1 Sequence Alignment
3.2 Character-Based Methods
Comparison of different phylogenetic tree construction methods
Appropriate for very similar sequences and a small number of sequences
Very time-consuming as it tests all possible trees
Parsimony may fail for diverged sequences
Suffers from the long-branch attraction
Predict the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences
It is built with the fewest changes required to explain (tree) the differences observed in the data
Suitable for very dissimilar sequences
We can formulate hypothesis about evolutionary relationships
More accurate phylogenetic trees can be constructed for a small number of taxa in a reasonable time frame
A slow search algorithm will lead to slow response
Takes a long time for large datasets
It tries to find a model that has the highest probability to generate the input sequence under a given evolutionary model
Faster than the character-based method
They are fast and can be used with a variety of models
Conversion from sequence data to distance data leads to loss of information
Provides an unrooted tree and a single resultant tree
Reliable for related sequences
Evolution rate is constant in all branches
UPGMA provides rooted tree
Less sensitive to variations in evolutionary rate
Dependent on the model used to obtain the distance matrix
3.3 Distance-Based Methods
Distance-based methods use the dissimilarity (the distance) between the two sequences to construct trees. They are much less computationally intensive than the character based methods are mostly accurate as they take mutations into count. For tree generation, generally, hierarchical clustering is used in which dendrograms (clusters) are created. Table 1 briefly compares various phylogenetic tree construction methods.
4 Alignment-Based Versus Alignment-Free Sequence Comparison
Multiple alignments of related sequences may often yield the most helpful information on its phylogeny. However, it can produce incorrect results when applied to more divergent sequence rearrangements . Some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them. Multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. In cases of sequences being less related to one another, however, sharing a common ancestor may be clustered separately . This implies that they can be more accurately aligned, but may result in incorrect phylogeny. Alignment can provide an optimized tree if a recursive approach is followed; however, this will increase the complexity of the problem. If the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation.
The use of dynamic programming in alignment makes computation more complicated, and iterative steps limit their utility for large datasets. Therefore, consistent efforts have been made in developing and improving multiple sequence alignment methods for supporting variable length sequences with high accuracy and also for aligning a larger number of sequences simultaneously. Because of the problems associated with alignment-based phylogeny the importance of alignment-free methods is apparent . Hence, the alignment quality affects the relationship created in a phylogenetic tree based on the consideration discussed above.
4.1 Alignment-Free Methods for Sequence Analysis
Alignment-free methods proposed in recent years can be classified into various categories as shown in Fig. 1. These include k-tuple based on the word frequencies, methods that represent the sequence without using the word frequencies, i.e. compression algorithms probabilistic methods and information theory-based method. In the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on the frequency vector of subsequence. The probabilistic methods represent the sequences using the transition matrix of a Markov chain  of a pre-specified order, and comparison of two sequences is done by finding the distance between two transition matrices. Graphical representation comprising 2D or 3D or even 20D methods provides an easy way to view, sort and compare various sequences. Graphical representation further helps in recognizing major characteristics among similar biological sequences.
As discussed k-tuple method uses k-words to characterize the compositional features of a sequence numerically. A biological sequence is numerically converted into a vector or a matrix composed of the word frequency. The k-word frequency provides a fast arithmetic speed and can be applied to full sequences. The problem with k-tuple is a big value of k that poses a challenge in the computing time and space, and k-word methods underestimate or even ignore the importance of its location. The string-based distance measure uses substring matches with k mismatches.
5 Application of Phylogenetics in Cancer Studies
Cancer research is considered one of the most significant areas in the medical community. Mutations in genomic sequences are responsible for cancer development and increased aggressiveness in patients [13, 14]. The combination of all such genes mutations, or progression pathways, across a population can be summarized in a phylogeny describing the different evolutionary pathways . Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15]. Discovery of genes associated in cancer subtype help researchers to map different pathways to classify cancer subtypes according to their mutations. Methods of phylogenetic tree inference have proliferated in cancer genome studies such as breast cancer . Phylogenetic can capture important mutational events among different cancer types; a network approach can also capture tumour similarities.
It has been observed from the literature that in cancer disease, the driver genes change the cancer progression, and it even affects the participation of other genes thus generating gene interaction network. Phylogenetic methods can solve the problem of class prediction by using a classification tree. Phylogenetic methods give us a deeper understanding of biological heterogeneity among cancer subtype.
The research focuses on the various methods of sequence analysis to generate phylogenetic trees. The limitations associated with sequence alignment methods lead to the development of alignment-free sequence analysis. However, most of the existing alignment-free methods are unable to build an accurate tree so more refinement is required in alignment-free methods. The phylogenetic study is not limited to species evolution, but disease evolution as well. Extending phylogenetic to disease diagnosis can give birth to new treatment options and understanding its progression.
The research is funded by Department of Science and Technology, Delhi, under the sanction number SR/WOS-A/ET-1015/2015.
- 2.Moret, B.M.E., Warnow, T.: Reconstructing optimal phylogenetic trees : a challenge in experimental algorithmics. In: Experimental Algorithmics, LNCS, pp. 163–180 (2002)Google Scholar
- 4.Geetika, Hanmandlu, M., Gaur, D.: Analyzing DNA strings using information theory concepts. In: ICTCS-16, ACM Conference, Udaipur, no. 9 (2016)Google Scholar
- 8.Needleman, W.: A general method applicable to the search for similarity in the amino acid sequence of two proteins. J. Mol. Biol. 1970(48), 443–453 (1969)Google Scholar
- 13.Somarelli, J., Ware, K., Kostadinov, R., Robinson, J., Amri, H., Abu-Asab, M., Fourie, N., Diogo, R., Swofford, D., Townsend, J.: PhyloOncology: understanding cancer through phylogenetic analysis. Biochimica et Biophysica Acta (BBA)—Rev. Cancer 1867(2), 101–108 (2017)Google Scholar
- 15.Munjal, G., Hanmandlu, M., Srivastava, S.: Novel gene selection method for breast cancer classification. J. Biochem. Technol. 8(4), 1116–1120Google Scholar
- 17.Nemeth, C.: Hidden Markov models with applications to DNA sequence analysis. STOR-i, Lancaster UniversityGoogle Scholar
- 19.Potter, R.M.: Constructing phylogenetic trees using multiple sequence alignment. University of Washington (2008)Google Scholar
- 21.Cho, A.: Constructing phylogenetic trees using maximum likelihood. Ph.D. Thesis, Scripps women’s college Claremont (2012)Google Scholar
- 22.Felsenstein, J.: PHYLIP. University of Washington Seattle, WA (1993)Google Scholar
- 25.Potiny, S.: An improved phylogenetic tree comparison method. Thesis University of North Carolina (2010)Google Scholar
- 26.Bryant, D., Moulton, V.: Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Col 21(2), 255–265 (2004)Google Scholar
- 27.Brinkman, F.S.L.: Bioinformatics: a practical guide to the analysis of genes and proteins. Publisher John Wiley and Sons (2001)Google Scholar
- 28.Munjal, G., Sharma, P., Gaur, D.: Sequence similarity using composition method. Int. J. Data Sci. 3(1), 19–28. https://doi.org/10.1504/IJDS.2018.090626