Alignment-free phylogeny of whole genomes using underlying subwords

Comin, Matteo; Verzotto, Davide

doi:10.1186/1748-7188-7-34

Alignment-free phylogeny of whole genomes using underlying subwords

Research
Open access
Published: 06 December 2012

Volume 7, article number 34, (2012)
Cite this article

Download PDF

You have full access to this open access article

Algorithms for Molecular Biology Aims and scope Submit manuscript

Alignment-free phylogeny of whole genomes using underlying subwords

Download PDF

Matteo Comin¹ &
Davide Verzotto¹

8389 Accesses
41 Citations
1 Altmetric
Explore all metrics

Abstract

Background

With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods.

Methods

In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords.

Results

The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information.

Availability

http://www.dei.unipd.it/∼ciompin/main/underlying.html

Alignment- and reference-free phylogenomics with colored de Bruijn graphs

Article Open access 07 April 2020

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

A greedy alignment-free distance estimator for phylogenetic inference

Article Open access 07 June 2017

Background

The global spread of low-cost high-throughput sequencing technologies has made a large number of complete genomes publicly available, and this number is still growing rapidly. In contrast, only few computational methods can really handle as input entire chromosomes, or entire genomes.

Traditionally the comparison of related genomes is carried out by sequence alignment. Popular methods extract gene-specific sequences from all species under examination and build a multiple sequence alignment for each gene[1]. Then all multiple sequence alignments are merged to form the final phylogeny. Other methods[2] use genes as a dictionary, counting the presence or absence of a gene. This gene profile is then used to derive a similarity score. However, if the genomes in question do not share a common set of genes, or if they cannot be aligned to each other, e.g., due to substantially different lengths, these traditional techniques cannot be applied. As a general example, in a pairwise comparison of genomes popular alignment tools rely on a specific order of elements for each genome sequence, and on a set of sparse shared seeds that should then be extended to obtain a global alignment. Therefore low sequence similarity, rearrangements, and inversions can cause major problems in identifying a possible alignment and thus the actual sequence similarity.

Furthermore, when considering whole genomes, the global alignment of large sequences has become a prohibitive task even for supercomputers, hence simply infeasible. To overcome these obstacles, in the last ten years a variety of alignment-free methods have been proposed. In principle they are all based on counting procedures that characterize a sequence based on its constituents, e.g., k-mers[3, 4].

An important aspect in phylogeny reconstruction is the fact that gene-based methods strictly focus on comparing the coding regions of genomes, which can account for as little as 1% of the genomic sequence in humans[5]. Whereas it is known that the use of whole genomes may provide more robust information when comparing different organisms[6]. Also most alignment-free methods in the literature use only a portion of complete genomes[7]. For instance, there are approaches that use only genic regions[3] or mitochondria; other methods filter out regions that are highly repetitive or with low complexity[4]. Recently, it has been shown that the evolutionary information is also carried by non-genic regions[8]. For certain viruses, we are not even able to estimate a complete phylogeny by analyzing their genes, since these organisms may share a very limited genetic material[7].

Sims et al. recently applied the Feature Frequency Profiles method (FFP) presented in[4] to compute a whole-genome phylogeny of mammals[8]—i.e., large eukaryotic genomes including the human genome — and of bacteria. This method needs to estimate the parameter k in order to compute a feature vector for each sequence, where the vector represents the frequency of each possible k-mer. Each feature vector is then normalized by the total number of k-mers found (i.e., by the sequence length minus k-1), obtaining a probability distribution vector, or feature frequency profile, for each genome. FFP finally computes the distance matrix between all pairs of genomes by applying the Jensen-Shannon[9] divergence to their frequency profiles.

This general characterization of strings based on their subsequence composition closely resembles some of the information theory problems, and is tightly related with the compression of strings. In fact, compositional methods can be viewed as the reinterpretation of data compression methods, well known in the literature[10, 11]. For a comprehensive survey on the importance and implications of data compression methods in computational biology, we refer the reader to[12].

When comparing entire genomes we want to avoid that large non-coding regions, which by nature tend to be highly repetitive, may contribute to our scoring function a multiple number of times, thus misleading the final similarity score. In fact, while analyzing massive genomes, the number of repeated patterns is very high, particularly in the non-genic regions. Furthermore if we allow mismatches the number of patterns can grows exponentially[13–15]. In this paper we will address this problem by controlling the distribution of subwords over the sequences under consideration, so that their contribution will not be overcounted.

Moreover, when comparing genomes it is well known that different evolutionary mechanisms can take place. In this framework, two closely related species are expected to share larger portions of DNA than two distant ones, whereby also other large complements and reverse-complements, or inversions, may occur[16]. In this work we will take into account all these symmetries, in order to define a measure of similarity between whole genomes.

Matching statistics

Among the many distance measures proposed in the literature, which in most cases are dealing with k-mers, an effective and particularly elegant method is the Average Common Subword approach (ACS), introduced by Ulitsky et al.[7]. They use a popular concept in the field of string algorithms, known as matching statistics[17]. In short, given two sequences s₁ and s₂, where s₁ is the reference sequence, the matching statistic is a vector l such that l[i] is the length of the longest subword starting at position i of s₁ that is also a subword of s₂, for every possible position i of s₁(see Table1).

Table 1 Example of matching statistics l ₁ [ i ] and l ₂ [ j ] for the strings s ₁ = ACACGTAC and s ₂ = TACGTGTA

Full size table

A popular measure of similarity between strings is the average of this vector. In fact the general form of ACS is:

ACS (s_{1}, s_{2}) = \frac{\sum_{i = 1}^{| s_{1} |} l [i]}{| s_{1} |} .

We can notice the similarity with the cross entropy of two probability distributions P and Q:

H (P, Q) = - \sum_{x} p (x) log q (x),

where p(x) log q(x) measures the number of bits needed to code an event x from P if a different coding scheme based on Q is used, averaged over all possible events x.

From the theoretical prospective it can be shown[7] that the ACS approach mimics the cross entropy estimated between two large sequences generated by a finite-state Markov process. In practice, this is closely related to the Kullback-Leibler information divergence, and represents the minimum number of bits needed to describe one string, given the other: D_KL(P ∥ Q) = H(P,Q) − H(P). This is perhaps the most frequently used information-theoretic similarity measure.

The advantage of using the matching statistics is that it is not based on fixed-length subwords, but it can capture also variable length matches, in contrast to most methods that are based on fixed sets of k-mers. In fact, with the latter the choice of the parameter k is critical, and every method needs to estimate k from the data under examination, typically using empirical measurements[4].

For this reason ACS proved to be useful for reconstructing whole-genome phylogenies of viruses, bacteria, and eukaryotes, outperforming in most cases the state-of-the-art methods[7].

Methods

In this section we propose a distance measure between entire genomes based on the notion of underlying subwords. In order to build a sound similarity measure between genomes, we need first to study the properties of the matching statistics. Our first contribution is the characterization of the subwords that are needed to compute the matching statistics. A second contribution is the selection of these subwords so that the resulting similarity measure does not contain overcounts. Our main idea is to avoid overlaps between selected subwords, more precisely by discarding common subwords occurring in regions covered by other more significant subwords.

Irredundant common subwords

In the literature, the values l[i] used by the ACS approach are called the matching statistics, as described in detail by Gusfield[17]. Our first contribution is to characterize the matching statistics in order to identify which subwords are essentials.

It is well known that the total number of distinct subwords of any length found in a sequence of length n can be at most Θ(n²). Remarkably a notable family of fewer than 2n subwords exist that is maximal in the host sequence, in the sense that it is impossible to extend a word in this class by appending one or more characters to it without losing some of its occurrences[18]. It has been shown that the matching statistics can be derived from this set of maximal subwords[19]. Here we will further tighten this bound by showing that to compute the matching statistics it is enough to consider a subset of the maximal subwords, called irredundant common subwords.

The notion of irredundancy was introduced in[20] and later modified for the problem of protein comparison[21, 22]. It proved useful in different contexts from data compression[23] to the identification of transcription factors[24]. In this paper we introduce the concept of irredundant common subwords (i.e., without mismatches/wildcards). This ensures that there exists a close correspondence between the irredundant common subwords and the matching statistics.

Definition 1

(Irredundant/Redundant common subword) A common subword w is irredundant if and only if at least an occurrence of w in s₁ or s₂ is not covered by other common subwords. A common subword that does not satisfy this condition is called a redundant common subword.

We observe that the number of irredundant common subwords $I_{s_{1}, s_{2}}$ is bounded by m + n, where |s₁| = n and |s₂| = m, since it is a subset of the set of maximal common subwords (see[19, 25] for a more complete treatment of this topic).

Proposition 1

The matching statistics $l_{s_{1}} (i)$ can be computed by combining together all and only the irredundant common subwords of s₁ and s₂.

Proof

To show that the vector $l_{s_{1}} (i)$ can be derived from the irredundant common subwords, we define a new vector of scores l_wfor each subword w, where l_w[j] = |w|−j + 1 represents the length of each suffix j of w, with j = 1,…,|w|. Then, for each subword w in $I_{s_{1}, s_{2}}$ we superimpose the vector l_won all the occurrences of w in s₁. For each position i, in s₁, $l_{s_{1}} (i)$ is the maximum value of the scores ma x_w(l_w[j]), such that k + j = i and k is an occurrence of w.

To complete the proof we have to show that every occurrence of a common subword of s₁ and s₂ is covered by some occurrence of a subword in $I_{s_{1}, s_{2}}$ . By definition of irredundant common subword, any occurrence of a subword corresponds to an irredundant common subwords or is covered by some subword in $I_{s_{1}, s_{2}}$ . Moreover every irredundant common subword w has at least an occurrence i that is not covered by other subwords. Thus, $l_{s_{1}} (i)$ corresponds exactly to |w| and the subword w is necessary to compute the matching statistics. In conclusion, by using the method described above for $l_{s_{1}} (i)$ , we can compute for each position the length of the maximum common subword starting in that location, which corresponds to the matching statistics. □

In summary, the notion of irredundant common subwords is useful to decompose the information provided by the matching statistics into several patterns. Unfortunately these subwords can still overlap in some position. This might lead to an overcount in the matching statistics, in which the same region of the string contributes more than once. Our aim is to remove the possibility of overcount by filtering the most representative common subwords for each region of the sequences s₁ and s₂, which will also remove all overlaps.

Underlying subwords

When comparing entire genomes we want to avoid that large non-coding regions, which by nature tend to be highly repetitive, may contribute to the similarity score a multiple number of times, thus misleading the final score. In fact, while analyzing massive genomes, the number of repeated patterns is very high, particularly in the non-genic regions. Therefore we need to filter out part of this information, and select the “best” common subword, by some measure, for each region of the sequences.

In this regard, we must recall the definition of pattern priority and of underlying pattern, adapted from[26] to the case of pairwise sequence comparison. We will take as input the irredundant common subwords and the underlying quorum u = 2, i.e. they must appears at least twice. Let now w and w^′ be two distinct subwords. We say that w has priority over w^′, or w→w^′, if and only if either |w| ≥ |w^′|, or |w| = |w^′| and the first occurrence of w appears before the first occurrence of w^′. In this case, every subword can be defined just by its length and one of its starting positions in the sequences, meaning that any set of subwords is totally ordered with respect to the priority rule. We say that an occurrence l of w is tied to an occurrence l^′ of a subword w^′, if the two occurrences overlap, i.e. ([l,l + |w| − 1]∩l^′,l^′ + |w^′| − 1]) ≠ ∅, and w^′→w. Otherwise, we say that l is untied from l^′.

Definition 2

(Underlying subword) A set of subwords $U_{s_{1}, s_{2}} \subseteq I_{s_{1}, s_{2}}$ is said to be underlying if and only if:

(i)
every subword w in $U_{s_{1}, s_{2}}$ , called an underlying subword, has at least two occurrences, one in s ₁ and the other in s ₂, that are untied from all the untied occurrences of other subwords in $U_{s_{1}, s_{2}} ∖ w$ , and
(ii)
there does not exist a subword $w \in I_{s_{1}, s_{2}} ∖ U_{s_{1}, s_{2}}$ such that w has at least two untied occurrences, one per sequence, from all the untied occurrences of subwords in $U_{s_{1}, s_{2}}$ .

This subset of $I_{s_{1}, s_{2}}$ is composed only by those subwords that rank higher with our priority rule with respect to s₁. In fact, if there are overlaps between subwords that are in $I_{s_{1}, s_{2}}$ , we will select only the subwords with the highest priority. Similarly to the score ACS(s₁s₂), the set $U_{s_{1}, s_{2}}$ is asymmetric and depends on the order of the two sequences; we will address this issue in Section “A distance-like measure based on underlying subwords”. As for the underlying patterns[26], one can show that the set of underlying subwords exists, and is unique. As a corollary we know that the untied occurrences of the underlying subwords can be mapped into the sequences s₁ and s₂ without overlaps. Moreover, by definition, the total length of the untied occurrences cannot exceed the length of the sequences. These two properties are crucial when building a similarity measure, because any similarity that is based on these subwords will count the contribution of a region of the sequence only once.

Efficient computation of underlying subwords

To summarize we select the irredundant common subwords that best fit each region of s₁ and s₂, employing a technique that we call Underlying Approach or, in short, UA. This technique is based on a simple pipeline. We first select the irredundant common subwords and subsequently filter out the subwords that are not underlying. From a different perspective, we start from the smallest set of subwords that captures the matching statistics and remove the overlaps by applying our priority rule. In the following we show how to compute the irredundant common subwords and the matching statistics, and then we present an approach for the selection of the underlying subwords among these subwords. The general structure of the Underlying Approach (UA) is the following:

1) Compute the set of the irredundant common subwords $I_{s_{1}, s_{2}}$
2) Rank all subwords in $I_{s_{1}, s_{2}}$ according to the priority and initialize $U$ to an empty set.
3) Iteratively select a subwords p from $I_{s_{1}, s_{2}}$ following the ranking.
4a) If p has at least two untied occurrences: add p to $U$ and update the corresponding regions of Γ(see next) in which p occurs;
4b) otherwise, discard p and return to (3).

Discovery of the irredundant common subwords

In step (1) we construct the generalized suffix tree $T_{s_{1}, s_{2}}$ of s₁ and s₂. We recall that an occurrence of a subword is (left)right-maximal if it cannot be covered from the (left)right by some other common subword. The first step consists in making a depth-first traversal of all nodes of $T_{s_{1}, s_{2}}$ , and coloring each internal node with the colors of its leaves (each color corresponds to an input sequence). In this traversal, for each leaf i of $T_{s_{1}, s_{2}}$ , we capture the lowest ancestor of i having both the colors c₁ and c₂, say the node w. Then, w is a common subword, and i is one of its right-maximal occurrences (in s₁ or in s₂); we select all subwords having at least one right-maximal occurrence. The resulting set will be linear in the size of the sequences, that is O(m + n). This is only a superset of the irredundant common subwords, since the occurrences of these subwords could be not left-maximal.

In a second phase, we map the length of each right-maximal occurrence i into $l_{s_{1}} (i)$ , and, using Proposition 1, we check which occurrences i have length greater than or equal to the length stored in the location i−1 (for locations i ≥ 2). These occurrences are also left-maximal, since they cannot be covered by a subword appearing at position i−1. Finally we can retain all subwords that have at least an occurrence that is both right- and left-maximal, i.e, the set of irredundant common subwords $I_{s_{1}, s_{2}}$ . Note that, by employing the above technique, we are able to directly discover the irredundant common subwords and the matching statistics $l_{s_{1}} (i)$ .

The construction of the generalized suffix tree $T_{s_{1}, s_{2}}$ and the subsequent extraction of the irredundant common subwords $I_{s_{1}, s_{2}}$ can be completed in time and space linear in the size of sequences.

Selection of the underlying subwords

In this section we describe, given the set of the irredundant common subwords $I_{s_{1}, s_{2}}$ , how to filter out the subwords that are not underlying, obtaining the set of underlying subwords $U_{s_{1}, s_{2}}$ .

The extraction of underlying subwords takes as input the set $I_{s_{1}, s_{2}}$ and the tree $T_{s_{1}, s_{2}}$ from the previous section. First we need to sort all subwords in $I_{s_{1}, s_{2}}$ according to the priority rule (step 2). Then, starting from the top subword, we analyze iteratively all subwords by checking their untied occurrences (step 3). If the subword passes a validity test we select it as underlying (step 4a), otherwise we move on with the next subword (step 4b). The two key steps of this algorithm are: sorting the subwords (step 2) and checking for their untied occurrences (step 4a).

Step 2 is implemented as follows. For all subwords we retrieve their lengths and first occurrences in s₁ from the tree $T_{s_{1}, s_{2}}$ . Then each subword is characterized by its length and the first occurrence. Since these are integers in the range [0,n] we can apply radix sort[27], first by length and then by occurrence. This step can be done in linear time.

In order to implement step 4a we need to define the vector Γ of n booleans, representing the locations of s₁. If Γ[i] is true, then the location i is covered by some untied occurrence. We also preprocess the input tree and add a link for all nodes v to the closest irredundant ancestor, say prec(v). This can be done by traversing the tree in preorder. During the visit of a the node v if it is not irredundant we transmit to the children prec(v) otherwise if v is irredundant we transmit v. This preprocess can be implemented is linear time and space.

For each subword w in $I_{s_{1}, s_{2}}$ we consider the list $L_{w}$ of occurrences to be checked. All $L_{w}$ are initialized in the following way. Every leaf v, that represent a position i, send its value i to the location list of the closest irredundant ancestor using the link prec(v). Again this preprocess takes linear time and space since all positions appear in exactly one location list. We will updated these lists $L_{w}$ only with the occurrences to be checked, i.e. that are not covered by some underlying subword already discovered. We start analyzing the top subword w and for this case $L_{w}$ is composed by all the occurrences of w.

For each occurrence i of w we need to check only its first and last location in the vector Γ; i.e., we need to check the locations Γ[i] and Γ[i + |w|−1]. If one of these two values is set to true, then i is tied by some subword w^′. Otherwise, if both the values are set to false, then i must be untied from all other subwords. Since all subwords already evaluated are not shorter than w, then they cannot cover some locations in Γ[i,i + |w|−1] without also covering Γ[i] or Γ[i + |w|−1]. Thus, if Γ[i] and Γ[i + |w|−1] are both set to false, we mark this occurrence i as untied for the subword w and update the vector Γaccordingly.

If Γ[i] is true we can completely discard the occurrence i, for the subword w and also for all its prefixes, that are represented by the ancestors of w in the tree $T_{s_{1}, s_{2}}$ . Thus the occurrence i will no longer be evaluated for any other subword.

If Γ[i] is false and Γ[i + |w|−1] is true, we need to further evaluate this occurrence for some ancestors of w. In this case, one can compute the longest prefix, w^′, of w such that Γ[i + |w^′|−1] is set to false and w^′is an irredundant common subword. Then the occurrence i is inserted into the list $L_{w^{'}}$ .

This step is performed by first computing the length d < |w| such that Γ[i + d−1] is false and Γ[i + d] is true, and then retrieving the corresponding prefix w^′ of w in the tree that spells an irredundant common subword with length equal to or shorter than d. We can compute d by means of a length table χ in support (or in place) of the boolean vector Γ. For each untied occurrence i of w, χ stores the values [1,2,…,|w|] in the locations [i,i + 1,…,i + |w|−1], similarly to the proof of Proposition 1. Using this auxiliary table we can compute the value of d for the location under study i as d = |w|−χ[i + |w|−1].

Now, to select w^′, the longest prefix of w with |w^′| ≤ d, we employ an algorithm proposed by Kopelowitz and Lewenstein[28] for solving the weighted ancestor problem, where weights correspond to the length of words spelled in the path from the root to each node, in case of a suffix tree. In the weighted ancestor problem one preprocesses a weighted tree to support fast predecessor queries on the path from a query node to the root. That is, with a linear preprocessing on a tree of height n, using the above algorithm it is possible to locate any ancestor node w^′ that has a weight less than d in time O(loglogn). In our case, the maximum length for an irredundant subword is min{m,n}, thus we can find a suitable ancestor w^′of w in time O(loglogmin{m,n}), with O(m + n) preprocessing of the tree $T_{s_{1}, s_{2}}$ .

At the end of the process, if the subword w has at least one untied occurrence per sequence, then we mark w as underlying subword. Otherwise, all the occurrences of w that are not covered are sent to its ancestors, using the previous procedure.

To analyze the overall complexity we need to compute how many times the same location i is evaluated. Suppose, for example, that i belongs to $L_{w}$ of the subword w. The location i is evaluated again for some $\bar{w}$ , and inserted into the list $L_{\bar{w}}$ , only if Γ[i] is false and Γ[i + |w|−1] is true. Note that the locations not already covered are in the range [i,i + |w|−d−1], with d > 0. Then, the subword $\bar{w}$ is the longest prefix of w that is an irredundant common subword and that lives completely in the locations [i,i + |w|−d−1]; however $\bar{w}$ may not cover the entire interval. Now, the occurrence i will be evaluated again only if there exists another subword w^′ that overlaps with $\bar{w}$ , and that has a higher priority with respect to $\bar{w}$ . The worst case is when w^′ends exactly at position i + |w|−d−1 and overlaps with $\bar{w}$ by only one location. Since w^′must be evaluated before $\bar{w}$ , then $| w^{'} | \geq | \bar{w} |$ . Thus the worst case is when the two subwords have about the same length. In this settings the length of the subword $\bar{w}$ can be at most (|w|−d)/2. We can iterate this argument at most O(log|w|) times for the same position i. Therefore any location can be evaluated at most O(logmin{m,n}) times. In conclusion, our approach requires O((m + n)logmin{m,n}loglogmin{m,n}) time and O(m + n) space to discover the set of all underlying subwords $U_{s_{1}, s_{2}}$ .

Extension to inversions and complements

In this section we discuss the extension of the algorithmic structure discussed above to accommodate also inversion and complement matches.

A simple idea is to concatenate each sequence with its inverse and its complement, while keeping separate the occurrences coming from direct matches, inversions, and complements. In brief, we first define $\hat{x}$ as the concatenation of a string x with its inverse, followed by its complement, in this exact order. Then, we compute the irredundant common subwords, $I_{s_{1}, \hat{s_{2}}}$ , on the sequences s₁and $\hat{s_{2}}$ . We subsequently select the underlying subwords by ranking all the irredundant common subwords in the set $I_{s_{1}, \hat{s_{2}}}$ . Using the same algorithm described above we compute the set $U_{s_{1}, \hat{s_{2}}}$ , and then we map each subword occurrence to the reference sequences s₁. This will include also inversions and complements of s₂ that are shared by s₁. In this way, we can store all the untied occurrences and consider all possible matches for each region of s₁.

In this framework, we choose to take into account all these symmetries, and thus the experiments presented will use this extended approach. We will also measure the contribution of inversions and complements to our similarity measure.

A distance-like measure based on underlying subwords

In the following we report the basic steps of our distance-like measure. Let us assume that we have computed $U_{s_{1}, s_{2}}$ , and the other specular set $U_{s_{2}, s_{1}}$ . For every subword $w \in U_{s_{1}, s_{2}}$ we sum up the score $h_{w}^{s_{1}} \sum_{i = 1}^{| w |} i = h_{w}^{s_{1}} | w | (| w | + 1) / 2$ in UA(s₁s₂), where $h_{w}^{s_{1}}$ is the number of its untied occurrences in s₁, similarly to ACS[7]. Then, we average UA(s₁,s₂) over the length of the first sequence, s₁, yielding

UA (s_{1}, s_{2}) = \frac{\sum_{w \in U_{s_{1}, s_{2}}} h_{w}^{s_{1}} | w | (| w | + 1)}{2 n} .

This is a similarity score that is large when two sequences are similar, therefore we take its inverse.

Moreover, for a fixed sequence s₁ this score can also grow with the length of s₂, since the probability of having a match in s₁increases with the length of s₂. For this reason, we consider the measure log₄(|s₂|)/UA(s₁,s₂); we use a base-4 logarithm since DNA sequences have four bases. Another issue with the above formula is the fact that it is not equal to zero for s₁ = s₂; thus we subtract the correction term log₄(|s₁|)/UA(s₁,s₁), which ensures that this condition is always satisfied. Since $U_{s_{1}, s_{1}}$ contains only one subword, the sequence s₁ itself, which trivially has only one untied occurrence in s₁, this yields to UA(s₁,s₁) = |s₁|(|s₁| + 1)/(2|s₁|) = (|s₁| + 1)/2. The following formulas accommodate all of these observations in a symmetrical distance-like measure d_UA(s₁,s₂) between the sequences s₁ and s₂:

\bar{UA} (s_{1}, s_{2}) = \frac{{log}_{4} (| s_{2} |)}{UA (s_{1}, s_{2})} - \frac{2 {log}_{4} (| s_{1} |)}{(| s_{1} | + 1)},

d_{UA} (s_{1}, s_{2}) = \frac{\bar{UA} (s_{1}, s_{2}) + \bar{UA} (s_{2}, s_{1})}{2} .

We can easily see that the correction term rapidly converges to zero as |s₁| increases. Moreover, we notice that d_UA(s₁,s₂) grows as the two sequences s₁ and s₂ diverge. From now we will simply refer to the measure d_UA(s₁,s₂) as the Underlying Approach measure, or UA.

Results

Genome datasets and reference taxonomies

We assess the effectiveness of the Underlying Approach on the estimation of whole-genome phylogenies of different organisms. We tested our distance function on three types of datasets: viruses, prokaryotes, and unicellular eukaryotes.

In the first dataset we selected 54 virus isolates of the 2009 human pandemic Influenza A – subtype H1N1, also called the “Swine Flu.” The Influenza A virion has eight segments of viral RNA with different functions. These RNAs are directly extracted from infected host cells, and synthesized into complementary DNA by reverse transcription reaction, where a specific gene amplification is performed for each segment[29]. We concatenate these segments according to their conventional order given by the literature[30]; this step, in general, does not affect the final phylogeny computed by our algorithm, and is used to sort subwords by location. The resulting sequences are very similar to each other, and have lengths in the order of 13,200 nucleotides each, accounting for a total of 714,402 b. To compute a reference taxonomic tree, we perform multiple sequence alignment using the ClustalW2[31] tool^a as suggested by many scientific articles on the 2009 Swine Flu[29, 30]. Then, we compute the tree using the Dnaml tool from the PHYLIP[32] software package,^b which implements the maximum likelihood method for aligned DNA sequences. In Dnaml we used the parameters suggested in[29, 30], which consider empirical base frequencies, constant rate variation among sites (with no weights), a transition ratio of 2.0, and best tree search based on proper searching heuristics.

In the second dataset we selected 18 prokaryotic organisms among the species used in[7] for a DNA phylogenomic inference. We chose the species whose phylogenomic tree can be inferred by well-established methods in the literature (see Table2). The organisms come from both the major prokaryotic domains: Archaea, 8 organisms in total, and Bacteria, 10 organisms in total. The sequences in question have lengths ranging from 0.6 Mbp to 8 Mbp, accounting for a total 48 Mbp. We compute their tree-of-life by using genes that code for the 16S RNA, the main RNA molecule inside the small ribosomal subunit characterizing prokaryotes and widely used to reconstruct their phylogeny; the considered sequences are called 16S rDNA. We can extract a multiple alignment of 16S rDNA sequences of the selected organisms from the Ribosomal Database Project[33];^c our experiments are based on the release 8.1. Next, we perform a maximum likelihood estimation on the aligned set of sequences, employing Dnaml from PHYLIP with standard parameters, in order to compute a reference tree based on the resulting estimation.

Table 2 Benchmark for prokaryotes – Archaea & Bacteria domains

Full size table

In the third dataset we selected five eukaryotic organisms of the protozoan genus Plasmodium whose genomes have been completely sequenced (Table3). Plasmodium are unicellular eukaryotic parasites best known as the etiological agents of malaria infectious disease. The sequences have lengths ranging from 18 Mbp to 24 Mbp, accounting for a total 106 Mbp. We used as reference tree the taxonomy computed by Martinsen et al.[34], as suggested by the Tree of Life Project.

Table 3 Plasmodium are parasites known as causative agents of malaria in different hosts and geographic regions

Full size table

Whole-genome phylogeny reconstruction

We exploited the above datasets to compare our method, the Underlying Approach (UA), with other efficient state-of-the-art approaches in the whole-genome phylogeny reconstruction challenge: ACS[7], FFP[4]^d and FFP_RY. The FFP_RYmethod, in contrast to FFP, employs the Purine-Pyrimidine reduced alphabet (RY) which is composed by two character classes: [A,G] (both purine bases, denoted by R) [C,T] (both pyrimidines, denoted by Y). We implemented the ACS method by ourselves, while for FFP and FFP_RY we used the FFP package release 3.14 available online.

We reconstruct the phylogenomic trees from the distance matrices using the Neighbor-joining method as implemented in the PHYLIP package. We compare the resulting topologies with the respective reference trees using the symmetric difference of Robinson and Foulds (R-F) and the triplet distance. For two unrooted binary trees with n ≥ 3 leaves, the R-F score is in the range [0,2n−6]. A score equal to 0 means that the two trees are isomorphic, while 2n−6 means that all non-trivial bipartitions are different. The R-F difference between two or more trees can be computed using the TreeDist tool from the PHYLIP package.

We ran FFP and FFP_RYfor different values of k (the fixed subword length) as suggested by[4], retaining the best results in agreement with the reference trees. Table4 compares our method with the other state-of-the-art approaches, by showing the R-F difference with respect to the reference taxonomy tree.

Table 4 Comparison of whole-genome phylogeny reconstructions

Full size table

Our method, UA, achieves good performance in every test considering the R-F difference with the reference taxonomy tree, and very good performance if we further analyze the resulting phylogenies, as in Figures1,2, and3. For every dataset the best results are shown in bold. We can observe that UA is constantly the best performing method, and that this advantage becomes more evident for large dataset, where sequences share large parts, such as the Influenza A (H1N1) viruses.

The Robinson and Foulds distance is a standard method to evaluate topological discordance between trees. However when dealing with large trees it is known that small variations can generate very large R-F scores (typically, already for n∼10). For this reason we conducted a second series of experiments using the triplet distance[35]. The triplet distance is a more refined measure that does not suffer this problem. Moreover, to better compare all taxonomies, we report the triplet distance between all trees. Tables5,6 and7 show the triplet distance between all trees for all datasets. This more refined measure confirms the applicability of UA with respect to FFP and ACS.

Table 5 Comparison of whole-genome phylogeny of influenza virus

Full size table

Table 6 Comparison of whole-genome phylogeny of prokaryotes

Full size table

Table 7 Comparison of whole-genome phylogeny of Plasmodium

Full size table

In more detail, Figure1 shows that our approach can distinguish the two main clades of the 2009 Influenza A-H1N1 (in green and red), which have been outlined in[30]. The origin of the flu could reside in the Mexican isolate (Mexico/4108, in green), from which all other green isolates may have ensued. Two sub-clades for the U.S. states of California and Texas are highlighted within the red clade, most probably corresponding to the first major evolutions of the viral disease.

Similar results are obtained for the second dataset, as shown in Figure2. UA can easily distinguish the Archaea domain, in red, from the Bacteria domain, in green, and also other sub-clades with respect to the reference tree (these sub-clades are highlighted in the figure with different colors). The organisms in black do not form a clade with other organisms in the reference tree. For the third dataset (Figure3), the whole-genome phylogeny of the genus Plasmodium generated by UA corresponds exactly to the taxonomy found in the literature.

The accuracy results are promising, but we believe that of equal interest are the patterns used for the classification. Our approach, by construction, uses only a very small number of patterns. For this reason we report in Table8 some statistics for the underlying subwords selected, averaged over all experiments. We can notice that the number of irredundant patterns is in general smaller than the length of the genomes, and this is a first form of information filtering. Moreover we can observe that only a few underlying subwords are selected on average among the irredundant common subwords. This number is always very small when compared with all possible irredundant subwords, and much smaller than the length of the sequences.

Table 8 Main statistics for the underlying approach averaged over all experiments

Full size table

Similar considerations can be drawn for the underlying subwords length. On average they can be very long, especially with respect to FFP that uses only k-mers with k in the range [5,10]. Furthermore, each underlying subword occurs only a few times per sequence, and in general about one occurrence per sequence. Removing the high-frequency subwords, we can notice that the underlying subwords typically have length ≥ log₄ min{m,n}, and in the case of viruses they can be very large, capturing more information than FFP. The longest underlying subwords appear in the virus dataset, and they are on the order of a thousand bases. We checked if these subwords may have some biological meaning and we found that in some cases they correspond to whole viral segments that are shared between two genomes. This confirms that, in some cases, the underlying subwords used for classification can capture some biological insight.

Another interesting aspect is the contribution of inversions and complements in our similarity measure, with respect to the classical notion of match. We compute the average number of occurrences used in our scoring function that is due to inversions and complements. The contribution of inversions and complements is about 28-33% and 19-20%, respectively. This fact may be due to the nature of the sequences considered, but we believe that this topic deserves more attention.

Conclusion

In conclusion, we have shown that the underlying subwords can be used for the reconstruction of phylogenetic trees. Preliminary experiments have shown very good performance in the identification of major clusters for viruses, prokaryotes, and unicellular eukaryotes. An important observation that distinguishes our methods from the others is that only a small number of underlying subwords is used in our distance, nevertheless the results are promising. From this fact we can speculate that only a very limited number of subwords is needed to establish the phylogeny of genomic sequences. Thus, an interesting problem that can be addressed using the underlying subwords is the selection of probes for DNA chips.

In the future, we plan to extend this method for the comparison of whole genomes based on short reads coming from next-generation sequencing, instead of using assembled genomes.

Endnotes

^a ClustalW2 is available athttp://www.ebi.ac.uk/Tools/msa/clustalw2.

^b PHYLIP (phylogenetic inference package) is a free computational phylogenetics software package available athttp://evolution.genetics.washington.edu/phylip.

^c The Ribosomal Database Project is available athttp://rdp.cme.msu.edu.

^d The FFP software package release 3.14 is available athttp://ffp-phylogeny.sourceforge.net.

References

Wildman D, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, Gascuel O, Grossman LI, Romero R, Goodman M: Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci USA. 2007, 104: 14395-14400. 10.1073/pnas.0704342104
Article PubMed CAS PubMed Central Google Scholar
Huynen M, Bork P: Measuring genome evolution. Proc Natl Acad Sci USA. 1998, 95: 5849-5856. 10.1073/pnas.95.11.5849
Article PubMed CAS PubMed Central Google Scholar
Chor B, Horn D, Goldman N, Levy Y, Massingham T: Genomic DNA k-mer spectra: models and modalities. Genome Biol. 2009, 10 (10): R108. 10.1186/gb-2009-10-10-r108
Article PubMed PubMed Central Google Scholar
Sims GE, Jun SRR, Wu GA, Kim SH: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Nat Acad Sci USA. 2009, 106 (8): 2677-2682. 10.1073/pnas.0813249106
Article PubMed CAS PubMed Central Google Scholar
Venter C: The sequence of the human genome. Science. 2001, 291: 1305-1350.
Article Google Scholar
Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6: 361-375.
Article PubMed CAS Google Scholar
Ulitsky I, Burstein D, Tuller T, Chor B: The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006, 13 (2): 336-350. 10.1089/cmb.2006.13.336
Article PubMed CAS Google Scholar
Sims GE, Jun SR, Wu GA, Kim SH: Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions. Proc Nat Acad Sci USA. 2009, 106 (40): 17077-17082. 10.1073/pnas.0909377106
Article PubMed CAS PubMed Central Google Scholar
Lin J: Divergence measures based on the Shannon entropy. IEEE T Inform Theory. 1991, 37: 145-151. 10.1109/18.61115
Article Google Scholar
Apostolico A, Comin M, Parida L: Mining, compressing and classifying with extensible motifs. Algorithms Mol Biol. 2006, 1: 4. 10.1186/1748-7188-1-4
Article PubMed PubMed Central Google Scholar
Apostolico A, Comin M, Parida L: Motifs in Ziv-Lempel-Welch Clef. Proceedings of IEEE DCC Data Compression Conference. IEEE Computer Society, 2004, 72-81.
Google Scholar
Giancarlo R, Scaturro D, Utro F: Textual data compression in computational biology: a synopsis. Bioinformatics. 2009, 25 (13): 1575-1586. 10.1093/bioinformatics/btp117
Article PubMed CAS Google Scholar
Iliopoulos C, Mchugh J, Peterlongo P, Pisanti N, Rytter W, Sagot MF: A first approach to finding common motifs with gaps. Int J Foundations Comput Sci. 2005, 16 (6): 1145-1154. 10.1142/S0129054105003716
Article Google Scholar
Apostolico A, Comin M, Parida L: Conservative extraction of over-represented extensible motifs. Bioinformatics. 2005, 21 (Suppl 1): i9-i18. 10.1093/bioinformatics/bti1051
Article PubMed CAS Google Scholar
Apostolico A, Comin M, Parida L: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 7 (4): 752-762.
Article CAS Google Scholar
Kong SG, Fan WL, Chen HD, Hsu ZT, Zhou N, Zheng B, Lee HC: Inverse symmetry in complete genomes and whole-genome inverse duplication. PLoS ONE. 2009, 4 (11): e7553. 10.1371/journal.pone.0007553
Article PubMed PubMed Central Google Scholar
Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York, USA: Cambridge University Press, 1997.
Book Google Scholar
Apostolico A: The myriad virtues of subword trees. Combinatorial Algorithms on Words, A. Apostolico, Z. Galil (Eds.). 1985, 12: 85-96.
Article Google Scholar
Apostolico A: Maximal words in sequence comparisons based on subword composition. Algorithms and Applications, Volume 6060 of Lecture Notes in Computer Science. Edited by: Elomaa T, Mannila H, Orponen P. Berlin: Springer-Verlag, 2010, 34-44.
Google Scholar
Apostolico A, Parida L: Incremental paradigms of motif discovery. J Comput Biol. 2004, 11: 15-25. 10.1089/106652704773416867
Article PubMed CAS Google Scholar
Comin M, Verzotto D: Classification of protein sequences by means of irredundant patterns. BMC Bioinformatics. 2010, 11 (Suppl. 1): S16.
Article PubMed PubMed Central Google Scholar
Comin M, Verzotto D: The Irredundant Class method for remote homology detection of protein sequences. J Comput Biol. 2011, 18 (12): 1819-1829. [http://dx.doi.org/10.1089/cmb.2010.0171].] 10.1089/cmb.2010.0171
Article PubMed CAS Google Scholar
Apostolico A, Comin M, Parida L: Bridging lossy and lossless compression by motif pattern discovery. Lect Notes Comput Sci. 2006, 4123: 793-813. 10.1007/11889342_51
Article Google Scholar
Comin M, Parida L: Detection of subtle variations as consensus motifs. Theor Comput Sci. 2008, 395 (2-3): 158-170. 10.1016/j.tcs.2008.01.017
Article Google Scholar
Ukkonen E: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theor Comput Sci. 2009, 410 (43): 4341-4349. 10.1016/j.tcs.2009.07.015
Article Google Scholar
Comin M, Verzotto D: Comparing, ranking and filtering motifs with character classes: application to biological sequences analysis. Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data. Edited by: Elloumi M, Zomaya AY. 2013, chapter 13-chapter 13. Wiley.
Google Scholar
Cormen TH, Leiserson CE, Rivest RL: Introduction To Algorithms, chap. 9. MIT Press, 1990, 178–180.
Google Scholar
Kopelowitz T, Lewenstein M: Dynamic weighted ancestors. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2007). Society for Industrial and Applied Mathematics Philadelphia SIAM, 2007, 565-574.
Google Scholar
Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, Pybus OG, Ma SK, Cheung CL, Raghwani J, Bhatt S, Peiris JSM, Guan Y, Rambaut A: Origins and evolutionary genomics of the 2009 swine-origin H1N1 Influenza A epidemic. Nature. Nature Publishing Group. 2009, 459 (7250): 1122-1125.
CAS Google Scholar
Shiino T, Okabe N, Yasui Y, Sunagawa T, Ujike M, Obuchi M, Kishida N, Xu H, Takashita E, Anraku A, Ito R, Doi T, Ejima M, Sugawara H, Horikawa H, Yamazaki S, Kato Y, Oguchi A, Fujita N, Odagiri T, Tashiro M, Watanabe H: Molecular Evolutionary Analysis of the Influenza A(H1N1)pdm, May–September, 2009: Temporal and Spatial Spreading Profile of the Viruses in Japan. PLoS ONE. 2010, 5 (6): e11057. 10.1371/journal.pone.0011057
Article PubMed PubMed Central Google Scholar
Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673
Article PubMed CAS PubMed Central Google Scholar
Felsenstein J: PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.
Google Scholar
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009, 37: D141-D145. 10.1093/nar/gkn879
Article PubMed CAS PubMed Central Google Scholar
Martinsen ES, Perkins SL, Schall JJ: A three-genome phylogeny of malaria parasites (Plasmodium and closely related genera): Evolution of life-history traits and host switches. Mol Phylogenet Evol. 2008, 47: 261-273. 10.1016/j.ympev.2007.11.012
Article PubMed CAS Google Scholar
Critchlow D, Pearl D, Qian C: The triples distance for rooted bifurcating phylogenetic trees. Syst Biol. 1996, 45 (3): 323-334. 10.1093/sysbio/45.3.323
Article Google Scholar

Download references

Acknowledgements

M.C. was partially supported by the Ateneo Project “Pattern discovery and computational problems for whole genomes comparison” of the University of Padova. The authors would like to thank prof. Benny Chor for providing the data and for helpful discussions. We thank the reviewers for their careful read and thoughtful comments on the paper.

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Matteo Comin & Davide Verzotto

Authors

Matteo Comin
View author publications
You can also search for this author in PubMed Google Scholar
Davide Verzotto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Comin.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed equally to this study. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Comin, M., Verzotto, D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 7, 34 (2012). https://doi.org/10.1186/1748-7188-7-34

Download citation

Received: 19 September 2012
Accepted: 29 November 2012
Published: 06 December 2012
DOI: https://doi.org/10.1186/1748-7188-7-34

Alignment-free phylogeny of whole genomes using underlying subwords

Abstract

Background

Methods

Results

Availability

Similar content being viewed by others

Alignment- and reference-free phylogenomics with colored de Bruijn graphs

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

A greedy alignment-free distance estimator for phylogenetic inference

Background

Matching statistics

Methods

Irredundant common subwords

Definition 1

Proposition 1

Proof

Underlying subwords

Definition 2

Efficient computation of underlying subwords

Discovery of the irredundant common subwords

Selection of the underlying subwords

Extension to inversions and complements

A distance-like measure based on underlying subwords

Results

Genome datasets and reference taxonomies

Whole-genome phylogeny reconstruction

Conclusion

Endnotes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation