1 Introduction

Proteins are important molecules that perform a wide range of functions in the biological system. Protein is composed of amino acids, and it is the amino acid sequence that determines the chemical structure of protein. Analysis of amino acid sequences can provide useful insights into the tertiary structure of proteins and the reconstruction of evolutionary tree [13, 25, 51, 56]. Phylogenetics is the study of the evolutionary history among organisms. Moreover, it can provide information for function prediction. Some pharmaceutical researchers may use phylogenetic methods to determine species, thus perhaps sharing their medicinal qualities [15]. Traditional phylogenetic approaches based on multiple sequence alignments, such as maximum parsimony and maximum likelihood, become impractical due to their high computational complexity given that most proteomes contain millions of amino acids [11, 23, 31, 50]. Therefore, it is valuable and important to develop novel alignment-free methods for phylogenetic analysis.

In the past two decades, many alignment-free methods have been developed [1, 2, 9, 12, 2022, 27, 29, 3245, 54, 55, 57, 58]. These methods are intended to extract some hidden information from protein sequences, but from different angles. Graphical representations of proteins have emerged as one kind of alignment-free methods [1, 9, 2022, 27, 29, 3245, 58]. Those methods can make some special useful insights into local and global characteristics and the occurrences, variations and repetition of some special patterns along an amino acid sequence. Alternatively, the compression based methods generally regard the protein sequence as plain text, and define the similarity between two protein sequences as the relative compression ratio [1618, 28, 53, 56]. These methods will suffer from aggregate errors arising from compression. The third class of methods in the protein phylogenetic analysis attempt to extend single amino acid composition to study string composition for protein sequences where a string is a consecutive segment of amino acids [5, 10, 14, 19, 30, 46]. Hao and Qi [10], Li et al. [19], Qi et al. [30], who analyzed k-word frequencies, then extracted phylogenetic properties on genome-wide scale for prokaryotes. These methods based k-word distribution have to faced the dilemma of the length of word k. Theoretically, one may increase the maximum string length to have finer composition for the whole genomes in order to obtain more accurate pair-wise evolutionary distances. However, increasing string length requires too much memory to be practical as well as increased CPU usage. Ulitsky et al. introduced the average length of longest common substring measure (ACS) based on computing the average length of maximum common substrings. As it is shown that the ACS only concentrates on the length of the longest common word starting at any position in two sequences [8, 47]. Moreover, lengths of other common words also play an important role in the measuring the evolutionary distance between two sequences. Motivated by their work, in this paper, we develop the harmonic distribution for all lengths of common substrings at any position between two sequences. Based on the harmonic distribution, we propose a new alignment-free method for phylogenetic analysis.

The proposed method is tested by phylogenetic analysis on two different data sets: 24 transferrin sequences from vertebrates and 26 spike protein sequences from coronavirus. These results demonstrate that the new method is effectual and feasible.

2 Materials and Methods

2.1 Average Common Substring Measure

The average common substring measure is based on the longest common word between two sequences. It has been introduced by Ulitsky et al. [47] as the average length of longest common substrings starting at any position in both sequences.

Let \(A=A_{1}A_{2} \ldots\,A_{n}\) and \(B=B_{1}B_{2}\ldots\,B_{m}\) be two sequences of lengths n and m respectively. For any position i in A, the subsequence of A of length l(i) can be denoted as \(A(i, i+l(i)-1)=A_{i}A_{i+1}\ldots\,A_{i+l(i)-1}\). At each position in A, a longest subsequence common to B is searched. Let ω i be this subsequence starting at position i in A that can be anywhere in B and let |ω i | be its length. We can average all the length |ω i | to get a measure L(A,B) = ∑ ni=1 i |/n. Intuitively, the larger this L(AB) is, the more similar the two genomes are. Considering that the L(AB) is increased when the length of B is high, the similarity between A and B is normalized by L(AB)/log(m). We can obtain the average common substring distance by taking the reciprocal of L(AB)/log(m) and subtracting a “correction term ”. The distance between A and B is denoted by d(AB) = log(m)/L(AB) − log(n)/L(AA). As generally d(AB) ≠ d(BA), the average common substring measure is finally defined by

$$ ACS(A,B)=\frac{1}{2}(d(A,B)+d(B,A)). $$

As it is described, this distance considers only the length of the longest common subsequence starting at any position in both sequences. In fact, lengths of other common subsequences also play an important role in the measuring the similarity between two sequences. Therefore, we propose a novel measure involved in all lengths of common subsequences between two sequences.

2.2 Harmonic Common Substring Measure

At each position i in A, the longest word, the second longest word and the third longest word et al. common to B are searched. Let ω A ij be the common subsequence with the length j, starting at position i in A that can be anywhere in B respectively. Let n A ij be the frequencies of ω A ij in B. We can define the random variable HCS A i to represent the harmonic distribution about all lengths of common substring starting at position i in A. The distribution of HCS A i can be obtained by

HCS A i

1

2

\(\cdots\)

L i

P

\(\frac{\frac{1}{n_{i1}^{A}}}{{\frac{1}{n_{i1}^{A}}}+\cdots+\frac{1}{n_{iL_{i}}^{A}}}\)

\(\frac{\frac{1}{n_{i2}^{A}}}{{\frac{1}{n_{i1}^{A}}}+\cdots+\frac{1}{n_{iL_{i}}^{A}}}\)

\(\cdots\)

\(\frac{\frac{1}{n_{iLi}^{A}}}{{\frac{1}{n_{i1}^{A}}}+\cdots+\frac{1}{n_{iL_{i}}^{A}}}\)

here L i is the length of the longest common word starting at position i in A.

For each position i in A, we can get the distribution of HCS A i . The expectation of HCS A i denoted by EHCS A i can be computed by

$$ EHCS_{i}^{A}=\sum_{k=1}^{L_{i}}k\frac{\frac{1}{n_{ik}^{A}}}{{\frac{1}{n_{i1}^{A}}}+\cdots+\frac{1}{n_{iL_{i}}^{A}}}. $$

Obviously, not only the information from the longest common substring but also the information from other common substrings are involved in the expectation of HCS A i . Therefore, we can derive the harmonic common substring measure by EHCS A i . Firstly, we replace the |ω i | by the EHCS A i in L(AB) to get EL(A, B) = ∑ ni=1 EHCS A i /n. Secondly, we “normalize” EL(AB) to get EL(AB)/log(m) in order to account for the length of B. Thirdly, we derive the distance ED(AB) by ED(AB) = log(m)/EL(AB) − log(n)/EL(AA). Lastly, we define the harmonic common substring measure by computing

$$ HCS(A,B)=\frac{1}{2}(ED(A,B)+ED(B,A)). $$

As the same to ACS, the HCS(AB) is derived from the basis of KL relative entropy [3, 47]. Given a set of amino acid sequences, our algorithm computes the pairwise distances for this set according to our HCS(AB). We can efficiently perform the subsequence search by using suffix trees [49]. It has been shown that pairwise distance comparing all m sequences of length up to l takes \(O(m^{2}l \cdot log(l))\) time [47].

3 Results and Discussion

In this section, we will apply our method to two sets of proteins to see how much phylogenetic information the HCS(AB) can extract. Generally, the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopt this idea to test the validity of our phylogenetic trees.

3.1 Phylogenetic Analysis of Transferrin

In the first experiment, we choose transferrin sequences from 24 vertebrates as a dataset. Taxonomic information and accession numbers are provided in Table 1. The proteomic sequence is a concatenation of all the known amino acid sequences for an organism, also with delimiters. All the sequences have been obtained from the NCBI genome database in FASTA format.

Table 1 Transferrin sequences, sources, and accession numbers

The phylogenetic tree illustrated in Fig. 1 is constructed by HCS(AB) using UPGMA method in the PHYLIP package [6]. To indicate that the validity of our evolutionary trees, we show the result of Dai et al. in Fig. 2 [4].

Fig. 1
figure 1

The phylogenetic tree is constructed by our method HCS(AB). The proteomic sequence is a concatenation of all the known amino acid sequences for an organism, also with delimiters. Our phylogenetic tree can be obtained at any ionic strength, temperature, time

Fig. 2
figure 2

The phylogenetic tree is based on the distance of structural characteristic vector in Dai et al. 47. The proteomic sequence is a concatenation of all the known amino acid sequences for an organism, also with delimiters. The phylogenetic tree can be obtained at any ionic strength, temperature, time

Compared with the result in Figs. 1 and 2, we find ours is better:

  1. 1.

    Among the two trees, the tree in Fig. 1 is the most consistent with the trees constructed by Ford [7], which is the most classical result in the publicized existing trees. This verifies the validity of our method. From Fig. 1 we can observe that all the proteins that belong to transferrin (TF) proteins and lactoferrin (LF) proteins have been separated well and grouped into respective taxonomic classes accurately.

  2. 2.

    In Fig. 1, the Human TF, Rabbit TF, Rat TF and Cow TF are clustered into the same branch while in Fig. 2, the Rat TF, Cow TF are separated from Human TF and Rabbit TF, this contradicts the classical result.

  3. 3.

    The transferrin (TF) proteins and lactoferrin (LF) proteins are clustered into their corresponding branches in Fig. 1, while they are mixed together in Fig. 2 and they are far with each other. This contradicts the traditional opinion.

  4. 4.

    In respect to the transferrin Possum, our result in Fig. 1 is better than Fig. 2 in general. That shows our result is more close to classical results.

Summing up, our method has significant advantage, compared with the method of Dai et al. [4].

3.2 Phylogenetic Analysis of Spike Proteins

In order to further verify the validity of our method, in the second experiment, we turn to make phylogenetic analysis of protein sequences of coronaviruses has been studied by different methods, such as multiple sequence alignments, graphical representation, and word frequency [13, 24, 26, 48, 52]. Here the phylogenetic tree for 26 spike protein sequences in Table 2 from coronavirus is constructed by our method, which is presented in Fig. 3. The proteomic sequence is a concatenation of all the known amino acid sequences for an organism, also with delimiters. All the sequences have been obtained from the NCBI genome database in FASTA format.

Table 2 Coronavirus spike proteins sequences, sources, and accession numbers
Fig. 3
figure 3

The phylogenetic tree for 26 spike proteins is constructed based on our method HCS(AB). The proteomic sequence is a concatenation of all the known amino acid sequences for an organism, also with delimiters. Our phylogenetic tree can be obtained at any ionic strength, temperature, time

From Fig. 3, we can see that the phylogenetic tree constructed by our method is more consistent with the known fact of evolution [52]:

  1. 1.

    As can be seen from Fig. 3, SARS-CoVs appear to cluster together and form a new separate branch, which are not closely related to any groups.

  2. 2.

    In respect to HCoV-OC43 , our result in Fig. 3 is same to the result of Yang et al. [52]. That shows our result is more closed to classical results.

4 Conclusion

With fast development of worldwide genome sequencing project, more and more biological sequences have become available. However, traditional sequence alignment tools and regular evolutionary models are impossible to deal with large-scale protein sequence. Alignment-free method is therefore of great value as it reduces the technical constraints of alignment.

In the present study, we propose a novel alignment-free method, the harmonic common substring measure, for phylogenetic reconstruction based on protein sequences. As it is well known that the more similar two sequences are, the greater the number of the factors shared by the two sequences. So the main advantage is that this algorithm can extract more information hidden in common subsequences. Our examples have indicated that our method is at least as good, and usually better, than some of existing alignment-free methods, both in terms of reconstruction accuracy and of computational efficiency.