Skip to main content
Log in

Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript


A systematic way of inferring evolutionary relatedness of microbial organisms from the oligopeptide content, i.e., frequency of amino acid K-strings in their complete proteomes, is proposed. The new method circumvents the ambiguity of choosing the genes for phylogenetic reconstruction and avoids the necessity of aligning sequences of essentially different length and gene content. The only “parameter” in the method is the length K of the oligopeptides, which serves to tune the “resolution power” of the method. The topology of the trees converges with K increasing. Applied to a total of 109 organisms, including 16 Archaea, 87 Bacteria, and 6 Eukarya, it yields an unrooted tree that agrees with the biologists’ “tree of life” based on SSU rRNA comparison in a majority of basic branchings, and especially, in all lower taxa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2

Similar content being viewed by others


  1. B Alberts (1994) Molecular biology of the cell, 3rd ed. Garland New York 121

    Google Scholar 

  2. L Aravind RL Tatusov YI Wolf DR Walker EV Koonin et al. (1998) ArticleTitleEvidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet 14 442–444 Occurrence Handle10.1016/S0168-9525(98)01553-4 Occurrence Handle1:CAS:528:DyaK1cXns1ynur8%3D Occurrence Handle9825671

    Article  CAS  PubMed  Google Scholar 

  3. SL Baldauf JD Palmer WF Doolittle (1996) ArticleTitleThe root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc Natl Acad Sci USA 93 7749–7754 Occurrence Handle10.1073/pnas.93.15.7749 Occurrence Handle1:CAS:528:DyaK28XksFSrtr0%3D Occurrence Handle8755547

    Article  CAS  PubMed  Google Scholar 

  4. DA Benson I Karsch-Mizrachi DJ Lipman J Ostell DL Wheeler (2003) ArticleTitleGenBank. Nucleic Acid Res 31 23–27 Occurrence Handle10.1093/nar/gkg057 Occurrence Handle1:CAS:528:DC%2BD3sXhvFSgu78%3D Occurrence Handle12519940

    Article  CAS  PubMed  Google Scholar 

  5. InstitutionalAuthorNameBergey’s Manual Trust (1984–1989) Bergey’s manual of systematic bacteriology, 1st ed, Vols 1–4. Williams & Wilkins Baltimore

    Google Scholar 

  6. InstitutionalAuthorNameBergey’s Manual Trust (2001) Bergey’s manual of systematic bacteriology, 2nd ed, Vol 1. Springer-Verlag New York

    Google Scholar 

  7. V Brendel JS Beckmann EN Trifonov (1986) ArticleTitleLinguistics of nucleotide sequences: Morphology and comparison of vocabularies. J Biomol Struct Dyn 4 11–21 Occurrence Handle1:CAS:528:DyaL28XlvVKjtbo%3D Occurrence Handle3078230

    CAS  PubMed  Google Scholar 

  8. K Chu J Qi Z Yu VO Anh (2003) ArticleTitleOrigin and phylogeny of chloroplasts: A simple correlation analysis of complete genomes. Mol Biol Evol Occurrence Handle10.1093/molbev/msg021 Occurrence Handle1:CAS:528:DC%2BD3sXhsVOitbk%3D Occurrence Handle12598686

    Article  CAS  PubMed  Google Scholar 

  9. V Daubin M Gouy G Perriere (2001) ArticleTitleBacterial molecular phylogeny using supertree approach. Genome Inform 12 155–164 Occurrence Handle1:CAS:528:DC%2BD38XkvV2rt7Y%3D

    CAS  Google Scholar 

  10. WF Doolittle (1999) ArticleTitlePhylogenetic classification and the universal tree. Science 284 2124–2128 Occurrence Handle1:CAS:528:DyaK1MXkt1Kgsbs%3D Occurrence Handle10381871

    CAS  PubMed  Google Scholar 

  11. WF Doolittle (2000) ArticleTitleUprooting the tree of life. Sci Am February 90–95

    Google Scholar 

  12. Felsenstein J (1993) PHYLIP (phylogeny inference package), version 3.5c. Distributed by the author at

  13. ST Fitz-Gibbon CH House (1999) ArticleTitleWhole genome-based phylogenetic analysis of free-living microorganism. Nucleic Acid Res 27 4218–4222 Occurrence Handle10.1093/nar/27.21.4218 Occurrence Handle1:CAS:528:DyaK1MXnt1Gkur8%3D Occurrence Handle10518613

    Article  CAS  PubMed  Google Scholar 

  14. Garrity GM, Winters M, Searles DB (2001) Taxonomic outline of the prokaryotic genera. In: Bergey’s manual of systematic bacteriology, 2nd ed, Rel 1.0 (Available at )

  15. Hao BL, Xie HM, Zhang SY (2001) Compositional representation of protein sequences and the number of Eulerian loops. (Available at arXiv:physics/0103028 at: )

  16. R Hu B Wang (2001) ArticleTitleStatistically significant strings are related to regulatory elements in the promoter region of Saccharomyces cerevisiae. Physica A 290 464–474 Occurrence Handle10.1016/S0378-4371(00)00488-X Occurrence Handle1:CAS:528:DC%2BD3MXpt1Khuw%3D%3D

    Article  CAS  Google Scholar 

  17. MA Huynen B Snel P Bork (1999) ArticleTitleLateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science 286 1441 Occurrence Handle10.1126/science.286.5444.1443a

    Article  Google Scholar 

  18. S Karlin C Burge (1995) ArticleTitleDinucleotide relative abundance extremes: A genomic signature. Trends Genet 11 283–290 Occurrence Handle1:CAS:528:DyaK2MXmvVahtLY%3D Occurrence Handle7482779

    CAS  PubMed  Google Scholar 

  19. LM Margulis KV Schwartz (1998) Five kingdoms, 3rd ed. WH Freeman New York 60

    Google Scholar 

  20. O Matte-Tailliez C Brochier P Forterre H Philippe (2002) ArticleTitleArchaeal phylogeny based on ribosomal proteins. Mol Biol Evol 19 631–639 Occurrence Handle1:CAS:528:DC%2BD38XjsFakurs%3D Occurrence Handle11961097

    CAS  PubMed  Google Scholar 

  21. RGE Murray (1989) The higher taxa, or, a place for everything…? ST Williams ME Sharpe JG Holt (Eds) Bergey’s manual of systematic bacteriology, Vol 4. Williams and Wilkins Baltimore 2329–2332

    Google Scholar 

  22. E Pennisi (1998) ArticleTitleGenome data shake tree of life. Science 280 672–674 Occurrence Handle10.1126/science.280.5364.672 Occurrence Handle1:CAS:528:DyaK1cXjtVOltbg%3D Occurrence Handle9599142

    Article  CAS  PubMed  Google Scholar 

  23. E Pennisi (1999) ArticleTitleIs it time to uproot the tree of life? Science 284 1305–1308 Occurrence Handle10.1126/science.284.5418.1305 Occurrence Handle1:CAS:528:DyaK1MXjs1SqtLg%3D Occurrence Handle10383313

    Article  CAS  PubMed  Google Scholar 

  24. MA Ragan (2001) ArticleTitleDetection of lateral gene transfer among microbial genomes. Curr Opin Gen Dev 11 620–626 Occurrence Handle10.1016/S0959-437X(00)00244-6 Occurrence Handle1:CAS:528:DC%2BD3MXnslajt7k%3D

    Article  CAS  Google Scholar 

  25. N Saitou M Nei (1987) ArticleTitleThe neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4 406–425 Occurrence Handle1:STN:280:BieC1cbgtVY%3D Occurrence Handle3447015

    CAS  PubMed  Google Scholar 

  26. B Snel P Bork MA Huynen (1999) ArticleTitleGenome phylogeny based on gene content. Nature Genet 21 108–110 Occurrence Handle10.1038/5052 Occurrence Handle1:CAS:528:DyaK1MXltlWjtQ%3D%3D Occurrence Handle9916801

    Article  CAS  PubMed  Google Scholar 

  27. F Tekaia A Lazcano B Dujon (1999) ArticleTitleThe genomic tree as revealed from whole genome proteome comparisons. Genome Res 9 550–557 Occurrence Handle1:CAS:528:DyaK1MXksVems7s%3D Occurrence Handle10400922

    CAS  PubMed  Google Scholar 

  28. JF Tomb O White AR Kerlavage et al. (1997) ArticleTitleThe complete genome sequence of the gastric pathogen Helicobacter pyroli. Nature 388 539–547 Occurrence Handle1:CAS:528:DyaK2sXlt1artb0%3D Occurrence Handle9252185

    CAS  PubMed  Google Scholar 

  29. DL Wheeler DM Church S Federhen et al. (2003) ArticleTitleDatabase resources of the National Center for Biotechnology. Nucleic Acids Res 31 28–33 Occurrence Handle10.1093/nar/gkg033 Occurrence Handle1:CAS:528:DC%2BD3sXhvFSgu7w%3D Occurrence Handle12519941

    Article  CAS  PubMed  Google Scholar 

  30. CR Woese (1998) ArticleTitleThe universal ancester. Proc Natl Acad Sci USA 95 6854–6859 Occurrence Handle1:CAS:528:DyaK1cXjslynu7w%3D Occurrence Handle9618502

    CAS  PubMed  Google Scholar 

  31. CR Woese (2000) ArticleTitleInterpreting the universal phylogenetic tree. Proc Natl Acad Sci USA 97 8392–8396 Occurrence Handle1:CAS:528:DC%2BD3cXlt1Ggtbc%3D Occurrence Handle10900003

    CAS  PubMed  Google Scholar 

  32. CR Woese GE Fox (1977) ArticleTitlePhylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc Natl Acad Sci USA 74 5088–5090 Occurrence Handle1:STN:280:CSeD2MzgsFI%3D Occurrence Handle270744

    CAS  PubMed  Google Scholar 

  33. YI Wolf IB Rogozin NV Grishin RL Tatusov EV Koonin (2001) ArticleTitleGenome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 1 8 Occurrence Handle10.1186/1471-2148-1-8 Occurrence Handle1:STN:280:DC%2BD3srgs1Wruw%3D%3D Occurrence Handle11734060

    Article  CAS  PubMed  Google Scholar 

  34. E Zuckerkandl L Pauling (1965) Evolutionary divergence and convergence in proteins. V Bryson HJ Vogel (Eds) Evolving genes and proteins. Academic Press New York 97–166

    Google Scholar 

Download references


The authors thank Drs. Yang Zhong (Fudan University) and Hongya Gu (Peking University) for discussion and comments. We also thank an anonymous referee who pointed out the problem with small genomes. The use of the 64 CPU IBM Cluster at Peking University is also gratefully acknowledged. This work was supported in part by grants from the Special Funds for Major State Basic Research Project of China, the Innovation Project of CAS, and Major Innovation Research Project “248” of Beijing Municipality.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ji Qi.



The list of all prokaryotic genomes used in our study is given in Tables A1 and A2. The species are listed in accordance with their “Bergey Code” in order to make comparison of the trees with Bergeys Manual easier. The Bergey Code is a shorthand of the classification given in the 2001 edition of Bergeys Manual of Systematic Bacteriology (Garrity et al. 2001). For example, Lacococcus lactis is listed under Phylum BXIII (Firmicutes)—Class III (Bacilli)—Order II (Lactobacillales)—Family VI (Streptococcaceae)—Genus II (Lactococcus). We changed all Roman numerals to Arabic and wrote the lineage as B13., dropping the taxonomic units and the Latin names.

Table 2 Archaea names, abbreviations, and NCBI accession numbers, ordered by their Bergey code
Table 3 Bacterium names, abbreviations, and NCBI accession numbers, ordered by their Bergey code

The six eukaryotes included are Saccharomyces cerevisiae (Yeast; NC_001133–48), Caenorhabitidis elegans (worm; NC_003279–84), Arabidopsis thaliana (Arath; NC_003070., Encephalitozoon cuniculi (Enccu; NC_003242.29–38), Plasmodium falciparum (Plafa; NC_000521.910.4314–18.25–31), and Schizosaccharomyces pombe (Schpo; NC_003421. 23.24).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qi, J., Wang, B. & Hao, BI. Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach . J Mol Evol 58, 1–11 (2004).

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: