Journal of Molecular Evolution

, Volume 58, Issue 1, pp 1–11 | Cite as

Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach

  • Ji QiEmail author
  • Bin Wang
  • Bai-Iin Hao


A systematic way of inferring evolutionary relatedness of microbial organisms from the oligopeptide content, i.e., frequency of amino acid K-strings in their complete proteomes, is proposed. The new method circumvents the ambiguity of choosing the genes for phylogenetic reconstruction and avoids the necessity of aligning sequences of essentially different length and gene content. The only “parameter” in the method is the length K of the oligopeptides, which serves to tune the “resolution power” of the method. The topology of the trees converges with K increasing. Applied to a total of 109 organisms, including 16 Archaea, 87 Bacteria, and 6 Eukarya, it yields an unrooted tree that agrees with the biologists’ “tree of life” based on SSU rRNA comparison in a majority of basic branchings, and especially, in all lower taxa.


Prokaryote Phylogeny Archaea K-strings Compositional distance Tree of life 



The authors thank Drs. Yang Zhong (Fudan University) and Hongya Gu (Peking University) for discussion and comments. We also thank an anonymous referee who pointed out the problem with small genomes. The use of the 64 CPU IBM Cluster at Peking University is also gratefully acknowledged. This work was supported in part by grants from the Special Funds for Major State Basic Research Project of China, the Innovation Project of CAS, and Major Innovation Research Project “248” of Beijing Municipality.


  1. 1.
    Alberts, B 1994Molecular biology of the cell, 3rd ed.GarlandNew York121and references thereinGoogle Scholar
  2. 2.
    Aravind, L, Tatusov, RL, Wolf, YI, Walker, DR, Koonin, EV,  et al. 1998Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles.Trends Genet14442444CrossRefPubMedGoogle Scholar
  3. 3.
    Baldauf, SL, Palmer, JD, Doolittle, WF 1996The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny.Proc Natl Acad Sci USA9377497754CrossRefPubMedGoogle Scholar
  4. 4.
    Benson, DA, Karsch-Mizrachi, I, Lipman, DJ, Ostell, J, Wheeler, DL 2003GenBank.Nucleic Acid Res312327(Available at Scholar
  5. 5.
    Bergey’s Manual Trust1984–1989Bergey’s manual of systematic bacteriology, 1st ed, Vols 1–4.Williams & WilkinsBaltimoreGoogle Scholar
  6. 6.
    Bergey’s Manual Trust2001Bergey’s manual of systematic bacteriology, 2nd ed, Vol 1.Springer-VerlagNew YorkGoogle Scholar
  7. 7.
    Brendel, V, Beckmann, JS, Trifonov, EN 1986Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.J Biomol Struct Dyn41121PubMedGoogle Scholar
  8. 8.
    Chu, K, Qi, J, Yu, Z, Anh, VO 2003Origin and phylogeny of chloroplasts: A simple correlation analysis of complete genomes.Mol Biol Evolin pressCrossRefPubMedGoogle Scholar
  9. 9.
    Daubin, V, Gouy, M, Perriere, G 2001Bacterial molecular phylogeny using supertree approach.Genome Inform12155164Google Scholar
  10. 10.
    Doolittle, WF 1999Phylogenetic classification and the universal tree.Science28421242128PubMedGoogle Scholar
  11. 11.
    Doolittle, WF 2000Uprooting the tree of life.Sci Am February9095Google Scholar
  12. 12.
    Felsenstein J (1993) PHYLIP (phylogeny inference package), version 3.5c. Distributed by the author at
  13. 13.
    Fitz-Gibbon, ST, House, CH 1999Whole genome-based phylogenetic analysis of free-living microorganism.Nucleic Acid Res2742184222CrossRefPubMedGoogle Scholar
  14. 14.
    Garrity GM, Winters M, Searles DB (2001) Taxonomic outline of the prokaryotic genera. In: Bergey’s manual of systematic bacteriology, 2nd ed, Rel 1.0 (Available at )
  15. 15.
    Hao BL, Xie HM, Zhang SY (2001) Compositional representation of protein sequences and the number of Eulerian loops. (Available at arXiv:physics/0103028 at: )
  16. 16.
    Hu, R, Wang, B 2001Statistically significant strings are related to regulatory elements in the promoter region of Saccharomyces cerevisiae.Physica A290464474CrossRefGoogle Scholar
  17. 17.
    Huynen, MA, Snel, B, Bork, P 1999Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes.Science2861441CrossRefGoogle Scholar
  18. 18.
    Karlin, S, Burge, C 1995Dinucleotide relative abundance extremes: A genomic signature.Trends Genet11283290PubMedGoogle Scholar
  19. 19.
    Margulis, LM, Schwartz, KV 1998Five kingdoms, 3rd ed.WH FreemanNew York60Google Scholar
  20. 20.
    Matte-Tailliez, O, Brochier, C, Forterre, P, Philippe, H 2002Archaeal phylogeny based on ribosomal proteins.Mol Biol Evol19631639PubMedGoogle Scholar
  21. 21.
    Murray, RGE 1989The higher taxa, or, a place for everything…?Williams, STSharpe, MEHolt, JG eds. Bergey’s manual of systematic bacteriology, Vol 4.Williams and WilkinsBaltimore23292332Google Scholar
  22. 22.
    Pennisi, E 1998Genome data shake tree of life.Science280672674CrossRefPubMedGoogle Scholar
  23. 23.
    Pennisi, E 1999Is it time to uproot the tree of life?Science28413051308CrossRefPubMedGoogle Scholar
  24. 24.
    Ragan, MA 2001Detection of lateral gene transfer among microbial genomes.Curr Opin Gen Dev11620626CrossRefGoogle Scholar
  25. 25.
    Saitou, N, Nei, M 1987The neighbor-joining method: A new method for reconstructing phylogenetic trees.Mol Biol Evol4406425PubMedGoogle Scholar
  26. 26.
    Snel, B, Bork, P, Huynen, MA 1999Genome phylogeny based on gene content.Nature Genet21108110CrossRefPubMedGoogle Scholar
  27. 27.
    Tekaia, F, Lazcano, A, Dujon, B 1999The genomic tree as revealed from whole genome proteome comparisons.Genome Res9550557PubMedGoogle Scholar
  28. 28.
    Tomb, JF, White, O, Kerlavage, AR,  et al. 1997The complete genome sequence of the gastric pathogen Helicobacter pyroli.Nature388539547PubMedGoogle Scholar
  29. 29.
    Wheeler, DL, Church, DM, Federhen, S,  et al. 2003Database resources of the National Center for Biotechnology.Nucleic Acids Res312833(The NCBI-curated prokaryote genomes are available at The NCBI Taxonomy Browser is located at Scholar
  30. 30.
    Woese, CR 1998The universal ancester.Proc Natl Acad Sci USA9568546859PubMedGoogle Scholar
  31. 31.
    Woese, CR 2000Interpreting the universal phylogenetic tree.Proc Natl Acad Sci USA9783928396PubMedGoogle Scholar
  32. 32.
    Woese, CR, Fox, GE 1977Phylogenetic structure of the prokaryotic domain: The primary kingdoms.Proc Natl Acad Sci USA7450885090PubMedGoogle Scholar
  33. 33.
    Wolf, YI, Rogozin, IB, Grishin, NV, Tatusov, RL, Koonin, EV 2001Genome trees constructed using five different approaches suggest new major bacterial clades.BMC Evol Biol18(Available at Scholar
  34. 34.
    Zuckerkandl, E, Pauling, L 1965Evolutionary divergence and convergence in proteins.Bryson, VVogel, HJ eds. Evolving genes and proteins.Academic PressNew York97166Google Scholar

Copyright information

© Springer-Verlag New York Inc. 2004

Authors and Affiliations

  1. 1.The Institute of Theoretical Physics, Academia Sinica, Beijing 100080China
  2. 2.The T-Life Research CenterFudan University, Shanghai 200433China

Personalised recommendations