Theoretical Chemistry Accounts

, Volume 125, Issue 3–6, pp 569–581 | Cite as

Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

  • Boris Mirkin
  • Renata Camargo
  • Trevor Fenner
  • George Loizou
  • Paul Kellam
Regular Article


The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.


Evolutionary Tree Similarity Shift Neighbourhood List Substantive Knowledge Hypothetical Ancestor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors thank the Wellcome Trust for its support under Grant 072831/Z/03/Z to Birkbeck, University of London. We are grateful to the anonymous reviewers for many helpful remarks and suggestions, and also to the editors for their efficient management of the reviewing process.


  1. 1.
    Alba MM, Lee D, Pearl FM, Shepherd AJ, Martin N, Orengo C, Kellam P (2001b) VIDA: A virus database system for the organisation of animal virus genome open reading frames. Nucleic Acids Res 29:133–136CrossRefGoogle Scholar
  2. 2.
    Allaby RG, Woodwark M (2004) Phylogenetics in the bioinformatics culture of understanding. Comp Funct Genomics 5:128–146CrossRefGoogle Scholar
  3. 3.
    Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402CrossRefGoogle Scholar
  4. 4.
    Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. Bioinformatics 4:2., doi: 10.1186/1471-2105-4-2
  5. 5.
    Ben-Dor A, Shamir R, Yakhini Z (1990) Clustering gene expression patterns. J Comput Biol 6:281–297CrossRefGoogle Scholar
  6. 6.
    Berman HM, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980CrossRefGoogle Scholar
  7. 7.
    Brohée S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7:488., doi: 10.1186/1471-2105-7-488 Google Scholar
  8. 8.
    Brown DP, Krishnamurty N, Sjolander K (2007) Automated protein subfamily identification and classification. PLoS Comput Biol 3(8):e160, 1526–1538Google Scholar
  9. 9.
    Davison AJ (2002) Evolution of the herpesviruses. Vet Microbiol 86:69–88CrossRefGoogle Scholar
  10. 10.
    Davison AJ, Dargan DJ, Stow ND (2002) Fundamental and accessory systems in herpesvirus: review. Antiviral Res 56:1–11CrossRefGoogle Scholar
  11. 11.
    Felsenstein J (2001) PHYLIP 3.6: Phylogeny Inference Package.
  12. 12.
    Fu M, Deng R, Wang J, Wang X (2008) Detection and analysis of horizontal gene transfer in herpesvirus. Virus Res 131(1):65–76CrossRefGoogle Scholar
  13. 13.
    Gouzy J, Eugene P, Greene EA, Khan D, Corpet F (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput Appl Biosci 13:601–608Google Scholar
  14. 14.
    Hartigan JA (1967) Representation of similarity matrices by trees. J Am Stat Assoc 62:1140–1158CrossRefGoogle Scholar
  15. 15.
    Holzerlandt R, Orengo C, Kellam P, Alba MM (2002) Identification of new herpesvirus gene homologs in the human genome. Genome Res 12:1739–1748CrossRefGoogle Scholar
  16. 16.
    Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans Comput 22:1025–1034CrossRefGoogle Scholar
  17. 17.
    Jenner R, Mar Alba M, Boshoff C, Kellam P (2001) Kaposis sarcoma-associated herpesvirus latent and lytic gene expression as revealed by DNA arrays. J Virol 75(2):891–902CrossRefGoogle Scholar
  18. 18.
    Kawaji H, Takenaka Y, Matsuda H (2004) Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics 20(2):243–252CrossRefGoogle Scholar
  19. 19.
    McGeoch DJ, Dolan A, Ralph AC (2000) Toward a comprehensive phylogeny for mammalian and avian herpesviruses. J Virol 74:10401–10406CrossRefGoogle Scholar
  20. 20.
    McGeoch DJ, Rixon FJ, Davison AJ (2006) Topics in herpesvirus genomics and evolution. Virus Res 117:90–104CrossRefGoogle Scholar
  21. 21.
    Mirkin B (1976) Analysis of categorical features. Finansy i Statistika Publishers, Moscow (In Russian)Google Scholar
  22. 22.
    Mirkin B (1987) Additive clustering and qualitative factor analysis methods for similarity matrices. J Classification 4:7–31; Erratum (1989) 6:271–272CrossRefGoogle Scholar
  23. 23.
    Mirkin B (1996) Mathematical classification and clustering. Kluwer, DordrechtGoogle Scholar
  24. 24.
    Mirkin B, Fenner T, Galperin M, Koonin E (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2., doi: 10.1186/1471-2148-3-2
  25. 25.
    Mirkin B, Koonin E (2003) A top-down method for building genome classification trees with linear binary hierarchies. In: Janowitz M, Lapointe J-F, McMorris F, Mirkin B, Roberts F (eds) Bioconsensus. DIMACS Series, vol 61, AMS, Providence, pp 97–112Google Scholar
  26. 26.
    Mirkin B, Camargo R, Fenner T, Loizou G, Kellam P (2006) Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock D (Ed) Proceedings of the 2006 IEEE symposium on computational intelligence in bioinformatics and computational biology, Piscataway, pp 255–262Google Scholar
  27. 27.
    Montague MG, Hutchison III CA (2000) Gene content phylogeny of herepsviruses. Proc Natl Acad Sci 97(10):5334–5339CrossRefGoogle Scholar
  28. 28.
    NCBI GenBank/Entrez web site (2006)
  29. 29.
  30. 30.
    Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8):e123., doi: 10.1371/journal.pcbi.0030123
  31. 31.
    Paccanaro A, Casbon JA, Saqi M (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34:1571–1580CrossRefGoogle Scholar
  32. 32.
    Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120., doi: 10.1186/1471-2105-8-120
  33. 33.
    Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of overlapping properties. Psychol Rev 86:87–123CrossRefGoogle Scholar
  34. 34.
    Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24:265–269CrossRefGoogle Scholar
  35. 35.
    Smid M, Dorssers LCJ, Jenster G (2003) Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 19(16):2065–2071CrossRefGoogle Scholar
  36. 36.
    Snel B, Bork P, Huynen MA (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 12:17–25CrossRefGoogle Scholar
  37. 37.
    Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P (2004) Consensus clustering and functional interpretation of gene expression data. Genome Biol 5:R94., doi: 10.1186/gb-2004-5-11-r94
  38. 38.
    Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein function and evolution. Nucleic Acids Res 28(1):33–36CrossRefGoogle Scholar
  39. 39.
    Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Boris Mirkin
    • 1
    • 2
  • Renata Camargo
    • 1
  • Trevor Fenner
    • 1
  • George Loizou
    • 1
  • Paul Kellam
    • 3
  1. 1.School of Computer Science and Information SystemsBirkbeck College, University of LondonLondonUK
  2. 2.Division of Applied MathematicsHigher School of EconomicsMoscowRussia
  3. 3.Centre for Virology, Department of InfectionUniversity College LondonLondonUK

Personalised recommendations