Skip to main content

Advertisement

Log in

Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

  • Regular Article
  • Published:
Theoretical Chemistry Accounts Aims and scope Submit manuscript

Abstract

The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Alba MM, Lee D, Pearl FM, Shepherd AJ, Martin N, Orengo C, Kellam P (2001b) VIDA: A virus database system for the organisation of animal virus genome open reading frames. Nucleic Acids Res 29:133–136

    Article  CAS  Google Scholar 

  2. Allaby RG, Woodwark M (2004) Phylogenetics in the bioinformatics culture of understanding. Comp Funct Genomics 5:128–146

    Article  CAS  Google Scholar 

  3. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  Google Scholar 

  4. Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. Bioinformatics 4:2. http://www.biomedcentral.com/1471-2105/4/2, doi:10.1186/1471-2105-4-2

  5. Ben-Dor A, Shamir R, Yakhini Z (1990) Clustering gene expression patterns. J Comput Biol 6:281–297

    Article  Google Scholar 

  6. Berman HM, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980

    Article  CAS  Google Scholar 

  7. Brohée S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7:488. http://www.biomedcentral.com/1471-2105/7/488, doi:10.1186/1471-2105-7-488

    Google Scholar 

  8. Brown DP, Krishnamurty N, Sjolander K (2007) Automated protein subfamily identification and classification. PLoS Comput Biol 3(8):e160, 1526–1538

    Google Scholar 

  9. Davison AJ (2002) Evolution of the herpesviruses. Vet Microbiol 86:69–88

    Article  CAS  Google Scholar 

  10. Davison AJ, Dargan DJ, Stow ND (2002) Fundamental and accessory systems in herpesvirus: review. Antiviral Res 56:1–11

    Article  CAS  Google Scholar 

  11. Felsenstein J (2001) PHYLIP 3.6: Phylogeny Inference Package. http://evolution.genetics.washington.edu/phylip.html

  12. Fu M, Deng R, Wang J, Wang X (2008) Detection and analysis of horizontal gene transfer in herpesvirus. Virus Res 131(1):65–76

    Article  CAS  Google Scholar 

  13. Gouzy J, Eugene P, Greene EA, Khan D, Corpet F (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput Appl Biosci 13:601–608

    CAS  Google Scholar 

  14. Hartigan JA (1967) Representation of similarity matrices by trees. J Am Stat Assoc 62:1140–1158

    Article  Google Scholar 

  15. Holzerlandt R, Orengo C, Kellam P, Alba MM (2002) Identification of new herpesvirus gene homologs in the human genome. Genome Res 12:1739–1748

    Article  CAS  Google Scholar 

  16. Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans Comput 22:1025–1034

    Article  Google Scholar 

  17. Jenner R, Mar Alba M, Boshoff C, Kellam P (2001) Kaposis sarcoma-associated herpesvirus latent and lytic gene expression as revealed by DNA arrays. J Virol 75(2):891–902

    Article  CAS  Google Scholar 

  18. Kawaji H, Takenaka Y, Matsuda H (2004) Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics 20(2):243–252

    Article  CAS  Google Scholar 

  19. McGeoch DJ, Dolan A, Ralph AC (2000) Toward a comprehensive phylogeny for mammalian and avian herpesviruses. J Virol 74:10401–10406

    Article  CAS  Google Scholar 

  20. McGeoch DJ, Rixon FJ, Davison AJ (2006) Topics in herpesvirus genomics and evolution. Virus Res 117:90–104

    Article  CAS  Google Scholar 

  21. Mirkin B (1976) Analysis of categorical features. Finansy i Statistika Publishers, Moscow (In Russian)

    Google Scholar 

  22. Mirkin B (1987) Additive clustering and qualitative factor analysis methods for similarity matrices. J Classification 4:7–31; Erratum (1989) 6:271–272

    Article  Google Scholar 

  23. Mirkin B (1996) Mathematical classification and clustering. Kluwer, Dordrecht

    Google Scholar 

  24. Mirkin B, Fenner T, Galperin M, Koonin E (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2. http://www.biomedcentral.com/1471-2148/3/2/, doi:10.1186/1471-2148-3-2

  25. Mirkin B, Koonin E (2003) A top-down method for building genome classification trees with linear binary hierarchies. In: Janowitz M, Lapointe J-F, McMorris F, Mirkin B, Roberts F (eds) Bioconsensus. DIMACS Series, vol 61, AMS, Providence, pp 97–112

  26. Mirkin B, Camargo R, Fenner T, Loizou G, Kellam P (2006) Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock D (Ed) Proceedings of the 2006 IEEE symposium on computational intelligence in bioinformatics and computational biology, Piscataway, pp 255–262

  27. Montague MG, Hutchison III CA (2000) Gene content phylogeny of herepsviruses. Proc Natl Acad Sci 97(10):5334–5339

    Article  CAS  Google Scholar 

  28. NCBI GenBank/Entrez web site (2006) http://www.ncbi.nlm.nih.gov/entrez

  29. NCBI Viral Genome Resources (2009) http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html

  30. Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8):e123. http://www.ploscompbiol.org/article/info, doi:10.1371/journal.pcbi.0030123

  31. Paccanaro A, Casbon JA, Saqi M (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34:1571–1580

    Article  CAS  Google Scholar 

  32. Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120. http://www.biomedcentral.com/1471-2105/8/120/additional/, doi:10.1186/1471-2105-8-120

  33. Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of overlapping properties. Psychol Rev 86:87–123

    Article  Google Scholar 

  34. Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24:265–269

    Article  Google Scholar 

  35. Smid M, Dorssers LCJ, Jenster G (2003) Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 19(16):2065–2071

    Article  CAS  Google Scholar 

  36. Snel B, Bork P, Huynen MA (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 12:17–25

    Article  CAS  Google Scholar 

  37. Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P (2004) Consensus clustering and functional interpretation of gene expression data. Genome Biol 5:R94. http://genomebiology.com/2004/5/11/R94, doi:10.1186/gb-2004-5-11-r94

  38. Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein function and evolution. Nucleic Acids Res 28(1):33–36

    Article  CAS  Google Scholar 

  39. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors thank the Wellcome Trust for its support under Grant 072831/Z/03/Z to Birkbeck, University of London. We are grateful to the anonymous reviewers for many helpful remarks and suggestions, and also to the editors for their efficient management of the reviewing process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boris Mirkin.

Additional information

Dedicated to Professor Sandor Suhai on the occasion of his 65th birthday and published as part of the Suhai Festschrift Issue.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mirkin, B., Camargo, R., Fenner, T. et al. Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus. Theor Chem Acc 125, 569–581 (2010). https://doi.org/10.1007/s00214-009-0614-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00214-009-0614-0

Keywords

Navigation