Abstract
The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.
Similar content being viewed by others
References
Alba MM, Lee D, Pearl FM, Shepherd AJ, Martin N, Orengo C, Kellam P (2001b) VIDA: A virus database system for the organisation of animal virus genome open reading frames. Nucleic Acids Res 29:133–136
Allaby RG, Woodwark M (2004) Phylogenetics in the bioinformatics culture of understanding. Comp Funct Genomics 5:128–146
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. Bioinformatics 4:2. http://www.biomedcentral.com/1471-2105/4/2, doi:10.1186/1471-2105-4-2
Ben-Dor A, Shamir R, Yakhini Z (1990) Clustering gene expression patterns. J Comput Biol 6:281–297
Berman HM, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980
Brohée S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7:488. http://www.biomedcentral.com/1471-2105/7/488, doi:10.1186/1471-2105-7-488
Brown DP, Krishnamurty N, Sjolander K (2007) Automated protein subfamily identification and classification. PLoS Comput Biol 3(8):e160, 1526–1538
Davison AJ (2002) Evolution of the herpesviruses. Vet Microbiol 86:69–88
Davison AJ, Dargan DJ, Stow ND (2002) Fundamental and accessory systems in herpesvirus: review. Antiviral Res 56:1–11
Felsenstein J (2001) PHYLIP 3.6: Phylogeny Inference Package. http://evolution.genetics.washington.edu/phylip.html
Fu M, Deng R, Wang J, Wang X (2008) Detection and analysis of horizontal gene transfer in herpesvirus. Virus Res 131(1):65–76
Gouzy J, Eugene P, Greene EA, Khan D, Corpet F (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput Appl Biosci 13:601–608
Hartigan JA (1967) Representation of similarity matrices by trees. J Am Stat Assoc 62:1140–1158
Holzerlandt R, Orengo C, Kellam P, Alba MM (2002) Identification of new herpesvirus gene homologs in the human genome. Genome Res 12:1739–1748
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans Comput 22:1025–1034
Jenner R, Mar Alba M, Boshoff C, Kellam P (2001) Kaposis sarcoma-associated herpesvirus latent and lytic gene expression as revealed by DNA arrays. J Virol 75(2):891–902
Kawaji H, Takenaka Y, Matsuda H (2004) Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics 20(2):243–252
McGeoch DJ, Dolan A, Ralph AC (2000) Toward a comprehensive phylogeny for mammalian and avian herpesviruses. J Virol 74:10401–10406
McGeoch DJ, Rixon FJ, Davison AJ (2006) Topics in herpesvirus genomics and evolution. Virus Res 117:90–104
Mirkin B (1976) Analysis of categorical features. Finansy i Statistika Publishers, Moscow (In Russian)
Mirkin B (1987) Additive clustering and qualitative factor analysis methods for similarity matrices. J Classification 4:7–31; Erratum (1989) 6:271–272
Mirkin B (1996) Mathematical classification and clustering. Kluwer, Dordrecht
Mirkin B, Fenner T, Galperin M, Koonin E (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2. http://www.biomedcentral.com/1471-2148/3/2/, doi:10.1186/1471-2148-3-2
Mirkin B, Koonin E (2003) A top-down method for building genome classification trees with linear binary hierarchies. In: Janowitz M, Lapointe J-F, McMorris F, Mirkin B, Roberts F (eds) Bioconsensus. DIMACS Series, vol 61, AMS, Providence, pp 97–112
Mirkin B, Camargo R, Fenner T, Loizou G, Kellam P (2006) Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock D (Ed) Proceedings of the 2006 IEEE symposium on computational intelligence in bioinformatics and computational biology, Piscataway, pp 255–262
Montague MG, Hutchison III CA (2000) Gene content phylogeny of herepsviruses. Proc Natl Acad Sci 97(10):5334–5339
NCBI GenBank/Entrez web site (2006) http://www.ncbi.nlm.nih.gov/entrez
NCBI Viral Genome Resources (2009) http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html
Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8):e123. http://www.ploscompbiol.org/article/info, doi:10.1371/journal.pcbi.0030123
Paccanaro A, Casbon JA, Saqi M (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34:1571–1580
Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120. http://www.biomedcentral.com/1471-2105/8/120/additional/, doi:10.1186/1471-2105-8-120
Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of overlapping properties. Psychol Rev 86:87–123
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24:265–269
Smid M, Dorssers LCJ, Jenster G (2003) Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 19(16):2065–2071
Snel B, Bork P, Huynen MA (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 12:17–25
Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P (2004) Consensus clustering and functional interpretation of gene expression data. Genome Biol 5:R94. http://genomebiology.com/2004/5/11/R94, doi:10.1186/gb-2004-5-11-r94
Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein function and evolution. Nucleic Acids Res 28(1):33–36
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Acknowledgments
The authors thank the Wellcome Trust for its support under Grant 072831/Z/03/Z to Birkbeck, University of London. We are grateful to the anonymous reviewers for many helpful remarks and suggestions, and also to the editors for their efficient management of the reviewing process.
Author information
Authors and Affiliations
Corresponding author
Additional information
Dedicated to Professor Sandor Suhai on the occasion of his 65th birthday and published as part of the Suhai Festschrift Issue.
Rights and permissions
About this article
Cite this article
Mirkin, B., Camargo, R., Fenner, T. et al. Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus. Theor Chem Acc 125, 569–581 (2010). https://doi.org/10.1007/s00214-009-0614-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00214-009-0614-0