Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

Mirkin, Boris; Camargo, Renata; Fenner, Trevor; Loizou, George; Kellam, Paul

doi:10.1007/s00214-009-0614-0

Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

Regular Article
Published: 05 August 2009

Volume 125, pages 569–581, (2010)
Cite this article

Theoretical Chemistry Accounts Aims and scope Submit manuscript

Boris Mirkin^1,2,
Renata Camargo¹,
Trevor Fenner¹,
George Loizou¹ &
…
Paul Kellam³

104 Accesses
6 Citations
Explore all metrics

Abstract

The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Article Open access 05 February 2015

Identification of Protein Homologs and Domain Boundaries by Iterative Sequence Alignment

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

References

Alba MM, Lee D, Pearl FM, Shepherd AJ, Martin N, Orengo C, Kellam P (2001b) VIDA: A virus database system for the organisation of animal virus genome open reading frames. Nucleic Acids Res 29:133–136
Article CAS Google Scholar
Allaby RG, Woodwark M (2004) Phylogenetics in the bioinformatics culture of understanding. Comp Funct Genomics 5:128–146
Article CAS Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS Google Scholar
Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. Bioinformatics 4:2. http://www.biomedcentral.com/1471-2105/4/2, doi:10.1186/1471-2105-4-2
Ben-Dor A, Shamir R, Yakhini Z (1990) Clustering gene expression patterns. J Comput Biol 6:281–297
Article Google Scholar
Berman HM, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980
Article CAS Google Scholar
Brohée S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7:488. http://www.biomedcentral.com/1471-2105/7/488, doi:10.1186/1471-2105-7-488
Google Scholar
Brown DP, Krishnamurty N, Sjolander K (2007) Automated protein subfamily identification and classification. PLoS Comput Biol 3(8):e160, 1526–1538
Google Scholar
Davison AJ (2002) Evolution of the herpesviruses. Vet Microbiol 86:69–88
Article CAS Google Scholar
Davison AJ, Dargan DJ, Stow ND (2002) Fundamental and accessory systems in herpesvirus: review. Antiviral Res 56:1–11
Article CAS Google Scholar
Felsenstein J (2001) PHYLIP 3.6: Phylogeny Inference Package. http://evolution.genetics.washington.edu/phylip.html
Fu M, Deng R, Wang J, Wang X (2008) Detection and analysis of horizontal gene transfer in herpesvirus. Virus Res 131(1):65–76
Article CAS Google Scholar
Gouzy J, Eugene P, Greene EA, Khan D, Corpet F (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput Appl Biosci 13:601–608
CAS Google Scholar
Hartigan JA (1967) Representation of similarity matrices by trees. J Am Stat Assoc 62:1140–1158
Article Google Scholar
Holzerlandt R, Orengo C, Kellam P, Alba MM (2002) Identification of new herpesvirus gene homologs in the human genome. Genome Res 12:1739–1748
Article CAS Google Scholar
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans Comput 22:1025–1034
Article Google Scholar
Jenner R, Mar Alba M, Boshoff C, Kellam P (2001) Kaposis sarcoma-associated herpesvirus latent and lytic gene expression as revealed by DNA arrays. J Virol 75(2):891–902
Article CAS Google Scholar
Kawaji H, Takenaka Y, Matsuda H (2004) Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics 20(2):243–252
Article CAS Google Scholar
McGeoch DJ, Dolan A, Ralph AC (2000) Toward a comprehensive phylogeny for mammalian and avian herpesviruses. J Virol 74:10401–10406
Article CAS Google Scholar
McGeoch DJ, Rixon FJ, Davison AJ (2006) Topics in herpesvirus genomics and evolution. Virus Res 117:90–104
Article CAS Google Scholar
Mirkin B (1976) Analysis of categorical features. Finansy i Statistika Publishers, Moscow (In Russian)
Google Scholar
Mirkin B (1987) Additive clustering and qualitative factor analysis methods for similarity matrices. J Classification 4:7–31; Erratum (1989) 6:271–272
Article Google Scholar
Mirkin B (1996) Mathematical classification and clustering. Kluwer, Dordrecht
Google Scholar
Mirkin B, Fenner T, Galperin M, Koonin E (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2. http://www.biomedcentral.com/1471-2148/3/2/, doi:10.1186/1471-2148-3-2
Mirkin B, Koonin E (2003) A top-down method for building genome classification trees with linear binary hierarchies. In: Janowitz M, Lapointe J-F, McMorris F, Mirkin B, Roberts F (eds) Bioconsensus. DIMACS Series, vol 61, AMS, Providence, pp 97–112
Mirkin B, Camargo R, Fenner T, Loizou G, Kellam P (2006) Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock D (Ed) Proceedings of the 2006 IEEE symposium on computational intelligence in bioinformatics and computational biology, Piscataway, pp 255–262
Montague MG, Hutchison III CA (2000) Gene content phylogeny of herepsviruses. Proc Natl Acad Sci 97(10):5334–5339
Article CAS Google Scholar
NCBI GenBank/Entrez web site (2006) http://www.ncbi.nlm.nih.gov/entrez
NCBI Viral Genome Resources (2009) http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html
Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8):e123. http://www.ploscompbiol.org/article/info, doi:10.1371/journal.pcbi.0030123
Paccanaro A, Casbon JA, Saqi M (2006) Spectral clustering of protein sequences. Nucleic Acids Res 34:1571–1580
Article CAS Google Scholar
Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120. http://www.biomedcentral.com/1471-2105/8/120/additional/, doi:10.1186/1471-2105-8-120
Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of overlapping properties. Psychol Rev 86:87–123
Article Google Scholar
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24:265–269
Article Google Scholar
Smid M, Dorssers LCJ, Jenster G (2003) Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 19(16):2065–2071
Article CAS Google Scholar
Snel B, Bork P, Huynen MA (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 12:17–25
Article CAS Google Scholar
Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, Kellam P (2004) Consensus clustering and functional interpretation of gene expression data. Genome Biol 5:R94. http://genomebiology.com/2004/5/11/R94, doi:10.1186/gb-2004-5-11-r94
Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein function and evolution. Nucleic Acids Res 28(1):33–36
Article CAS Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Article CAS Google Scholar

Download references

Acknowledgments

The authors thank the Wellcome Trust for its support under Grant 072831/Z/03/Z to Birkbeck, University of London. We are grateful to the anonymous reviewers for many helpful remarks and suggestions, and also to the editors for their efficient management of the reviewing process.

Author information

Authors and Affiliations

School of Computer Science and Information Systems, Birkbeck College, University of London, London, UK
Boris Mirkin, Renata Camargo, Trevor Fenner & George Loizou
Division of Applied Mathematics, Higher School of Economics, Moscow, Russia
Boris Mirkin
Centre for Virology, Department of Infection, University College London, London, UK
Paul Kellam

Authors

Boris Mirkin
View author publications
You can also search for this author in PubMed Google Scholar
Renata Camargo
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Fenner
View author publications
You can also search for this author in PubMed Google Scholar
George Loizou
View author publications
You can also search for this author in PubMed Google Scholar
Paul Kellam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boris Mirkin.

Additional information

Dedicated to Professor Sandor Suhai on the occasion of his 65th birthday and published as part of the Suhai Festschrift Issue.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mirkin, B., Camargo, R., Fenner, T. et al. Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus. Theor Chem Acc 125, 569–581 (2010). https://doi.org/10.1007/s00214-009-0614-0

Download citation

Received: 27 January 2009
Accepted: 18 July 2009
Published: 05 August 2009
Issue Date: March 2010
DOI: https://doi.org/10.1007/s00214-009-0614-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

Abstract

Access this article

Similar content being viewed by others

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Identification of Protein Homologs and Domain Boundaries by Iterative Sequence Alignment

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

Abstract

Access this article

Similar content being viewed by others

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Identification of Protein Homologs and Domain Boundaries by Iterative Sequence Alignment

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation