Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus
The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.
KeywordsEvolutionary Tree Similarity Shift Neighbourhood List Substantive Knowledge Hypothetical Ancestor
The authors thank the Wellcome Trust for its support under Grant 072831/Z/03/Z to Birkbeck, University of London. We are grateful to the anonymous reviewers for many helpful remarks and suggestions, and also to the editors for their efficient management of the reviewing process.
- 8.Brown DP, Krishnamurty N, Sjolander K (2007) Automated protein subfamily identification and classification. PLoS Comput Biol 3(8):e160, 1526–1538Google Scholar
- 11.Felsenstein J (2001) PHYLIP 3.6: Phylogeny Inference Package. http://evolution.genetics.washington.edu/phylip.html
- 13.Gouzy J, Eugene P, Greene EA, Khan D, Corpet F (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput Appl Biosci 13:601–608Google Scholar
- 21.Mirkin B (1976) Analysis of categorical features. Finansy i Statistika Publishers, Moscow (In Russian)Google Scholar
- 23.Mirkin B (1996) Mathematical classification and clustering. Kluwer, DordrechtGoogle Scholar
- 24.Mirkin B, Fenner T, Galperin M, Koonin E (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2. http://www.biomedcentral.com/1471-2148/3/2/, doi: 10.1186/1471-2148-3-2
- 25.Mirkin B, Koonin E (2003) A top-down method for building genome classification trees with linear binary hierarchies. In: Janowitz M, Lapointe J-F, McMorris F, Mirkin B, Roberts F (eds) Bioconsensus. DIMACS Series, vol 61, AMS, Providence, pp 97–112Google Scholar
- 26.Mirkin B, Camargo R, Fenner T, Loizou G, Kellam P (2006) Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock D (Ed) Proceedings of the 2006 IEEE symposium on computational intelligence in bioinformatics and computational biology, Piscataway, pp 255–262Google Scholar
- 28.NCBI GenBank/Entrez web site (2006) http://www.ncbi.nlm.nih.gov/entrez
- 29.NCBI Viral Genome Resources (2009) http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html
- 32.Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120. http://www.biomedcentral.com/1471-2105/8/120/additional/, doi: 10.1186/1471-2105-8-120