Multi-domain Protein Family Classification Using Isomorphic Inter-property Relationships
Multi-domain proteins result from the duplication and combination of complex but limited number of domains. The ability to distinguish multi-domain homologs from unrelated pairs that share a domain is essential to genomic analysis. Heuristics based on sequence similarity and alignment coverage have been proposed to screen out domain insertions but have met with limited success. In this paper we propose a unique protein classification schema for multi-domain protein superfamilies. Segmented profiles of physico-chemical properties and amino acid composition are created for vector quantization based dimensionality reduction to create a feature profile for rule-discovery and classification. Association rules are mined to identify isomorphic relationships that govern the formation of domains between proteins to correctly predict homologous pairs and reject unrelated pairs, including those that share domains. Our results demonstrate that effective classification of conserved domain classes can be performed using these feature profiles, and the classifier is not susceptible to class imbalances frequently encountered in these databases.
KeywordsMulti-domain proteins supervised classification association rules
Unable to display preview. Download preview PDF.
- 2.Song, N., Joseph, J.M., Davis, G.B., Durand, D.: Sequence Similarity Network Reveals. Common Ancestry of Multidomain Proteins 4(5), e1000063 (2004)Google Scholar
- 4.Dua, S., Singh, H., Thompson, H.W.: Associated Classification of Mammograms using Weighted Rules Based Classification. Elsevier Expert System with Applications (in press)Google Scholar
- 5.Vogel, C., Bashton, M., Kerrison, N.D., Chothia, C., Teichmann, S.A.: Structure, function and evolution of multidomain proteins 14, 208–216 (2004)Google Scholar
- 7.Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., Thornton, J., Orengo, C.A.: The CATH classification revisited–architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research (2008)Google Scholar
- 12.Vec Quantization, http://www.geocities.com/mohamedqasem/vectorquantization/vq.html
- 13.Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD ICMD, pp. 207–216. ACM, Washington (1993)Google Scholar