Bisecting K-Means Initialization Technique for Protein Sequence Motif Identification
Bioinformatics deals with the information technology as applied to management and analysis of biological data. In the field of bioinformatics, data mining helps researchers to mine large amount of biomolecular data. Major research efforts done in the area of bioinformatics involves sequence analysis, protein structure prediction and gene finding. Proteins are said to be an important molecule in all living organisms. They involve virtually in all cell functions. Protein sequence motifs are short fragments of conserved amino acids that transcend in protein sequences. Identifying such motifs is one of the challenging tasks in the area of bioinformatics. Data mining is one such technique to explore sequence motifs. These protein motifs are identified from the segments of protein sequences. All generated sequence segments may not be significant to find sequence motifs. The generated sequence segments have no classes or labels. Hence, Singular Value Decomposition (SVD) entropy technique is adopted as preprocessing method to select sequence segments. The Adaptive Fuzzy C-Means clustering method is performed on the selected segments to obtain granules. Then Bisecting K-Means is applied on each granule to obtain the specified number of clusters. These cluster centroids are given as input to the K-Means algorithm to cluster each granule separately. The result obtained using new initialization technique is then compared with random initialization for K-Means clustering. The comparative results show that new seed selection technique performs better than random initialization. This proposed method identifies significant motif patterns.
KeywordsSequence Motifs HSSP SVD Bisecting K-Means Adaptive Fuzzy C-Means
Unable to display preview. Download preview PDF.
- 3.Chen, B., Tai, P.C., Harrison, R., Pan, Y.: FGK Model: An Efficient Granular Computing Model for Protein Sequence Motifs Information Discovery. In: IASTED Proc. International Conference on Computational and Systems Biology(CASB), Dallas, pp. 56–61 (2006)Google Scholar
- 4.Cox, E.: Fuzzy Modelling and Genetic Algorithms for Data Mining Exploration. Elsevier (2005)Google Scholar
- 6.Hullo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.: Recent improvements to the PROSITE database. Nucleic Acids Res. 32(database issue), D134–D137 (2004)Google Scholar