Bisecting K-Means Initialization Technique for Protein Sequence Motif Identification

  • M. Chitralegha
  • K. Thangavel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8284)


Bioinformatics deals with the information technology as applied to management and analysis of biological data. In the field of bioinformatics, data mining helps researchers to mine large amount of biomolecular data. Major research efforts done in the area of bioinformatics involves sequence analysis, protein structure prediction and gene finding. Proteins are said to be an important molecule in all living organisms. They involve virtually in all cell functions. Protein sequence motifs are short fragments of conserved amino acids that transcend in protein sequences. Identifying such motifs is one of the challenging tasks in the area of bioinformatics. Data mining is one such technique to explore sequence motifs. These protein motifs are identified from the segments of protein sequences. All generated sequence segments may not be significant to find sequence motifs. The generated sequence segments have no classes or labels. Hence, Singular Value Decomposition (SVD) entropy technique is adopted as preprocessing method to select sequence segments. The Adaptive Fuzzy C-Means clustering method is performed on the selected segments to obtain granules. Then Bisecting K-Means is applied on each granule to obtain the specified number of clusters. These cluster centroids are given as input to the K-Means algorithm to cluster each granule separately. The result obtained using new initialization technique is then compared with random initialization for K-Means clustering. The comparative results show that new seed selection technique performs better than random initialization. This proposed method identifies significant motif patterns.


Sequence Motifs HSSP SVD Bisecting K-Means Adaptive Fuzzy C-Means 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Smith, D.J.P.: Progress with the PRINTS protein fingerprint database. Nucleic Acids Res. 24, 182–183 (1996)CrossRefGoogle Scholar
  2. 2.
    Alter, O., Brown, P.O., Boststein, D.: Singular value decomposition for genome-wide expression data preprocessing and modelling. PNAS 97(18), 10101–10106 (2000)CrossRefGoogle Scholar
  3. 3.
    Chen, B., Tai, P.C., Harrison, R., Pan, Y.: FGK Model: An Efficient Granular Computing Model for Protein Sequence Motifs Information Discovery. In: IASTED Proc. International Conference on Computational and Systems Biology(CASB), Dallas, pp. 56–61 (2006)Google Scholar
  4. 4.
    Cox, E.: Fuzzy Modelling and Genetic Algorithms for Data Mining Exploration. Elsevier (2005)Google Scholar
  5. 5.
    Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks+: a non redundant data-base of protein Alignment blocks derived from multiple compilation. Bioinformatics 15(6), 417–479 (1999)CrossRefGoogle Scholar
  6. 6.
    Hullo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.: Recent improvements to the PROSITE database. Nucleic Acids Res. 32(database issue), D134–D137 (2004)Google Scholar
  7. 7.
    Kabsch, W., Sander, C.: Dictionary of protein secondary structure pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)CrossRefGoogle Scholar
  8. 8.
    Sander, C., Schneider, R.: Database of Homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9(1), 56–68 (1991)CrossRefGoogle Scholar
  9. 9.
    Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003)CrossRefGoogle Scholar
  10. 10.
    Zhong, W., Altun, G., Harrison, R., Tai, P.C., Pan, Y.: Improved K-Means Clustering algorithm for Exploring Local Protein Sequence motifs Representing Common Structural Property. IEEE Transactions on Nanobioscience 4(3), 255–265 (2000)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • M. Chitralegha
    • 1
  • K. Thangavel
    • 1
  1. 1.Department of Computer SciencePeriyar UniversitySalemIndia

Personalised recommendations