Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints

  • Chen-Ming Hsu
  • Chien-Yu Chen
  • Ching-Chi Hsu
  • Baw-Jhiune Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


Discovering protein structural signatures directly from their primary information is a challenging task, because the residues associated with a functional motif are not necessarily clustered in one region of the sequence. This work proposes an algorithm that aims to discover conserved sequential blocks interleaved by large irregular gaps from a set of unaligned biological sequences. Different from the previous works that employ only one type of constraint on gap flexibility, we propose using combination of intra- and inter-block gap constraints to discover longer patterns with larger irregular gaps. The smaller flexible intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of mining process while keeping high accuracy of mining results. The efficiency of the algorithm also helps to identify functional motifs that are conserved in only a small subset of the input sequences.


Mining Process Structural Motif Sequential Pattern Pattern Mining Motif Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blanchette, M., Schwikowski, B., Tompa, M.: An exact algorithm to identify motifs in orthologous sequences from multiple species. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 37–45 (2000)Google Scholar
  2. 2.
    Blekas, K., Fotiadis, D.I., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19, 607–617 (2003)CrossRefGoogle Scholar
  3. 3.
    Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. 5, 277–305 (1998)CrossRefGoogle Scholar
  4. 4.
    Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. John Wiley & Sons, Chichester (2004)Google Scholar
  5. 5.
    Falquet, L., et al.: The PROSITE database, its status in 2002. Nucl. Acids Res. 30, 235–238 (2002)CrossRefGoogle Scholar
  6. 6.
    Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522 (1997)Google Scholar
  7. 7.
    Jonassen, I., Collins, J.F., Higgins, D.: Finding flexible patterns in unaligned protein sequences. Protein Science 4(8), 1587–1595 (1995)CrossRefGoogle Scholar
  8. 8.
    Liu, X., Brutlag, D.L., Liu, J.S.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 127–138 (2001)Google Scholar
  9. 9.
    Martin, P., et al.: Insights into the Structure, Solvation, and Mechanism of ArsC Arsenate Reductase, a Novel Arsenic Detoxification Enzyme. Structure 9(2001), 1071–1081 (2001)CrossRefGoogle Scholar
  10. 10.
    Martinez-Yamout, M., Legge, G.B., Zhang, O., Wright, P.E., Dyson, H.J.: Solution structure of the cysteine-rich domain of the Escherichia coli chaperone protein DnaJ. J. Mol. Biol. 300(4), 805–818 (2000)CrossRefGoogle Scholar
  11. 11.
    Narasimhan, G., Bu, C., Gao, Y., Wang, X., Xu, N., Mathee, K.: Mining protein sequences for motifs. J. Comput. Biol. 9, 707–720 (2002)CrossRefGoogle Scholar
  12. 12.
    Neuwald, A.F., Green, P.: Detecting patterns in protein sequences. J. Mol. Biol. 239, 698–712 (1994)CrossRefGoogle Scholar
  13. 13.
    Ogiwara, A., Uchiyama, I., Yasuhiko, S., Kanehisa, M.: Construction of a dictionary of sequence motifs that characterize groups of related proteins. Protein Eng. 5, 479–488 (1992)CrossRefGoogle Scholar
  14. 14.
    Pei, J., Han, J.: Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explorations (Special Issue on Constraints in Data Mining) 4(1), 31–39 (2002)CrossRefGoogle Scholar
  15. 15.
    Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering 16, 1424–1440 (2004)CrossRefGoogle Scholar
  16. 16.
    Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269–278 (2000)Google Scholar
  17. 17.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm. Bioinformatics 14, 55–67 (1998)CrossRefGoogle Scholar
  18. 18.
    Saqi, M.A.S., Sternberg, M.J.E.: Identification of sequence motifs from a set of proteins with related function. Protein Eng. 7, 165–171 (1994)CrossRefGoogle Scholar
  19. 19.
    Shi, Y.Y., Tang, W., Hao, S.F., Wang, C.C.: Constributions of cysteine residues in Zn2 to zinc figers and thioldisulfide oxidoreductase activities of chaperone DnaJ. Biochemistry 44, 1683–1689 (2005)CrossRefGoogle Scholar
  20. 20.
    Silvestri, C., Orlando, S., Perego, R.: A new algorithm for gap constrained sequence mining. In: Proceedings of the 2004, ACM Symposium on Applied Computing, special track on Data Mining, pp. 540–547 (2004)Google Scholar
  21. 21.
    Su, Q.J., Lu, L., Saxonov, S., Brutlag, D.L.: eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucl. Acids Res. 33, D178–D182 (2005)Google Scholar
  22. 22.
    Wang, J.T.L., et al.: Discovering active motifs in sets of related protein sequences and using them for classification. Nucl. Acids Res. 22, 2769–2775 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Chen-Ming Hsu
    • 1
  • Chien-Yu Chen
    • 2
  • Ching-Chi Hsu
    • 3
  • Baw-Jhiune Liu
    • 1
  1. 1.Department of Computer Science and EngineeringYuan Ze UniversityChung-LiTaiwan, R.O.C.
  2. 2.Department of Bio-Industrial Mechatronics EngineeringNational Taiwan UniversityTaipeiTaiwan, R.O.C.
  3. 3.Institute for Information IndustryTaipeiTaiwan, R.O.C.

Personalised recommendations