Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints
Discovering protein structural signatures directly from their primary information is a challenging task, because the residues associated with a functional motif are not necessarily clustered in one region of the sequence. This work proposes an algorithm that aims to discover conserved sequential blocks interleaved by large irregular gaps from a set of unaligned biological sequences. Different from the previous works that employ only one type of constraint on gap flexibility, we propose using combination of intra- and inter-block gap constraints to discover longer patterns with larger irregular gaps. The smaller flexible intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of mining process while keeping high accuracy of mining results. The efficiency of the algorithm also helps to identify functional motifs that are conserved in only a small subset of the input sequences.
KeywordsMining Process Structural Motif Sequential Pattern Pattern Mining Motif Block
Unable to display preview. Download preview PDF.
- 1.Blanchette, M., Schwikowski, B., Tompa, M.: An exact algorithm to identify motifs in orthologous sequences from multiple species. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 37–45 (2000)Google Scholar
- 4.Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. John Wiley & Sons, Chichester (2004)Google Scholar
- 6.Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522 (1997)Google Scholar
- 8.Liu, X., Brutlag, D.L., Liu, J.S.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 127–138 (2001)Google Scholar
- 16.Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269–278 (2000)Google Scholar
- 20.Silvestri, C., Orlando, S., Perego, R.: A new algorithm for gap constrained sequence mining. In: Proceedings of the 2004, ACM Symposium on Applied Computing, special track on Data Mining, pp. 540–547 (2004)Google Scholar
- 21.Su, Q.J., Lu, L., Saxonov, S., Brutlag, D.L.: eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucl. Acids Res. 33, D178–D182 (2005)Google Scholar