Skip to main content

Some Results on Flexible-Pattern Discovery

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Abstract

Given an input sequence of data, a “rigid” pattern is a repeating sequence, possibly interspersed with “dont care” characters. In practice, the patterns or motifs of interest are the ones that also allow a variable number of gaps (or “dont care” characters): we call these the flexible motifs. The number of rigid motifs could potentially be exponential in the size of the input sequence and in the case where the input is a sequence of real numbers, there could be uncountably infinite number of motifs (assuming two real numbers are equal if they are within some δ > 0 of each other). It has been shown earlier that by suitably defining the notion of maximality and redundancy, there exists only a linear (or no more than 3n) number of irredundant motifs and a polynomial time algorithm to detect these irredundant motifs. Here we present a uniform framework that encompasses both rigid and flexible motifs with generalizations to sequence of sets and real numbers and show a somewhat surprising result that the number of irredundant flexible motifs still have a linear bound. However, the algorithm to detect them has a higher complexity than that of the rigid motifs.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Altschul. Gap costs for multiple sequence alignment. J. Theor. Biol., 138:297–309, 1989.

    Article  MathSciNet  Google Scholar 

  2. AMS+95._R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Advances in knowledge discovery and data mining, chapter 12,. In Fast Discovery of Association Rules. AAAI/MIT Press, MA, 1995.

    Google Scholar 

  3. R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on DataEngineering (ICDE95), pages 3–14. 1995.

    Google Scholar 

  4. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. of the 1998 ACM-SIGMOD Conference on Management of Data, pages 85–93, 1992.

    Google Scholar 

  5. T. L. Bailey and M. Gribskov. Methods and statistics for combining motif match scores. Journal of Computational Biology, 5:211–221, 1998.

    Article  Google Scholar 

  6. Alvis Brazma, Inge Jonassen, Ingvar Eidhammer, and David Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2):279–305, 1998.

    Google Scholar 

  7. Andrea Califano. SPLASH: structural pattern localization algorithm by sequential histogramming. Bioinformatics (under publication), 2000.

    Google Scholar 

  8. H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal of Applied Mathematics, pages 1073–1082, 1988.

    Google Scholar 

  9. G. Das, D. Gunopulous, and H. Mannila. Finding similar time series. In Priniciples of Knowledge Discovery and Data Mining, 1997.

    Google Scholar 

  10. V Dhar and A Tuzhilin. Abstract-driven pattern discovery in databases. In IEEE Transactions on Knowledge and Data Engineering, 1993.

    Google Scholar 

  11. FRP+99._Aris Floratos, Isidore Rigoutsos, Laxmi Parida, Gustavo Stolovitzky, and Yuan Gao. Sequence homology detection through large-scale pattern discovery. In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB99), pages 209–215. ACM Press, 1999.

    Google Scholar 

  12. GYW+97._Y. Gao, M. Yang, X. Wang, K. Mathee, and G. Narasimhan. Detection of HTH motifs via data mining. International Conference on Bioinformatics, 1997.

    Google Scholar 

  13. D. Higgins and P. Sharpe. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene, 73:237–244, 1988.

    Article  Google Scholar 

  14. M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa. Comprehensive study on iterative algorithms of multiple sequence alignment. CABIOS, 11(1):13–18, 1995.

    Google Scholar 

  15. I. J. F. Jonassen, J. F. Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, pages 1587–1595, 1995.

    Google Scholar 

  16. LSB+93._C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wooton. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple sequence alignment. Science, 262:208–214, 1993.

    Article  Google Scholar 

  17. G. Myers, S. Selznick, Z. Zhang, and W. Miller. Progressive multiple alignment with constraints. In Proceedings of the First Annual Conference on Computational Molecular Biology (RECOMB97), pages 220–225. ACM Press, 1997.

    Google Scholar 

  18. M. A. McClure, T. K. Vasi, and W. M. Fitch. Comparative analysis of multiple protein-sequence alignment methods. Mol. Bio. Evol., 11(4):571–592, 1996.

    Google Scholar 

  19. A. F. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, pages 698–712, 1994.

    Google Scholar 

  20. L. Parida. Algorithmic Techniques in Computational Genomics. PhD thesis, Courant Institute of Mathematical Sciences, New York University, September 1998.

    Google Scholar 

  21. L. Parida. Bound of 3n on irredundant motifs on sequences of characters, character sets & real numbers and an O(n3 ln(n)) pattern detection algorithm. IBM Technical Report, 1999.

    Google Scholar 

  22. L. Parida, A. Floratos, and I. Rigoutsos. An approximation algorithm for alignment of mulitple sequences using motif discovery. To appear in Journal of Combinatorial Optimization, 1999.

    Google Scholar 

  23. L. Parida, A. Floratos, and I. Rigoutsos. MUSCA: An algorithm for constrained alignment of multiple data sequences. Genome Informatics, No 9:112–119, 1999.

    Google Scholar 

  24. PRF+00._Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Dan Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 297–308. ACM Press, 2000.

    Google Scholar 

  25. I. Rigoutsos and A. Floratos. Motif discovery in biological sequences without alignment or enumeration. In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB98), pages 221–227. ACM Press, 1998.

    Google Scholar 

  26. RFO+99._I. Rigoutsos, A. Floratos, C. Ouzounis, Y. Gao, and L. Parida. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins: Structure, Function and Genetics, 37(2), 1999.

    Google Scholar 

  27. RFP+00._Isidore Rigoutsos, Aris Floratos, laxmi Parida, Yuan Gao, and Daniel Platt. The emergence of pattern discovery techniques in computational biology. In Journal of metabolic engineering, 2000. To appear.

    Google Scholar 

  28. I. Rigoutsos, Y. Gao, A. Floratos, and L. parida. Building dictionaries of 2d and 3d motifs by mining the unaligned 1d sequences of 17 archeal and bacterial genomes. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology.AAI Press, 1999.

    Google Scholar 

  29. M. A. Roytberg. A search for common patterns in many sequences. CABIOS, pages 57–64, 1992.

    Google Scholar 

  30. M. Suyama, T. Nishioka, and O. Juníchi. Searching for common sequence patterns among distantly related proteins. Protein Engineering, pages 366–385, 1995.

    Google Scholar 

  31. M. F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. Proceedings of the 7th symposium on combinatorial pattern matching, pages 186–208, 1996.

    Google Scholar 

  32. M.S. Waterman. Parametric and ensemble alignment algorithms. Bulletin of Mathematical Biology, 56(4):743–767, 1994.

    MATH  Google Scholar 

  33. WCM+94._J. T. L. Wang, G. W. Chrin, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang. Combinatorial pattern discovery for scientific data: Some preliminary results. In Proceedings of ACM SIGMOD Conference on Management of Data, 1994.

    Google Scholar 

  34. WCM+96._J. Wang, G. Chirn, T. G. Marr, B. A. Shapiro, D. Shasha, and K. Jhang. Combinatorial pattern discovery for scientific data: Some preleminary results. Proceedings of the ACM SIGMOD conference on management of data, pages 115–124, 1996.

    Google Scholar 

  35. W. Miller Z. Zhang, B. He. Local multiple alignment vis subgraph enumeration. Discrete Applied Mathematics, 71:337–365, 1996.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Parida, L. (2000). Some Results on Flexible-Pattern Discovery. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_5

Download citation

  • DOI: https://doi.org/10.1007/3-540-45123-4_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67633-1

  • Online ISBN: 978-3-540-45123-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics