Finding Optimal Pairs of Patterns

  • Hideo Bannai
  • Heikki Hyyrö
  • Ayumi Shinohara
  • Masayuki Takeda
  • Kenta Nakai
  • Satoru Miyano
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)

Abstract

We consider the problem of finding the optimal pair of string patterns for discriminating between two sets of strings, i.e. finding the pair of patterns that is best with respect to some appropriate scoring function that gives higher scores to pattern pairs which occur more in the strings of one set, but less in the other. We present an O(N2) time algorithm for finding the optimal pair of substring patterns, where N is the total length of the strings. The algorithm looks for all possible Boolean combination of the patterns, e.g. patterns of the form \(p \land\lnot q\), which indicates that the pattern pair is considered to match a given string s, if p occurs in s, AND q does NOT occur in s. The same algorithm can be applied to a variant of the problem where we are given a single set of sequences along with a numeric attribute assigned to each sequence, and the problem is to find the optimal pattern pair whose occurrence in the sequences is correlated with this numeric attribute. An efficient implementation based on suffix arrays is presented, and the algorithm is applied to several nucleotide sequence datasets of moderate size, combined with microarray gene expression data, aiming to find regulatory elements that cooperate, complement, or compete with each other in enhancing and/or silencing certain genomic functions.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. 5, 279–305 (1998)CrossRefGoogle Scholar
  2. 2.
    Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7, 345–360 (2000)CrossRefGoogle Scholar
  3. 3.
    Arimura, H., Wataki, A., Fujino, R., Arikawa, S.: A fast algorithm for discovering optimal string patterns in large text databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 247–261. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  4. 4.
    Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. Bioinformatics 363, S354–S363 (2002)Google Scholar
  5. 5.
    Liu, X., Brutlag, D., Liu, J.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pac. Symp. Biocomput., pp. 127–138 (2001)Google Scholar
  6. 6.
    Shimozono, S., Shinohara, A., Shinohara, T., Miyano, S., Kuhara, S., Arikawa, S.: Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Transactions of Information Processing Society of Japan 35, 2009–2018 (1994)Google Scholar
  7. 7.
    Shinohara, A., Takeda, M., Arikawa, S., Hirao, M., Hoshino, H., Inenaga, S.: Finding best patterns practically. In: Arikawa, S., Shinohara, A. (eds.) Progress in Discovery Science. LNCS (LNAI), vol. 2281, pp. 307–317. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Takeda, M., Inenaga, S., Bannai, H., Shinohara, A., Arikawa, S.: Discovering most classificatory patterns for very expressive pattern classes. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 486–493. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  9. 9.
    Shinozaki, D., Akutsu, T., Maruyama, O.: Finding optimal degenerate patterns in DNA sequences. Bioinformatics 19, ii 206–ii214 (2003)CrossRefGoogle Scholar
  10. 10.
    Bussemaker, H.J., Li, H., Siggia, E.D.: Regulatory element detection using correlation with expression. Nature Genetics 27, 167–171 (2001)CrossRefGoogle Scholar
  11. 11.
    Bannai, H., Inenaga, S., Shinohara, A., Takeda, M., Miyano, S.: A string pattern regression algorithm and its application to pattern discovery in long introns. Genome Informatics 13, 3–11 (2002)Google Scholar
  12. 12.
    Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S.: Integrating regulatory motif discovery and genome-wide expression analysis. In: Proc. Natl. Acad. Sci., vol. 100, pp. 3339–3344 (2003)Google Scholar
  13. 13.
    Bannai, H., Inenaga, S., Shinohara, A., Takeda, M., Miyano, S.: Efficiently finding regulatory elements using correlation with gene expression. Journal of Bioinformatics and Computational Biology (2004) (in press) Google Scholar
  14. 14.
    Zilberstein, C.B.Z., Eskin, E., Yakhini, Z.: Using expression data to discover RNA and DNA regulatory sequence motifs. In: The First Annual RECOMB Satellite Workshop on Regulatory Genomics (2004) Google Scholar
  15. 15.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  16. 16.
    Wang, Y., Liu, C., Storey, J., Tibshirani, R., Herschlag, D., Brown, P.: Precision and functional specificity in mRNA decay. Proc. Natl. Acad. Sci. 99, 5860–5865 (2002)CrossRefGoogle Scholar
  17. 17.
    Yang, E., van Nimwegen, E., Zavolan, M., Rajewsky, N., Schroeder, M., Magnasco, M., J.D., Jr.: Decay rates of Human mRNAs: correlation with functional characteristics and sequence attributes. Genome Res. 13, 1863–1872 (2003)Google Scholar
  18. 18.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  20. 20.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  22. 22.
    Kasai, T., Arimura, H., Arikawa, S.: Efficient substring traversal with suffix arrays. Technical Report 185, Department of Informatics, Kyushu University (2001)Google Scholar
  23. 23.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: The enhanced suffix array and its applications to genome analysis. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 449–463. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  24. 24.
    Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  25. 25.
    Alstrup, S., Gavoille, C., Kaplan, H., Rauhe, T.: Nearest common ancestors: a survey and a new distributed algorithm. In: 14th annual ACM symposium on Parallel algorithms and architectures, pp. 258–264 (2002)Google Scholar
  26. 26.
    Hui, L.: Color set size problem with applications to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)Google Scholar
  27. 27.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longestcommon- prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  28. 28.
    Wilusz, C.J., Wormington, M., Peltz, S.W.: The cap-to-tail guide to mRNA turnover. Nat. Rev. Mol. Cell Biol. 2, 237–246 (2001)CrossRefGoogle Scholar
  29. 29.
    Graber, J.: Variations in yeast 3’-processing cis-elements correlate with transcript stability. Trends Genet 19, 473–476 (2003), http://harlequin.jax.org/yeast/turnover/ CrossRefGoogle Scholar
  30. 30.
    Wickens, M., Bernstein, D.S., Kimble, J., Parker, R.: A PUF family portrait: 3’UTR regulation as a way of life. Trends Genet. 18, 150–157 (2002)CrossRefGoogle Scholar
  31. 31.
    Ruiz-Echevarria, M.J., Munshi, R., Tomback, J., Kinzy, T.G., Peltz, S.W.: Characterization of a general stabilizer element that blocks deadenylation-dependent mRNA decay. J. Biol. Chem. 276, 30995–31003 (2001)CrossRefGoogle Scholar
  32. 32.
    Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., Hammond, M., Rocca-Serra, P., Cox, T., Birney, E.: EnsMart: A generic system for fast and flexible access to biological data. Genome Research 14, 160–169 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Hideo Bannai
    • 1
  • Heikki Hyyrö
    • 2
  • Ayumi Shinohara
    • 2
    • 3
  • Masayuki Takeda
    • 3
  • Kenta Nakai
    • 1
  • Satoru Miyano
    • 1
  1. 1.Human Genome Center, Institute of Medical ScienceThe University of TokyoTokyoJapan
  2. 2.PRESTOJapan Science and Technology Agency (JST) 
  3. 3.Department of InformaticsKyushu University 33FukuokaJapan

Personalised recommendations