Identification of Distinguishing Motifs

  • WangSen Feng
  • Zhanyong Wang
  • Lusheng Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4580)


Motivation: Motif identification for sequences has many important applications in biological studies, e.g., diagnostic probe design, locating binding sites and regulatory signals, and potential drug target identification. There are two versions.

  1. 1

    Single Group: Given a group of n sequences, find a length-l motif that appears in each of the given sequences and those occurrences of the motif are similar.

  2. 1

    Two Groups: Given two groups of sequences B and G, find a length-l (distinguishing) motif that appears in every sequence in B and does not appear in anywhere of the sequences in G.


Here the occurrences of the motif in the given sequences have errors. Currently, most of existing programs can only handle the case of single group. Moreover, it is very difficult to use edit distance (allowing indels and replacements) for motif detection.

Results: (1) We propose a randomized algorithm for the one group problem that can handle indels in the occurrences of the motif. (2) We give an algorithm for the two groups problem. (3) Extensive simulations have been done to evaluate the algorithms.


motif detection EM Algorithms two groups 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bailey, T., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (ISMB-1994), pp. 28–36. AAAI Press, Menlo PArk (1994)Google Scholar
  2. 2.
    Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)Google Scholar
  3. 3.
    Buhler, J., Tompa, M.: Finding motifs using random projections. Journal of Computational Biology 9, 225–242 (2002)CrossRefGoogle Scholar
  4. 4.
    Cardon, L.R., Stormo, G.D.: Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223, 159–170 (1992)CrossRefGoogle Scholar
  5. 5.
    Deng, X., Li, G., Li, Z., Ma, B., Wang, L.: Generic Drug Design without Side Effect. SIAM J on Computing 32(4), 1073–1090 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of primers for PCR amplification of highly variable genomes. CABIOS 9, 123–125 (1993)Google Scholar
  7. 7.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)zbMATHGoogle Scholar
  8. 8.
    Hertz, G., Stormo, G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Proc. 3rd Intl Conf. Bioinformatics and Genome Research, pp. 201–216 (1995)Google Scholar
  9. 9.
    Hu, Y.-J.H: Finding subtle motifs with variable gaps in unaligned DNA sequences. Computer Methods and Programs in Biomedicine 70, 11–20 (2003)CrossRefGoogle Scholar
  10. 10.
    Keich, U., Pevzner, P.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002a)CrossRefGoogle Scholar
  11. 11.
    Keich, U., Pevzner, P.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390 (2002b)CrossRefGoogle Scholar
  12. 12.
    Lanctot, K., Li, M., Ma, B., Wang, S., Zhang, L.: Distinguishing string selection problems. In: Proc. 10th ACM-SIAM Symp. on Discrete Algorithms, pp. 633–642 (Also to appear in Information and Computation)Google Scholar
  13. 13.
    Lawrence, C., Reilly, A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51 (1990)CrossRefGoogle Scholar
  14. 14.
    Li, M., Ma, B., Wang, L.: Finding Similar Regions in Many Strings. In: Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing, Atlanta, pp. 473–482 (1999)Google Scholar
  15. 15.
    Li, M., Ma, B., Wang, L.: Finding Similar Regions in Many Sequences (special issue for Thirty-first Annual ACM Symposium on Theory of Computing). J. Comput. Syst. Sci. 65, 73–96 (2002a)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Li, M., Ma, B., Wang, L.: On the closest string and substring problems. JACM 49(2), 157–171 (2002b)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Lucas, K., Busch, M., Mössinger, S., Thompson, J.A.: An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. CABIOS 7, 525–529 (1991)Google Scholar
  18. 18.
    Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. pp. 269–278 (2000)Google Scholar
  19. 19.
    Proutski, V., Holme, E.C.: Primer Master: a new program for the design and analysis of PCR primers. CABIOS 12, 253–255 (1996)Google Scholar
  20. 20.
    Stormo, G.: Consensus patterns in DNA. In: Doolittle, R.F.(ed.) Molecular evolution: computer analysis of protein and nucleic acid sequences, Methods in Enzymology, vol. 183, pp. 211–221 (1990)Google Scholar
  21. 21.
    Price, A., Ramabhadran, S., Pevzner, P.: Finding Subtle Motifs by Branching from Sample Strings, Bioinformatics 19, 149–155 (2003)CrossRefGoogle Scholar
  22. 22.
    Keller, G.H., Manak, M.M.: DNA Probes, Stockton Press, p. 12 (1989)Google Scholar
  23. 23.
    McPearson, M.J., Quirke, M.J., Taylor, G.R: PCR A Practical Approach, p. 8. Oxford University Press, New York (1991)Google Scholar
  24. 24.
    Wang, L., Dong, L., Fan, H.: Randomized Algorithms for Motif Detection. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 884–895. Springer, Heidelberg (2004)Google Scholar
  25. 25.
    Waterman, M., Arratia, R., Galas, E.: Pattern recognition in several sequences:consenus and alignment. Bull. Math. Biol. 46, 515–527 (1984)zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • WangSen Feng
    • 1
  • Zhanyong Wang
    • 2
  • Lusheng Wang
    • 2
  1. 1.Department of Computer Science, Peking UniversityPeople’s Republic of China
  2. 2.Department of Computer Science, City University of Hong Kong, Hong Kong 

Personalised recommendations