Journal of Combinatorial Optimization

, Volume 3, Issue 2–3, pp 247–275 | Cite as

An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery

  • Laxmi Parida
  • Aris Floratos
  • Isidore Rigoutsos

Abstract

Given a set of N sequence, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. The quality of the alignment is usually measured by penalizing the mis-matches and gaps, and rewarding the matches with appropriate weight functions. However for larger values of N, additional constraints are required to give meaningful alignments. We identify a user-controlled parameter, an alignment number K (2 ≤ K ≤ N): this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. We identify a natural optimization problem for this approach called the K-MSA problem. We show that the problem is MAX SNP hard. We give a natural extension of this problem that incorporates “biological relevance” by using motifs (common patterns in the sequences) and give an approximation algorithm for this problem in terms of the motifs in the data. MUSCA is an implementation of this approach and our experimental results indicate that this approach is efficient, particularly on large numbers of long sequences, and gives good alignments when tested on biological data such as DNA and protein sequences.

multiple sequence alignment alignment number protein sequences motif discovery MAX SNP hard approximate algorithm set covering problem 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Altschul, “Gap costs for multiple sequence alignment,” J. Theor. Biol., vol. 138, pp. 297–309, 1989.Google Scholar
  2. S. Arora, D. Karger, and M. Karpinski, “Polynomial time approximation schemes for dense instances of NP-hard problems,” in Proc. STOC, 1996.Google Scholar
  3. S. Arora and C. Lund, “Hardness of approximations,” in Approximation Algorithms for NP-Hardness Problems, D.S. Hochbaum (Ed.), PWS Publishing Company: MA, 1997.Google Scholar
  4. S. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy, “Proof verification and hardness of approximation problems,” in Proc. of the 33th IEEE Symposium on the Foundations of Computer Science, 1992.Google Scholar
  5. H. Carrillo and D. Lipman, “The multiple sequence alignment problem in biology,” SIAM Journal of Applied Mathematics, vol. 22, pp. 1073–1082, 1988.Google Scholar
  6. K.M. Chao, R. Hardison, and W. Miller, “Recent developments in linear-space alignment methods: A survey,” J. Computational Biology, vol. 3, pp. 271–291, 1994.Google Scholar
  7. V. Chvátal, “A greedy heuristic for the set covering problem,” Math. Oper. Res., vol. 4, pp. 233–235, 1979.Google Scholar
  8. T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, The MIT Press: Cambridge, MA, 1990.Google Scholar
  9. U. Feige, “A threshold of ln n for approximating set cover,” in Proc. 28th Ann. ACM Symp. on Theory of Comp., 1996, pp. 314–318.Google Scholar
  10. S.K. Gupta, J. Kececioglu, and A.A. Schaffer, “Improving the practical space and time efficiency of the shortestpaths approach to sum-of-pairs multiple sequence alignment,” Journal of Computational Biology, vol. 2, no. 3, pp. 459–472, 1995.Google Scholar
  11. D. Higgins and P. Sharpe, “CLUSTAL: A package for performing multiple sequence alignment on a microcomputer,” Gene, vol. 73, pp. 237–244, 1988.CrossRefPubMedGoogle Scholar
  12. M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa, “Comprehensive study on iterative algorithms of multiple sequence alignment,” CABIOS, vol. 11, no. 1, pp. 13–18, 1995.Google Scholar
  13. D.S. Johnson, “Approximation algorithms for combinatorial problems,” J. Comput. System Sci., vol. 7, pp. 256–278, 1974.Google Scholar
  14. W. Miller, Z. Zhang, and B. He, “Local multiple alignment vis. subgraph enumeration,” Discrete Applied Mathematics, vol. 71, pp. 337–365, 1996.Google Scholar
  15. M.A. McClure, T.K. Vasi, and W.M. Fitch, “Comparative analysis of multiple protein-sequence alignment methods,” Mol. Bio. Evol., vol. 11, no. 4 pp. 571–592, 1996.Google Scholar
  16. L. Parida, “Algorithmic techniques in computational genomics,” Ph.D. Thesis, Courant Institute of Mathematical Sciences, New York University, September 1998.Google Scholar
  17. L. Parida, “On the approximability of physical map problems using single molecule methods,” in Proceedings of Discrete Mathematics and Theoretical Computer Science (DMTCS 99), Auckland, January 1999, pp. 310–328.Google Scholar
  18. I. Rigoutsos and A. Floratos, “Motif discovery in biological sequences without alignment or enumeration,” in Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB 98), ACM Press, 1998, pp. 221–227.Google Scholar
  19. M. Vihinen, “An algorithm for simultaneous comparison of several sequences,” CABIOS, vol. 4, pp. 89–92, 1998.Google Scholar
  20. M.S. Waterman, “Parametric and ensemble alignment algorithms,” Bulletin of Mathematical Biology, vol. 56, no. 4, pp. 743–767, 1994.Google Scholar
  21. M.S. Waterman, An Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman Hall, 1995.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Laxmi Parida
    • 1
  • Aris Floratos
    • 1
  • Isidore Rigoutsos
    • 1
  1. 1.Computational Biology Center, IBM Thomas J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations