Journal of Combinatorial Optimization

, Volume 22, Issue 4, pp 778–796 | Cite as

The multiple sequence sets: problem and heuristic algorithms

Article
  • 75 Downloads

Abstract

“Sequence set” is a mathematical model used in many applications such as biological sequences analysis and text processing. However, “single” sequence set model is not appropriate for the rapidly increasing problem size. For example, very large genome sequences should be separated and processed chunk by chunk. For these applications, the underlying mathematical model is “Multiple Sequence Sets” (MSS). To process multiple sequence sets, sequences are distributed to different sets and then sequences on each set are processed in parallel. Deriving effective algorithm for MSS processing is challenging.

In this paper, we have first defined the cost functions for the problem of Process of Multiple Sequence Sets (PMSS). The PMSS problem is then formulated as to minimize the total cost of process. Based on the analysis of the features of multiple sequence sets, we have proposed the Distribution and Deposition (DDA) algorithm and DDA* algorithm for PMSS problem. In DDA algorithm, the sequences are first distributed to multiple sets according to their alphabet contents; then sequences in each set are processed by deposition algorithm. The DDA* algorithm differs from the DDA algorithm in that the DDA* algorithm distributes sequences by clustering based on a set of sequence features. Experiments showed that the results of DDA and DDA* are always smaller than other algorithms, and DDA* outperformed DDA in most instances. The DDA and DDA* algorithms were also efficient both in time and space.

Keywords

Multiple sequence sets Distribution and deposition Shortest common supersequence Sequence features Performance ratio 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barone P, Bonizzoni P, Vedova GD, Mauri G (2001) An approximation algorithm for the shortest common supersequence problem: an experimental analysis. In: Symposium on applied computing, proceedings of the 2001 ACM symposium on applied computing, pp 56–60 Google Scholar
  2. Bennett K, Grothoff C, Horozov T, Patrascu I (2002) Efficient sharing of encrypted data. In: Information security and privacy. Lecture notes in computer science, vol 2384. Springer, Berlin, pp 107–120 CrossRefGoogle Scholar
  3. Benson DA, Boguski M, Lipman DJ, Ostell J (1994) GenBank. Nucleic Acids Res 22:3441–3444 CrossRefGoogle Scholar
  4. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms. MIT Press/McGraw-Hill, New York MATHGoogle Scholar
  5. Foulser DE, Li M, Yang Q (1992) Theory and algorithms for plan merging. Artif Intell 57:143–181 MathSciNetMATHCrossRefGoogle Scholar
  6. Garey MR, Johnson DS (1979) Computers and intractability. Freeman, San Francisco MATHGoogle Scholar
  7. Hannenhalli S, Hubell E, Lipshutz R, Pevzner PA (2002) Combinatorial algorithms for design of DNA arrays. Adv Biochem Eng Biotechnol 77:1–19 Google Scholar
  8. Jiang T, Li M (1995) On the approximation of shortest common supersequences and longest common subsequences. SIAM J Comput 24:1122–1139 MathSciNetMATHCrossRefGoogle Scholar
  9. Kasif S, Weng Z, Derti A, Beigel R, DeLisi C (2002) A computational framework for optimal masking in the synthesis of oligonucleotide microarrays. Nucleic Acids Res 30:e106 CrossRefGoogle Scholar
  10. Ning K, Choi KP, Leong HW, Zhang L (2005) A post processing method for optimizing synthesis strategy for oligonucleotide microarrays. Nucleic Acids Res 33:e144 CrossRefGoogle Scholar
  11. Ning K, Leong HW (2006) The distribution and deposition method for the multiple oligo nucleotide arrays. BMC Bioinform 7(Suppl 4):S12 CrossRefGoogle Scholar
  12. Rozen S, Skaletsky HJ (2000) Primer3 on the WWW for general users and for biologist programmers. Humana Press, Totowa Google Scholar
  13. Sankoff D, Kruskal J (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparisons. Addison-Wesley, Reading Google Scholar
  14. Sellis TK (1988) Multiple-query optimization. ACM Trans Database Syst (TODS) 13:23–52 CrossRefGoogle Scholar
  15. Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York Google Scholar
  16. Timkovsky VG (1993) On the approximation of shortest common non-subsequences and supersequences. Technical report Google Scholar
  17. Wilcoxin F (1947) Probability tables for individual comparisons by ranking methods. Biometrics 3:119–122 MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of PathologyUniversity of MichiganAnn ArborUSA
  2. 2.Department of Computer ScienceNational University of SingaporeSingaporeSingapore

Personalised recommendations