Cluster Computing

, Volume 10, Issue 2, pp 187–202 | Cite as

An exact parallel algorithm to compare very long biological sequences in clusters of workstations

  • Azzedine Boukerche
  • Alba Cristina Magalhaes Alves de Melo
  • Edans Flavius de Oliveira Sandes
  • Mauricio Ayala-Rincon
Article

Abstract

Biological Sequence Comparison is one of the most important operations in Computational Biology since it is used to determine how similar two sequences are. Smith and Waterman proposed an exact algorithm (SW), based on dynamic programming, that is able to obtain the best local alignment between two sequences in quadratic time and space.

In order to compare long biological sequences, SW is rarely used since the computation time and the amount of memory required becomes prohibitive. For this reason, heuristic methods like BLAST are widely used. Although faster, these heuristic methods do not guarantee that the best result will be produced.

In this paper, we propose an exact parallel variant of the SW algorithm that obtains the best local alignments in quadratic time and reduced space. The results obtained in two clusters (8-machine and 16-machine) for DNA sequences longer than 32 KBP (kilo base-pairs) were very close to linear and, in some cases, superlinear. For very long DNA sequences (1.6 MBP), we were able to reduce execution time from 12.25 hours to 1.54 hours, in our 8-machine cluster. As far as we know, this is the first time 1.6 MBP sequences are compared with an exact SW variant. In this case, 30240 best local alignments were obtained.

Keywords

Biological sequence comparison Parallel algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molec. Biol. 214, 403–410 (1990) Google Scholar
  2. 2.
    Batista, R.B., Silva, D.N., Melo, A.C.M.A., Weigang, L.: Using a dsm application to locally align dna sequences. In: Proc. of the IEEE/ACM Int. Symp. on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2004) Google Scholar
  3. 3.
    Boukerche, A., Melo, A.C.M.A.: Computational Molecular Biology. In: Zomaya, A.Y. (ed.) Parallel Computing for Bioinformatics and Computational Biology, pp. 149–165. Wiley Interscience, Hoboken (2006) Google Scholar
  4. 4.
    Boukerche, A., Melo, A.C.M.A., Ayala-Rincon, M., Santana, T.M.: Parallel smith-waterman algorithm for local dna comparison in a cluster of workstations. In: 4th Int. Workshop on Experimental and Efficient Algorithms. Lecture Notes in Computer Science, vol. 3530, pp. 464–475. Springer, Heidelberg (2005) Google Scholar
  5. 5.
    Boukerche, A., Melo, A.C.M.A., Walter, M.E.M.T., Melo, R.C.F., Santana, M.N.P., Batista, R.B.: Performance evaluation of a local dna sequence alignment algorithm on a cluster of workstations. In: Proc. of the Int. Parallel and Distributed Processing Symposium (IPDPS2004). IEEE Society, Los Alamitos (2004) Google Scholar
  6. 6.
    Chang, W.I., Lawler, E.W.: Approximate string matching in sublinear expected time. In: IEEE Thirty-first Annual Symposium on Foundations of Computer Science, 1990, pp. 116–124 Google Scholar
  7. 7.
    Chen, C., Schmidt, B.: Computing large-scale alignments on a multi-cluster. In: IEEE International Conference on Cluster Computing, 2003 Google Scholar
  8. 8.
    Fickett, J.: Fast optimal alignments. Nucleic Acids Res. 12(1), 175–179 (1984) CrossRefGoogle Scholar
  9. 9.
    Galper, A.R., Brutlag, D.R.: Parallel similarity search and alignment with the dynamic programming method. Technical Report KSL 90-74, Stanford University, 1990, pp. 1–14 Google Scholar
  10. 10.
    Gusfield D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Press Syndicate of the University of Cambridge, New York (1997) MATHGoogle Scholar
  11. 11.
    Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975) MATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Hu, S., Shi, W., Tang, Z.: Jiajia: An svm system based on a new cache coherence protocol. In: High Performance Computing and Networking (HPCN), pp. 463–472. Springer, Heidelberg (1999) Google Scholar
  13. 13.
    Huang, X., Miller, W.: A time efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12, 337–357 (1991) MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Landau, G., Viskin, U.: Introducing efficient parallelism into approximate string matching and new serial algorithm. In: 18th ACM STOC, 1986, pp. 220–230 Google Scholar
  15. 15.
    Martins, W.S., Cuvillo, J.B.D., Useche, F.J., Theobald, K.B., Gao, G.R.: A multithread parallel implementation of a dynamic programming algorithm for sequence comparison. In: Brazilian Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2001, pp. 1–8 Google Scholar
  16. 16.
    Melo, R., Walter, M.E.T., Melo, A.C.M.A., Batista, R.B.: Comparing two long dna sequences using a dsm system. In: Euro-Par 2003: Parallel Processing. Lecture Notes in Computer Science, vol. 2790, pp. 517–524. Springer, Heidelberg (2003) Google Scholar
  17. 17.
    Myers, E.W.: An o(nd) difference algorithm and its variations. Algorithmica 1, 251–266 (1986) MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001) CrossRefGoogle Scholar
  19. 19.
    NCBI: Ncbi homepage. Website, http://www.ncbi.nlm.nih.gov/, Nov. 2004
  20. 20.
    NCBI: Submit to genbank. Website, http://www.ncbi.nlm.nih.gov/Genbank/index.html, Nov. 2004
  21. 21.
    Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970) CrossRefGoogle Scholar
  22. 22.
    Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988) CrossRefGoogle Scholar
  23. 23.
    Pfister, G.: In: Search of Clusters—The Coming Battle for Lowly Parallel Computing. Prentice-Hall, Upper Saddle River (1995) Google Scholar
  24. 24.
    Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Trans. Parallel Distributed Syst. 15(12), 1070–1081 (2004) CrossRefGoogle Scholar
  25. 25.
    Setubal J.C., Meidanis J.: Introduction to Computational Molecular Biology. Brooks/Cole, Boston (1997) Google Scholar
  26. 26.
    Shao, G.: Adaptive scheduling of master/worker applications on distributed computational resources. PhD thesis, University of California at San Diego (2001) Google Scholar
  27. 27.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981) CrossRefGoogle Scholar
  28. 28.
    Tang, P., Yew, P.C.: Processor self-scheduling for multiple nested parallel loops. In: Int. Conf. on Parallel Processing (ICPP), 1986, pp. 528–535 Google Scholar
  29. 29.
    Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1), 100–118 (1985) MATHCrossRefMathSciNetGoogle Scholar
  30. 30.
    Zhang, F., Qiao, X., Liu, Z.: A parallel smith waterman algorithm based on divide and conquer. In: Fifth Int. Conf. on Algorithm and Architectures for Parallel Processing (ICA3PP02), pp. 162–169. IEEE Society, Los Alamitos (2002) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Azzedine Boukerche
    • 1
  • Alba Cristina Magalhaes Alves de Melo
    • 2
  • Edans Flavius de Oliveira Sandes
    • 2
  • Mauricio Ayala-Rincon
    • 2
  1. 1.SITEUniversity of OttawaOttawaCanada
  2. 2.University of BrasiliaBrasiliaBrazil

Personalised recommendations