A survey of genome sequence assembly techniques and algorithms using high-performance computing

Ahmed, Munib; Ahmad, Ishfaq; Ahmad, Mohammad Saad

doi:10.1007/s11227-014-1297-4

A survey of genome sequence assembly techniques and algorithms using high-performance computing

Published: 04 October 2014

Volume 71, pages 293–339, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Munib Ahmed¹,
Ishfaq Ahmad¹ &
Mohammad Saad Ahmad¹

662 Accesses
2 Citations
4 Altmetric
Explore all metrics

Abstract

Genome assembly has been an area of active research since the DNA structure was discovered and has gathered more steam after the Human Genome project was launched. A large number of genomes have been assembled and many more are in the pipeline. A number of full-scale assemblers and other special-purpose modules have been reported. Since the volume of data involved in the genome assembly process is extraordinarily large and requires significantly large computational power and processing time, many assemblers have utilized parallel computing to achieve faster and more efficient reconstruction of the DNA. A genome assembler is a multi-step process including different components that may be partly or fully parallelized. Although several assemblers and individual modules that perform various tasks, such as pairwise alignment, multiple sequence alignment, and repeat finding, have been analyzed and documented before, this paper provides a holistic view of the assembly process in the realm of parallel and distributed computing, streamlining all the individual tasks related, but not limited to, the whole genome shotgun sequencing into a sequence of loosely coupled stages where one stage consumes the output of the preceding stage and passes its results to the next one. Many of these tasks are essential to the current and next-generation sequence assemblers. The paper walks through the entire streamlined process while describing, analyzing, and commenting on the algorithms and techniques that have been designed and implemented for each of the stages. Where applicable, the paper suggests improvements that may form the basis of potentially new research work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time- and memory-efficient genome assembly with Raven

Article 20 May 2021

SAGE: String-overlap Assembly of GEnomes

Article Open access 15 September 2014

Recent Advances in Gene and Genome Assembly: Challenges and Implications

References

Ahmed M, Ahmad I, Khan S (2011) A theoretical analysis of scalability of the parallel genome assembly algorithms. In: Second international conference on bioinformatics models, methods and algorithms. pp 234–237
Ahmed M, Ahmad I, Khan S (2011) A comparative analysis of approaches to parallel genome assembly. J Interdiscipl Sci Comput Life Sci 3:1–7. doi:10.1007/s12539-011-00
Article Google Scholar
Ahmed M, Ahmad M, Ahmad I (2008) A multi-pronged parallel approach to enhance speed and accuracy of sequence assembly. In: Biotechnology and bioinformatics symposium
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article Google Scholar
Aluru S, Futamura N, Mehrotra K (2003) Parallel biological sequence comparison using prefix computations. J Parallel Distrib Comput 63(3):264–272
Article MATH Google Scholar
Bao Z, Eddy S (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 8:1269–1276
Article Google Scholar
Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov J, Lander E (2002) Arachne: a whole-genome shotgun assembler. Genome Res 12(1):177–189
Article Google Scholar
Berger M, Munson P (1991) A novel randomized iterative strategy for aligning multiple protein sequences. CABIOS 7:479–484
Google Scholar
Blackshields G, Wallace I, Larkin M, Higgins D (2006) Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6:321–339
Google Scholar
Blazewicz J, Figlerowicz M, Jackowiak P, Janny D, Jarczynski D, Kasprzak M, Nalewaj M, Nowierski B, Styszynski R, Szajkowski L, Widera P (2004) Parallel DNA sequence assembly. In: Proceedings of the fifth Mexican international conference in computer science (ENC ’04). IEEE Computer Society, New York, pp 378–382
Brudno M, Batzoglou S (2004) ProbCons: Probabilistic consistency based multiple alignment of amino acid sequences. In: Proceedings of nineteenth national conference on artificial intelligence. pp 703–708
Chao K, Pearson W, Miller W (1992) Aligning two sequences within a specified diagonal band. Comput Appl Biosci 8:481–487
Google Scholar
Cheetham J, Dehne F, Pitre S, Rau-Chaplin A, Taillon P (2003) Parallel CLUSTAL W for PC clusters. In: International conference on computational science and its applications. Lecture notes in computer science, vol 2668. pp 300–309
Darling A, Carey L, Feng W (2003) The design, implementation, and evaluation of mpiBLAST. In: Fourth international conference on Linux clusters: the HPC revolution 2003 in conjunction with The ClusterWorld Conference & Expo
Deng X, Li E, Shan J, Chen W (2006) Parallel implementation and performance characterization of MUSCLE. In: Parallel and distributed processing symposium
Dovichi N, Zhang J (2000) How capillary electrophoresis sequenced the human genome. Angew Chemie Int Edition 39:4463–4468
Article Google Scholar
Du Z, Lin F (2006) pNJTree: a parallel program for reconstruction of neighbor-joining tree and its application in ClustalW. J Parallel Comput 32:5–6
Article Google Scholar
Ebedes J, Datta A (2004) Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics 20(7):1193–1195
Edgar R (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Article Google Scholar
Edgar R, Myers E (2005) PILER: identification and classification of genomic repeats. Bioinformatics 1 21(Supplement 1):i152–i158
Essoussi N, Boujenfa K, Limam M (2008) A comparison of MSA tools. Bioinformatics 2:452–455
Ewing B, Hillier L, Wendl M, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8(3):175–185
Article Google Scholar
Felsenfeld A, Peterson J, Schloss J, Guyer M (1999) Assessing the quality of the DNA sequence from the human genome project. Genome Res 9:1–4
Google Scholar
Grama A, Gupta A, Kumar V (1993) Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib Technol 1(3):12–21
Article Google Scholar
Green P (1996) http://bozeman.mbt.washington.edu/phrap.docs/phrap.html. Accessed 19 Sep 2014
Gordon D, Abajian C, Green P (1998) Consed: a graphical tool for sequence finishing. Genome Res 8:195–202
Article Google Scholar
Gusfield D (1997) Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge, pp 9–10
Higgins D, Sharp P (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237–44
Article Google Scholar
Higgins D (1994) CLUSTAL V: multiple alignment of dna and protein sequences. Methods Mol Biol 25:307–318
Google Scholar
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13–18
Google Scholar
Huang X, Wang J, Aluru S, Yang S, Hillier L (2003) PCAP: a whole-genome assembly program. Genome Res 13:2164–2170
Article Google Scholar
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9(9):868–877
Article Google Scholar
Isokawa M, Wayama M, Shimizu T (1996) Multiple sequence alignment using a genetic algorithm. Genome Inform 7:176–177
Google Scholar
Jeanmougin F, Thompson J, Gouy M, Higgins D, Gibson T (1998) Multiple sequence alignment with clustal X. Trends Biochem Sci 23:403–405
Article Google Scholar
Johnson D, Metaxas P (1997) Connected components in O(log3/2n) parallel time for the CREW PRAM. J Comput Syst Sci 54(2):227–242
Kalyanaraman A, Kothari S, Brendel V, Aluru S (2003) Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res 31(11):2963–2964
Article Google Scholar
Kalyanaraman A, Aluru S, Brendel V, Kothari S (2003) Space and time efficient parallel algorithms and software for EST clustering. IEEE Trans Parallel Distrib Syst 14:1209–1221
Article Google Scholar
Karl JA, Wiseman RW, O’Connor DH (2009) Cost-effective sequence-based nonhuman primate MHC class I genotyping from RNA. Methods 49(1):11–17. doi:10.1016/j.ymeth.2009.05.002
Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H, Valentin F, Wallace A, Wilm R, Lopez R, Thompson J, Gibson T, Higgins D (2007) Clustal W and clustal X version 2.0. Bioinformatics 23:2947–2948
Article Google Scholar
Lee Z, Su S, Chuang C, Liu K (2008) Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Appl Soft Comput 8:55–78
Article Google Scholar
Li K (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12) :1585–1586
Li R, Zhu H, Ruan J, Qian W, Li S, Yang H, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20(2):265–272
Article Google Scholar
Lipman D, Altschul S, Kececioglu D (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci 86:4412–4415
Article Google Scholar
Liu X, Pande P, Meyerhenke H, Bader D (2013) PASQUAL: parallel techniques for next generation genome sequence assembly. IEEE Trans Parallel Distrib Syst 24(5):977–986
Article Google Scholar
Luo J, Ahmad I, Ahmed M (2005) Parallel multiple sequence alignment using dynamic scheduling. In: International conference on information technology: coding and computing, vol 1. pp 8–13
Mardis E (2008) Next-generation DNA sequencing methods. Ann Rev Genomics Hum Genet 9:387–402
Article Google Scholar
Martins W, Cuvillo J, Francisco B, Theobald J, Gao G (2001) A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison. In: Proceedings of the Pacific symposium on biocomputing. pp 311–332
Miller P, Nadkarni P, Carriero N (1991) Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm. Comput Appl Biosci 7(1):71–78
Google Scholar
Mullikin J, Ning Z (2003) The Phusion assembler. Genome Res 1:81–90
Article Google Scholar
Myers E, Sutton G, Smith H, Adams M, Venter J (2002) On the sequencing and assembly of the human genome. Proc Natl Acad Sci 99(7):4145–4146
Article Google Scholar
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two sequences. J Mol Biol 48:443–453
Article Google Scholar
Notredame C, Higgins D, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Article Google Scholar
Ogden T, Rosenberg M (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 56(2):314–328
Article Google Scholar
Pevzner P, Tang H, Waterman S (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753
Article MathSciNet MATH Google Scholar
Pevzner P, Tang H, Tesler G (2004) De novo repeat classification and fragment assembly. Genome Res 14(9):1786–1796
Article Google Scholar
Porreca G (2010) Genome sequencing on nanoballs. Nat Biotechnol 28(1):43–44
Article Google Scholar
Prism ABIABI (1996) DNA sequencing analysis software. In: User’s manual, PE Applied Biosystems, Foster City
Ralston A (1982) De Bruijn sequences—a model example of the interaction of discrete mathematics and computer science. Math Magaz 55:131–143
Article MathSciNet MATH Google Scholar
Ronaghi M, Uhlen M, Nyren P (1998) A sequencing method based on real-time pyrophosphate. Science 281(5375):363
Article Google Scholar
Rusk N (2011) Torrents of sequence. Nat Methods 8(1):44–44
Google Scholar
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol 4:406–425
Google Scholar
Sanger F, Nicklen S, Coulson A (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74:5463–7
Article Google Scholar
Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123
Article Google Scholar
Shi W, Zhou W (2005) A parallel Euler approach for large-scale biological sequence assembly. In: Proceedings of the third international conference on information technology and applications
Smit A, Hubley R, Green P (1996–2010) RepeatMasker Open-3.0. http://www.repeatmasker.org. Accessed 20 Sep 2014
Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article Google Scholar
Southern E (1975) Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 98:503–517
Article Google Scholar
Sutton G, White O, Adams M, Kerlavage A (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol 1(1):9–19
Article Google Scholar
Thompson J, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27(13):12682–2690
Article Google Scholar
Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S (2008) A high-resolution nucleosome position map of C. Elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18(7):1051–1063
Article Google Scholar
Venter J, Adams M, Myers E (2001) The sequence of the human genome. Science 16(291):1304–1351
Article Google Scholar
Volfovsky N, Haas B, Salzberg S (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2(8)
Watson J, Crick F (1953) Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171:737–738
Article Google Scholar
Yap T, Munson P, Frieder O, Martino R (1995) Parallel multiple sequence alignment using speculative computation. In: Proceedings of the international conference on parallel processing
Yap T, Frieder O, Martino R (1998) Parallel computation in biological sequence analysis. IEEE Trans Parallel Distrib Syst 9(3) :283–294
Zhang C, Wong A (1997) A genetic algorithm for multiple molecular sequence alignment. Comput Appl Biosci 13(6):565–581
Google Scholar
Zhao F, Li T, Bryant D (2008) A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Res 36(10):3455–3462
Article Google Scholar
Zola J, Yang X, Rospondek S, Aluru S (2007) Parallel T-Coffee: a parallel multiple sequence aligner. In: Proceedings of international society for computers and their applications, parallel and distributed computing systems. pp 248–253

Download references

Author information

Authors and Affiliations

University of Texas at Arlington, Arlington, USA
Munib Ahmed, Ishfaq Ahmad & Mohammad Saad Ahmad

Authors

Munib Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Ishfaq Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Saad Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Munib Ahmed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmed, M., Ahmad, I. & Ahmad, M.S. A survey of genome sequence assembly techniques and algorithms using high-performance computing. J Supercomput 71, 293–339 (2015). https://doi.org/10.1007/s11227-014-1297-4

Download citation

Published: 04 October 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s11227-014-1297-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of genome sequence assembly techniques and algorithms using high-performance computing

Abstract

Access this article

Similar content being viewed by others

Time- and memory-efficient genome assembly with Raven

SAGE: String-overlap Assembly of GEnomes

Recent Advances in Gene and Genome Assembly: Challenges and Implications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of genome sequence assembly techniques and algorithms using high-performance computing

Abstract

Access this article

Similar content being viewed by others

Time- and memory-efficient genome assembly with Raven

SAGE: String-overlap Assembly of GEnomes

Recent Advances in Gene and Genome Assembly: Challenges and Implications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation