Abstract
Sufficiently long genome strings, permitting adequate overlaps, is key to producing a quality genome assembly with minimal error rates and high coverage. Next Generation Sequencing (NGS) platforms produce large volumes (tera bytes) of short-sized raw genomic strings or reads (150–600 genomic alphabets or bases) with minimal error rates. If we are able to increase the read lengths of raw short reads computationally before assembly, then the full potential of short reads from NGS and de novo assembly can be harvested. The large data redundancy offered by billions of such raw reads, compounded by the target genome length of billions of bases, requires a complex big data engineering solution. This paper presents a co-designed algorithm-architecture model for ReneGENE de novo assembly (part of a larger ReneGENE-GI Genome Informatics pipeline). This module takes randomly presented short reads from NGS platforms and extends them iteratively to an appropriate length by identifying overlaps among them, aiding high-coverage assembly with minimal error rates. This task is parallelized across multiple processes, to allow parallel read assembly with performance scalability. Supported by parallel algorithms, multi-dimensional data structures and fine-grain synchronization, the module realises irregular computing for de novo assembly. A single FPGA realization of this model with 128 de novo compute elements, shows a 48.69x improvement in performance when compared to an 8-core implementation on a standard workstation based on Intel Core i7-4770 processors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Frese, K.S., Katus, H.A., Meder, B.: Next-generation sequencing: from understanding biology to personalized medicine. Biology 2, 378–398 (2013)
Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. 14, 157–167 (2013)
Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., Shen, B.: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS ONE 6(3), e17915 (2011)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010)
Hernandez, D., Francois, P., Farinelli, L., Osteras, M., Schrenzel, J.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18, 802–809 (2008)
Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17, 1697–1706 (2007)
Bryant Jr., D.W., Wong, W.K., Mockler, T.C.: QSRA: a quality-value guided de novo short read assembler. BMC Bioinform. 10, 69 (2009)
Varma, B.S.C., Paul, K., Balakrishnan, M., Lavenier, D.: Fassem: FPGA based acceleration of de novo genome assembly. In: FCCM 2013, pp. 173–176. IEEE Computer Society, Washington, DC (2013)
Varma, B., Paul, K., Balakrishnan, M.: Accelerating genome assembly using hard embedded blocks in FPGAs. In: 27th International Conference on VLSI Design and 13th International Conference on Embedded Systems, pp. 306–311, January 2014
Cray Inc.: Cray XC40: Scaling Across the Supercomputer Performance Spectrum. http://www.cray.com/sites/default/files/resources/CrayXC40Brochure.pdf
Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K.: AccuRA: accurate alignment of short reads on scalable reconfigurable accelerators. In: Proceedings of IEEE International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XVI), pp. 79–87, July 2016
Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K.: Accurate and accelerated secondary analysis of genomes: implications for genomics. In: Barcelona NGS 2017: Structural Variation and Population Genomics, April 2017
Natarajan, S., KrishnaKumar, N., Pavan, M., Pal, D., Nandy, S.K.: ReneGENE-DP: accelerated parallel dynamic programming for genome informatics. In: Accepted at the 2018 International Conference on Electronics, Computing and Communication Technologies (IEEE CONECCT), March 2018
Myers, E.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)
Shi, F.: Fast approximate string matching with q-blocks sequences. In: Proceedings of 3rd South American Workshop on String Processing, pp. 257–271 (1996)
Ukkonen, E.: Finding approximate patterns in strings. J. Algorithms 6, 132–137 (1985)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Natarajan, S., KrishnaKumar, N., Anuchan, H.V., Pal, D., Nandy, S.K. (2018). ReneGENE-Novo: Co-designed Algorithm-Architecture for Accelerated Preprocessing and Assembly of Genomic Short Reads. In: Voros, N., Huebner, M., Keramidas, G., Goehringer, D., Antonopoulos, C., Diniz, P. (eds) Applied Reconfigurable Computing. Architectures, Tools, and Applications. ARC 2018. Lecture Notes in Computer Science(), vol 10824. Springer, Cham. https://doi.org/10.1007/978-3-319-78890-6_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-78890-6_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78889-0
Online ISBN: 978-3-319-78890-6
eBook Packages: Computer ScienceComputer Science (R0)