Abstract
We show how positional markers can be used to encode genetic variation within a Burrows-Wheeler Transform (BWT), and use this to construct a generalisation of the traditional “reference genome”, incorporating known variation within a species. Our goal is to support the inference of the closest mosaic of previously known sequences to the genome(s) under analysis. Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25 GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Valenzuela, D., Valimaki, N., Pitkanen, E., Makinen, V.: On enhancing variation detection through pan-genome indexing. Biorxiv. http://dx.doi.org/10.1101/021444
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Tech. Rep. 124 (1994)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25(14), 1754–1760 (2009)
Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Meth. 9(4), 357–359 (2012)
Reinert, K., Langmead, B., Weese, D., et al.: Alignment of next-generation sequencing reads. Annu. Rev. Genomics Hum. Genet. 16, 13–51 (2015)
The 1000 Genomes Project Consortium: A global reference for human genetic variation. 526, pp. 68–74 (1000)
Ossowski, S., Schneeberger, K., Clark, R.M., et al.: Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18, 2024–2033 (2008)
Dilthey, A., Cox, C., Iqbal, Z., et al.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015)
Schneeberger, K., Hagmann, J., Ossowski, S., et al.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009)
Siren, J., Valimaki, N., Makinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 375–388 (2014)
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)
Siren, J.: Indexing Variation Graphs. arXiv:1604.06605
Ferragina, P., Manzini, G.: Opportunistic datastructures with applications. In: Proceedings of the 41st Symposiumon Foundations of Computer Science (FOCS 2000), pp. 390–398. IEEE Computer Society, Los Alamitos (2000)
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841850. Society for Industrial and Applied Mathematics (2003)
Miles, A., Iqbal, Z., Vauterin, P., et al.: Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses (2015). Biorxiv http://dx.doi.org/10.1101/024182
Iqbal, Z., Caccamo, M., Turner, I., et al.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genetics 44, 226–232 (2012)
Li, H.: Aligning sequence reads, clone sequences, assembly contigs with BWA-MEM. arXiv:1303.3997
Gog, S., Beller, T., Moffat, A., et al.: From theory to practice: plug and play with succinct data structures. In: 13th International Symposium on Experimental Algorithms, (SEA 2014), pp. 326–337 (2014)
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014)
Srivastava, A., Sarkar, H., Gupta, N., Patro, R.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32(12), i192–i200 (2016)
Bray, N., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016)
Acknowledgments
We would like to thank Jacob Almagro-Garcia, Phelim Bradley, Rayan Chikhi, Simon Gog, Lin Huang, Jerome Kelleher, Heng Li, Gerton Lunter, Rachel Norris, Victoria Popic, and Jouni Siren for discussions and help. We thank the SDSL developers for providing a valuable resource.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Maciuca, S., del Ojo Elias, C., McVean, G., Iqbal, Z. (2016). A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. In: Frith, M., Storm Pedersen, C. (eds) Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science(), vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-43681-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43680-7
Online ISBN: 978-3-319-43681-4
eBook Packages: Computer ScienceComputer Science (R0)