Skip to main content

A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9838))

Abstract

We show how positional markers can be used to encode genetic variation within a Burrows-Wheeler Transform (BWT), and use this to construct a generalisation of the traditional “reference genome”, incorporating known variation within a species. Our goal is to support the inference of the closest mosaic of previously known sequences to the genome(s) under analysis. Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25 GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Valenzuela, D., Valimaki, N., Pitkanen, E., Makinen, V.: On enhancing variation detection through pan-genome indexing. Biorxiv. http://dx.doi.org/10.1101/021444

  2. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Tech. Rep. 124 (1994)

    Google Scholar 

  3. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  4. Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Meth. 9(4), 357–359 (2012)

    Article  Google Scholar 

  5. Reinert, K., Langmead, B., Weese, D., et al.: Alignment of next-generation sequencing reads. Annu. Rev. Genomics Hum. Genet. 16, 13–51 (2015)

    Article  Google Scholar 

  6. The 1000 Genomes Project Consortium: A global reference for human genetic variation. 526, pp. 68–74 (1000)

    Google Scholar 

  7. Ossowski, S., Schneeberger, K., Clark, R.M., et al.: Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18, 2024–2033 (2008)

    Article  Google Scholar 

  8. Dilthey, A., Cox, C., Iqbal, Z., et al.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015)

    Article  Google Scholar 

  9. Schneeberger, K., Hagmann, J., Ossowski, S., et al.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009)

    Article  Google Scholar 

  10. Siren, J., Valimaki, N., Makinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 375–388 (2014)

    Article  Google Scholar 

  11. Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)

    Article  Google Scholar 

  12. Siren, J.: Indexing Variation Graphs. arXiv:1604.06605

  13. Ferragina, P., Manzini, G.: Opportunistic datastructures with applications. In: Proceedings of the 41st Symposiumon Foundations of Computer Science (FOCS 2000), pp. 390–398. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  14. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841850. Society for Industrial and Applied Mathematics (2003)

    Google Scholar 

  15. Miles, A., Iqbal, Z., Vauterin, P., et al.: Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses (2015). Biorxiv http://dx.doi.org/10.1101/024182

  16. Iqbal, Z., Caccamo, M., Turner, I., et al.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genetics 44, 226–232 (2012)

    Article  Google Scholar 

  17. Li, H.: Aligning sequence reads, clone sequences, assembly contigs with BWA-MEM. arXiv:1303.3997

  18. Gog, S., Beller, T., Moffat, A., et al.: From theory to practice: plug and play with succinct data structures. In: 13th International Symposium on Experimental Algorithms, (SEA 2014), pp. 326–337 (2014)

    Google Scholar 

  19. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014)

    Article  Google Scholar 

  20. Srivastava, A., Sarkar, H., Gupta, N., Patro, R.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32(12), i192–i200 (2016)

    Article  Google Scholar 

  21. Bray, N., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank Jacob Almagro-Garcia, Phelim Bradley, Rayan Chikhi, Simon Gog, Lin Huang, Jerome Kelleher, Heng Li, Gerton Lunter, Rachel Norris, Victoria Popic, and Jouni Siren for discussions and help. We thank the SDSL developers for providing a valuable resource.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zamin Iqbal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Maciuca, S., del Ojo Elias, C., McVean, G., Iqbal, Z. (2016). A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. In: Frith, M., Storm Pedersen, C. (eds) Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science(), vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43681-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43680-7

  • Online ISBN: 978-3-319-43681-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics