Skip to main content

Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNBI,volume 7534)

Abstract

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored.

Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods.

Code to construct and compare the BWT of large genomic data sets is available at http://beetl.github.com/BEETL/ as part of the BEETL library.

Keywords

  • Splice Junction
  • External Memory
  • Junction Site
  • Tasmanian Devil
  • Short Read Alignment

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)

    CrossRef  MathSciNet  MATH  Google Scholar 

  2. Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, 1st edn. Springer Publishing Company, Incorporated (2008)

    CrossRef  Google Scholar 

  3. Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT Construction for Very Large String Collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  4. Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science (2012) (online February 10, 2012)

    Google Scholar 

  5. Burrows, M., Wheeler, D.J.: A block sorting data compression algorithm. Technical report, DIGITAL System Research Center (1994)

    Google Scholar 

  6. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    CrossRef  Google Scholar 

  7. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE Computer Society, Washington, DC (2000)

    CrossRef  Google Scholar 

  8. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25+ (2009)

    Google Scholar 

  9. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    CrossRef  Google Scholar 

  10. Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: Soap2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)

    CrossRef  Google Scholar 

  11. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)

    CrossRef  MathSciNet  MATH  Google Scholar 

  12. Murchison, E.P., Schulz-Trieglaff, O.B., Ning, Z., Alexandrov, L.B., Bauer, M.J., Fu, B., Hims, M., Ding, Z., Ivakhno, S., Stewart, C., Ng, B.L., Wong, W., Aken, B., White, S., Alsop, A., Becq, J., Bignell, G.R., Cheetham, R.K., Cheng, W., Connor, T.R., Cox, A.J., Feng, Z., Gu, Y., Grocock, R.J., Harris, S.R., Khrebtukova, I., Kingsbury, Z., Kowarsky, M., Kreiss, A., Luo, S., Marshall, J., McBride, D.J., Murray, L., Pearse, A., Raine, K., Rasolonjatovo, I., Shaw, R., Tedder, P., Tregidgo, C., Vilella, A.J., Wedge, D.C., Woods, G.M., Gormley, N., Humphray, S., Schroth, G., Smith, G., Hall, K., Searle, S.M.J., Carter, N.P., Papenfuss, A.T., Futreal, P.A., Campbell, P.J., Yang, F., Bentley, D.R., Evers, D.J., Stratton, M.R.: Genome sequencing and analysis of the tasmanian devil and its transmissible cancer. Cell 148(4), 780–791 (2012)

    CrossRef  Google Scholar 

  13. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)

    Google Scholar 

  14. Quinlan, A.R., Hall, I.M.: Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010)

    CrossRef  Google Scholar 

  15. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Google Scholar 

  16. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2011)

    CrossRef  Google Scholar 

  17. Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cox, A.J., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.B. (2012). Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes. In: Raphael, B., Tang, J. (eds) Algorithms in Bioinformatics. WABI 2012. Lecture Notes in Computer Science(), vol 7534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33122-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33122-0_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33121-3

  • Online ISBN: 978-3-642-33122-0

  • eBook Packages: Computer ScienceComputer Science (R0)