Skip to main content

Nucleotide Sequence Alignment and Compression via Shortest Unique Substring

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9044))

Abstract

Aligning short reads produced by high throughput sequencing equipments onto a reference genome is the fundamental step of sequence analysis. Since the sequencing machinery generates massive volumes of data, it is becoming more and more vital to keep those data compressed also. In this study we present the initial results of an on-going research project, which aims to combine the alignment and compression of short reads with a novel preprocessing technique based on shortest unique substring identifiers. We observe that clustering the short reads according to the set of unique identifiers they include provide us an opportunity to combine compression and alignment. Thus, we propose an alternative path in high-throughput sequence analysis pipeline, where instead of applying an immediate whole alignment, a preprocessing that clusters the reads according to the set of shortest unique substring identifiers extracted from the reference genome is to be performed first. We also present an analysis of the short unique substrings identifiers on the human reference genome and examine how labeling each short read with those identifiers helps in alignment and compression.

This work has been supported by the Scientific & Technological Research Council of Turkey (TÜBİTAK), BİDEB–2221 Fellowship Program, and also with the TÜBİTAK-ARDEB-1005 grant number 114E293.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., et al.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics 41(10), 1061–1067 (2009)

    Article  Google Scholar 

  2. Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nature Reviews Genetics 14(5), 333–346 (2013)

    Article  Google Scholar 

  3. Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencing data. PloS One 8(3), e59190 (2013)

    Google Scholar 

  4. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    Article  Google Scholar 

  5. Deorowicz, S., Grabowski, S.: Compression of dna sequence reads in fastq format. Bioinformatics 27(6), 860–862 (2011)

    Article  Google Scholar 

  6. Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms for Molecular Biology 8(1), 25 (2013)

    Article  Google Scholar 

  7. Fonseca, N.A., Rung, J., Brazma, A., Marioni, J.C.: Tools for mapping high-throughput sequencing data. Bioinformatics 28(24), 3169–3177 (2012)

    Article  Google Scholar 

  8. Hsi-Yang, F.M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput dna sequencing data using reference-based compression. Genome Research 21(5), 734–740 (2011)

    Article  Google Scholar 

  9. Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings in Bioinformatics, bbt088 (2013)

    Google Scholar 

  10. Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E.E., Sahinalp, S.C.: mrsfast: A cache-oblivious algorithm for short-read mapping. Nature Methods 7(8), 576–577 (2010)

    Article  Google Scholar 

  11. Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: Scalce: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)

    Article  Google Scholar 

  12. Hach, F., Sarrafi, I., Hormozdiari, F., Alkan, C., Eichler, E.E., Sahinalp, S.C.: mrsfast-ultra: a compact, snp-aware mapper for high performance sequencing applications. Nucleic Acids Research, gku370 (2014)

    Google Scholar 

  13. İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  14. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nature Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  15. Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010)

    Article  Google Scholar 

  16. Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nature Biotechnology 30(7), 627–630 (2012)

    Article  Google Scholar 

  17. Pei, J., Wu, W.C.-H., Yeh, M.-Y.: On shortest unique substring queries. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 937–948. IEEE (2013)

    Google Scholar 

  18. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)

    Article  Google Scholar 

  19. Tsuruta, K., Inenaga, S., Bannai, H., Takeda, M.: Shortest Unique Substrings Queries in Optimal Time. In: Geffert, V., Preneel, B., Rovan, B., Štuller, J., Tjoa, A.M. (eds.) SOFSEM 2014. LNCS, vol. 8327, pp. 503–513. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Adaş, B., Bayraktar, E., Faro, S., Moustafa, I.E., Külekci, M.O. (2015). Nucleotide Sequence Alignment and Compression via Shortest Unique Substring. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2015. Lecture Notes in Computer Science(), vol 9044. Springer, Cham. https://doi.org/10.1007/978-3-319-16480-9_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16480-9_36

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16479-3

  • Online ISBN: 978-3-319-16480-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics