Skip to main content

Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1415))

Abstract

Bacterial genome sequencing is now an affordable choice for many laboratories for applications in research, diagnostic, and clinical microbiology. Nowadays, an overabundance of tools is available for genomic data analysis. However, tools differ for algorithms, languages, hardware requirements, and user interface, and combining them as it is necessary for sequence data interpretation often requires (bio)informatics skills which can be difficult to find in many laboratories. In addition, multiple data sources, as well as exceedingly large dataset sizes, and increasingly computational complexity further challenge the accessibility, reproducibility, and transparency of the entire process. In this chapter we will cover the main bioinformatics steps required for a complete bacterial genome analysis using next-generation sequencing data, from the raw sequence data to assembled and annotated genomes. All the tools described are available in the Orione framework (http://orione.crs4.it), which uniquely combines in a transparent way the most used open source bioinformatics tools for microbiology, allowing microbiologist without any specific hardware or informatics skill to conduct data-intensive computational analyses from quality control to microbial gene annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten simple rules for reproducible computational research. PLoS Comput Biol 9:e1003285

    Article  PubMed  PubMed Central  Google Scholar 

  2. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86

    Article  PubMed  PubMed Central  Google Scholar 

  4. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19: Unit 19.10.1–21

    Google Scholar 

  5. Sloggett C, Goonasekera N, Afgan E (2013) BioBlend: automating pipeline analyses within Galaxy and CloudMan. Bioinformatics 29:1685–1686

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Leo S, Pireddu L, Cuccuru G, Lianas L, Soranzo N, Afgan E, Zanetti G (2014) BioBlend.objects: metacomputing with Galaxy. Bioinformatics 30:2816–2817. doi:10.1093/bioinformatics/btu386

    Article  PubMed  PubMed Central  Google Scholar 

  7. Cuccuru G, Orsini M, Pinna A, Sbardellati A, Soranzo N, Travaglione A, Uva P, Zanetti G, Fotia G (2014) Orione, a web-based framework for NGS analysis in microbiology. Bioinformatics 30:1928–1929. doi:10.1093/bioinformatics/btu135

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cuccuru G, Leo S, Lianas L, Muggiri M, Pinna A, Pireddu L, Uva P, Angius A, Fotia G, Zanetti G, Bioinformatics H (2014) An automated infrastructure to support high-troughput bioinformatics. In: Smari, Waleed W, Zeljkovic V (eds) Proc. IEEE Int. Conf. High Perform. Comput. Simul. (HPCS 2014). IEEE. pp 600–607

    Google Scholar 

  9. Liu T, Ortiz JA, Taing L, Meyer CA, Lee B, Zhang Y, Shin H, Wong SS, Ma J, Lei Y, Pape UJ, Poidinger M, Chen Y, Yeung K, Brown M, Turpaz Y, Liu XS (2011) Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol 12:R83. doi:10.1186/gb-2011-12-8-r83

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Boeva V, Lermine A, Barette C, Guillouf C, Barillot E (2012) Nebula--a web-server for advanced ChIP-seq data analysis. Bioinformatics 28:2517–2519. doi:10.1093/bioinformatics/bts463

    Article  CAS  PubMed  Google Scholar 

  11. Vos M, te Beek TAH, van Driel MA, Huynen MA, Eyre-Walker A, van Passel MWJ (2013) ODoSE: a webserver for genome-wide calculation of adaptive divergence in prokaryotes. PLoS One 8:e62447. doi:10.1371/journal.pone.0062447

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Williamson SJ, Allen LZ, Lorenzi HA, Fadrosh DW, Brami D, Thiagarajan M, McCrow JP, Tovchigrechko A, Yooseph S, Venter JC (2012) Metagenomic exploration of viruses throughout the Indian Ocean. PLoS One 7:e42047. doi:10.1371/journal.pone.0042047

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. MBAC metabiome portal. Accessed 15 Jun 2015 from http://mbac.gmu.edu:8080

  14. Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4:17–27

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Loman NJ, Constantinidou C, Chan JZM, Halachev M, Sergeant M, Penn CW, Robinson ER, Pallen MJ (2012) High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol 10:599–606

    Article  CAS  PubMed  Google Scholar 

  16. BWA-MEM. Accessed 15 Jun 2015 from http://bio-bwa.sourceforge.net/bwa.shtml

  17. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi:10.1093/bioinformatics/btp324

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics Chapter 11: 11–7

    Google Scholar 

  19. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Andrews S FastQC a quality control tool for high throughput sequence data. Accessed 15 Jun 2015 from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  21. SeqAnswers. Accessed 15 Jun 2015 from http://seqanswers.com/wiki/Software/list

  22. Hatem A, Bozdağ D, Toland AE, Çatalyürek ÜV (2013) Benchmarking short sequence mapping tools. BMC Bioinformatics 14:184. doi:10.1186/1471-2105-14-184

    Article  PubMed  PubMed Central  Google Scholar 

  23. Cornish A, Guda C (2015) A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res Int 2015:456479

    Article  PubMed  PubMed Central  Google Scholar 

  24. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421

    Article  PubMed  PubMed Central  Google Scholar 

  25. Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12:656–664. doi:10.1101/gr.229202

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Harris RS (2007) Improved pairwise alignment of genomic DNA. Pennsylvania State University, State College, PA

    Google Scholar 

  27. Lee W-P, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT (2014) MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One 9:e90581

    Article  PubMed  PubMed Central  Google Scholar 

  28. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi:10.1038/nmeth.1923

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967. doi:10.1093/bioinformatics/btp336

    Article  CAS  PubMed  Google Scholar 

  30. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11:473–483. doi:10.1093/bib/bbq015

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Mielczarek M, Szyda J (2015) Review of alignment and SNP calling algorithms for next-generation sequencing data. J Appl Genet (in press)

    Google Scholar 

  32. Wajid B, Serpedin E (2012) Review of general algorithmic features for genome assemblers for next generation sequencers. Genomics Proteomics Bioinformatics 10:58–73

    Article  PubMed  Google Scholar 

  33. El-Metwally S, Hamza T, Zakaria M, Helmy M (2013) Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 9:e1003345

    Article  PubMed  PubMed Central  Google Scholar 

  34. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA (2012) GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557–567

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18:802–809

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23:500–501. doi:10.1093/bioinformatics/btl629

    Article  CAS  PubMed  Google Scholar 

  38. The MIRA assembler. Accessed 15 Jun 2015 from http://sourceforge.net/projects/mira-assembler/

  39. Gladman S, Seemann T VelvetOptimiser. Accessed 15 Jun 2015 from http://bioinformatics.net.au/software.velvetoptimiser.shtml

  40. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27:578–579. doi:10.1093/bioinformatics/btq683

    Article  CAS  PubMed  Google Scholar 

  42. Ronen R, Boucher C, Chitsaz H, Pevzner P (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28:i188–i196. doi:10.1093/bioinformatics/bts219

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Dayarian A, Michael TP, Sengupta AM (2010) SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics 11:345. doi:10.1186/1471-2105-11-345

    Article  PubMed  PubMed Central  Google Scholar 

  44. Lin S-H, Liao Y-C (2013) CISA: contig integrator for sequence assembly of bacterial genomes. PLoS One 8:e60843. doi:10.1371/journal.pone.0060843

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12

    Article  PubMed  PubMed Central  Google Scholar 

  46. Angiuoli SV, Salzberg SL (2011) Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27:334–342

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Darling AE, Mau B, Perna NT (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5:e11147

    Article  PubMed  PubMed Central  Google Scholar 

  48. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352

    Article  PubMed  PubMed Central  Google Scholar 

  49. Garrison E, Marth G (2012) Haplotype-based variant detection from short-read sequencing. arXiv Prepr arXiv12073907 342:9. doi: arXiv:1207.3907 [q-bio.GN]

    Google Scholar 

  50. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi:10.1101/gr.107524.110

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Lukens AK, Ross LS, Heidebrecht R, Javier Gamo F, Lafuente-Monasterio MJ, Booker ML, Hartl DL, Wiegand RC, Wirth DF (2014) Harnessing evolutionary fitness in Plasmodium falciparum for drug discovery and suppressing resistance. Proc Natl Acad Sci U S A 111:799–804

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Veenemans J, Overdevest IT, Snelders E, Willemsen I, Hendriks Y, Adesokan A, Doran G, Bruso S, Rolfe A, Pettersson A, Kluytmans JAJW (2014) Next-generation sequencing for typing and detection of resistance genes: performance of a new commercial method during an outbreak of extended-spectrum-beta-lactamase-producing Escherichia coli. J Clin Microbiol 52:2454–2460

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Al-Shahib A, Underwood A (2013) snp-search: simple processing, manipulation and searching of SNPs from high-throughput sequencing. BMC Bioinformatics 14:326

    Article  PubMed  PubMed Central  Google Scholar 

  54. Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23:673–679

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25:955–964

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi:10.1093/bioinformatics/btu153

    Article  CAS  PubMed  Google Scholar 

  57. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119

    Article  PubMed  PubMed Central  Google Scholar 

  58. Lagesen K, Hallin P, Rødland EA, Staerfeldt H-H, Rognes T, Ussery DW (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35:3100–3108

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Laslett D (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res 32:11–16

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786

    Article  CAS  PubMed  Google Scholar 

  61. Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29:2933–2935

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Cingolani P, Platts A, Wang L, Coon M, Nguyen T, Land S, Lu X, Ruden D (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6:80–92. doi:10.4161/fly.19695

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Rajasekar A, Moore R, Hou C-Y, Lee CA, Marciano R, de Torcy A, Wan M, Schroeder W, Chen S-Y, Gilbert L, Tooby P, Zhu B (2010) iRODS primer: integrated rule-oriented data system. Synth Lect Inf Concepts, Retrieval, Serv 2:1–143. doi:10.2200/S00233ED1V01Y200912ICR012

    Article  Google Scholar 

  64. Allan C, Burel J-M, Moore J, Blackburn C, Linkert M, Loynton S, MacDonald D, Moore WJ, Neves C, Patterson A, Porter M, Tarkowska A, Loranger B, Avondo J, Lagerstedt I, Lianas L, Leo S, Hands K, Hay RT, Patwardhan A, Best C, Kleywegt GJ, Zanetti G, Swedlow JR (2012) OMERO: flexible, model-driven data management for experimental biology. Nat Methods 9:245–253. doi:10.1038/nmeth.1896

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Leo S, Pireddu L, Zanetti G (2012) SNP genotype calling with MapReduce, Proc. third Int. Work. MapReduce its Appl. Date - MapReduce’12. ACM, New York, NY, p 49

    Google Scholar 

  66. Blankenberg D, Gordon A, Von Kuster G, Coraor N, Taylor J, Nekrutenko A (2010) Manipulation of FASTQ data with Galaxy. Bioinformatics 26:1783–1785

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Blankenberg D, Taylor J, Nekrutenko A (2011) Making whole genome multiple alignments usable for biologists. Bioinformatics 27:2426–2428

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. FASTQ paired-end interlacer. Accessed 15 Jun 2015 from https://toolshed.g2.bx.psu.edu/view/devteam/fastq_paired_end_interlacer/b89bdf6acb6c

  69. VelvetOptimizer. Accessed 15 Jun 2015 from https://github.com/tseemann/VelvetOptimiser

  70. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477. doi:10.1089/cmb.2012.0021

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Sardinian Regional Authorities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giorgio Fotia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Orsini, M., Cuccuru, G., Uva, P., Fotia, G. (2016). Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3572-7_21

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3570-3

  • Online ISBN: 978-1-4939-3572-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics