Abstract
Next-generation sequencing (NGS) is related to massively parallel or deep deoxyribonucleic acid (DNA) sequencing technology which has revolutionized genomic researches in recent years. Although the cost of generating NGS data was decreased compared to the one at the time of emerging this technology, its cost might still be somewhat a problem. Hence, new strategies as pool-seq and low-coverage NGS data have been developed to overcome the cost problem. Despite decreasing cost, it is important to elucidate whether they are efficient in NGS studies. We applied a bioinformatics pipeline on pool-seq and low-coverage retinoblastoma data retrieved from only tumor data. Retinoblastoma is an eye malignancy in childhood that is initiated by RB1 mutation or MYCN amplification and can lead to the loss of vision of eye(s), and even sometimes life. We applied our pipeline on both retinoblastoma disease data and two other particular data to testify the validity and also for comparison purposes in the aspect of performance. High-confidence variant calls from Genome in a Bottle Consortium were used for fulfilling these purposes. We observed that our pipeline successfully called higher number of variants than a standard pipeline for all these three different data. Besides, the recall and F-score values were quite better in our pipeline as being noteworthy. We further presented our results on disease data in the aspects of the variants, variant types and disease-related genes. This study provides a guideline for performing NGS data analysis pipeline on pool-seq and low-coverage sequencing data in conjunction. To get more conclusive outcomes of these two strategies, we recommend using cancer data having higher mutation rates and larger pools.
Similar content being viewed by others
References
Aerts I, Lumbroso-Le Rouic L, Gauthier-Villars M, Brisse H, Doz F, Desjardins L (2006) Retinoblastoma. Orphanet J Rare Dis 1:31. https://doi.org/10.1186/1750-1172-1-31
Altmann A, Weber P, Quast C, Rex-Haffner M, Binder EB, Mueller-Myhsok B (2011) vipR: variant identification in pooled DNA using R. Bioinformatics 27(13):I77–I84. https://doi.org/10.1093/bioinformatics/btr205
Anand S, Mangano E, Barizzone N, Bordoni R, Sorosina M, Clarelli F, Corrado L, Martinelli Boneschi F, D’Alfonso S, De Bellis G (2016) Next generation sequencing of pooled samples: guideline for variants’ filtering. Sci Rep 6:33735. https://doi.org/10.1038/srep33735
Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs RA, Boerwinkle E, Doddapaneni H, Han Y, Korchina V, Kovar C, Lee S, Muzny D, Reid JG, Zhu Y, Wang J, Chang Y, Feng Q, Fang X, Guo X, Jian M, Jiang H, Jin X, Lan T, Li G, Li J, Li Y, Liu S, Liu X, Lu Y, Ma X, Tang M, Wang B, Wang G, Wu H, Wu R, Xu X, Yin Y, Zhang D, Zhang W, Zhao J, Zhao M, Zheng X, Lander ES, Altshuler DM, Gabriel SB, Gupta N, Gharani N, Toji LH, Gerry NP, Resch AM, Flicek P, Barker J, Clarke L, Gil L, Hunt SE, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Thormann A, Toneva I, Vaughan B, Zheng-Bradley X, Bentley DR, Grocock R, Humphray S, James T, Kingsbury Z, Lehrach H, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo ML, Mardis ER, Wilson RK, Fulton L, Fulton R, Sherry ST, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O?Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Zhang H, McVean GA, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt JP, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Auton A, Campbell CL, Kong Y, Marcketta A, Gibbs RA, Yu F, Antunes L, Bainbridge M, Muzny D, Sabo A, Huang Z, Wang J, Coin LJM, Fang L, Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Shao H, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Alkan C, Dal E, Kahveci F, Marth GT, Garrison EP, Kural D, Lee WP, Fung Leong W, Stromberg M, Ward AN, Wu J, Zhang M, Daly MJ, DePristo MA, Handsaker RE, Altshuler DM, Banks E, Bhatia G, del Angel G, Gabriel SB, Genovese G, Gupta N, Li H, Kashin S, Lander ES, McCarroll SA, Nemesh JC, Poplin RE, Yoon SC, Lihm J, Makarov V, Clark AG, Gottipati S, Keinan A, Rodriguez-Flores JL, Korbel JO, Rausch T, Fritz MH, Stütz AM, Flicek P, Beal K, Clarke L, Datta A, Herrero J, McLaren WM, Ritchie GRS, Smith RE, Zerbino D, Zheng-Bradley X, Sabeti PC, Shlyakhter I, Schaffner SF, Vitti J, Cooper DN, Ball EV, Stenson PD, Bentley DR, Barnes B, Bauer M, Keira Cheetham R, Cox A, Eberle M, Humphray S, Kahn S, Murray L, Peden J, Shaw R, Kenny EE, Batzer MA, Konkel MK, Walker JA, MacArthur DG, Lek M, Sudbrak R, Amstislavskiy VS, Herwig R, Mardis ER, Ding L, Koboldt DC, Larson D, Ye K, Gravel S, Consortium TGP, authors C, committee S, group P, of Medicine BC, BGI-Shenzhen, of Broad Institute MIT, Harvard, for Medical Research CI, European Molecular Biology Laboratory EBI, Illumina, for Molecular Genetics MPI, at Washington University MGI, of Health USNI, of Oxford U, Institute WTS, group A, Affymetrix, of Medicine AEC, University B, College B, Laboratory CSH, University C, Laboratory EMB, University H, Database HGM, of Medicine at Mount Sinai IS, University LS, Hospital MG, University M, National Eye Institute NIH (2015) A global reference for human genetic variation. Nature 526(7571):68–74. https://doi.org/10.1038/nature15393
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA (2013) From fastq data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protocols Bioinform 43(25431634):11.10.1–11.10.33. https://doi.org/10.1002/0471250953.bi1110s43
Babraham-Bioinformatics (2019) Babraham bioinformatics - fastqc a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed: 2019 Mar 25
Bansal V (2010) A statistical method for the detection of variants from next-generation resequencing of dna pools. Bioinformatics 26(12):i318–i324. https://doi.org/10.1093/bioinformatics/btq214
Bizon C, Spiegel M, Chasse SA, Gizer IR, Li Y, Malc EP, Mieczkowski PA, Sailsbery JK, Wang X, Ehlers CL, Wilhelmsen KC (2014) Variant calling in low-coverage whole genome sequencing of a native american population sample. BMC Genomics 15(1):85. https://doi.org/10.1186/1471-2164-15-85
ten Bosch JR, Grody WW (2008) Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagnostics 10:484–92. https://doi.org/10.2353/jmoldx.2008.080027
Cornish A, Guda C (2015) A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference. BioMed Res Int 2015:456479. https://doi.org/10.1155/2015/456479
Devarajan B, Prakash L, Kannan TR, Abraham AA, Kim U, Muthukkaruppan V, Vanniarajan A (2015) Targeted next generation sequencing of rb1 gene for the molecular diagnosis of retinoblastoma. BMC Cancer 15:320. https://doi.org/10.1186/s12885-015-1340-8
ENA (2018) The european nucleotide archive (ena). https://www.ebi.ac.uk/ena/data/view/PRJEB6630. Accessed 2018 Oct 12
Fang L, Hu J, Wang D, Wang K (2018) NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data. BMC Bioinform 19:180. https://doi.org/10.1186/s12859-018-2207-1
García-Chequer A, Méndez-Tenorio A, Olguín-Ruiz G, Sánchez-Vallejo C, Isa P, Arias C, Torres J, Hernández-Angeles A, Ramírez-Ortiz M, Lara C, Cabrera-Muñoz M, Sadowinski-Pine S, Bravo-Ortiz J, Ramón-García G, Diegopérez-Ramírez J, Ramírez-Reyes G, Casarrubias-Islas R, Ramírez J, Orjuela M, Ponce-Castañeda M (2016) Overview of recurrent chromosomal losses in retinoblastoma detected by low coverage next generation sequencing. Cancer Genet 209(3):57–69. https://doi.org/10.1016/j.cancergen.2015.12.001
Grotta S, D’Elia G, Scavelli R, Genovese S, Surace C, Sirleto P, Cozza R, Romanzo A, De Ioris MA, Valente P, Tomaiuolo AC, Lepri FR, Franchin T, Ciocca L, Russo S, Locatelli F, Angioni A (2015) Advantages of a next generation sequencing targeted approach for the molecular diagnosis of retinoblastoma. BMC Cancer 15:841. https://doi.org/10.1186/s12885-015-1854-0
happy (2020) Illumina/hap.py: Haplotype vcf comparison tools. https://github.com/Illumina/hap.py. Accessed 2020 Mar 02
Huang HW, Mullikin JC, Hansen NF, Program NISCCS (2015) Evaluation of variant detection software for pooled next-generation sequence data. BMC Bioinform. 16(1):235. https://doi.org/10.1186/s12859-015-0624-y
Huang L, Wang B, Chen R, Bercovici S, Batzoglou S (2016) Reveel: large-scale population genotyping using low-coverage sequencing data. Bioinformatics 32(11):1686–1696. https://doi.org/10.1093/bioinformatics/btv530
Kofler R, Pandey RV, Schloetterer C (2011) PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics 27(24):3435–3436. https://doi.org/10.1093/bioinformatics/btr589
Kooi IE, Mol BM, Massink MPG, Ameziane N, Meijers-Heijboer H, Dommering CJ, van Mil SE, de Vries Y, van der Hout AH, Kaspers GJL, Moll AC, te Riele H, Cloos J, Dorsman JC (2016a) Somatic genomic alterations in retinoblastoma beyond rb1 are rare and limited to copy number changes. Sci Rep 6:25264. https://doi.org/10.1038/srep25264
Kooi IE, Mol BM, Massink MPG, de Jong MC, de Graaf P, van der Valk P, Meijers-Heijboer H, Kaspers GJL, Moll AC, Te Riele H, Cloos J, Dorsman JC (2016b) A meta-analysis of retinoblastoma copy numbers refines the list of possible driver genes involved in tumor progression. PloS One 11:e0153323. https://doi.org/10.1371/journal.pone.0153323
Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England) 25:1754–60. https://doi.org/10.1093/bioinformatics/btp324
Li H, Durbin R (2010) Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England) 26:589–95. https://doi.org/10.1093/bioinformatics/btp698
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Data GP, Sam T (2009) The sequence alignment / map format and SAMtools. Bioinformatics 25(16):2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Li WL, Buckley J, Sanchez-Lara PA, Maglinte DT, Viduetsky L, Tatarinova TV, Aparicio JG, Kim JW, Au M, Ostrow D, Lee TC, O’Gorman M, Judkins A, Cobrinik D, Triche TJ (2016) A rapid and sensitive next-generation sequencing method to detect rb1 mutations improves care for retinoblastoma patients and their families. J Mol Diagnostics 18(4):480–493. https://doi.org/10.1016/j.jmoldx.2016.02.006
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR (2011) Low-coverage sequencing: implications for design of complex trait association studies. Genome Res 21(21460063):940–951. https://doi.org/10.1101/gr.117259.110
Li Z, Wang Y, Wang F (2018) A study on fast calling variants from next-generation sequencing data using decision tree. BMC Bioinformatics 19(1):145. https://doi.org/10.1186/s12859-018-2147-9
McKenna A, Hanna M, Banks E, Al E, (2010) The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res 20(9):1297–1303. https://doi.org/10.1101/gr.107524.110
Navon O, Sul JH, Han B, Conde L, Bracci PM, Riby J, Skibola CF, Eskin E, Halperin E (2013) Rare variant association testing under low-coverage sequencing. Genetics 194(3):769. https://doi.org/10.1534/genetics.113.150169
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z (2014) A survey of tools for variant analysis of next-generation genome sequencing data. Briefings Bioinform 15(2):256–278. https://doi.org/10.1093/bib/bbs086
Picard (2019) Picard tools - by broad institute. http://broadinstitute.github.io/picard/. Accessed 2019 Mar 27
Pihlstrom L, Rengmark A, Bjornara KA, Toft M (2014) Effective variant detection by targeted deep sequencing of dna pools: an example from parkinson’s disease. Ann Hum Genet 78:243–52. https://doi.org/10.1111/ahg.12060
Poplin R, Ruano-rubio V, Depristo MA, Fennell TJ, Carneiro MO, Auwera GAVD, Kling DE, Gauthier D, Levy-moonshine A, Roazen D, Shakir K (2017) Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv https://doi.org/10.1101/201178
Popp B, Ekici AB, Thiel CT, Hoyer J, Wiesener A, Kraus C, Reis A, Zweier C (2017) Exome pool-seq in neurodevelopmental disorders. Europ J Hum Genet 25:1364–1376. https://doi.org/10.1038/s41431-017-0022-1
R-Project (2019) R: The r project for statistical computing. https://www.r-project.org/. Accessed 2019 Mar 05
Raineri E, Ferretti L, Esteve-Codina A, Nevado B, Heath S, Pérez-Enciso M (2012) Snp calling by sequencing pooled samples. BMC Bioinformatics 13(1):239. https://doi.org/10.1186/1471-2105-13-239
Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals-mining genome-wide polymorphism data without big funding. Nat Rev Genet 15:749. https://doi.org/10.1038/nrg3803
Shyr D, Liu Q (2013) Next generation sequencing in cancer research and clinical application. Biol Procedures Online 15:4. https://doi.org/10.1186/1480-9222-15-4
Theriault BL, Dimaras H, Gallie BL, Corson TW (2014) The genomic landscape of retinoblastoma: a review. Clin Exp Ophthalmol 42(1):33–52. https://doi.org/10.1111/ceo.12132
Tomar S, Sethi R, Sundar G, Quah TC, Quah BL, Lai PS (2017) Mutation spectrum of rb1 mutations in retinoblastoma cases from singapore with implications for genetic management and counselling. PloS One 12:e0178776. https://doi.org/10.1371/journal.pone.0178776
Wang K, Li M, Hakonarson H (2010) Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucl Acids Res 38(20601685):e164–e164. https://doi.org/10.1093/nar/gkq603
Wold B, Myers RM (2008) Sequence census methods for functional genomics. Nat Methods 5:19–21. https://doi.org/10.1038/nmeth1157
Yu X, Sun S (2013) Comparing a few snp calling algorithms using low-coverage sequencing data. BMC Bioinform 14(1):274. https://doi.org/10.1186/1471-2105-14-274
Zhang J, Wu Y (2011) SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data. Bioinformatics 27(23):3228–3234. https://doi.org/10.1093/bioinformatics/btr563
Zhang J, Benavente CA, McEvoy J, Flores-Otero J, Ding L, Chen X, Ulyanov A, Wu G, Wilson M, Wang J, Brennan R, Rusch M, Manning AL, Ma J, Easton J, Shurtleff S, Mullighan C, Pounds S, Mukatira S, Gupta P, Neale G, Zhao D, Lu C, Fulton RS, Fulton LL, Hong X, Dooling DJ, Ochoa K, Naeve C, Dyson NJ, Mardis ER, Bahrami A, Ellison D, Wilson RK, Downing JR, Dyer MA (2012) A novel retinoblastoma therapy from genomic and epigenetic analyses. Nature 481(7381):329–334. https://doi.org/10.1038/nature10733
Zhang J, Wang J, Wu Y (2012) An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data. BMC Bioinform 13(6):S6. https://doi.org/10.1186/1471-2105-13-S6-S6
Acknowledgments
The numerical calculations reported in this paper were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Data availability
Disease data are downloaded from The European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena/data/view/PRJEB6630) with accession number PRJEB6630. NA12878 low coverage sequencing data are downloaded from NCBI Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) with accession number SRR622461. NA20355 low coverage sequencing data are downloaded from the data portal of 1000 Genomes Project (https://www.internationalgenome.org/data-portal/sample/NA20355) with accession number ERR251661 and ERR251662. Disease test data are available in the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-3515.
Rights and permissions
About this article
Cite this article
Özdemir Özdoğan, G., Kaya, H. Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data. Interdiscip Sci Comput Life Sci 12, 302–310 (2020). https://doi.org/10.1007/s12539-020-00374-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-020-00374-8