Abstract
Nowadays, the manipulation and analysis of genomic data stored in publicly accessible repositories have become a daily task in genomics and bioinformatics laboratories. Due to the enormous advancement in the field of genome sequencing and the emergence of many projects, bioinformaticians have pushed for the creation of a variety of programs and pipelines that will automatically analyze such big data, in particular the pipelines of gene annotation. Dealing with annotation files using easy and simple programs is very important, particularly for non-developers, enhancing the genomic data analysis acceleration. One of the first tasks required to work with genomic annotation files is to extract different features. In this regard, we have developed GAD (https://github.com/bio-projects/GAD) using Python to be a fast, easy, and controlled script that has a high ability to handle annotation files such as GFF3 and GTF. GAD is a cross-platform graphical interface tool used to extract genome features such as intergenic regions, upstream, and downstream genes. Besides, GAD finds all names of ambiguous sequence ontology, and either extracts them or considers them as genes or transcripts. The results are produced in a variety of file formats, such as BED, GTF, GFF3, and FASTA, supported by other bioinformatics programs. The GAD can handle large sizes of different genomes and an infinite number of files with minimal user effort. Therefore, our script could be integrated into various pipelines in all genomic laboratories to accelerate data analysis.
This is a preview of subscription content, access via your institution.

References
Eilbeck K et al (2005) The sequence ontology: a tool for the unification of genome annotations. Genome Biol 6(5):R44. https://doi.org/10.1186/gb-2005-6-5-r44
Tweedie S et al (2009) FlyBase: enhancing Drosophila gene ontology annotations. Nucleic Acids Res 37(suppl_1):D555–D559. https://doi.org/10.1093/nar/gkn788
Harris TW et al (2010) WormBase: a comprehensive resource for nematode research. Nucleic Acids Res 38:D463–D467. https://doi.org/10.1093/nar/gkp952
Winsor GL et al (2010) Pseudomonas Genome Database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res 39(suppl_1):596–600. https://doi.org/10.1093/nar/gkq869
Lamesch P et al (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40(D1):D1202–D1210. https://doi.org/10.1093/nar/gkr1090
Cherry JM et al (2012) Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res 40(D1):D700–D705. https://doi.org/10.1093/nar/gkr1029
NCBI Resource Coordinators (2013) Database resources of the national center for biotechnology information. Nucleic Acids Res 41(D1):D8–D20. https://doi.org/10.1093/nar/gks1189
Zerbino DR et al (2018) Ensembl 2018. Nucleic Acids Res 46(D1):D754–D761. https://doi.org/10.1093/nar/gkx1098
Aken BL et al (2017) Ensembl. Nucleic acid res 45(D1):D635–642. https://doi.org/10.1093/nar/gkw1104
dos Santos G et al (2015) FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res 43(D1):D690–D697. https://doi.org/10.1093/nar/gku1099
Howe K et al (2012) WormBase: annotating many nematode genomes. Worm 1(1):15–21. https://doi.org/10.4161/worm.19574
O’Leary NA et al (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44(D1):D733–D745. https://doi.org/10.1093/nar/gkv1189
Potter SC et al (2004) The Ensembl analysis pipeline. Genome Res 14(5):934–941. https://doi.org/10.1101/gr.1859804
Skrzypek MS, Hirschman J (2011) Using the Saccharomyces Genome Database (SGD) for analysis of genomic information. Curr Protoc Bioinf 35(1):1–20. https://doi.org/10.1002/0471250953.bi0120s35
Tatusova T et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44(14):6614–6624. https://doi.org/10.1093/nar/gkw569
Winsor GL et al (2009) Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes. Nucleic acids Res 37(suppl_1):D483–D488. https://doi.org/10.1093/nar/gkn861
Trapnell C et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511. https://doi.org/10.1038/nbt.1621
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842. https://doi.org/10.1093/bioinformatics/btq033
Camiolo S, Porceddu A (2013) gff2sequence, a new user friendly tool for the generation of genomic sequences. BioData Min 6:15. https://doi.org/10.1186/1756-0381-6-15
Rastogi A, Gupta D (2014) GFF-Ex: a genome feature extraction package. BMC Res Notes 7(1):315. https://doi.org/10.1186/1756-0500-7-315
Afgan E et al (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46(W1):W537–W544. https://doi.org/10.1093/nar/gky379
Acknowledgements
In this paper, we would like to thank the families of the authors for their continued support. Special thanks to Assistant Professor Dr. Ahmed Ismail for his help and support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
‘The author(s) declare that they have no competing interests.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yasser, N., Karam, A. GAD: A Python Script for Dividing Genome Annotation Files into Feature-Based Files. Interdiscip Sci Comput Life Sci 12, 377–381 (2020). https://doi.org/10.1007/s12539-020-00378-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-020-00378-4
Keywords
- Genome annotation
- Extraction
- Features
- GFF3
- GTF
- BED