ATGC transcriptomics: a web-based application to integrate, explore and analyze de novo transcriptomic data
- 1.6k Downloads
In the last years, applications based on massively parallelized RNA sequencing (RNA-seq) have become valuable approaches for studying non-model species, e.g., without a fully sequenced genome. RNA-seq is a useful tool for detecting novel transcripts and genetic variations and for evaluating differential gene expression by digital measurements. The large and complex datasets resulting from functional genomic experiments represent a challenge in data processing, management, and analysis. This problem is especially significant for small research groups working with non-model species.
We developed a web-based application, called ATGC transcriptomics, with a flexible and adaptable interface that allows users to work with new generation sequencing (NGS) transcriptomic analysis results using an ontology-driven database. This new application simplifies data exploration, visualization, and integration for a better comprehension of the results.
ATGC transcriptomics provides access to non-expert computer users and small research groups to a scalable storage option and simple data integration, including database administration and management. The software is freely available under the terms of GNU public license at http://atgcinta.sourceforge.net.
KeywordsDe novo transcriptomics Data integration Ontology storage Web application
Application program interface
Basic local alignment search tool
Cascading style sheet
General feature format
Generic model organism database consortium
HyperText markup language
Hypertext transfer protocol
National Center for Biotechnology Information
Next generation sequencing
Open biomedical ontologies
Random access memory
Reads per kilobase of a million of mapped reads
Structured query language
Variant call format
Extensible markup language
Functional genomic analysis of species lacking a reference genome involves the integration of heterogeneous data sources including transcriptome assembly and annotation, along with gene expression and genetic variants. NGS data analysis results requires the development of a user-friendly tool to manage, visualize and analyze a large amount of results in a comprehensive and integrated manner. In the last few years, many applications for conducting functional genomic studies have appeared [1, 2]. A typical de novo RNA-seq pipeline involves; (i) Sequence quality control (ii) Transcript assembly, (iii) Functional and structural annotation of those transcripts, (iv) Discovery of molecular markers (microsatellites (SSRs) and single nucleotide polymorphisms (SNPs), and (v) Differential expression of transcripts between the conditions assayed (See Conesa, A. et al., 2016 for a detailed review: ). All these steps produce large and complex structured results which need to be handled properly. Most of the applications performing these steps in a de novo RNA-seq pipeline greatly aided small research groups but, usually, these same groups lack data management capabilities and know how.
ATGC transcriptomics relies on the ability to store data in an orderly manner, by using an ontology-based and modular database schema such as Chado . Our web application allows a broad interpretation of the produced data and the formulation of data-driven hypotheses to test biological questions. Thus, ontology-driven databases provide a flexible and adaptable schema capable of extracting complex relationships arising from the data structure. Chado is a relational database schema designed by the Generic Model Organism Database Consortium (GMOD) to handle complex representations of biological knowledge. The GMOD introduces the concept of computational ontology as applied to the management of the information structure in a way that accounts for the tacit knowledge of ontology. Computational ontologies, such as the Gene Ontology project  and Sequence Ontology , not only allow users to sort data by introducing a structure with terms and definitions, but also provide a common language to store and share information. The Chado schema, which was originally designed by FlyBase, has been widely used to develop databases of specific organisms [7, 8]. Several software packages import or export data using the Chado database schema: GBrowse , Maker , Apollo  and Ergatis .
However, the administration, management and analysis of large databases lacking a graphical interface may be complicated and non-intuitive for non-informatics users. In this context, web applications emerged as the best choice due to user’s familiarity with this kind of tools, their simple installation and shared access to databases. Recently, Tripal has appeared as a solution to develop web front-ends for Chado specifically designed to manage genomic and genetic information . Tripal is a toolkit for the construction of biological, scientific research-oriented websites that is based on the Drupal content management system . This toolkit provides an Application Program Interface (API) that makes the application more flexible, thus allowing complete customization of data. This option, however, requires informatics expertise, for example in a de novo transcriptomic assay (discussed below).
Here we present the ATGC transcriptomics, a web-based application that arose from the need for genomic data management of small research groups. In parallel, during the development process, we used this new web-based application to integrate and analyze data from different projects that involve diverse data sources [15, 16, 17, 18].
ATGC transcriptomics is a free and open-source application available on-line designed to visualize, explore, analyze and share de novo transcriptomic data generated by NGS platforms using the Chado database schema to store data. The ATGC application allows users to integrate heterogeneous datasets such as structural annotation, genotype information, experimental designs, gene expression, functional annotation and other computational analyses, into a single database with a user-friendly interface. ATGC Transcriptomics has support for storing data in several modules and implements different ontologies for classification and further analysis of data through ontology-based searches, graphics, and detailed tables.
The ATGC application was built using the Chado database schema implemented in PostgreSQL , a web interface based on Web2py , and an in house developed Python module called pychado, which was created as an interface between the web application and the database.
The Chado schema is partitioned into modules, thus creating a scalable model with the addition of specific modules that provide support for additional data types including those from novel technologies, without modifying the actual schema. The database schema consists of five core modules, which are named Sequence, General, Publication, Audit and Controlled vocabularies (Ontologies). Moreover, ATGC transcriptomics incorporates the following modules: Organism, Companalysis, Phenotype, Strain, Genetics, Stock, and Expression, to store data from a wide variety of biological experiments and research fields, such as comparative sequence analysis, gene expression studies, genetics, taxonomy, biological collections and phenotypic diversity.
For the database interface, we used and implemented the Model – View – Controller (MVC) pattern that Web2py provides, as to access the modular schema, separating the data representation (model), data presentation (view) and application logic (controller). We used an internal Web2py database to store configuration data related to the user interface and the Chado database administration. The application appearance can be easily changed or customized since all web pages inherit styles from an application-wide cascading style sheet (CSS), and since overall page layout is controlled by a single HTML layout file to have a global control over the views. The ATGC web interface is designed to handle a variety of application tasks. Database navigation is from a set of drop-down lists including Home, Data loading, Search, Ontology annotation, Download, Software, Modify and Delete and Setup (Fig. 2a). The lists are configurable by the authorized user in a straightforward way using the Web2py administration interface.
The pychado module makes the connection of the application with the Chado database using the psycopg python package . Pychado contains a structure of data classes and methods to insert, modify or delete data entries and get information adapted to the Chado modules and tables. Furthermore, taking into account the ontology-driven storage, we worked in modeling ontology terms and structures to go through the ontology graph to save data and enable searches and queries.
ATGC transcriptomics permits the storage of sequence-related data, including structural annotation (GFF format), sequence alignments obtained by BLAST , functional annotation results using Blast2GO and InterProScan and classification of other database features using any other ontology that can be loaded (tabular file format). This application allows users to load data from genotypes or strains, genetic variants, molecular markers, sequence relationships (e. g., assembly of gene isoforms). Moreover, it is also possible to load and analyze diverse experimental approaches. For example, users can load differential gene expression studies, including the assay structure with replicates and the experiment characteristics specified using ontological classification .
For small scale projects, a typical desktop computer can be used to install and run ATGC transcriptomics. The application was tested and works correctly in a virtual machine with 1 CPU and 1Gb of RAM and requires an average of 500 Mb of RAM together with the Linux operating system . The complete installation from source code requires a Unix operating system (e. g., Linux) with a minimum set of software packages and a single administration user with basic UNIX skills (see manual). Additionally, a pre-installed self-contained virtual machine is available. Users using any operating system (e. g., Windows) may download the complete virtual machine to bypass the installation steps. Detailed instructions for different types of installation are provided in Additional files 1 and 2.
Database administration and management
We developed an interface for database creation, administration, update, backup, restore and ontologies loading. We generated a template for database schema in SQL format, with all the modules to be used including small modifications to load RNA-Seq data (not fully supported in Chado). The interface allows users to load ontologies using files in OBO and XML format and also load data with relationships between ontologies and several classification systems, in a section called dbxref2ontology (i. e. Interpro2go). ATGC transcriptomics allows user to have several databases on the same application instance, with the option to switch between them and maintain the original environment settings automatically. Moreover, it is possible to keep the information updated and curated in the database by using the modify and delete section. These processes take into account data types and the database schema. Furthermore, the backup and restore components allow users to create a complete database backup in SQL format. For this reason, the user can restart the database to a previous state when required.
Load data and results of software analysis
The application is capable of loading data to the Chado database using standard file formats and of adding information one at a time or with bulk loader. ATGC implements loading methods for different types of sources and software results, such as FASTA format for sequences, GFF for structural annotations, XML for BLAST results, ANNOT or RAW for Blast2GO or InterProScan functional annotations and VCF for molecular markers and genetic variants. Moreover, several outputs from specific software (i. e. misaSSR  or ssahaSNP ) and simple file formats (generally CSV or TAB) are implemented to load additional information (i.e., relationships between features). The CSV or TAB files can be created from flat text files using a web interface, such as Galaxy , or exporting directly spreadsheets. From the Chado stock and genotype modules, we included the concept of genotype (line) with its characteristics that are associated and defined by ontology terms. As a consequence, we give the possibility to load the alleles of different types of molecular markers to each genotype. Finally, using the Chado modules: expression, stock, and library, we created an accessible way to load the complete structure of RNA-Seq experiments. The load includes the characteristics of the conditions defined by the ontology terms and the experimental design with replicates. The user can load different measured values to a sequence in a replicate-specific manner from a particular experimental condition, including genotype identity, transcript expression values measured as reads per kilobase of a million of mapped reads (RPKM), among others. Detailed information on all steps in load data and database administration are in the Additional file 2.
Searches and query
Regarding data sharing, the users can share information with colleagues (external system users) with a simple machine or server, by using the application through the network connection with the Apache HTTP server , or directly with the web server included in Web2Py framework. The users can adapt the drop-down menu to show the chosen options and hide the parts of the menu where they can modify the database. Optionally, with small modifications the users have the choice to use the Web2py access control mechanism to manage the permissions for the external users to perform restricted operations (i.e., restricting access to database modifications keeping the options in the drop-down menu). Furthermore, we created several points for data access, inside the download section (general files, such as complete functional annotation in ANNOT format) or particular downloads in other places (e. g., Download FASTA sequence in the sequence report page or spreadsheets for search result tables).
Results and discussion
Tools capability comparison
Complete software installation
Access to personalize whole application
Access control for external (non-administrative) users
Out of the box ready querying of ontology terms (GO, SO) in database
Expression data integrated in schema (ready-to-use)
Functional Annotation Loading
Blast2Go (annot files)
InterProScan (raw or xml files)
Kegg Pathways (KAAS output files)
De novo RNA-seq case study
We will test the ATGC transcriptomics application and will show different ways to respond to several biological questions arising from RNA-Seq data. For this purpose, we will use an training dataset from the transcriptome sequencing and differential expression assay of Diachasmimorpha Longicaudata . Three conditions are analyzed (male-larvae, adult females and adult males). The main objective of this project was to identify new transcripts involved in the sex determination mechanism and develop new molecular markers for those transcripts to study them in natural populations. The data consist of reads obtained from 454 sequencing, assembled into “isotigs” (transcript contigs) with the Newbler assembler . Moreover, specifically in this project, we ran a battery of applications to our original dataset to obtain the following: (1) Structural Annotation of all transcripts: using Transdecoder ; (2) Functional Annotation: the Blast2Go suite and the InterProScan software; (3) SSR discovery: the misaSSR software; (4) SNP discovery: ssahaSNP tool; and (5) Transcript Expression: relative expression level for a given transcript was calculated normalizing the read counts by the length of that transcript and divided by the number of sequence reads in the library.
Firstly, we create a new database and load the ontologies the user specifies (GO and SO are mandatory). In this case, they will be the GO, SO, and the hymenoptera anatomy ontology (HAO) . We load the data using the web interface as explained in the tutorial. We create an organism and then we load the contigs (“isotigs” in Newbler transcriptomic nomenclature) using a FASTA file and the genes (“isogroups” by Newbler) using the section “Load Features from list file”. We proceed in this way, since we do not have sequences for these “isogroups” (a type of gene locus: the assembler provides an inferred relationship between “isotigs”). Next, we load the structural annotation using “Load features and relationships from gff3 file”, the functional annotation using “Feature -> CV associations”, and the Blast results used for the annotation with the section “Blast Run Results (XML Files)”. Molecular markers in VCF format (SNPs and SSRs) are loaded using “Load markers”. To load the differential expression data, first, we create the structure of experiments and libraries. We load a set of three experiments using the section “Create Experiment” called: “male_larvae”, “male_adult”, and “female_adult” with one RNA-Seq library for each. To assign the features to describe the experiments (for example phenological state and sex), we use the hymenoptera anatomy ontology. Finally, we load the expression data using a read pseudocount values for each contig using the section “Feature -> Library associations”.
After all information has been saved properly, we can begin to query the database. The first question that was raised in this project was which contigs have functional annotation related with “GO:0003006: developmental process involved in reproduction” as well as microsatellites markers associated, and which was the expression of these transcripts among the three experimental conditions. To achieve this, the user goes to the section “Search -> Features by name” and obtain a result table with all microsatellites stored in the database, using only ‘%’ as search expression in the microsatellite pull down menu (Additional file 3). Afterwards, the simplest way for the analysis is to navigate using the GO exploration at the bottom of the page where you will be able to navigate all GO terms associated with the list created above. Hence, you can explore your way to any GO term using the full capacity of the controlled vocabulary. When the user reaches the GO Term: “developmental process involved in reproduction”, he can click in “Feature List (new window)” and explore those transcripts with that GO term and at least have one SSR associated (Additional file 4). From a total of 982 transcripts with at least one SSR, we find that 8 transcripts have the GO:0003006 associated. In each detailed view of each transcript the user can examine the expression patterns of that transcript in all conditions. Exploring the “Feature report” of the transcripts, you found one “isogroup” with three “isotigs” which are highly expressed in the larvae stage and not in adult (male or female). This putative “isogroup” (gene locus) is probably one of the target to further investigate.
In brief, this application provides non-expert computer users with accessible biological data management and simple data integration, as it can be used without any prior knowledge of programming, database administration, and/or management. ATGC transcriptomics presents a more comprehensive interface for ontological storage and browsing as a method to explore information and relationships between transcripts and other features of interest, such as the concept of alternative splicing, and relationships between transcripts and genes (or genomic loci) with functional annotation. ATGC offers the opportunity to expand the database schema by adding other modules that store information from different data sources (i.e., microarrays, phylogeny, and genetic maps). Moreover, the web-interface can easily be improved via Python modules that can be automatically copied and distributed with the application. Overall, ATGC is a user-friendly application which allows small research groups to handle their de novo transcriptome data as a whole.
Availability and requirements
Project name: ATGC transcriptomics.
Project home page: http://atgcinta.sourceforge.net.
Project demo site: http://atgc-sur.inta.gob.ar.
Operating system(s): Platform independent, source code installation or using complete virtual machine.
Programming language(s): Python.
Software requirements: Postgresql, Ncbi-blast+, Emboss, Python, Python packages: Matplotlib, Pygraphviz, Biopython, Psycopg2, Perl, Perl packages: Bioperl, Libgo, Libpg, Libdata-stag, Libdbix-dbstag, Libsql-translator and Apache2 (Optional).
License: ATGC transcriptomics is freely available under the terms of the GNU public license.
We would like to acknowledge Julia Sabio Y Garcia for the manuscript correction.
This work was supported by the Instituto Nacional de Tecnología Agropecuaria (AEBio-245001, 245732; PNBIO-1131041, 1131043); Agencia Española de Cooperación Internacional y Desarrollo (D/031348/10; A1/041041/11); Consejo Nacional de Investigaciones Científicas y Técnicas; BBSRC, Institute Strategic Programme Grant BB/J004669/1 and Marie Curie IRSES Project DEANN (PIRSES-GA-2013-612583).
SG and BC design and wrote the software, setup the implementation and wrote the user guide. SG, MR and NP drafted the manuscript. MR, PM, PF and NP collaborated with the software design. JD and NP conceived and financed the project. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
- 5.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 8.Jung S, Lee T, Ficklin S, Yu J, Cheng C, Main D. Chado use case: storing genomic, genetic and breeding data of Rosaceae and Gossypium crops in Chado. Database. 2016;2016:baw010.Google Scholar
- 9.Donlin MJ. Using the Generic Genome Browser (GBrowse). In: Current Protocols in Bioinformatics. Hoboken: John Wiley & Sons, Inc; 2009. p. 1–25 (December).Google Scholar
- 11.Lewis SE, Searle SMJ, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglu L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME. Apollo: a sequence annotation editor. Genome Biol. 2002;3:RESEARCH0082.Google Scholar
- 12.Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli SV. Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics. 2010;26:1488–92.CrossRefPubMedPubMedCentralGoogle Scholar
- 13.Ficklin SP, Sanderson L-A, Cheng C-H, Staton ME, Lee T, Cho I-H, Jung S, Bett KE, Main D. Tripal: a construction toolkit for online genome databases. Database. 2011;2011:bar044.Google Scholar
- 14.Drupal content management system. http://www.drupal.org
- 15.Helguera M, Rivarola M, Clavijo B, Martis MM, Vanzetti LS, González S, Garbus I, Leroy P, Šimková H, Valárik M, Caccamo M, Doležel J, Mayer KFX, Feuillet C, Tranquilli G, Paniego N, Echenique V. New insights into the wheat chromosome 4D structure and virtual gene order, revealed by survey pyrosequencing. Plant Sci. 2015;233:200–12.CrossRefPubMedPubMedCentralGoogle Scholar
- 16.Torales SL, Rivarola M, Pomponio MF, Gonzalez S, Acuña CV, Fernández P, Lauenstein DL, Verga AR, Hopp HE, Paniego NB, Poltri SNM. De novo assembly and characterization of leaf transcriptome for the development of functional molecular markers of the extremophile multipurpose tree species Prosopis alba. BMC Genomics. 2013;14:705.CrossRefPubMedPubMedCentralGoogle Scholar
- 17.Fernandez P, Soria M, Blesa D, DiRienzo J, Moschen S, Rivarola M, Clavijo BJ, Gonzalez S, Peluffo L, Príncipi D, Dosio G, Aguirrezabal L, García-García F, Conesa A, Hopp E, Dopazo J, Heinz RA, Paniego N. Development, characterization and experimental validation of a cultivated sunflower (Helianthus annuus L.) gene expression oligonucleotide microarray. PLoS One. 2012;7:e45899.Google Scholar
- 19.PostgreSQL. http://www.postgresql.org
- 20.Web2py. http://www.web2py.com
- 21.Psycopg python package. http://initd.org/psycopg/
- 24.LUbuntu. http://lubuntu.net/
- 28.Apache HTTP server. http://apache.org/
- 29.Droc G, Lariviere D, Guignon V, Yahiaoui N, This D, Garsmeur O, Dereeper A, Hamelin C, Argout X, Dufayard J-F, Lengelle J, Baurens F-C, Cenci A, Pitollat B, D’Hont A, Ruiz M, Rouard M, Bocs S. The Banana Genome Hub. Database. 2013;2013:bat035.Google Scholar
- 33.Turnkey. http://genome.ucla.edu/turnkey/
- 34.Chado on rails framework. http://gmod.org/wiki/Chado_on_Rails
- 36.Grails. http://grails.org
- 37.Sanderson L, Ficklin SP, Cheng C, Jung S, Feltus FA, Bett KE, Main D. Tripal v1.1: a standards-based toolkit for construction of online genetic and genomic databases. Database. 2013;2013:bat075.Google Scholar
- 39.Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, Leduc RD, Friedman N, Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–512.CrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.