Key words

1 Introduction

Ascidians, also known as tunicates or sea squirts, are marine invertebrate filter-feeding animals. They belong to the Tunicata subphylum , an extant sister clade of the vertebrate clade in the chordate phylum. The close evolutionary relationship is most evident during embryo development where ascidians share vertebrate morphological features such as a notochord and a neural tube. Broadly, ascidians can be classified as solitary or colonial and whether they are pelagic or sessile. Botrylloid colonial ascidians have become an important model organism to study whole-body regeneration, chordate evolution, immunobiology, and allorecognition [1,2,3,4,5,6,7]. Fueled by recent advances in next-generation sequencing technology, RNA-sequencing has become a standard tool to characterize entire transcriptomes of cells, tissues, and whole organisms. This protocol uses publicly available RNA-seq data from a regeneration time course experiment in Botrylloides leachii [1]. Additionally, with the availability of a sequenced genome [8], a wider range of software tools (for sequenced organisms) are now also available to analyze this RNA-seq data from B. leachii.

Despite the wide use of RNA-seq, data analysis workflows are equally numerous and not yet standardized [9]. RNA-seq can provide a snapshot of all transcripts present in a cell or tissue of interest. Often though, the research questions are centered around quantifying changes in gene expression across time points or treatment conditions. To accurately compare different samples to each other important post-processing steps of RNA-seq data must be done before meaningful conclusions can be drawn. The focus of this chapter is to provide the user with a workflow to produce a table of differentially expressed genes (DEGs) from an RNA-seq experiment. The dataset used in this protocol has been previously published by our lab and consists of eight samples, each consisting of pooled RNA from multiple regenerating fragments isolated during a regeneration time-course [1]. Although no replicates were included in this earlier study, limiting differential expression analysis of lowly expressed genes, this protocol includes a “simulated” workflow for DESeq2 [10] to obtain a list of differentially expressed genes across stages of regeneration. We also provide an overview of library preparation strategies and overall experimental design.

Figure 1 presents a general workflow for species with a sequenced and annotated genome.

Fig. 1
figure 1

Overview over the protocol . The first step in a typical RNA-seq workflow is removal of low quality and adapter sequences (step 1). Reads are mapped to the genome either using STAR where it is optional to include a transcriptome reference annotation (step 2). Star maps, assembles and quantifies transcripts simultaneously producing count tables (step 3). Count tables are then used as input for differential gene expression analysis with DESeq2 which produces tables of differentially expressed (DE) genes (step 4). Gene ontology analysis of DE genes is performed with GOATOOLS

In principle, this protocol can also be applied to RNA-seq datasets generated from other organisms. Although a genome reference is needed when following the protocol described here, we also suggest software to quantify gene expression without such a reference.

2 Materials

One of the challenges in conducting RNA-seq analysis or any other bioinformatic task is the installation and maintenance of various software packages and programs which can represent a major hurdle for users new to this field. Commands written in shell will be indicated by the “$” prefix, commands in R will be proceeded by “>” and outputs by “#”.

  1. 1.

    RNA-seq data (see Notes 14).

  2. 2.

    Reference genome (see Notes 5 and 6).

  3. 3.

    Genome annotations and gene ontology information (see Notes 68).

  4. 4.

    A medium to high-performance computer (see Note 9).

  5. 5.

    A terminal access to that computer (see Note 10).

  6. 6.

    A scientific software management environment: Bioconda (see Notes 11 and 12).

  7. 7.

    An RNA sequence mapper: STAR (see Note 12).

  8. 8.

    Differential expression software: DESeq2 (see Note 12).

  9. 9.

    Command-line access to the sequence read archive (SRA): SRA-Tools (see Note 13).

  10. 10.

    Gene Ontology (GO) analysis toolkit: GOATOOLS (see Note 14).

  11. 11.

    R statistical computing environment (see Note 15).

3 Methods

This protocol will be exemplified using our paired RNA-seq data of regeneration in B. leachii [1]. This dataset consists of eight samples which are listed in Table 1 and can be retrieved from the SRA archive SRP064769 (see Note 16).

Table 1 RNA-seq sample list from SRA archive SRP064769

3.1 Quality Control and Trimming

First, before mapping can be done, quality and adapter trimming should be performed.

  1. 1.

    Open a terminal on the computer.

  2. 2.

    Change directory to the folder containing the RNA-seq data to be analyzed:

    $ cd /Users/yourusername/projectfolder/RNAseqdata

  3. 3.

    Quality control (QC) paired-end data by running the following command:

    $ trim_galore --paired --fastqc embr_1.fq embr_2.fq

  4. 4.

    QC single-ended data run this command instead:

    $ trim_galore --fastqc embr.fq

  5. 5.

    Open the newly created embr.fastqc file using a web browser.

  6. 6.

    If there are persistent QC issues with the sequencing data additional filtering steps might be necessary (see Note 17).

  7. 7.

    Otherwise, proceed to mapping the reads to the reference genome.

3.2 Mapping

Before we can align the reads to the reference genome, we have to build a genome index using STAR [11]. Then samples will be mapped and gene expression values will be calculated and expressed as a raw gene count number.

  1. 1.

    In terminal create a folder for the reference genome files (see Note 18).

  2. 2.

    Create the genome index (see Note 19):

    $ STAR --runMode genomeGenerate --runThreadN 4 --genomeDir genome_Boleac --genomeFastaFiles Boleac_SBv3_genome.fasta --sjdbGTFfile Boleac_transcripts.gtf

  3. 3.

    Align the reads (see Notes 20 and 21):

    $ STAR --runMode alignReads --genomeDir genome --outFileNamePrefix BL_regeneration/reg_0 --sjdbGTFfile Boleac_transcripts_v5.gtf --quantMode GeneCounts --runThreadN 4 --outSAMtype BAM SortedByCoordinate --readFilesIn reg_0_1_val_1.fq reg_0_2_val_2.fq

  4. 4.

    Repeat for other samples (see Note 22).

3.3 Normalizing Count Data with DESeq2

  1. 1.

    Start R software environment [12].

  2. 2.

    Set the working directory to the path where your STAR output files are contained.

    > setwd(“path to your local folder”)

  3. 3.

    Load DESeq2 by typing:

    > library(DESeq2)

  4. 4.

    Concatenate count files by typing (see Notes 23 and 24):

    > ff <- list.files( path = "./", pattern = "*ReadsPerGene.out.tab$", full.names = TRUE ) > counts.files <- lapply( ff, read.table, skip = 4 ) > counts <- as.data.frame(sapply( counts.files, function(x) x[ , 4 ])) > ff <- gsub( "[.]ReadsPerGene[.]out[.]tab", "", ff ) > ff <- gsub( "[.]/counts/", "", ff ) > colnames(counts) <- c("embr","wcol","reg_0","reg_1","reg_2","reg_3","reg_4","reg_5") > row.names(counts) <- counts.files[[1]]$V1

  5. 5.

    DESeq2 requires a sample information sheet (see Note 25) that can be created using a spreadsheet software such as MS Excel and imported into R by typing:

    coldata <- read.csv(file="coldata.csv", row.names=1)

  6. 6.

    Generate a dds object which normalizes all count data across samples to library size by typing:

    > dds <- DESeqDataSetFromMatrix(countData = counts,colData = coldata,design = ~ stage) > dds <- DESeq(dds)

  7. 7.

    Create a sample distance matrix type (see Note 26):

    > res <- results(dds) > vsd <- vst(dds, blind=FALSE) > sampleDists <- dist(t(assay(vsd))) > data <- plotPCA(vsd, returnData=TRUE) > percentVar <- round(100 * attr(data, "percentVar"))

  8. 8.

    Load ggplot2 by typing:

    > library(ggplot2)

  9. 9.

    Plot the principal component analysis (PCA) of your data (Fig. 2) with:

    > p<-ggplot(data, aes(PC1, PC2)) + geom_text(label=rownames(data), nudge_x=0.25, nudge_y=0.25,check_overlap=T) + xlab(paste0("PC1: ",percentVar[1],"% variance")) + ylab(paste0("PC2: ",percentVar[2],"% variance")) + theme_light() > p

Fig. 2
figure 2

PCA plot created from the top 500 variable genes among samples

3.4 Differential Gene Expression Analysis Using DESeq2

To extract differentially expressed genes (DEGs) between conditions (stages 1 and 2 and 0, arbitrarily chosen), we define the contrasts we are interested in with the results() function from DESeq2 (see Note 27).

  1. 1.

    To get differentially expressed genes between stages 1 and 0, we type:

    > res1_0 <- results(dds, contrast=c("stage","1","0"))

  2. 2.

    We then using the subset function to generate result tables of differentially up- and down-regulated at an adjusted p-value also described as false discovery rate (FDR) of 5%.

    > res_1_0_sig<-subset(res1_0, padj < 0.05) > res_1_0_sig_up<-subset(res_1_0_sig, log2FoldChange > 0.5) > res_1_0_sig_up_order<-res_1_0_sig_up[order(res_1_0_sig_up$padj),] > write.csv(res_1_0_sig_up_order, file=“res_1_0_sig_up_order.csv”) > res_1_0_sig_down<-subset(res_1_0_sig, log2FoldChange < 0.5) > res_1_0_sig_down_order<-res_1_0_sig_down[order(res_1_0_sig_down$padj),] > write.csv(res_1_0_sig_down_order, file=“res_1_0_sig_down_order.csv”)

  3. 3.

    To view the results table (Table 2):

    > head(res_1_0_sig_up_order)

  4. 4.

    Perform the same process to generate results for comparing stages 2 and 0.

Table 2 List of DEGs ordered by increasing significance

> res2_0 <- results(dds, contrast=c("stage","2","0")) > res_2_0_sig<-subset(res2_0, padj < 0.05) > res_2_0_sig_up<-subset(res_2_0_sig, log2FoldChange > 0.5) > res_2_0_sig_up_order<-res_2_0_sig_up[order(res_2_0_sig_up$padj),] > write.csv(res_2_0_sig_up_order, file=“res_2_0_sig_up_order.csv”) > res_2_0_sig_down<-subset(res_2_0_sig, log2FoldChange < 0.5) > res_2_0_sig_down_order<-res_2_0_sig_down[order(res_2_0_sig_down$padj),] > write.csv(res_2_0_sig_down_order, file=“res_2_0_sig_down_order.csv”)

3.5 Gene Ontology Enrichment Analysis

To get more insight into the biological significance of DE genes we obtained in the above step, we will assign gene ontologies and perform Gene Ontology (GO) enrichment analysis using goatools [13].

  1. 1.

    In R create a list of gene IDs from the result file:

    > gene_ids_up<-rownames(res_1_0_sig_up_order) > write.table(gene_ids_up,file="gene_ids_up.csv", row.names=FALSE, quote = FALSE, col.names=FALSE)

  2. 2.

    Repeat this step for all the result output files created.

  3. 3.

    Define our background gene list:

    > res1_0_universe<-subset(res1_0, baseMean > 10) > gene_ids_univ<-rownames(res1_0_universe) > write.table(gene_ids_univ,file="gene_ids_univ.txt", row.names=FALSE, quote = FALSE, col.names=FALSE)

  4. 4.

    Modify the GAF file so the identifiers match the gene IDs in our lists:

    > gaf<-read.table(file="Boleac_slimTunicate.gaf",skip=5,sep="\t") > gaf_red<-gaf_new[ ,c("V6","V5")] > gaf_red_col<-aggregate(V5 ~V6, gaf_red, paste, collapse=";") > write.table(gaf_red_col,file="ids2go_BL.txt", row.names=FALSE, quote = FALSE, col.names=FALSE)

  5. 5.

    Find enrichments with a python script run in terminal (see Note 28).

    $ find_enrichment.py gene_ids_up.txt gene_ids_univ.txt ids2go_BL.txt --pval=0.05 --method=fdr_bh --pval_field=fdr_bh --outfile=results_id2gos_1_0_up.xlsx

  6. 6.

    Plot top 15 enriched terms of each category (see Notes 29 and 30, Fig. 3).

    > library(“readxl”) > go<-readxl::read_excel("results_id2gos_1_0_up.xlsx") > go_MF<-subset(go, NS=="MF") > go_MF_15<-go_MF[1:15, ] > b<-ggplot(go_MF_15, aes(x=reorder(name, -p_fdr_bh),y=study_count,color=p_fdr_bh,size=study_count))+geom_point() +coord_flip() > b + theme_minimal() + labs(x="Molecular Function",y="Gene count",color="p.adjust", size="Gene count")

Fig. 3
figure 3

Example of a GO plot for molecular function terms overrepresented using the list of stage 1 upregulated genes. The dotplot generated using ggplot shows the 15 most significant (adjusted p-value < 0.05) enriched GO terms within the molecular function (MF) category

4 Notes

  1. 1.

    Although this protocol focuses largely on the analysis of RNA-seq data, a few important considerations about the study design are mentioned here. How many reads per sample and how many biological replicates should an RNA-seq experiment have? As a general rule, biological replicates should be prioritized over-read depth. To better estimate the number of reads needed, Illumina offers a read coverage calculator called Scotty (https://support.illumina.com/downloads/sequencing_coverage_calculator.html). This web-based tool is designed to use user data from a small trial experiment consisting of at least two biological replicates per condition to estimate the number of replicates and sequencing depth needed. Alternatively, publicly available datasets closely resembling the user’s data can be used. The higher the expected biological variation the more replicates and the higher the sequencing depth needed to identify DEGs.

  2. 2.

    Due to the high sensitivity of RNA-seq, so-called batch-effects created during sample or library preparation can have negative implications on identifying DEGs stemming from “true” biological variation. To control for these effects, processing (e.g., RNA-extraction ) of samples and library prep should ideally happen at the same time.

  3. 3.

    This experiment used an RNA-seq library preparation method generating strand-specific paired-end reads which are recommended when no reference genome is available. Strand-specific information is also useful if anti-sense transcripts such as long-non-coding RNAs are of interest. If the goal is to simply measure differential gene expression between conditions, then an un-stranded library can be sufficient although the price difference in stranded vs. non-stranded kits almost warrants a stranded kit.

  4. 4.

    RNA quality is very important and should be assessed using Bioanlayzer (Agilent Technologies) which gives an RNA integrity number (RIN) based on the ratio of the major ribosomal RNA fractions. Especially samples that are widely different from the average RIN of other samples should be reconsidered. Most sequencing providers will recommend a certain threshold above which they will process a sample (>7.0). Ascidians biosynthesize cellulose which is incorporated into their protective tunic imposes the use of relatively harsh methods to extract RNA, which can lead to degradation of the RIN in some samples. We have had success using plant RNA extraction kits to circumvent this issue.

  5. 5.

    If there is no genomic sequence available to map RNA-seq reads, programs such as Trinity [14] or SOAPdenovo [15] can be used to assemble and quantify transcripts.

  6. 6.

    Download the genome sequence (FASTA) and annotation files (GTF and GAF) for B. leachii can be obtained from Aniseed https://www.aniseed.cnrs.fr/aniseed/download/download_data. This can be done directly in terminal using the commands:

    $ wget https://www.aniseed.cnrs.fr/aniseed/download/?file=data%2Fboleac%2FBoleac_SBv3_genome_gff3_fasta.zip $ wget https://www.aniseed.cnrs.fr/aniseed/download/?file=data%2Fboleac%2FBoleac.gaf.gz

  7. 7.

    Download the gene annotation and gene ontology files in terminal by typing:

    $ wget https://www.aniseed.cnrs.fr/aniseed/download/?file=data%2Fboleac%2FBoleac_slimTunicate.gaf.gz -O Boleac_slimTunicate.gaf.gz $ gunzip Boleac_slimTunicate.gaf.gz $ wget http://purl.obolibrary.org/obo/go/go-basic.obo -O go-basic.obo

  8. 8.

    If there is no genome annotation available, it is possible to perform automated genome annotation. However, this analysis lies outside of the scope of this protocol . For a generic pipeline for performing such an analysis, see Blanchoud et al. [8].

  9. 9.

    A computer with more than 4 GB RAM is required for analysis. The most memory-intensive step is the alignment of the reads to the genome which scales with the genome size of the organism. This step could be also performed on a high-capacity server more suitable for this task.

  10. 10.

    We recommend that you familiarize yourself with basic command line use (e.g., https://ubuntu.com/tutorials/command-line-for-beginners#1-overview) to create, rename, and navigate folders and files.

  11. 11.

    To also make RNA-seq analysis more reproducible, we recommend using a software manager called Bioconda [16]. This protocol will be using software available in this environment, but similar software could be installed from other sources too.

  12. 12.

    To install Miniconda (a version of Anaconda which supports Bioconda) on a computer, run the following commands:

    $ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-${ARCH}.sh $ sh Miniconda3-latest-${ARCH}.sh

    where the environment variable ARCH should be set to the type of your local operating system (e.g., Linux-ppc64le, Linux-x86_64, MacOSX-x86_64, Windows-x86). This installation includes software tools used in this protocol such as STAR [11], statistical environment R [12], and DESeq2 [10]. For a full list of installed packages run:

    $ conda list $ conda install -c bioconda bioconductor-deseq2

  13. 13.

    To install SRA-Tools run:

    $ conda install -c bioconda sra-tool

    Detailed instructions and information on setting up downloads from SRA archives can be found on the website https://ncbi.github.io/sra-tools/

  14. 14.

    Install goatools [13] by typing the code below in terminal.

    $ conda install -c bioconda goatools

  15. 15.

    Go to R project website (http://www.r-project.org/), download, and install an up-to-date R version [12].

  16. 16.

    Individual files from the archive SRP064769 can be downloaded using the fastq-dump command:

    $ fastq-dump --split-files SRR2729873

    This will create two FASTQ files for each sample, rename all files with their short identifier according to Table 3, e.g., SRR2729873 to embr before proceeding.

  17. 17.

    A detailed explanation of FastQC and examples of good/bad datasets and what to look for in the reports created can be found here on the website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

  18. 18.

    To create the directory that will contain the genome index, use the command:

    $ mkdir genome_Boleac

  19. 19.

    The option “--genomeDir” specifies where the generated indexes are stored. The path to the FASTA file is specified with “–genomeFastaFiles” and the GTF annotation file with “–sjdbGTFfile.” In this case, the annotation file is supplied in GFF format “–sjdbGTFtagExonParentTranscript,” Parent option can be used instead. “–runThreadN” option defines the number of threads used for parallelization.

  20. 20.

    This commands give paths to genome indexes (--genomeDir), the GTF file (--sjdbGTFfile), the trimmed FASTQ files (– readFilesIn), and will quantify transcripts(--quantMode GeneCounts). This step is computationally intensive and can take several hours.

  21. 21.

    The option—quantMode GeneCounts—produces a file named ReadsPerGene.out.tab with four columns specified in Table 3: geneID; 2: counts for unstranded RNA; 3: counts for first strand; 4: counts for second strand. Additionally, the number of unmapped and multi-mapping reads are in the header. If the percentage of unmapped reads is higher than 25% of total reads, the sample is problematic and should be reconsidered.

  22. 22.

    This command detailed in step 3 of Subheading 3.2 is only for sample reg_0. For other samples, the “–outFileNamePrefix” as well as “–readFilesIn” part needs to be altered. Again, specify a directory for the output files (see Note 20), and give paths to genome indexes (--genomeDir) as well as GTF file (--sjdbGTFfile).

  23. 23.

    For subsequent analysis, using DESeq2 read counts must be selected based on the library preparation protocol was used, unstranded or stranded. To identify the library preparation method and sequencing strategy of the sample, the number of reads for each strand and in total can be counted using the awk function:

    $ grep -v "N_" reg_0ReadsPerGene.out.tab | awk '{unst+=$2;forw+=$3;rev+=$4}END{print\ unst,forw,rev}'

    which results in “$ 8842238 508841 8718480.” This is interpreted as 8,718,480 reads map to the reverse or second strand and only 508,841 reads map to the first strand, which indicates a stranded library was made and the reverse strand was sequenced.

  24. 24.

    Depending on the library preparation method, it is crucial to select the right column in this step. In our case, second-strand synthesis was used, so column 4 was selected for further analysis.

    > counts <- as.data.frame( sapply( counts.files, function(x) x[ , 4 ))

  25. 25.

    To define which condition or time-points samples are associated, we create a coldata object which can be made in a text-editor or Microsoft Excel and saved in a CSV file format in the working directory specified earlier. In this example, the coldata object has a unique identifier for each sample (column 1) and one column specifying conditional information (Table 4). The unique identifier of samples must be identical in the coldata object as well as in the counts object. To check if that is the case type:

    > all(rownames(coldata) == colnames(counts))

    In this case, the counts object column names can be replaced with the sample identifiers in coldata. Please note that the condition in this chosen arbitrary is for demonstration purposes only.

  26. 26.

    A principal component plot (PCA) is a useful tool to assess if and how individual samples cluster and if the variation in the expression of the most heterogeneously expressed genes can be explained by the nature of the sample (e.g., samples clustering along developmental time).

    In this case, we can see (Fig. 2) a strong separation of the embryonic stage (explaining most of the variation) and separation of regeneration stages and the whole colony as the second-largest component of variation.

  27. 27.

    It is not recommended to perform differential gene expression analysis on data sets without biological replicates. In our case and for demonstration purposes only, we arbitrarily pooled samples from reg_1 and reg_2 (Stage 1) and reg_3 and reg_4 (Stage 2) to compare it to reg_0 (Stage 0) in the coldata object (Table 4) when running DESeq2 in step 6 of Subheading 3.3. To get differentially expressed genes between these conditions, we need to specify which comparison we want to make.

  28. 28.

    The results of the goatools enrichment analysis contains GO terms which are found significantly (Benjamini/Hochberg: fdr_bh at a 5% false discovery rate) overrepresented in the differentially expressed genes compared to the GO terms of all expressed genes in the samples. The resulting file contains terms for three major categories such as biological process (BP), cellular component (CC), and molecular function (MF).

  29. 29.

    To import xls files back into R, install the library “readxl”:

    > install.packages("readxl")

  30. 30.

    To plot the other categories or more terms, you can change the subset and name the output accordingly.

Table 3 An example of a text file showing the first few lines of gene raw count data
Table 4 Coldata object with sample information