Background

Anopheles funestus is a major vector species of malaria in sub-Saharan Africa [1]. Similar to the more extensively studied vector species Anopheles gambiae, resistance to commonly used insecticides is a growing problem in An. funestus [29]. Efforts to improve the design and implementation of resistance management strategies require a good understanding of the molecular basis of resistance. Use of genomic tools is paramount to achieve this goal. In contrast to An. gambiae, for which significant progress has been made in its genomics since its genome was sequenced more than a decade ago [10], An. funestus has received much less attention. However, progress is being made as the genome of Anopheles funestus was recently sequenced as part of a program to sequence a number of Anopheles species genomes [11]. The draft genome assembly, in the form of unplaced scaffolds not assembled into chromosome sequences, is publicly available via the VectorBase web resource [12, 13]. The draft genome assembly has been annotated, using the genome annotation pipeline MAKER [14, 15]. However, this annotation has not been extensively manually curated and is inaccurate for a number of known genes of interest such as cytochrome P450 monooxygenase and glutathione S-trasferase (GST) genes. Thus, more efforts are currently needed to maximise the usefulness of the An. funestus genome and allow more studies to be performed such as transcriptome profiling or genome-wide association studies. One approach to optimise the quality of this genome is to use RNAseq to improve genome annotation.

RNAseq is a powerful method to study the transcriptomes of organisms, providing a rich dataset allowing transcriptional profiling as well as the identification of novel transcripts, alternative splicing and detection of expressed sequence polymorphisms [16, 17]. However in the case of Anopheles mosquitoes, limited information have been generated to determine the best methodology for library preparation to ensure an accurate profiling of transcriptomes, notably between the enrichment of polyadenylated (poly(A)) RNA transcripts and the alternative method of ribosomal RNA (rRNA) depletion. Methods of mRNA enrichment have been shown to have an effect on the observed transcriptome [1821]. Comparisons of poly(A) mRNA enrichment with rRNA depletion methods report that the depletion methods result in more reads aligned outside of annotated gene coding regions [18, 20]. This may be due to the presence of novel transcripts and/or a greater proportion of genomic DNA in the library. Also, the greater number of reads aligned within introns in depletion libraries could be derived from immature transcripts [21]. Depletion libraries show more even read distribution less 3′ bias than poly-A libraries [18].

In malaria-transmitting mosquitoes, insecticide resistance is often due to ‘metabolic resistance’: up-regulation of genes involved in detoxification, such as cytochrome P450 and GST genes. This is particularly the case in An. funestus, for which pyrethroid resistance is caused by metabolic resistance and not by target site resistance, such as mutations in the voltage gated sodium channel gene. As there are no known genomic biomarkers to detect metabolic resistance, transcriptomic analyses are necessary. These analyses have commonly used microarrays to detect up-regulated genes [46, 8]. RNAseq may offer advantages over microarrays in that unannotated, or poorly annotated, transcripts can be defined and corrected using the sequence alignment, which is more data-rich than the results of microarray analysis. This data richness includes information on alternative splicing, as well as non-protein-coding RNA transcripts (ncRNA) that may play a role in differential gene expression.

Here, we perform an RNAseq experiment on Anopheles funestus (FANG strain), primarily to compare two methods to enrich messenger RNA (mRNA) relative to ribosomal RNA (rRNA): ribosomal depletion and poly(A) selection. We compare the results of both methods applied to total RNA from the same individual mosquitoes and identify the most differentially represented transcripts between the methods to identify non-polyadenylated transcripts. We use the RNAseq data along with published de novo assemblies of Anopheles funestus transcripts [22, 23] to improve the genome annotation for selected genes, focusing on detoxification associated gene families.

Results and discussion

Preparation of transcriptome sequencing libraries

Total RNA was extracted from pools of whole mosquitoes of the An. funestus FANG colony. Extracts showed a pink colouration, probably due to eye pigments. It was not known if this would affect the sequencing library quality, so one sample was split and bead clean-up carried out on one of the sub-samples, to compare results with its uncleaned counterpart and assess whether this discolouration had any effect on the sequencing results. BioAnalyzer traces of total RNA (Additional file 1) showed evidence of the ‘hidden break’ reported in other insect total RNA samples [24]: two peaks of approximately 2000 bp representing 18S rRNA and the 28S rRNA subunits alpha and beta. The total RNA traces also showed some ‘spikiness’ the cause of which was unclear. The profile of degraded RNA tends to look smoother, with a broad peak extending left of the major 18S/28S rRNA peaks. Whether this spikiness would affect the sequencing results was not known at this stage.

After Ribo-Zero rRNA depletion or poly(A) mRNA enrichment, BioAnalyzer traces suggested that all samples were of sufficient quality for RNAseq library preparation, with no obvious quality differences among samples within each method (Additional file 1). The amount of rRNA remaining in the samples appeared to be much lower for Ribo-Zero depleted samples than for poly(A) enriched samples, where even after 3 rounds of poly(A) selection, samples displayed a large peak near 2000 bp (Additional file 1). Later analysis indicated that this peak may represent mitochondrial 16S rRNA.

Initial quality checking of the sequence read data showed that the 8 libraries were evenly represented in the sequenced pool (median 61,229,870 reads per library; range 44,384,316 to 68,640,804 reads) and of good quality, with little read trimming due to sequenced adapters or low quality base calls (full details in Additional file 1).

Alignment of sequence reads to reference genome

All read libraries were aligned to a reference consisting of 1392 scaffolds derived from the Anopheles funestus FUMOZ colony with the An. funestus mitochondrial genome sequence added. Broad alignment metrics are shown in Table 1. Overall, a median of 35,309,124 (range 21,002,025 to 38,878,773), or 57 % (range 35 % to 59 %) of the reads to be aligned could be aligned. Alignment of spliced reads is more difficult than for unspliced reads and leads to lower proportions of the starting libraries being aligned, but this alone is unlikely to account for the relatively low alignment rate. The incompleteness of the reference genome sequence and high levels of sequence polymorphism and genetic divergence between laboratory colonies FANG (derived from Angola) and FUMOZ (from Mozambique, used for sequencing the reference genome) may also contribute to it. A greater proportion of the poly(A) than the Ribo-Zero libraries tended to be aligned (mean difference 11 %, median 8 %). For the sample where an initial bead clean-up was applied (“F1_XP”), the proportion of reads aligned for the Ribo-Zero library (“F1_XP_RZ”; 55 %) was greater than for the uncleaned sample (“F1_RZ”; 35 %). This difference was not seen for poly(A) libraries (59 % for both “F1_XP_PA” and “F1_PA”). This suggests that the purity of the sample may affect the efficacy of the rRNA depletion reaction, which is supported by the ribosomal RNA mapping results shown in Table 2.

Table 1 Metrics describing the transcriptome alignments
Table 2 Tag counts for ten putative rRNA genes

Tag counting for annotated genomic features and analysis of the ribosomal RNA content of libraries

Tag counting means quantifying read pairs or singletons (representatives of a single fragment of transcript) aligned in genomic regions annotated as genome features (e.g. protein coding genes or ncRNAs). As the RNAseq protocol used is strand-specific, tags can be aligned sense or antisense to a feature, which mean different things: i.e. that antisense tags come from an antisense transcript, rather than from the primary annotated transcript. Sense and antisense tag counting was performed for each sequence alignment.

To assess how much ribosomal RNA was represented in each sequence library (i.e. the success of the mRNA enrichment procedure) gene features annotated as rRNAs were extracted from the genome annotation. There were ten of these features (Table 2): 8 resulting from automated annotation of the genome assembly and 2 annotated on the mitochondrial genome. For putative nuclear 28S and 18S rRNA genes, poly(A) mRNA enrichment almost always (except for the F1_XP sample pair) resulted in fewer tags from these genes than Ribo-Zero rRNA depletion. Putative 5.8S rRNA genes rarely showed any tags (possibly due to ambiguous read alignment arising from non-uniqueness among multiple 5.8S rRNA copies in the reference genome or to their small size). The mitochondrial 12S rRNA gene showed the same trend as for nuclear 28S and 18S rRNA genes. However, the mitochondrial 16S rRNA gene showed the opposite trend, with more tags in the poly(A) mRNA enriched libraries than in the Ribo-Zero depleted libraries.

Tag counting is summarised in Table 3. As counting used annotated genomic features, antisense counts should be much lower than sense counts. This was the case, with a ten-fold greater assignment of tags to features in the sense orientation than antisense. The count data generated for each genomic feature is the data analysed in differential gene expression (DGE) analysis. Therefore, in general, the more sense tag-count data available, the greater the power of DGE analysis to detect differential gene expression. Overall, the poly(A) mRNA-enriched libraries produced more “on target” sense tags, as a proportion of all aligned tags (between 7.5 and 9.2 % more than Ribo-Zero), as might be expected given enrichment on the poly(A) tails of mRNA transcripts (Table 3 and Fig. 1). However, after excluding tag counts from 10 putative rRNA genes, the differences between Ribo-Zero and poly(A) samples were less marked and for all samples (between 1.3 and 2.7 % more on-target tags for poly(A) ). Overall, “on target” sense tag counts (excluding rRNA genes) comprised around 47.9 to 56.4 % of the total aligned tags. This relatively low proportion probably reflects the fact that 5′ and 3′ un-translated regions (UTRs) are rarely predicted by automated annotation pipelines, so a large proportion of tags from mRNA transcripts are not counted as such when using gene models mostly comprising protein coding regions. Overall (excluding rRNA genes), our differential “gene expression” analysis would be based on between 6.6 and 13.4 million tags per sample.

Table 3 Summary of sense tag counting
Fig. 1
figure 1

Sense tag counts per annotated gene feature as a proportion of total aligned tags. Grey bars show total sense tag counts as a proportion of total aligned tags while black bars show this after excluding 10 putative rRNA genes

Comparison of poly(A) mRNA-enriched and Ribo-Zero rRNA-depleted samples

Usually, the variation between samples within a sample group (within-group variation) is smaller than that between samples from different sample groups (overall variation) because the former consists of technical and biological variation only, while the latter also contains variation due to the treatment effect. When the treatment effect is the dominant term of variation, the sample groups can be clearly separated by statistical tools. If the treatment effect is weak compared to the technical and biological variation the samples will be difficult to discriminate based on the data. Here, the treatment effect is the difference between poly(A) and Ribo-Zero mRNA enrichment methods and a large treatment effect indicates that the mRNA enrichment/rRNA depletion method greatly affects the composition of the transcriptome. We assessed these effects.

Initial visual inspection of sequence alignments indicated that the Ribo-Zero libraries showed more ‘background’ alignments (reads aligned in intergenic and intronic regions) than the poly(A) datasets. It is not certain whether this represents real, un-polyadenylated transcripts or possible erroneous sequencing of DNA. However, as samples came from the same pool and were DNase treated prior to being split, any difference in the amount of DNA being sequenced should be due to specific differences in the enrichment/depletion process.

As an initial gene-wise analysis of gene expression levels, reads or read pairs (“tags”) mapped in the sense orientation to annotated gene models were counted for each sample. One caveat to this analysis is that the gene models used to generate tag counts were from a draft genome annotation and therefore were likely to contain errors. 257 of the 13,793 annotated features had sense tag counts of zero in all eight samples and were not analysed. To assess variation in the data, log2 sense tag counts per gene were plotted among sample pairs (Additional file 1). These give an indication of how similar are samples belonging to the same group and how different are samples from different sample groups. The associated correlation coefficients are displayed as a correlation heatmap (Additional file 1). Principal component analysis (PCA), using log2 sense tag counts per gene, was used to plot each sample relative to all others (Additional file 1). These analyses showed very low technical variation between the cleaned and uncleaned FANG-1 samples and greater biological variation among FANG-1, FANG-2 and FANG-3 samples. Ribo-Zero samples showed more biological variation than poly(A) samples, though this was mainly due to the FANG-2 Ribo-Zero sample, which was a clear outlier (Additional file 1). FANG-2 Ribo-Zero was the smallest sample library, with around 44 million reads compared to around 55–62 million reads for the other samples, which may be associated with this result. Together, these assessments indicate that the mRNA enrichment method has a large effect on the sequenced transcriptomes of the samples.

Analysis of differential transcriptome composition between mRNA enrichment methods

To assess differences between the transcriptomes due to the mRNA enrichment method applied, differential gene expression (DGE) analysis was carried out to identify significantly differentially represented annotated genome features. First, as technical variation due to pre-enrichment sample cleaning of FANG-1 was low and to make full use of the data, the mean tag counts for cleaned and uncleaned samples were calculated and used in the DGE analysis. The data were modelled using a generalized linear model in a one-factor (mRNA enrichment method) experiment with two levels (“Ribo-Zero” and “poly(A)”). Full results of the analysis are shown in Additional file 2. Figure 2 shows the relationship between log2 fold change in expression between Ribo-Zero and poly(A) and mean expression level across all samples. The power in the data to discriminate between Ribo-Zero and poly(A) samples (clearly shown by PCA) is reflected in the large number of significantly differentially represented transcripts among the two transcriptomes. This analysis indicates that results based on Ribo-Zero and poly(A) mRNA enrichment may be very different from one another. Great care must therefore be taken when comparing results derived using the different methods.

Fig. 2
figure 2

Plot of differential transcript abundance in Ribo-Zero and poly(A) transcriptomes. The y-axis shows log2 fold-change of transcript abundance between Ribo-Zero and poly(A) samples. Greater than 0 indicates that a transcript is more abundant in Ribo-Zero samples than in poly(A) samples and vice versa, as indicated on the plot. Green lines indicate 1.5-fold change. Red points are significantly differentially represented among treatments (FDR-adjusted P-values < 0.05). Orange points are transcripts represented in only one sample, so fold-change could not be calculated. The x-axis shows the log2 average abundance of the transcripts in both Ribo-Zero and poly(A) transcriptomes. The shape of the plot and distribution of significantly differentially represented transcripts indicates that low abundance transcripts show more variation among Ribo-Zero and poly(A) transcriptomes due to sampling effects than do higher abundance transcripts

In order to assess why the composition of Ribo-Zero and poly(A) transcriptomes differ so much, we looked at the 20 most differentially represented genome features in each group (ordered by FDR-adjusted P-value). Table 4 lists the 20 features most over- or under-represented in the Ribo-Zero transcriptome relative to poly(A). The two most over-represented features in the Ribo-Zero transcriptomes were ncRNAs, suggesting a large population of non-polyadenylated transcripts, including many ncRNAs, may exist in the cells and be under-represented in poly(A) transcriptomes. Two histone genes (histone H2A and H3) were over-represented in the Ribo-Zero transcriptome, consistent with the occurrence of replication-dependent, unpolyadenylated and replication-independent, polyadenylated histone forms [19, 25]. Also, three genes with Cadherin domains were over-represented in the Ribo-Zero transcriptome. In other systems, Cadherin is among a set of genes controlled at the translational level by cytoplasmic polyadenylation [26]. If this were occurring in An. funestus, we would expect transcript forms with short poly(A) tails to be over-represented in the Ribo-Zero transcriptome. Further investigation would be needed to confirm if this is the case. Among the genes over-represented in the poly(A) transcriptome were a number of putative core metabolic genes such as transcription-associated proteins and ribosomal proteins. A study of poly(A) tail length in yeast showed that ribosomal proteins had long poly(A) tails [27]. We speculate that preferential enrichment of transcripts with longer poly(A) tails may affect the resulting transcriptome composition.

Table 4 The most significantly differentially represented genome features between Ribo-Zero and poly(A) transcriptomes

A broader analysis of functional categories and gene ontologies represented in the gene sets showed differences between poly(A) and Ribo-Zero libraries. The broad functional categories (as defined in the KEGG databases) of genes over-represented in the Ribo-Zero or poly(A) libraries were quite different. 28 % (516/1,841) of genes were annotated for the Ribo-Zero set, compared to 51 % (643/1,250) for the poly(A) set and 36 % (3702/10,253) for genes not differentially represented among treatments. The largest category (after “unclassified”) for the Ribo-Zero set was “environmental information processing”, while for the poly(A) set it was “genetic information processing” (Fig. 3). Many of these “genetic information processing”-associated genes in the poly(A) set were ribosomal proteins, as indicated by gene ontology analysis (summarised in Table 5), where ribosome-associated terms were enriched in this set (Fig. 4 and Additional file 1). In contrast, many of the “environmental information processing”-associated genes in the Ribo-Zero set may be membrane and/or nuclear proteins involved in signal transduction or transcriptional regulation (Fig. 4 and Additional file 1).

Fig. 3
figure 3

Representation of functional categories in poly(A) and Ribo-Zero over-represented gene sets and in genes not differentially represented (“both”)

Table 5 Summary of gene ontology analysis
Fig. 4
figure 4

Cellular compartment GO terms differentially represented between gene sets over-represented in poly(A) (red) and Ribo-Zero (blue) samples

Detection of un-annotated transcripts

To further compare the transcriptomes, transcripts were annotated from aligned data from each method. 11,395 genes (15,605 transcripts) were predicted using the poly(A) data and 15,914 genes (20,317 transcripts) using the Ribo-Zero data (Table 6). After removing transcripts that overlapped those predicted in the AfunF1.2 annotation, more remained in the Ribo-Zero transcriptome (nearly 5,000) that in the poly(A) transcriptome (around 1,000), suggesting the presence of transcripts not detected in the poly(A) transcriptome. Some of this difference could be due to an increased amount of immature transcripts (that are neither fully spliced nor polyadenylated) and/or a greater contribution from DNA contamination in the Ribo-Zero samples (which are subjected to less PCR amplification specifically of transcripts). However, visual inspection of a sample of these putative transcripts showed that a number were indeed unannotated, Ribo-Zero-specific transcribed regions that did not occur within the introns of annotated genes and were predominantly strand-specific (i.e. did not result from DNA contamination, which would not be strand-specific). An example is shown in Fig. 5. Further analyses are needed to elucidate the functions of these transcripts.

Table 6 Comparison of the published genome annotation with de novo annotations based on RNAseq alignments
Fig. 5
figure 5

Example of putative novel transcript(s) represented in the Ribo-Zero libraries but not in the poly(A) libraries. The top panel shows a coverage depth summary plot (grey plot; log-scaled coverage depth) and aligned reads (red bars are reads from pairs aligned to the plus strand, blue bars from pairs aligned to the minus strand) for the poly(A) data. Below these are the same for the Ribo-Zero data. Below these are gene models (grey boxes represent exons, black lines un-translated regions: introns and UTRs) from the gene set AfunF1.2. A large transcribed region on the minus strand is evident in the Ribo-Zero data only, representing one or more putatively poly(A)- transcripts

Validation and improvement of the annotation of Anopheles funestus genes, focused on detoxification-associated gene families

An important part of genome annotation is validation of the predicted gene models. These RNAseq data allowed this to be done. On a genome-wide scale there were 44,031 predicted introns, of which 43,852 were unique (i.e. not the same intron in multiple transcript isoforms or overlapping genes). Of these, 28,346 (61.64 %) were validated by at least 10 reads in the FANG-1, 2 and 3 poly(A) data sets (31,927 introns, 72.91 %, were validated by at least 1 read in the same data set). A small number of introns were supported at only one end, suggesting mis-annotation of the gene models. Of the 44,031 predicted introns, 1,125 were supported only at the 5′ end and 2,118 only at the 3′ end. Introns supported only at the 3′ end tend to be the first predicted intron (45 % the first intron vs. 13 % the last intron), while introns supported only at the 5′ end tend to be the last predicted intron (33 % the last intron vs. 17 % the first intron), suggesting that the ends of up to around 3,000 gene models are mis-annotated and could be manually improved. Results of the validation analysis for all predicted introns are shown in Additional file 3. For the 28,346 validated introns, consensus motifs at intron-exon boundaries are shown in Fig. 6. The 5′ end of the intron has an extended GTAAGT motif, while the 3′ end of the intron has the conserved AG motif preceded by an extended pyrimidine-rich region. No conserved motif is seen in the flanking exons, except for a slight enrichment of purines at the last position of the upstream exon.

Fig. 6
figure 6

Consensus sequences at the 5′ and 3′ intron-exon boundaries of 28,346 introns validated by RNAseq. Position 0 at the 5′ end marks the first nucleotide in the intron. Position 0 at the 3′ end marks the first nucleotide in the following exon. Two diagonal lines denote the variable central portions of the introns. The y-axis indicates the bit-score for each nucleotide position. An extended motif of GTAAGT is seen at the 5′ end, with a slight enrichment of purines at the last position of the preceding exon. The conserved AG motif is seen at the 3′ end, preceded by an extended pyrimidine-rich region. Little or no sequence conservation is seen in the flanking exons, apart from at the last position of the preceding exon

Visual inspection of the automated MAKER genome annotation of Anopheles funestus showed that, in many cases, the MAKER annotation had inferred a single gene model from multiple closely located genes. This behaviour is particularly problematic for detoxification-associated genes such as cytochrome P450 and GST because these genes often occur in clusters of tandemly duplicated paralogous genes. Manual editing was required to improve the genome annotation, an example of which is shown in Fig. 7.

Fig. 7
figure 7

Example of an incorrect gene model (AFUN009068) and three manually edited gene models. The top panel shows a coverage depth summary plot (grey plot; log-scaled coverage depth) and aligned reads (red bars are reads from pairs aligned to the plus strand, blue bars from pairs aligned to the minus strand). Below these are the gene models (grey boxes represent exons, black lines un-translated regions: introns and UTRs). The gene model in gene set AfunF1.2 (AFUN009068) is shown above three manually edited gene models based on read alignments and orthologous genes in the Anopheles gambiae (Ag) genome annotation (bottom). The three genes are carboxylesterases (COE)

We aimed to improve the annotation of genes involved in the three phases of detoxification: transformation (phase I), conjugation (phase II) and transport/efflux (phase III). Two major enzyme families, P450 (phase I) and GST (phase I and II), have already been manually curated and, at the time of writing, are being incorporated into the VectorBase genome annotation, AfunF1.3 (Dr. Craig Wilding and Dr. Dan Lawson, personal communications). Therefore, we focused upon seven other detoxification gene families. These were: the phase I enzymes carboxylesterases (COE), flavin-containing monooxygenases (FMO), aldehyde oxidases (AOX) and aldehyde dehydrogenases (ALDH); the phase II enzymes UDP glucosyltransferases (UGT) and sulfotransferases (SULT); and the phase III transporter-encoding ATP-binding cassette genes (ABC).

Manual editing considerably improved the annotation of detoxification-associated genes. A table of manually edited gene models and unedited, putatively correct, gene models belonging to each family is shown in Additional file 4.

For the carboxylestarase genes (COE), 22 gene models were manually edited to produce 28 putative COE genes and 6 non-COE genes, 10 gene models were not modified. For the flavin-containing monooxygenase genes (FMO) 4 gene models were identified and were not modified. For the aldehyde oxidase genes (AO), 9 gene models were manually edited to produce 6 putative AO genes, 2 xanthine dehydrogenase genes and 2 non-AO genes, 2 gene models were not modified. For the aldehyde dehydrogenase genes (ALDH), 2 gene models were manually edited to produce 2 ALDH genes and 1 non-ALDH gene, 1 gene model was not modified.

For the UDP-glycosyltransferase genes (UGT), 4 gene models were manually edited to produce 8 UGT genes and 3 non-UGT genes, 14 gene models were not modified. For the sulfotransferase genes (SULT), 1 gene model was manually edited to produce 5 SULT genes and and an oestrogen sulfotransferase gene, 3 gene models were not modified.

For the ATP-binding cassette genes (ABC), 9 gene models were manually edited to produce 13 ABC genes and 10 non-ABC genes, 16 gene models were not modified.

All manually edited gene models have been submitted to VectorBase for incorporation into the Anopheles funestus genome annotation.

Integration of An. funestus transcriptome datasets and assignment of FUMOZ scaffolds to chromosome arms

Two An. funestus whole transcriptome datasets were published before the whole genome assembly [22, 23]. We integrated both the “Crawford” transcriptome (14,850 sequences) and the “Gregory” transcriptome (18,103 sequences) with the FUMOZ predicted transcript set, AfunF1.2 (13,897 sequences) by clustering them based on sequence similarity. The results are shown in Additional file 5. Clustering the full set of 46,850 sequences at 95 % similarity resulted in 25,519 clusters. The results suggested that the “Gregory” transcriptome was more fragmented than the other two: its 18,103 sequences were represented in only 5,741 unique clusters, compared to 12,349 clusters for the “Crawford” transcriptome and 13,172 clusters for the AfunF1.2 transcript set.

Genome sequences assembled to the level of whole chromosomes are more useful than draft assemblies for analyses of genome wide variation and gene expression, because structural and spatial information can be important (e.g. for defining the location of a selective sweep or the effect of regional chromatin remodelling on co-regulation of gene expression). The comparison of An. funestus predicted genes to those of An. gambiae allowed a number of scaffolds to be tentatively assigned to chromosome arms based on homology to An. gambiae. Table 7 summarises the number of scaffolds with 5 or more genes assigned to a chromosome arm by putative 1:1 orthology with An. gambiae (all scaffold IDs are listed in Additional file 6). In a small number of anomalous cases where a scaffold contained genes with orthology to genes on a different chromosome An. gambiae arm, the minority chromosome arm was never supported by more than 1 gene and these were removed by setting a threshold of 5 genes to support assignment of a scaffold to a group. This conservative approach to defining linkage groups assigned only a small number (286/1392 = 20.55 %) of scaffolds. However, these were generally large scaffolds and represented the greater proportion (183,517,537/225,223,604 = 81.48 %) of the total genome length. These groups could form a starting point for genome assembly finishing and for further analysis of synteny and co-linearity of genes between An. funestus and An. gambiae.

Table 7 Anopheles funestus scaffolds assigned to groups representing putative chromosome arms

Conclusions

The sequencing of the genome of the major malaria vector Anopheles funestus allows new analyses of its biology and may accelerate research on this species, which currently lags behind that of Anopheles gambiae. To carry out analyses of gene expression relevant to insecticide resistance, we have piloted the use of RNAseq. In addition, we have improved the genome annotation for a number of gene families with roles in the detoxification of xenobiotic agents.

As reported in other eukaryotic systems, the method used to enrich for mRNA relative to rRNA has a large effect upon the observed transcriptome [1821]. Functional classes of genes are differentially represented in poly(A) mRNA-enriched and Ribo-Zero rRNA-depleted transcriptomes, with greater representation of ribosomal proteins with poly(A) and membrane-associated proteins with Ribo-Zero. Genes with both poly(A) + and poly(A)- transcripts, such as histones, and ncRNA, such as signal recognition particle RNAs, were also highly represented in the Ribo-Zero transcriptome. The Ribo-Zero libraries show a higher proportion of reads aligning outside of annotated exons, which may represent a combination of novel, poly(A)- transcripts and the introns of immature transcripts.

In making recommendations for which method to use for studies of the transcriptomes of mosquitoes, several factors should be considered. For analyses of differential gene expression, power to detect different expression levels is proportional to the number of reads aligned to each gene. Poly(A) mRNA enrichment does provide more of this ‘on-target’ data than Ribo-Zero rRNA depletion, yet the difference is less striking after accounting for rRNA genes, in particular a highly poly(A)-enriched mitochondrial 16S rRNA. The transcriptome profiles of poly(A) samples were more highly correlated with each other than Ribo-Zero samples, indicating that power to detect differential gene expression would be greater for poly(A) mRNA-enriched samples. However, we argue that Ribo-Zero rRNA depletion produces a truer representation of the transcriptome because (i) there is less PCR amplification of transcripts, which can bias the final results and (ii) The transcriptome includes the poly(A)- fraction, which may be functionally relevant. Due to the large effect that the mRNA enrichment method has on the transcriptome obtained, researchers should be very careful in comparing results obtained using different methods.

The draft genome annotation of An. funestus is currently inadequate for detailed studies of gene expression targeted at insecticide resistance mechanisms. Major gene families involved in detoxification of xenobiotic substances, such as cytochrome P450 and GST genes, are not well annotated by automated approaches. Manually edited gene models for these gene families are being incorporated into the genome annotation at the time of writing (Dr. Craig Wilding and Dr. Dan Lawson, personal communications). We extended this annotation improvement to 7 more key detoxification genes families: 4 associated with phase I; 2 with phase II and 1 with phase III. This effort offers a major improvement of the utility of the An. funestus genome as a tool to study insecticide resistance in this species. Accurate annotation of these gene families is particularly important for An. funestus, as insecticide resistance appears to be primarily mediated by detoxification of the compounds in this species.

Methods

Mosquito samples used in the study

The study used Anopheles funestus mosquitoes of the FANG laboratory colony, a fully insecticide susceptible colony derived from Angola [28]. Mosquito eggs (a kind gift of Prof. Lizette Koekemoer, University of the Witwatersrand, South Africa) were hatched and mosquitoes reared to adulthood in the insectaries at the Liverpool School of Tropical Medicine under conditions described elsewhere [2].

RNA extraction, sequence library preparation and sequencing

Total RNA was extracted from three pools of 10 individual adult female mosquitoes using the Arcturus PicoPure RNA isolation kit (Life Technologies, Carlsbad, USA), according to the manufacturer’s instructions and including a DNase treatment step. On visual inspection of the purified RNA, the samples displayed some pigmentation, the likely effect of which on data quality was unknown. To assess this, the most abundant sample (FANG-1) was split in two. One half of the sample was cleaned using Agencourt AMPure XP beads (Beckman Coulter, Brea, USA) prior to further processing.

All four samples were split in two and one of each pair subjected to poly(A) mRNA enrichment and the other to ribosomal RNA depletion. Polyadenylated RNA was selected from total RNA samples using 3 rounds of poly(A) selection with the Dynabeads mRNA purification kit (Life Technologies), using 1.5 μg of starting material. Total RNA was rRNA-depleted using the Ribo-Zero low input kit for Human/Mouse/Rat (Illumina, San Diego, USA), using 100 ng of starting material. RNAseq libraries were prepared from poly(A) and Ribo-Zero mRNA-enriched material using the ScriptSeq v2 RNAseq library preparation kit (Illumina), using 15 cycles of PCR amplification. Libraries were purified using Agencourt AMPure XP beads (Beckman Coulter). Each library was quantified using a Qubit fluorometer (Life Technologies) and the size distribution assessed using the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, USA).

The eight final libraries were pooled in equimolar amounts using the Qubit and Bioanalyzer data. The quantity and quality of each pool was assessed by Bioanalyzer and subsequently by qPCR using the Kapa Illumina library quantification kit (Kapa Biosystems, Wilmington, USA), on a Light Cycler LC480II (Roche, Basel, Switzerland), according to manufacturers’ instructions. The pool of libraries was sequenced on one lane of the HiSeq 2500 (Illumina) at 2x125 bp paired-end sequencing with v4 chemistry. Sequence library preparation and sequencing were done at the Centre for Genomic Research, University of Liverpool, UK.

Alignment of sequence reads to a reference genome

Initial processing and quality assessment of the sequence data was performed using a software pipeline developed at the Centre for Genomic Research, University of Liverpool (Dr. Richard Gregory, personal communication). In this pipeline, basecalling and de-multiplexing of indexed reads was performed by CASAVA version 1.8.2 (Illumina) to produce samples from the pooled sequence data, in fastq format. The raw fastq files were trimmed to remove Illumina adapter sequences using Cutadapt version 1.2.1 [29]. The 3' end of any read matching the adapter sequence over at least 3 bp was trimmed off. Reads were further trimmed to remove low quality bases, using Sickle version 1.200 [30], with a minimum window quality score of 20. After trimming, reads shorter than 10 bp were removed. If both reads from a pair passed this filter, each was included in either the R1 (forward reads) or R2 (reverse reads) file. If only one of a read pair passed this filter, it was included in the R0 (unpaired reads) file.

The reference sequence used for alignment was the Anopheles funestus assembled scaffold sequences derived from the FUMOZ laboratory colony, a multi-insecticide resistant colony derived from southern Mozambique [11, 28]. Assembly AfunF1 (GenBank assembly identifier GCA_000349085.1; GenBank WGS project identifier APCI01) was downloaded from VectorBase [12, 13]. A single sequence representing the Anopheles funestus mitochondrial genome (accession number DQ146364) [31] was added to these scaffolds to make the reference sequence used in the analysis.

R1/R2 read pairs were aligned to the reference sequence using TopHat version 2.0.10 [32], based on the bowtie2 aligner [33]. The alignment was carried out using the following non-default parameters: the insert inner distance (between 3′ ends of R1 and R2 reads) was set to 0, with a standard deviation of 50 (options “--mate-inner-dist 0” and “--mate-std-dev 50”); for reads with multiple alignments in the genome(s), only a single, best alignment was recorded, with ties assigned randomly (option “--max-multihits 1”); the library type was specified as second-stranded (option “--library-type fr-secondstrand”), as it was a ScriptSeq library (in second strand sequencing, the forward read of a pair, R1, will map sense to a transcript, and the reverse read, R2, will map antisense).

Analysis of transcript abundance among samples

Tags mapped in the sense orientation to annotated Anopheles funestus genes (automated predictions from gene set AfunF1.2, 2014-08-22, downloaded from VectorBase and annotated genes from the mitochondrial genome) were counted using htseq-count, part of the ‘HTSeq’ framework, version 0.5.3p9 [34]. Read count data were analysed in the R environment, in particular using the package edgeR for differential gene expression analysis [35].

Analysis of the tag count data used log2 transformed counts (after replacing 0 values with 1 s). Pairwise scatter plots of these values were plotted and the Pearson’s correlation coefficients (r 2) represented in a sample correlation heatmap. Principal component analysis (PCA) was applied and the first, second and third principal components of variation plotted.

Differential gene expression analysis was carried out using edgeR [35]. Due to the low level of technical variation between cleaned and uncleaned FANG-1 samples, differential gene expression analysis used the mean tag count values for cleaned and uncleaned FANG-1 samples for each enrichment method. Two groups of samples (“Ribo-Zero” and “poly(A)”) were defined. Normalisation factors were calculated to correct for differences in total tag counts among samples, which may otherwise cause bias in differential gene expression analysis, using the “TMM” (Trimmed Mean M-values) method in edgeR [36] with default parameters. Common, trended (with a minimum of 500 genes in each bin) and tag-wise dispersion parameters were estimated. Trended dispersion was used for significance testing. Variation of RNAseq data can be modelled by a negative binomial distribution [37] and the data modelled using a generalized linear model [38]. For the contrast “Ribo-Zero / poly(A)”, the estimated log2 Fold Change for each gene was tested in edgeR using a Likelihood-Ratios (LR) test [39]. P-values associated with logFC were adjusted for multiple testing using the False Discovery Rate (FDR) approach [40]. Significantly differentially expressed genes were defined as those with an FDR-adjusted P-value < 5 %.

Analysis of gene ontologies and KEGG functional categories

All GO terms associated with features in genome annotation AfunF1.2 were obtained from VectorBase. The terms were separated by GO domain: biological process (BP; 7,228 instances), molecular function (MF; 14,362 instances) and cellular compartment (CC; 3,648 instances). An “instance” refers to a GO term associated with a gene ID and the term is used here because multiple GO terms can be associated with the same gene ID and multiple gene IDs with the same GO term. To test for significant differences in GO term representation among sets of genes, Fisher’s exact test was applied for each GO term, with a correction for multiple testing using the method of Benjamini and Hochberg [40]. The gene sets tested were those over-represented in Ribo-Zero or in poly(A) libraries (with adjusted p-values < 0.01 in the test for differential expression).

Proteins were annotated using blastKOALA v2.0 [41], a tool available via the Kyoto Encyclopedia of Genes and Genomes, KEGG [4244]. Four sets of proteins were submitted to blastKOALA: those over-represented in Ribo-Zero libraries (n = 1841); those over-represented in poly(A) libraries (n = 1250); those not significantly differentially represented (two sets, as a maximum of 10,000 sequences can be submitted at a time: n = 5531 and n = 4722). The protein sequence files were uploaded and the taxonomy ID of An. funestus (taxid:62324) entered. The database searched against was the “genus_eukaryotes” database (containing 3,696,044 entries).

Detection of un-annotated transcripts

Alignment files of the four poly(A) mRNA-enriched libraries and the four Ribo-Zero rRNA-depleted libraries (in bam format) were merged to create a single poly(A) and a single Ribo-Zero alignment. Alignments to the mitochondrial genome were filtered out. These filtered alignments were used to predict transcripts using Cufflinks v2.2.1 [45]. Default parameters were applied except for the options “--library-type” (set to “fr-secondstrand”) and “--min-frags-per-transfrag” (set to 100 rather than the default 10). Putative transcripts were compared to the genome annotation using Cuffmerge and transcripts not associated with annotated features extracted for each of the poly(A) and the Ribo-Zero alignments. These transcripts were merged using Cuffmerge and the resulting file of features used in conjunction with the alignment files to generate read counts for poly(A) and the Ribo-Zero samples using featureCounts v1.4.6 [46].

Genome annotation validation and improvement

Genomic location information for introns annotated in An. funestus gene set AfunF1.2 and the mitochondrial genome was extracted from the genome annotation. Similar location information for introns identified by alignment of RNAseq data, for the poly(A) samples, was extracted and the two datasets were compared to identify introns validated by RNAseq. Similar analyses were done for 5′ and 3′ intron ends alone. For validated introns, sequences spanning the 5′ and 3′ intron ends were extracted and used to identify enriched nucleotide motifs, represented as sequence logos using WebLogo 3 [47, 48].

Manual genome annotation improvement was carried out using Anopheles gambiae gene models as a reference. Starting with translated An. gambiae genes, BLASTp similarity searches were carried out in VectorBase to define the closest matching An. funestus proteins and identify gross differences in protein length. Then, gene models were compared and An. funestus gene models modified to match those of An. gambiae where appropriate. RNAseq alignments were used to guide the An. funestus gene model editing. As transcript ends are difficult to define from RNAseq alignments (due to coverage drop off near to molecule ends), edited gene models included putative protein coding regions but not putative un-translated regions. Edited gene models were submitted to VectorBase [12, 13].

Clustering of sequences from multiple transcriptome datasets and assignment of scaffolds to chromosome arms

Sequences from two transcriptome assemblies were downloaded from NCBI [49]: the “Crawford” transcriptome, comprising 14,850 sequences (accessions EZ966136-EZ980985) and the “Gregory” transcriptome, comprising 18,103 sequences (accessions EZ915182-EZ933284). Sequences from the FUMOZ predicted transcript set AfunF1.2, comprising 13,897 sequences, were downloaded from VectorBase [12, 13]. The three datasets were concatenated and clustered by sequence identity using CD-HIT-EST v4.6 [50], with a clustering threshold of 95 % nucleotide identity.

To tentatively assign An. funestus genome assembly scaffolds to chromosome arms, predicted protein sequences from An. funestus (gene set Afun1.2) and An. gambiae (gene set Agam4.3) were downloaded from VectorBase [12, 13]. Sequence headers were parsed to retain the scaffold/chromosome information and the two datasets concatenated and clustered by peptide sequence identity using CD-HIT v4.6 [50], with a clustering threshold of 75 % amino acid identity. An. funestus scaffolds were assigned to a chromosome arm if they contained at least 5 putative 1:1 An. gambiae orthologues from that chromosome arm. Chromosome arms 2 L and 3R were switched to account for a whole arm translocation between 2 L and 3R in An. funestus relative to An. gambiae.

Availability of supporting data

The RNAseq read data reported in this study were submitted to the European Nucleotide Archive (ENA) under the study accession PRJEB10294 (http://www.ebi.ac.uk/ena/data/view/PRJEB10294) and the sample accessions ERS809802-ERS809809.