Data description

Background

Fopius arisanus is an egg-pupal parasitoid of Tephritid fruit flies. It is important as a biological control agent for these invasive and damaging pests stems since it is an egg parasitoid, thus has the ability to infect flies across a broad range of Tephritid speciies during their early developmental stages [1]. In Hawaii, it was estimated that F. arisanus constitute up to 95 % of the parasitoid guild, and that levels of parasitism in the oriental fruit fly (Bactrocera dorsalis) range between 65 % and 70 %, significantly reducing the infestation of fruits by these flies [2]. However, for some other fly species, such as Bactrocera cucurbitae (Melon fly), F. arisanus was reported to have low parasitism rates [3, 4]. It is also known that this parasite wasp is able to discriminate between hosts depending on the fruit substrate on which they feed [3]. Foundational genomic and transcriptomic information in this species would help scientists to understand the underlying mechanisms contributing to parasite behavior, describe the physiology and biology of host selection and host–parasitoid interactions, design better biological control strategies, and develop monitoring tools for parasitism rates in the field.

Samples

Samples were derived from a research colony of F. arisanus maintained on B. dorsalis at the US Department of Agriculture–Agricultural Research Service (USDA–ARS) Daniel K. Inouye Pacific Basin Agricultural Research Center Insectary in Hilo, Hawaii, USA. Wasp larvae, pupae, and male and female adults were obtained in order to generate samples representative of a broad range of life stages and ages. In brief, a cohort of B. dorsalis eggs were exposed to mated F. arisanus females for approximately 24 h. Larvae and pupae from the cohort of exposed B. dorsalis eggs were dissected in order to target larval and pupal stages of F. arisanus. When an F. arisanus individual was found, it was carefully removed from the egg, rinsed in sterile water and snap-frozen in liquid nitrogen. Adult males and females were obtained after their emergence from parasitized pupae. For each developmental stage, an effort was made to collect individuals of varying ages within that stage (i.e. corresponding to each developmental instar), so as to encompass as many stage-specific genes as possible. For this purpose, daily collections were made across a developmental stage, total RNA was extracted from each sample, and then RNA samples collected from the same developmental stages were pooled in equimolar concentrations. These samples have been identified as NCBI BioSamples SRS691550, SRS691551, SRS69153, and SRS691554, associated with BioProject PRJNA259570. RNA was extracted from each sample set using the Zymo Quick-RNA MiniPrep Extraction kit (Zymo Research, Irvine, California, USA) following recommended procedures for each tissue. This was then quantified with the Qubit Broad Range RNA assay on a Qubit 2.0 fluorometer (Life Technologies, Carlsbad, California, USA). The size and quality of the total RNA was determined with an RNA 6000 Nano Chip on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, California, USA).

Sequencing

Total RNA was sent to the Beijing Genomics Institute (BGI Americas, University of California, Davis, California, USA) and eukaryotic mRNA libraries were prepared using TruSeq technology (TruSeq RNA Sample Prep Kit v2). The resulting four libraries (larvae, pupae, adult male and adult female) were barcoded and sequenced together on a single lane of the Illumina HiSeq 2000 sequencing system, generating approximately 44.48 Gb of raw data from approximately 211 million 2 × 100 bp-paired reads. These raw reads were filtered by quality and for adapter contamination using an in-house pipeline at BGI, targeting reads containing adapter sequences, those with more than 5 % ambiguous bases, or those with more than 50 % of bases with a Phred quality score below 10. After filtering, data were reduced by approximately 6 % to 42.15 Gb. These filtered data were used for de novo assembly, and were also deposited into NCBI under SRA SRX689037, SRX689038, SRX689040, SRX689041, associated with BioProject PRJNA259570.

Transcriptome assembly

A single representative de novo assembly was generated from a concatenation of the four libraries using the Trinity pipeline (r2014_07–17) [5, 6]. In brief, reads were normalized in silico to 50x coverage, and then assembled using default Trinity parameters (except for the addition of the ‘--jaccard_clip’ flag to reduce transcript fusions from non-strand-specific data). After assembly, transcript and unigene level expression values were calculated using RSEM [7], and open reading frames (ORFs) were predicted with Transdecoder [6], including those with a detectable Pfam-A domain based on a Hmmer3 search. Next, the raw transcriptome was filtered to discard poorly supported transcripts, and to maintain transcripts with strong evidence of protein coding regions and reasonable support for expression. To do this, we implemented Transvestigator [8], filtering the assembly with parameters set to retain only those transcripts with a transcript per million (TPM) value greater than 0.5, transcript isoforms representing at least 5 % of the abundance of the parent unigene, and transcripts with a predicted ORF. Transvesgitator was also utilized to prepare the data for NCBI Transcriptome Shotgun Assembly (TSA) submission by ensuring that the predicted ORF was on the positive strand. This confirmed a single ORF per transcript, and generated an NCBI .tbl file for submission. In addition to the filters described above, since the larval and pupal samples were derived from the dissection of B. dorsalis, any protein sequence with a BLASTp match containing no more than one mismatch at the amino acid level to a B. dorsalis protein (acquired from previously published B. dorsalis transcriptome and genome datasets, NCBI accessions GAKP00000000.1 and GCF_000789215.1) were flagged and the parent unigene and all transcripts derived from that unigene were discarded. This resulted in the removal of 496 host-derived transcript sequences. Statistics on unfiltered and filtered assemblies are detailed in Table 1.

Table 1 Transcriptome assembly and annotation statistics for F. arisanus

Annotation

Annotation was performed at the peptide level, and these annotations used to generate a transcript name and product, as well as functional annotations. All predicted proteins were subjected to analysis using InterProScan5 to search all available databases, including gene ontology and InterPro term lookup. In addition, proteins were subjected to a BLASTp search against the UniProtKB/SwissProt database (downloaded 10 November 2013). Annotation information was pulled from these results using Annie [8], which assigns gene names and products by cross-referencing SwissProt BLAST hits, and performs database cross-referencing from InterProScan5 results. The resulting annotation file was provided to Transvestigator, as described above, to include functional annotations on the resulting .gff3 and .tbl files (described at [8]).

Orthology-based comparison of F. arisanus proteins to existing hymenoptera parasitoid genome annotation sets

Transcriptome data were compared with gene sets of four other parasitic wasps: Copidosoma floridanum (CFLO draft peptide set, i5k workspace [9]), Orussus abietinus (Parasitic Wood Wasp, OABI draft peptide set, i5k workspace), Trichogramma pretiosum (TPRE draft peptide set, i5k workspace), and Nasonia vitripennis Jewel Wasp, Nvit_OGSv1, [10]) (Fig. 1). In addition, data from Apis mellifera (European Honey Bee, amel_OGSv3.2, [10]) was used to provide comparison with a non-wasp hymenopteran species. Orthologous groups between predicted proteins for these species were identified using OrthoMCL [11, 12] with default parameters. Data were summarized to identify orthologs shared between species (Fig. 2). Peptide sequences for each species, and a putative ortholog list between species, is presented in the GigaDB accession associated with this publication [13].

Fig. 1
figure 1

Comparison of F. arisanus transcriptome assembly to related hymenopteran parasitoids. Distribution of (a) transcript length and (b) predicted protein length of the F. arisanus transcriptome compared to published transcript and protein sets from related hymenopteran genomes (Copidosoma floridanum, Orussus abietinus [parasitic wood wasp], Trichogramma pretiosum, Nasonia vitripennis, and Apis mellifera) available on NCBI or the i5k web space (i5k.nal.usda.gov, [9])

Fig. 2
figure 2

Putative orthologs between parasitoid genomes. Venn diagram showing the number of orthologs shared between five different parasitoid wasp species (Copidosoma floridanum, Orussus abietinus [parasitic wood wasp], Trichogramma pretiosum, Nasonia vitripennis, and Fopius arisanus) available on NCBI or the i5k web space (i5k.nal.usda.gov, [9]). Inset tree was constructed utilizing COI (cytochrome c oxidase subunit 1 mitochondrial region) sequences using maximum likelihood and rooted with A. mellifera to show relative phylogenetic relatedness of species. Nodes showed >90 % reliability after bootstrapping. Numbers in parentheses after the species name are the number of orthologous proteins (orthologous to at least one of the other species analyzed) and total number of predicted proteins for the respective genome annotations

Availability of supporting data and materials

The raw datasets supporting the results of this article, including unfiltered assembly results, protein predictions, BLAST results, annotations, and orthology files are available in the GigaScience repository [13]. Filtered data used for de novo assembly are deposited into NCBI under SRA SRX689037, SRX689038, SRX689040, SRX689041, associated with BioProject PRJNA259570.