Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery

Hale, Matthew C; McCormick, Cory R; Jackson, James R; DeWoody, J Andrew

doi:10.1186/1471-2164-10-203

Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery

Research article
Open access
Published: 29 April 2009

Volume 10, article number 203, (2009)
Cite this article

Download PDF

You have full access to this open access article

BMC Genomics Aims and scope Submit manuscript

Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery

Download PDF

Matthew C Hale¹,
Cory R McCormick¹,
James R Jackson² &
…
J Andrew DeWoody¹

12k Accesses
118 Citations
6 Altmetric
Explore all metrics

Abstract

Background

Next-generation sequencing technologies have been applied most often to model organisms or species closely related to a model. However, these methods have the potential to be valuable in many wild organisms, including those of conservation concern. We used Roche 454 pyrosequencing to characterize gene expression in polyploid lake sturgeon (Acipenser fulvescens) gonads.

Results

Titration runs on a Roche 454 GS-FLX produced more than 47,000 sequencing reads. These reads represented 20,741 unique sequences that passed quality control (mean length = 186 bp). These were assembled into 1,831 contigs (mean contig depth = 4.1 sequences). Over 4,000 sequencing reads (~19%) were assigned gene ontologies, mostly to protein, RNA, and ion binding. A total of 877 candidate SNPs were identified from > 50 different genes. We employed an analytical approach from theoretical ecology (rarefaction) to evaluate depth of sequencing coverage relative to gene discovery. We also considered the relative merits of normalized versus native cDNA libraries when using next-generation sequencing platforms. Not surprisingly, fewer genes from the normalized libraries were rRNA subunits. Rarefaction suggests that normalization has little influence on the efficiency of gene discovery, at least when working with thousands of reads from a single tissue type.

Conclusion

Our data indicate that titration runs on 454 sequencers can characterize thousands of expressed sequence tags which can be used to identify SNPs, gene ontologies, and levels of gene expression in species of conservation concern. We anticipate that rarefaction will be useful in evaluations of gene discovery and that next-generation sequencing technologies hold great potential for the study of other non-model organisms.

De novo transcriptomes of six calanoid copepods (Crustacea): a resource for the discovery of novel genes

Article Open access 27 April 2023

GigaTON: an extensive publicly searchable database providing a new reference transcriptome in the pacific oyster Crassostrea gigas

Article Open access 02 December 2015

De novo assembly, gene annotation, and molecular marker development using Illumina paired-end transcriptome sequencing in the clam Saxidomus purpuratus

Article 03 April 2017

Background

Massively parallel pyrosequencing via Roche's 454 platform [1] has great potential for identifying genes of interest to ecologists and evolutionary biologists. The 454 approach provides more accurate base calling and deeper sequencing coverage than is possible with conventional Sanger sequencing while dramatically decreasing labor [1–3]. For the most part, pyrosequencing has so far been restricted to model organisms [3–6] or species closely related to a model [7] because of the short reads which make de novo genome sequencing difficult without a scaffold [8]. However, the depth of coverage provided by the Roche 454 platform means that transcriptomes of non-model organisms can be characterized without a genome sequence. Recently, 454 technology has been applied to transcriptomes of the rose gum tree (Eucalyptus grandis [9]) and the Glanville fritillary butterfly (Melitaea cinxia [10]). These studies illustrate the potential of 454 pyrosequencing to rapidly characterize expressed genes that can be used to address pertinent questions regarding a species' ecology, life history, and evolution [11, 12].

A complete description of expressed sequence tags (ESTs) provides an overview of the transcriptome, those genes expressed (transcribed) in a given tissue at a specific point in time. Pyrosequencing of ESTs can be used to characterize gene expression [6], QTL [13], single nucleotide polymorphisms (SNPs; [9, 14]), and patterns of selection [9]. The identification of SNPs is especially appealing in non-model species because as genetic markers, SNPs can illuminate population structure, sex ratios, and genetic variability [15]. Recently, 454 sequencing has been used to identify tens of thousands of SNPs in inbred lines of maize and in E. grandis [14, 9] demonstrating the power of pyrosequencing for SNP discovery.

There are many factors to be considered prior to transcriptome characterization, including the expected number of unique transcripts in the sampled tissue(s) and their relative abundance. When it comes to sequencing effort, transcriptome sequencing needs to be deep in order to identify rare transcripts and to overcome short read lengths, but the absolute depth required is usually unknown although important for quantifying levels of gene expression. If the relative abundance of transcripts is not critical, normalization can enhance the identification of rare transcripts by reducing the number of overabundant transcripts [16]. This is especially important when sequencing from cDNA pooled from many different organs/tissues or individuals (e.g. [7, 9]). However, in non-model species with poorly characterized genomes, it may not be apparent if relative gene expression is responsible for an interesting phenotype. To date, the costs and benefits of normalization have been evaluated only in model species where entire genome sequences are available (e.g. Arabidopsis; [3]).

We are interested in the sex determining mechanisms of fishes, and 454 pyrosequencing of gonad transcriptomes may provide a powerful approach for understanding the genetic architecture of sexual differentiation. In particular, we are interested in lake sturgeon (Acipenser fulvescens), a species of conservation concern in North America due to historical overharvest, pollution and habitat fragmentation [17]. The restoration of lake sturgeon populations is complicated by their biology, namely delayed sexual maturity (between 10 and 30 years of age), infrequent spawning (every few years), and sexual monomorphism [18]. With regard to the latter issue, DNA sexing assays have proven invaluable in the conservation of other sexually monomorphic species and could be a great boon for lake sturgeon biologists [19]. However, the search for sturgeon sex determining genes has so far been unsuccessful [18, 20, 21]. This is probably because the genetics of sex determination in fishes is much more complicated (i.e., evolutionary labile) than in mammals and birds [19, 22, 23]. Furthermore, the lake sturgeon genome is large (5 times the size of humans) and polyploid [24].

Herein we present the results of a pilot study designed to characterize gonad transcriptomes in lake sturgeon. To our knowledge, this is the first such study in vertebrates and the first in a polyploid species. Our ultimate goal is to provide an overview of transcription in lake sturgeon gonads, including the discovery of new genes and the SNPs they harbor. Our proximate goals were a) to evaluate the relative merits of normalization for 454 runs and b) to evaluate an ecological technique used to determine species richness (rarefaction) and establish its usefulness in gene discovery. We did so by performing five titration runs on the 454 using two normalized and three non-normalized (hereafter referred to as native) pools of cDNA.

Results

Gene and SNP discovery

Sequence assembly

Libraries 1 and 2 were normalized whereas native libraries 3–5 were not. The titration runs produced 3,811, 13,414, 11,178, 10,296 and 8,361 sequences from libraries one through five respectively, for a total of 47,060 reads. These yielded 1,234, 1,385, 8,700, 5,061 and 4,361 reads after quality control (a total of 20,741; Table 1). Mean read length varied across libraries, with longer sequences in the normalized libraries (range 143–232 bp; Table 1; Figure 1). The number of contigs varied from 110 to 578 across libraries and mean read depth per contig varied from 2.8 to 4.9 sequences (Table 2; Figure 2). Figures 1 and 2 represent contig length and contig depth as averaged across the normalized and native libraries. Regression analysis indicated there was a significant positive relationship between contig length and contig depth (r² = 0.25, slope = 5.861 (± 0.237), p = < 0.001).

Table 1 Number of reads and nucleotides produced by five cDNA libraries (L1–L5). Significant BLAST hit criteria were an e-value ≤ 1 × 10^-3 and a bit score > 40.

Full size table

Table 2 Contig summary statistics from the PCAP analyses of all five libraries (L1–L5).

Full size table

Xenobiotics

Pyrosequencing from metazoan tissues can produce sequences from endosymbionts [10]. Of the lake sturgeon ESTs that passed quality control (QC), 88.7% blasted back to vertebrates, 6.9% to invertebrates, 3.7% to bacteria, 0.4% to plants, and 0.3% to fungi (Figure 3a). The mean e-value for the top BLAST hits to vertebrates was 9.9 × 10^-05 (± 0.001), compared to 8.4 × 10^-04 (± 0.002) for non-vertebrate taxa. Of the ESTs with a top BLAST hit to vertebrates, 59% had a top BLAST hit to a fish species, 29% matched a mammal, 5.9% a reptile or amphibian, and 6.2% a bird (Figure 3b). The BLAST hits to fishes most frequently matched zebrafish, pufferfish, and trout (Figure 3c).

Gene ontologies

All lake sturgeon reads that passed QC were annotated for sequence similarities using BLASTx against the GenBank database; over 5,000 produced a significant BLASTx hit (Table 1). Gene Ontology (GO) assignments were broken into Molecular Function, Biological Process, and Cellular Components. GO analyses categorized ESTs into one of 115 Molecular Function categories; those most highly represented include protein binding, RNA binding, and zinc ion binding. Sixteen of the 115 Molecular Function categories (13.9%) differed statistically with regard to their representation between normalized and native libraries (Table 3). For example, a much higher proportion of sequences from the native libraries were RNA binding genes (18.3%) than in the normalized libraries (5.3%; P-value = 0.001). A higher proportion of sequences from the normalized libraries were involved in several key pathways (such as ATP binding, oxygen binding, and zinc ion binding) than in the native libraries (see Additional files 1, 2 and 3).

Table 3 Of 115 different Molecular Functions identified in the Gene Ontology analysis, 16 differed significantly in expression between normalized and native libraries.

Full size table

GO analysis identified 66 Biological Process categories, including translation, muscle cell differentiation, and transport. Fifteen of these 66 (22.7%) were represented unequally between the normalized and native libraries (Table 4). Not surprisingly, genes involved in protein translation exhibited the biggest differences in expression; 12.8% of sequences in the normalized libraries were involved in translation compared to 40.7% of sequences in the native libraries (p < 0.001).

Table 4 List of genes found within the Biological Process category of the Gene Ontology assignment with a significant difference in expression between normalized and native libraries.

Full size table

GO analysis identified 62 Cellular Components among our ESTs, including cytoplasm, mitochondrion, and membrane. Twelve of the 62 (19.3%) were represented unequally between the libraries (Table 5). Four gene classes (small cytosolic subunit, large cytosolic subunit, plasma membrane and ribosome) were overrepresented in the native libraries. The other 8 were more abundant in the normalized libraries.

Table 5 List of genes found within the Cellular Component category of the Gene Ontology assignment with a significant difference in expression between normalized and native libraries.

Full size table

SNP detection

We identified 877 candidate SNPs from 1,840 contigs whose sequences spanned 403,258 bp; this equates to 1 SNP every 460 bp. Of the 877 SNPs, only 16 were insertion/deletions whereas 861 were substitutions. A total of 722 SNPs (82.3%) were singletons present in one read within a contig. Of the 155 SNPs that appeared in multiple reads, 66 occurred in different fish (libraries). Table 6 summarizes a subset of the SNP data, those SNPs found in genes of known function, and with a Ts/Tv ratio of less than 1 (more transversions than transitions) these genes could be targets of selection (see Additional file 4 for all SNP data). SNP density varied across genes, this may be due in part to strong historical selection, and the Ts/Tv ratio can help identify such genes [25, 26]. A total of 561 SNPs were transitions and 300 SNPs were transversions, giving a mean Ts/Tv ratio of 1.87 across the transcriptomes. We estimated the Ts/Tv ratio for all contigs that BLASTed to a known gene (see Additional file 5); their frequency distribution is shown in Figure 4.

Table 6 Subset of those contigs constructed from all five cDNA libraries combined (20,741 individual reads).

Full size table

Rarefaction and normalization

Rarefaction analysis was pioneered in theoretical ecology, where it was used to evaluate species richness [27, 28]. It has more recently been adopted in population genetics [29–31] and it has potential utility in genomics, particularly with regard to gene discovery as a function of effort. Figure 5 illustrates the difference in gene discovery rates in normalized and native libraries. Figure 5a plots our empirical data with the curves truncated at 260 reads (the number present in the normalized libraries), whereas Figure 5b shows simulated data based on the projection of the curves in Figure 5a. When using both empirical and simulated data, the rate of gene discovery is moderately faster in the normalized libraries, but only up to ~4,000 sequences.