Massively parallel pyrosequencing via Roche's 454 platform [1] has great potential for identifying genes of interest to ecologists and evolutionary biologists. The 454 approach provides more accurate base calling and deeper sequencing coverage than is possible with conventional Sanger sequencing while dramatically decreasing labor [13]. For the most part, pyrosequencing has so far been restricted to model organisms [36] or species closely related to a model [7] because of the short reads which make de novo genome sequencing difficult without a scaffold [8]. However, the depth of coverage provided by the Roche 454 platform means that transcriptomes of non-model organisms can be characterized without a genome sequence. Recently, 454 technology has been applied to transcriptomes of the rose gum tree (Eucalyptus grandis [9]) and the Glanville fritillary butterfly (Melitaea cinxia [10]). These studies illustrate the potential of 454 pyrosequencing to rapidly characterize expressed genes that can be used to address pertinent questions regarding a species' ecology, life history, and evolution [11, 12].

A complete description of expressed sequence tags (ESTs) provides an overview of the transcriptome, those genes expressed (transcribed) in a given tissue at a specific point in time. Pyrosequencing of ESTs can be used to characterize gene expression [6], QTL [13], single nucleotide polymorphisms (SNPs; [9, 14]), and patterns of selection [9]. The identification of SNPs is especially appealing in non-model species because as genetic markers, SNPs can illuminate population structure, sex ratios, and genetic variability [15]. Recently, 454 sequencing has been used to identify tens of thousands of SNPs in inbred lines of maize and in E. grandis [14, 9] demonstrating the power of pyrosequencing for SNP discovery.

There are many factors to be considered prior to transcriptome characterization, including the expected number of unique transcripts in the sampled tissue(s) and their relative abundance. When it comes to sequencing effort, transcriptome sequencing needs to be deep in order to identify rare transcripts and to overcome short read lengths, but the absolute depth required is usually unknown although important for quantifying levels of gene expression. If the relative abundance of transcripts is not critical, normalization can enhance the identification of rare transcripts by reducing the number of overabundant transcripts [16]. This is especially important when sequencing from cDNA pooled from many different organs/tissues or individuals (e.g. [7, 9]). However, in non-model species with poorly characterized genomes, it may not be apparent if relative gene expression is responsible for an interesting phenotype. To date, the costs and benefits of normalization have been evaluated only in model species where entire genome sequences are available (e.g. Arabidopsis; [3]).

We are interested in the sex determining mechanisms of fishes, and 454 pyrosequencing of gonad transcriptomes may provide a powerful approach for understanding the genetic architecture of sexual differentiation. In particular, we are interested in lake sturgeon (Acipenser fulvescens), a species of conservation concern in North America due to historical overharvest, pollution and habitat fragmentation [17]. The restoration of lake sturgeon populations is complicated by their biology, namely delayed sexual maturity (between 10 and 30 years of age), infrequent spawning (every few years), and sexual monomorphism [18]. With regard to the latter issue, DNA sexing assays have proven invaluable in the conservation of other sexually monomorphic species and could be a great boon for lake sturgeon biologists [19]. However, the search for sturgeon sex determining genes has so far been unsuccessful [18, 20, 21]. This is probably because the genetics of sex determination in fishes is much more complicated (i.e., evolutionary labile) than in mammals and birds [19, 22, 23]. Furthermore, the lake sturgeon genome is large (5 times the size of humans) and polyploid [24].

Herein we present the results of a pilot study designed to characterize gonad transcriptomes in lake sturgeon. To our knowledge, this is the first such study in vertebrates and the first in a polyploid species. Our ultimate goal is to provide an overview of transcription in lake sturgeon gonads, including the discovery of new genes and the SNPs they harbor. Our proximate goals were a) to evaluate the relative merits of normalization for 454 runs and b) to evaluate an ecological technique used to determine species richness (rarefaction) and establish its usefulness in gene discovery. We did so by performing five titration runs on the 454 using two normalized and three non-normalized (hereafter referred to as native) pools of cDNA.


Gene and SNP discovery

Sequence assembly

Libraries 1 and 2 were normalized whereas native libraries 3–5 were not. The titration runs produced 3,811, 13,414, 11,178, 10,296 and 8,361 sequences from libraries one through five respectively, for a total of 47,060 reads. These yielded 1,234, 1,385, 8,700, 5,061 and 4,361 reads after quality control (a total of 20,741; Table 1). Mean read length varied across libraries, with longer sequences in the normalized libraries (range 143–232 bp; Table 1; Figure 1). The number of contigs varied from 110 to 578 across libraries and mean read depth per contig varied from 2.8 to 4.9 sequences (Table 2; Figure 2). Figures 1 and 2 represent contig length and contig depth as averaged across the normalized and native libraries. Regression analysis indicated there was a significant positive relationship between contig length and contig depth (r2 = 0.25, slope = 5.861 (± 0.237), p = < 0.001).

Figure 1
figure 1

Length of contigs as averaged (± SD) between the normalized and native libraries. Note the logarithmic axes.

Figure 2
figure 2

Depth of contigs as averaged (± SD) between the normalized and native libraries.

Table 1 Number of reads and nucleotides produced by five cDNA libraries (L1–L5). Significant BLAST hit criteria were an e-value ≤ 1 × 10-3 and a bit score > 40.
Table 2 Contig summary statistics from the PCAP analyses of all five libraries (L1–L5).


Pyrosequencing from metazoan tissues can produce sequences from endosymbionts [10]. Of the lake sturgeon ESTs that passed quality control (QC), 88.7% blasted back to vertebrates, 6.9% to invertebrates, 3.7% to bacteria, 0.4% to plants, and 0.3% to fungi (Figure 3a). The mean e-value for the top BLAST hits to vertebrates was 9.9 × 10-05 (± 0.001), compared to 8.4 × 10-04 (± 0.002) for non-vertebrate taxa. Of the ESTs with a top BLAST hit to vertebrates, 59% had a top BLAST hit to a fish species, 29% matched a mammal, 5.9% a reptile or amphibian, and 6.2% a bird (Figure 3b). The BLAST hits to fishes most frequently matched zebrafish, pufferfish, and trout (Figure 3c).

Figure 3
figure 3

Taxonomic distribution of the BLAST results. (a) The percentage of BLAST hits that are vertebrate, invertebrate, bacteria, plant, and fungi. (b) The percentage of vertebrate blast hits that are mammals, birds, herps, and fish. (c) The percentage of BLAST fish hits that are zebrafish, trout, puffer fish, sole, flounder, sturgeon, and other fish species.

Gene ontologies

All lake sturgeon reads that passed QC were annotated for sequence similarities using BLASTx against the GenBank database; over 5,000 produced a significant BLASTx hit (Table 1). Gene Ontology (GO) assignments were broken into Molecular Function, Biological Process, and Cellular Components. GO analyses categorized ESTs into one of 115 Molecular Function categories; those most highly represented include protein binding, RNA binding, and zinc ion binding. Sixteen of the 115 Molecular Function categories (13.9%) differed statistically with regard to their representation between normalized and native libraries (Table 3). For example, a much higher proportion of sequences from the native libraries were RNA binding genes (18.3%) than in the normalized libraries (5.3%; P-value = 0.001). A higher proportion of sequences from the normalized libraries were involved in several key pathways (such as ATP binding, oxygen binding, and zinc ion binding) than in the native libraries (see Additional files 1, 2 and 3).

Table 3 Of 115 different Molecular Functions identified in the Gene Ontology analysis, 16 differed significantly in expression between normalized and native libraries.

GO analysis identified 66 Biological Process categories, including translation, muscle cell differentiation, and transport. Fifteen of these 66 (22.7%) were represented unequally between the normalized and native libraries (Table 4). Not surprisingly, genes involved in protein translation exhibited the biggest differences in expression; 12.8% of sequences in the normalized libraries were involved in translation compared to 40.7% of sequences in the native libraries (p < 0.001).

Table 4 List of genes found within the Biological Process category of the Gene Ontology assignment with a significant difference in expression between normalized and native libraries.

GO analysis identified 62 Cellular Components among our ESTs, including cytoplasm, mitochondrion, and membrane. Twelve of the 62 (19.3%) were represented unequally between the libraries (Table 5). Four gene classes (small cytosolic subunit, large cytosolic subunit, plasma membrane and ribosome) were overrepresented in the native libraries. The other 8 were more abundant in the normalized libraries.

Table 5 List of genes found within the Cellular Component category of the Gene Ontology assignment with a significant difference in expression between normalized and native libraries.

SNP detection

We identified 877 candidate SNPs from 1,840 contigs whose sequences spanned 403,258 bp; this equates to 1 SNP every 460 bp. Of the 877 SNPs, only 16 were insertion/deletions whereas 861 were substitutions. A total of 722 SNPs (82.3%) were singletons present in one read within a contig. Of the 155 SNPs that appeared in multiple reads, 66 occurred in different fish (libraries). Table 6 summarizes a subset of the SNP data, those SNPs found in genes of known function, and with a Ts/Tv ratio of less than 1 (more transversions than transitions) these genes could be targets of selection (see Additional file 4 for all SNP data). SNP density varied across genes, this may be due in part to strong historical selection, and the Ts/Tv ratio can help identify such genes [25, 26]. A total of 561 SNPs were transitions and 300 SNPs were transversions, giving a mean Ts/Tv ratio of 1.87 across the transcriptomes. We estimated the Ts/Tv ratio for all contigs that BLASTed to a known gene (see Additional file 5); their frequency distribution is shown in Figure 4.

Figure 4
figure 4

Frequency plot of Ts/Tv ratios for 60 contigs with a significant BLAST hit. Points represent the number of contigs with that specific Ts/Tv ratio. Ts/Tv ratios greater than 1 indicate an excess of transitions, those lower than 1 indicate an excess of transversions. A line of best fit is plotted through the data.

Table 6 Subset of those contigs constructed from all five cDNA libraries combined (20,741 individual reads).

Rarefaction and normalization

Rarefaction analysis was pioneered in theoretical ecology, where it was used to evaluate species richness [27, 28]. It has more recently been adopted in population genetics [2931] and it has potential utility in genomics, particularly with regard to gene discovery as a function of effort. Figure 5 illustrates the difference in gene discovery rates in normalized and native libraries. Figure 5a plots our empirical data with the curves truncated at 260 reads (the number present in the normalized libraries), whereas Figure 5b shows simulated data based on the projection of the curves in Figure 5a. When using both empirical and simulated data, the rate of gene discovery is moderately faster in the normalized libraries, but only up to ~4,000 sequences.

Figure 5
figure 5

Rarefaction curves illustrating gene discovery (number of genes) as a function of effort (number of sequencing reads). (a) Rarefaction analysis comparing empirical gene discovery between normalized and native libraries. (b) Rarefaction analysis comparing simulated gene discovery between normalized and native libraries.


Gene and SNP discovery

Sequence assembly

We performed five Roche 454 titration runs on lake sturgeon cDNA samples with the ultimate goal of characterizing (for the first time) a polyploid vertebrate transcriptome. The lake sturgeon data compare favorably with other 454 cDNA studies of non-model species, both in terms of read length and contig assembly. Mean length of lake sturgeon reads after removing primers, linkers, and substandard bases was 190 bp, in between the 111 bp and 209 bp reported in similar non-model 454 studies [9, 10]. Likewise, our mean contig length (218 bp) and depth (4.1 reads) were similar to the fritillary assembly (197 bp and 2.9 reads) and the Eucalypt assembly (353 bp and 9.9 sequences). Reads generated by 454 sequencing are generally shorter than those produced by dideoxy sequencing, but the base-calling accuracy of the 454 is equivalent or even superior to conventional sequencing [1]. Furthermore, the absolute number of reads produced in a run can help increase overall accuracy.


In the Glanville Fritillary butterfly, 454 transcriptomics provided evidence of microsporidia, intracellular parasites that affect insect population dynamics [32, 10]. Thus it is possible that some of the sequences in our data set are from xenobiotics that are commensal with lake sturgeon. However, 88.7% of our sequences with a significant BLAST hit best matched a vertebrate, and of these 59% best matched another fish suggesting the vast majority of our sequences originated from lake sturgeon. Interestingly, 156 (3.7%) of our sequences BLASTed to a bacterial species. This could be due to noise (i.e., the lack of closely related sequences in GenBank) or to true xenobiotics that are lake sturgeon commensals, pathogens, or parasites. Of these 156 sequences, the average top BLAST hit had an e-value of 1.6 × 10-04, an order of magnitude less than the average top BLAST hits to vertebrates (e-value = 9.9 × 10-05) but still high, suggesting the presence of xenobiotics in our gonad samples. The top BLAST hits to bacteria were from a variety of species, of which 6.4% were aquatic. Of the reads that BLASTed to bacteria, 14.7% were to pathogenic species. Because our sequences were derived from reproductive tissues, some of the bacterial sequences could belong to sexually transmitted pathogens or Wolbachia-type microorganisms, although our preliminary analyses revealed no hits to sexually transmitted microorganisms. Further investigation of lake sturgeon xenobiotics could prove informative regarding coevolution, disease, and infection.

Gene ontologies

In principle, GO analysis can help gauge the merits and demerits of normalization. By comparing GO results from the normalized and native libraries, some interesting trends emerge. For instance, 32% of the annotated genes within the GO Molecular Function category were singletons in the native libraries compared to 6% in the normalized libraries. [These numbers include only those reads that produced a significant BLAST hit; due to the lack of related sequences in Genbank, most (74%) reads did not produce a significant BLAST hit.] Results from the Biological Process category were similar, with singletons comprising 32% of the native libraries compared to 15% in the normalized libraries. In terms of Molecular Function, the most striking result is the abundance of rRNA genes (as shown by the terms "Ribosome biogenesis" and "RNA binding" in Table 3). A total of 18.2% of all sequences from the native libraries are RNA binding genes as compared to 5.3% from the normalized libraries. In other words, normalization is effective at tempering the signature of abundant transcripts exactly as expected. However, even without normalization more than 60% of transcripts were not rRNAs. This suggests that many transcriptomics studies should forego normalization, as native libraries allow one to answer questions regarding gene dosage (i.e., relative expression levels) whereas normalized libraries do not. Clearly, this decision depends on the depth of coverage expected and the design of the study.

SNP detection

We identified a total of 877 candidate SNPs in lake sturgeon, about one every 460 bp. This compares to one in every 192 bp in Eucalypt [9] and one in 214 bp in maize [14]. Previous studies have used Sanger sequencing to confirm ~80% of SNPs initially identified via 454 pyrosequencing [9]. Due to the polyploid nature of the lake sturgeon genome, we might have expected more diversity than in the diploids (because of duplicate genes), but as mentioned earlier lake sturgeon are of conservation concern. Our source population was established from a few wild fish less than 20 years ago [33], and thus our SNP data may not reflect broader patterns of genetic diversity in lake sturgeon. The SNP frequency estimate of 1 per ~460 bp includes all SNPs identified in all contigs in the data set. However, the SNP frequency in contigs that produced a significant BLAST hit was 1 per ~142 bp. This is due to the removal of short contigs that are less likely to both a) contain SNPs and b) produce a significant BLAST hit. This bias is especially prevalent in our data as the majority of contigs are ≤ 300 bps.

Our SNP discovery requirements were stringent (see Methods) and our estimates of SNP density should be quite conservative. For example, we documented only 16 indels despite the susceptibility of 454 sequencing technology to indel-type errors [1]. Sixty-six of the candidate SNPs we identified were seen in multiple fish; these could be orthologous intraspecific polymorphisms (useful for population genetic analyses) or intragenomic polymorphisms due to gene duplications (useful for reconstructing gene trees). Population surveys will be required to discriminate among these alternatives. Eighteen candidate SNPs were unique to one sex; 16 were unique to males (i.e., found in multiple male libraries but not in female libraries) and 2 unique to female (i.e., found in both female libraries but not in male libraries). Three contigs that had SNPs unique to males produced a significant BLAST hit; contigs 27.1, 58.1 and 258.1, which produced a top BLAST hit to ORF1, transposase, and a muscle binding protein respectively. None of these genes are expected a priori to be involved in sexual differentiation, but further research will determine if these candidate SNPs might play a role in sex determination or will prove useful for sex identification.

We investigated the Ts/Tv ratio of SNPs in genes with a significant BLAST hit to help identify potential targets of selection, as transversions are rarer than transitions in neutral genes [25]. The plot of Ts/Tv ratios (Figure 4) shows that the majority of genes have either a Ts/Tv ratio near 1 (equal number of transitions and transversions) or greater than 1 (more transitions than transversions). Only seven contigs produced a Ts/Tv ratio < 1 (more transversions than transitions). These contigs produced a top BLAST hit to ORF2, Calmodulin, XXC finger 6, Ribosomal rRNA S20, endonuclease reverse transcriptase, polyprotein, and Tc1-like transposase. Of these, Tc1-like transposase may be of interest as studies on salmon have found similar signatures of selection on this gene [34]. Tc1-like transposase is co-regulated with other genes implicated in the immune response, signal transduction, and regulation of transcription. Moreover, Tc1-like domains reside in a number of immune and stress-related salmonid genes [34] and probably perform similar functions in lake sturgeon.

Rarefaction and normalization

Rarefaction has proven valuable in ecology and population genetics (27–31); in this paper we demonstrate that rarefaction is also useful in the context of gene discovery. In the lake sturgeon, every 5.2 reads (on average) resulted in a different significant BLAST hit (Figure 5). To truly evaluate gene expression, sequencing effort must be sufficient to detect each RNA species and its dosage. In other words, the absolute depth of sequencing coverage required is a function of the total number of transcripts and their relative abundance. cDNA normalization is often employed when gene discovery (as opposed to gene dosage) is the primary consideration [7, 10]. Normalization effectively reduces the expression of common genes, thus enhancing detection of rare transcripts relative to a native library. However, in many cases (e.g., sex determining genes) it may not be clear a priori if dosage is important [35]. Our data indicate the depth of sequencing coverage provided by next-generation sequencing platforms may obviate the need for normalization. For example, the proportion of singletons following contig assembly was ~65% of the reads in the normalized libraries and ~67% in the native libraries. Thus, normalization has little impact on the discovery of rare transcripts when thousands of sequences are considered. The rarefaction curves in Figure 5b illustrate this point; after ~5000 sequences, there is no real advantage to normalization. Given that next-generation pyrosequencing approaches all generate tens or hundreds of thousands of sequences, our data suggest that normalization is not usually necessary and can obscure the relative gene expression data that can be inferred from native pools of cDNA.


We have demonstrated that it is possible to use 454 sequencing methods to rapidly characterize transcriptomes in non-model species of conservation concern, notably a polyploid vertebrate with a large genome. We have identified over 5000 ESTs from lake sturgeon that are represented by similar sequences in GenBank; we also identified 877 candidate SNPs. We used Ts/Tv ratios to identify specific ESTs that might have been targets of selection, and we determined that lake sturgeon SNPs occur (on average) every few hundred bp in transcribed genes. Titration runs indicate that normalization is not normally necessary when thousands of sequences from a single tissue type are considered. Finally, we have utilized ecological rarefaction in a genomics context to gauge the relative intensity of sampling required for gene discovery (i.e., sequencing depth). Overall, pyrosequencing has great potential to rapidly identify genes of interest to ecologists and evolutionary biologists, including those from species of conservation concern.


RNA isolation and cDNA library construction

Gonads were sampled from five different lake sturgeon. One sample was collected from a juvenile lake sturgeon of unknown sex estimated to be 3–4 years of age; this fish died in the Purdue hatchery one night and was frozen whole the following morning for subsequent biopsy. Two adolescent males and two adolescent female lake sturgeon were sampled from Lake Oneida, New York in June 2007 and May 2008. Based on stocking records and established length-at-age relationships [33], lake sturgeon collected in 2007 were most likely 12 years of age and those in 2008 were 13. From these four fish, we collected gonad biopsies with the surgical assistance of veterinarians from Cornell University. The two male fish expressed milt during handling and were apparently ready to mate, whereas gametes from the two females were not fully developed [36]. Biopsies were immediately frozen in liquid nitrogen and RNA was extracted using TRIZOL reagent (Invitrogen) following the manufacturer's protocol but with the addition of an extra isolation step: following homogenization, insoluble material was removed by centrifugation at 12,000 × g for 10 minutes. The resulting supernatant containing the RNA was removed and the protocol proceeded as normal. The resulting RNA pellet was resuspended in 50 μl of RNase free water. Quantity and quality of total RNA was analysed using a spectrophotometer (Nanodrop) and by gel electrophoresis.

Five cDNA libraries were constructed. Library 1 was from the 2007 adolescent male, library 2 from the 2007 adolescent female, library 3 from the unknown juvenile, library 4 from the 2008 adolescent male, and library 5 from the 2008 adolescent female. Libraries 1, 2, and 5 were constructed from 1 μg of RNA whereas libraries 3 and 4 were constructed with 1.5 μg of RNA. Libraries 1 and 2 were constructed using the ClonTech SMART cDNA kit, whereas libraries 3, 4 and 5 used a modified version of the kit: the CDS III/3' primer we used was (5'-TAGAGGCCGAGGCGGCCGACATGTTTTGTTTTTTTTTCTTTTTTTTTTVN) to avoid long homopolymer repeats unsuitable for 454 sequencing. All other primers were unmodified from the ClonTech protocol [37]. For all five libraries, cDNA was amplified using PCR Advantage II polymerase (ClonTech) and the following thermal profile: 1 min at 95°C followed by 23 cycles of 95°C for 7 secs, 68°C for 6 minutes. Five microliters of PCR product were analyzed in a 1% agarose gel to determine amplification efficiency. The entire cDNA pools from samples 3, 4 and 5 were digested with 10 units of SfiI for a total of two hours at 50°C because cDNA libraries constructed with ClonTech's SMART kit are susceptible to primer concatemerization, and the SfiI enzyme digests both the CDS III/3' and the SMART IV primers. All libraries were purified with the QIAquick PCR Purification Kit (QIAGEN) following manufacturer's instruction and concentrated with a conventional ethanol precipitation. The quality and quantity of the cDNA library was evaluated by both spectrophotometry and by gel electrophoresis. Libraries 1 and 2 were normalized using the Evrogen Trimmer-Direct Kit (Evrogen); normalized cDNA was purified using the QIAquick PCR purification Kit. Libraries 3, 4 and 5 were not normalized.

454 sequencing and assembly

Approximately 4 μg of amplified cDNA was used for 454 library construction and sequencing following established protocols [1]. Titration runs on the 454 were conducted on all 5 libraries. Bases were called by measuring the luminescence intensity from each well and comparing it to known standards. Upon completion, sequences were screened for primer concatemers, weak signal, and poly A/T tails.

Commercial software provided with Roche 454 sequencers (i.e., Newbler assembler) performs poorly with non-model species that lack a reference genome sequence [3, 38]. Thus, sequence assembly was conducted using the default parameters in PCAP, a free program that works well with next-generation sequence datasets [39]. BLASTx was used to search for similar sequences in GenBank. The top hit of each BLASTx search with an e-value ≤ 1 × 10-03 and a bit score > 40 was considered a significant match. BLAST results were then imported into Blast2GO [40], a software package that retrieves GO terms, allowing the function of ESTs to be determined and compared [40, 41]. Significant GO was determined with an e-value ≤ 1 × 10-03 and a bit score > 40. Ontology was categorized with respect to Molecular Function, Biological Process, and Cellular Component.

We statistically evaluated normalization efficiency by calculating the exact binomial probability of obtaining the observed number of counts of a particular gene by chance alone. In other words, our null hypothesis was no statistical difference in the expression of genes between the normalized and native libraries. Genes with a p-value less than 0.05 were more abundant in one of the libraries than expected by chance.


We generated rarefaction curves [27, 28] to compare the rate of discovery of new genes in the normalized versus native libraries as compared to the amount of effort as measured by sequencing depth. The mean number of genes that produced a significant BLAST hit and the mean number of sequences were calculated for the normalized and native libraries. Rarefaction curves were constructed using resampling procedures similar to bootstrapping: all identified genes within a library were recorded along with their frequency. The list was then randomized; the numbers sorted in ascending order, and the first appearance of the new gene was recorded along with its frequency. This was repeated 1,000 times with the program EcoSim700 [42]. Rarefaction analysis was performed on empirical data (i.e., actual counts from the sequences) and up to 10,000 simulated sequences.

SNP discovery

To identify SNPs, we pooled all sequences from the five libraries into PCAP and constructed contigs from the entire data set. The consensus sequences from these contigs were used as reference sequences to which individual reads were aligned using GC Reference Mapper (454 Life Science). Each read was only aligned to one site in the reference sequence; if reads were aligned to different references sequences then the reads were discarded. We limited SNP scoring to contigs composed of four or more reads, and we only identified SNPs where 30 bp of high-quality sequence data was present both upstream and downstream of the variable site. We also discounted SNPs in homopolymer repeats of > 4 nucleotides. The Ts/Tv ratios were determined for every SNP-containing contig that BLASTed to a known gene.