Background

Drosophila melanogaster is an important model organism. After more than 50 years of study, the anatomy of the brain is well described and many brain functions have been mapped to particular substructures [1,2,3,4,5,6,7,8]. The adult brain is composed of approximately 200,000 neurons which are organized into discrete substructures. The optic lobe (composed of the lamina, medulla, lobula and lobula plate) is primarily involved in the processing of visual information from the photoreceptors and sending that information to the central brain [2,5,9]. The antennal lobes are chiefly responsible for the processing of olfactory information [10]. The mushroom bodies are involved in olfactory learning and memory and other complex behaviors [11,12,13,14,15]. A group of approximately six neurons in the lateral protocerebrum are sufficient to drive circadian rhythms in locomoter activity [16,17]. The central complex, although poorly understood, appears to be involved in motor coordination [18,19,20].

Despite our increasing knowledge of Drosophila brain anatomy and function, relatively little information is available concerning the molecules expressed in the brain that coordinate function and manifest behavior. Classic methods of identifying genes involved in neural function include behavioral screening of mutagenized flies, then rescreening candidate lines for pleiotropic effects due to developmental defects. This process is both laborious and time consuming. To augment this genetic approach, sequencing of random cDNAs is proving effective in identifying genes expressed in a specific cell type [21]. Much information has been collected through the analysis of expressed sequence tags (ESTs) [22,23,24,25]. Using this approach, sequence information is gathered from one or both ends of a cDNA and cataloged to determine the complexity of an mRNA population. Here, we use a modified EST approach and completely sequence novel cDNAs. Others have used a similar approach by shotgun sequencing concatenated cDNA inserts [26,27]. One goal of our work was to begin to develop a catalog of transcripts expressed in the brain. These transcripts, because of the location of their expression, are expected to contain a higher proportion of clones that are involved in neuronal function.

Many Drosophila head libraries have been used to isolate cDNAs that correspond to genes identified by genetic screens for their involvement in brain function. Several transcripts identified in this manner are expressed at a relatively low level (dunce [28], CREB [29], dco [30], period [31], timeless [32], dissonance [33]). The Drosophila brain makes up only a small part of head tissue (approximately 14% dry weight). By eliminating non-brain tissues, we increase the relative representation of rare neural transcripts in this unique library.

We began a catalog of the genes expressed in the brain of adult Drosophila in support of more conventional methods of understanding brain function. Cataloging sequence information and publishing the data through electronic databases has enriched molecular science in general. In a matter of a few minutes, one can use information from a single sequencing reaction to identify a gene that was sequenced by another laboratory, and one maybe able to deduce the function of the isolated clone. This set of tools facilitates molecular work in virtually every branch of biological sciences. This report details construction, quality analysis and initial characterization of a unique library created from adult Drosophila brains. Surprisingly, we discovered that 11% (29 clones) of the Drosophila brain cDNA clones that were randomly chosen for analysis are not matched with any EST sequence generated in support of the Drosophila genome project (as of 10 October 2000). Further, the genes encoding 59% of these novel ESTs are not predicted by algorithms used for fly genome annotation. From our analysis of ESTs that do not correspond to one of the 13,601 annotated genes, we predict that the number of genes in the Drosophila genome may be underestimated by 10-15% (approximately 1,300 to 2,000 genes).

Results and discussion

Library quality assessment

Desiccated brain tissue from adult Drosophila melanogaster was used to construct a library using the Stratagene Hybrid-Zap system. This library was designed for protein expression and, therefore, was constructed such that full-length cDNAs containing 5' untranslated regions are not likely to be present. The number of clones in the library and the size of the clones were used to assess the quality of the library. The number of clones in the primary library was determined by titering one of the five packaging reactions. The total number of clones in the primary library is 6.75 × 106 (that is, all five packaging reactions). From the analysis of the fully sequenced clones (141 novel and matched isolog clones reported in this study), the majority of the inserts (53%) were between 400 and 800 base pairs (511 base pairs ± 197 base pairs average deviation). Characterized clones from the library range between 139 to 1,746 base pairs (bp), including only 15 As of the poly(A) tail. The insert size for this library is as expected using the Stratagene Hybrid-Zap kit, given that the size-selection column retains DNA molecules larger than 200 bp) (Stratagene technical support, personal communication). Of the 283 clones that were either completely or end-tagged sequenced, approximately 4% (12 clones) did not contain an insert (Figure 1).

Figure 1
figure 1

Scheme for classifying the Drosophila brain library clones.

Clone selection

To try to maximize the discovery of novel transcripts, we investigated whether there was a correlation between transcript abundance and the presence of the sequence in a public database. Specifically, a reverse northern blot experiment using radiolabeled head cDNA was performed to determine whether hybridization level could be used to identify frequently occurring transcripts. We reasoned that the abundance of these transcripts may increase their representation in data banks when compared to less abundant transcripts. The data from this experiment are shown in Table 1.

Table 1 Hybridization data from 85 randomly chosen clones

The level of hybridization to the probe varied considerably within a category. In particular, novel transcripts did not uniformly have low levels of hybridization, which suggested that hybridization level would not greatly aid in identifying novel clones. Therefore, subsequent clones for this study were randomly chosen for sequence analysis. It is possible that abundant transcripts may not be as well represented in the database as a result of directed cloning of rarer molecules, or that cDNA abundance in this library may not accurately reflect relative transcript abundance in the fly brain.

Sequence data

We obtained sequence data for 271 independently isolated cDNAs representing transcripts expressed in the Drosophila brain (Figure 1, Table 2). Of these, 141 clones originally classified as either novel (114 clones) or matched isologs (27 clones) were completely sequenced. Only end-tag sequence data was collected for clones classified as matched isolog ribosomal protein sequences (16 clones), known Drosophila sequences (71 clones) and known Drosophila ribosomal protein sequences (23 clones). All insert sequences or ESTs can be obtained by searching GenBank with the appropriate accession numbers listed in Table 2. Data for 20 mitochondrial 16S clones are not reported here because mitochondrial expression is not the focus of this study.

Table 2 GenBank accession numbers of all sequenced clones

We generated sequence data for 27 Drosophila genes that had been previously sequenced from other organisms. Isologs that were identified in the brain library but which had not been identified or sequenced in Drosophila melanogaster are listed in Table 3 (with the exception of ribosomal protein genes). An isolog is defined as a sequence that has a high degree of similarity to genes identified in other organisms, but the functional relationship between these genes has not been demonstrated [34]. As expected, we recovered many previously identified Drosophila genes (Table 4). We did not continue with full insert sequencing of these Drosophila sequences, but the EST data for these clones was submitted to GenBank.

Table 3 Isologs identified in Drosophila brain study
Table 4 Brain cDNA clones matched with previously reported Drosophila genes

Approximately 42% of the sequence data generated in this study were originally novel according to sequence analysis searches conducted at the beginning of this project. Since then, much EST data has been added to GenBank and the Drosophila genome sequence has been released. Thus, in October 2000 the 114 previously novel brain cDNA were again compared with fly sequence data. The percentage of transcripts that do not have corresponding ESTs is reduced to 11% (Table 2; of 29 clones, 17 have no EST matches and are not predicted genes following genome annotation, and 12 have no EST matches but are matched with a predicted gene). Although each of these 29 clones lacks an EST match, each clone is identified within the Drosophila genome sequence recently reported by Adams et al. [35]. It is possible that some of these clones represent the 3' ends of ESTs for which only 5' sequence data is available. Considering that data for approximately 80,000 ESTs (24,193 ESTs from adult heads alone) are reported [36] and that our analysis examined only 271 randomly chosen brain library clones, 11% is a surprisingly large number. This indicates that this library is a valuable resource for generating sequence data that will facilitate genome annotation, specifically identifying regions transcribed in the adult fly brain.

From our analysis it is clear that EST data are essential for accurate and thorough genome annotation. In particular, using current genome annotation algorithms, 42 of the 271 brain clones do not correspond to predicted genes (Table 2). Of these 42 clones, however, 25 have EST matches with the Berkeley Drosophila Genome Project (BDGP) data (Tables 2, 5). Comparisons of the remaining 17 cDNA sequences with the Drosophila genome sequence show evidence of RNA processing (exon/intron borders and consensus splicing sequences) for two clones, and presence of a poly(A) addition sequence (AAUAAA) 12 to 30 bp upstream of an extensive poly(A) region at the 3' end of the insert sequence for seven clones (Table 5b). Ten of the 17 clones were detected in reverse northern experiments using either brain or body radiolabeled cDNA (Table 6). The distribution of detection by brain cDNA, body cDNA, both or neither (not detectable above background) for the 17 clones in this category is similar to the distributions observed in the other categories (Tables 5a, 6), and strikingly similar to the detection frequency observed for the 'matched with an EST and a predicted gene' category. Although these data suggest that these sequences are transcribed, additional experiments are necessary to confirm whether this is true for each clone. None of the clones in this category is predicted to encode a protein larger than 100 amino acids. It is possible that these sequences may correspond to genomic DNA. Alternatively, these novel RNA molecules may perform some unknown cellular function that requires a conserved structure rather than a conserved sequence.

Table 5a Correlation between EST match, gene prediction and hybridization analysis
Table 5b Additional information
Table 6 Body versus brain expression of originally novel clones

The Drosophila genome is predicted to contain 13,601 genes [35]. Ifour observations are representative and can be extended to the number of genes in the fly genome, then our analysis suggests that the total number of genes may be underestimated by approximately 15% (42 of the 271 randomly chosen cDNAs do not correspond to a predicted gene). Thus, approximately 2,000 genes may await discovery.

Transcript distribution analysis

A second hybridization study was conducted to determine whether clones originally identified as novel were detectable in the brain and/or body of adult Drosophila. This data may offer clues as to which transcripts are involved in basic neuronal function, as opposed to a function that may be specific to the brain. The Drosophila central nervous system (CNS) includes thoracic and abdominal ganglia and, therefore, neural transcripts are often expressed throughout the body. Thus, it was possible that few transcripts would be brain-specific.

To determine how the (originally) novel clones were distributed in the animal, plasmid templates from 114 novel clones were spotted on filters and hybridized with radiolabeled cDNA from either brains or bodies (minus heads). The results of this study are listed in Table 6. In this experiment cDNA probe is limiting and, therefore, many transcripts that are in low abundance may not be detectable. In fact, 36% of the clones were not detected in either brain or body. These clones may correspond to less abundant transcripts. Ideally, hybridization probe would be in excess in these experiments to determine which clones are brain specific, but Drosophila brain cDNA is limiting. Approximately 30% of the clones were detectable only in the brain and are candidates for genes involved in brain function. Clones that were detected in both tissues made up about one third of the novel transcripts (29%). About 5% of the clones were detected only in the body. As the library is made from brain tissue, we did not expect to recover many transcripts that would only be detectable in the body, as compared to the brain.

We used published localization data from previously identified transcripts to evaluate the data we collected for the novel clones (Table 4). Approximately 22% (7 of 32) of the known Drosophila genes listed are neural-specific, and approximately 30% of novel transcripts were detected only in the brain. Approximately 29% of the novel transcripts were detectable in both brain and body tissues. Known Drosophila genes that were localized in body and brain tissues accounted for 56% (18 of 32) of genes for which localization data was available (Table 4). Clones detected in both tissues may indicate that the gene product is needed in all cells. Genes from the nervous system would be expected to be expressed in both tissues, so transcripts detected in both cannot be ruled out of this category; but these transcripts are not brain specific. Approximately 5% of the novel clones were detectable only in the body, as compared to 22% (7 of 32) of the known Drosophila clones detected only in body tissues. These transcripts are apparently expressed at a higher level in the body and at relatively low levels in the brain. It should be noted that localization data were not specified for 18 of the 49 (37%) known transcripts listed in Table 4. This analysis suggests that this brain cDNA library is a rich source for generating cDNA sequence information and for identifying novel, brain-specific cDNAs.

Conclusions

The initial analysis of an adult Drosophila brain library is presented here. Somewhat surprisingly, we observe no clear connection between the abundance of a transcript and its appearance in a sequence data bank. However, molecular screens that are directed towards isolating rare transcripts may skew the transcript-related data in sequence banks towards less abundant molecules. As shown in Figure 1 and Table 2, we have identified and sequenced 29 novel clones that do not match with other known expressed sequences (but do match with fly genomic sequence information), 85 clones that are matched with EST data, 71 clones that were previously reported Drosophila sequences, 39 clones that contain ribosomal protein sequences, 27 clones that are matched with genes previously reported for other organisms (isologs, Table 3) and 20 clones corresponding to mitochondrial sequences.

Why did we recover such a high percentage of novel sequences? Libraries made from brain tissue are proposed to have a higher complexity of transcripts than libraries made from other tissues [21]. Therefore, EST screens of brain libraries should yield larger numbers of independent transcripts, as a result of the increased transcript complexity within brain tissues. Another possible explanation for the surprisingly large number of novel cDNAs identified in our analysis is that our library is not normalized. It has been proposed that hybrids form between poly(dA) and poly(dT) sequences during the hybridization/subtraction reaction and that these sequences are subsequently lost [36].

An ultimate goal of this project is to create a database of all the transcripts expressed in the Drosophila brain and to correlate this information with their patterns of expression in the brain. This type of a database would be a valuable resource and could be used in comparative studies with other organisms. Comparisons of transcripts from organisms with relatively simple brains (Drosophila) to organisms with more complex neural function (humans) may offer insights into basic brain function and aid in the identification of transcripts involved in higher-order brain functions. The 35 clones that appear enriched in the brain may identify proteins or RNAs that are involved in a brain-specific function. Transcripts identified in this library can be directly tested for protein-protein interaction using the yeast two-hybrid capability of the library, making it a good resource for many areas of study.

Our analysis of this unique brain library demonstrates that many transcribed regions of the Drosophila genome remain undiscovered, and that approximately 2,000 more genes may be identified. Genome annotation efforts emphasize identifying protein-coding regions [37]. Thus, it is possible that some of the ESTs lacking a corresponding predicted gene were missed during genome annotation because an open reading frame (or one of sufficient size) was not predicted.

Complete genomic sequences are excellent resources, and extensive annotation of a genome makes the sequence information even more powerful. Current software is not sufficient to identify all transcribed regions within the genome. As of the year 2000, EST data for 24,193 clones from adult Drosophila head libraries is reported and estimated to represent over 40% of all Drosophila genes [38]. Our results confirm that not all transcribed regions of the genome are identified and that EST analyses are essential for accurate and complete genome annotation.

Materials and methods

Tissue preparation

To produce the animals for the brain dissections, adult Drosophila were entrained using 12 h light and 12 h darkness in temperature-controlled incubators at 25°C. Entrained adult D. melanogaster (Canton S) flies were collected 3 h after the lights were turned off, frozen on dry ice, shaken to detach heads from bodies, and separated through a screen to isolate the heads. Frozen heads were incubated in prechilled -20°C, 100% acetone (EM Science) at -20°C overnight to replace the water in the tissue with acetone [39]. Prechilling the acetone prevents the heads from thawing when added to the acetone. Heads were dried at room temperature, and brains were removed using fine dissecting tweezers.

RNA preparation

RNA was isolated according to the Micro RNA Isolation protocol from Stratagene, with the exception of homogenization. Dried tissue was homogenized in denaturing solution with β-mercaptoethanol for 1 min, incubated on ice for 15 min, and then homogenized for an additional 5 min. The addition of a rehydration step increased the yield of RNA from dried tissue to approximately the same level as fresh tissue (16-21.9 μg total RNA extracted from 100 fresh heads, 15-29.3 μg total RNA extracted from 100 acetone-dried heads with rehydration step, compared to 3-6.6 μg total RNA extracted from 100 acetone-dried heads with no rehydration step). Poly(A) RNA was isolated using the Poly(A) Quick mRNA Isolation Kit from Stratagene according to the manufacturer's instructions. Approximately 5 μg poly(A) RNA was extracted from approximately 15,000 brains. Weighing acetone-dried brains allowed us to estimate how many were used to construct the library (50 acetone-dried brains weigh 0.25 mg and 50 acetone-dried heads weigh 1.8 mg).

Library construction

The library was constructed using a Stratagene HybriZAP™ Library kit. First-strand cDNA synthesis was primed from the 3' end of the poly(A) RNA using a poly(T) primer that also contained an XhoI restriction site and a GAGA sequence (5'-GAGAGAGAGAGAGAGAGAGAACTAGTCTCGAGTTTTTTITTTTTTTTTTT-3'). 5-methyl dCTP was used during first-strand cDNA synthesis to protect internal XhoI sites. Second-strand synthesis was primed by the partially digested RNA that resulted from RNase H treatment of the first-strand synthesis reaction. Pfu DNA polymerase was used to blunt the cDNA and EcoRI adapters were ligated to the blunt ends. The cDNA was digested with EcoRI and XhoI, size separated (retaining molecules approximately 200 bp or larger), ligated into HybriZAP™ vector arms, and packaged into phage heads for amplification.

Determining the number of primary clones

The cDNA library was titered to determine how many independent clones were recovered. At the 10-1 dilution there were 270 plaques per plate, giving a total of 6.75 × 106 clones in the primary library. This number was calculated as follows: (number of plaques 270) × (dilution factor 10) × (total packaging volume 500 μl) / (total number of mg packaged 8.75 × 10-5) × (number of μl packaged 1) = 1.542 × 1010 plaque-forming units (PFU) per mg or 1.35 × 106 PFU per packaging reaction. There are five packaging reactions for the entire library for a total of 6.75 × 106 clones in the primary library.

Sequencing-template preparation

PCR template

Individual phage plaques were incubated in 400 μl SM buffer overnight and used as amplification template. Amplification reactions were performed in a total volume of 40 μl and contained 2 μl eluted phage, 40 ng of each primer (FADI 5'-CACTACAATGGATGATG-3' and RADI 5'-CTTGCGGGGTTTTTCAG-3'), 0.001% Tween 20 (Sigma), 2.5 U Taq DNA polymerase (Promega), 1x Taq polymerase buffer (Promega), 1.56 mM MgCl2 (Promega), and 0.25 mM of each dNTP (USB). PCR was performed on a Perkin-Elmer 9600 GeneAmp PCR system, as specified in the HybriZAP™ Two-Hybrid cDNA Gigapack Cloning Kit instruction manual (Stratagene). After amplification, reactions were incubated at 37°C for 15 min with 0.5 Uμl-1 exonuclease I (USB) and 0.5 Uμl-1 shrimp alkaline phosphatase (Amersham Life Sciences). Enzymes were inactivated by heat treatment at 85°C for 15 min. Resulting samples were electrophoretically separated on a 1% Agarose (Kodak) gel and compared to a quantitative marker, BioMarker-EXT (BioVentures), to estimate the DNA concentration of each sample. This DNA was directly used in subsequent sequencing reactions.

Plasmid template

Individual phage plaques were incubated in 400 μl SM buffer overnight and then used for excision. Library phage were incubated with ExAssist Helper Phage™ (Stratagene) and XL1-Blue Escherichia coli cells, and grown overnight in Luria Broth. E. coli cells were killed by heat treatment (70°C, 20 min). XLOR E. coli cells were inoculated with the released phagemids and this mixture was plated on 50μg ml-1 ampicillin (Sigma) selection medium. Resulting colonies were cultured for subsequent plasmid DNA preparation (Perfect Prep Plasmid DNA kit, 5'-3' Inc.).

Sequencing

Initial sequence information was obtained using standard sequencing methods (described below) and a vector primer directed toward the 5' end of the insert (FADI 5'-CACTACAATGGATGATG). These ESTs were evaluated using the BLAST search program [40,41] linked to the nonredundant GenBank database. Novel cDNAs and isologs were completely sequenced. If a cDNA had been previously identified, sequence determination was not continued. The second standard sequencing reaction was primed from poly(A) tail using 1.6 pmol of a poly(T) primer anchored (PLYT 5'-TTTTTTTTTTTTTTTV-3' (V=A, C, or G)). cDNA sequences were completed using an octamer-primer walking strategy [42,43].

Automated sequencing reactions were performed using ABI PRISM Dye Terminator (or dRhodomine for PLYT reactions) Cycle Sequencing Ready Reaction Kits with AmpliTaq DNA polymerase, FS, according to the manufacturer's directions or as described for octamer-primed sequencing reactions [42,43]. The FADI primer was annealed at 48°C and the PLYT primer was annealed at 20°C. Sequencing reactions were ethanol precipitated, pellets were resuspended in 3.5 μl loading buffer, 1.5 μl was loaded onto a sequencing gel, and the data was collected by an ABI PRISM 377 DNA sequencer. Data collected from the ABI PRISM 377 DNA sequencer was manually edited using Sequencher 3.0 (GeneCodes).

Hybridization analyses

Eighty-five individual phage plaques were incubated in 400 μl SM buffer overnight and 2 μl phage eluant was used to produce a grid of plaques on a lawn of E. coli cells. Filter lifts were taken from the grid of plaques and hybridized at 65°C overnight with labeled Drosophila head cDNA at 1 × 106 cpm per ml hybridization buffer (50% formamide, 5x SSC, 0.1% Ficoll w/v, 0.1% PVP w/v, 0.1% BSA w/v, 0.1% SDS w/v, 0.2 μg ml-1 salmon sperm DNA and 1 mM EDTA). Filters were then washed sequentially at 42°C in 5x SSC for 1 h, at 65°C in 1x SSC for 1 h, and at 65°C in 0.1x SSC for 1 h. Filter lifts were exposed to phosphorimaging plates for 24 h (Fuji Medical Systems) and psl (photo stimulating units) of a standard area were determined using a Fuji Bas1000 Imager.

Hybridization of all novel clones

Plasmid DNA (10 ng) from each novel clone was denatured at 95°C for 5 min and spotted onto a nylon filter. Filters were hybridized at 65°C overnight with either labeled Drosophila body (minus head) or labeled Drosophila brain cDNA at 1 × 106 cpm ml-1 in Church and Gilbert buffer (7% SDS, 1 mM EDTA, 500 mM Na2HPO4, and pH to 7.2 with H3PO4 [44]). Filters were washed twice for 1 h (each) at 65°C in Church and Gilbert buffer. Filters were then exposed to phosphorimaging plates for 24 h, and psl of a standard area was determined using a Fuji Bas1000 Imager. 10 ng of each plasmid on the filter contains approximately 1.2 × 1012 copies and, therefore, probe is expected to be limiting in these experiments.

Categorizing clones

Clones classified as 'novel' had no obvious match to nucleic acid/protein sequence information in GenBank (a score less than 100). Novel clones may contain blocks of less than 100 bases that are matched with other sequences, but these small regions have no known functional correlation and the similarity between the two sequences was very low. It is possible that some of our novel clones could be part of an EST from the BDGP that has not been fully sequenced. Sequences categorized as matched to EST data have a high degree of similarity (a score over 100) to reported sequence information, but the previously collected sequence data was not associated with a known function. Sequences categorized as 'known Drosophila' were a perfect match (with perhaps the exception of a few bases, fewer than 10) with sequence information from Drosophila. 'Matched isologs' are sequences that have a high degree of similarity (score of over 100 at the protein level) to a gene found in another organism, but a functional homology between these genes has not been determined. Ribosomal protein sequences were categorized as 'known' or 'isologs' using the above criteria. Ribosomal protein and mitochondrial sequence clones were categorized separately because these types of transcripts frequently occurred in our library.