Background

Reading disability (RD), or dyslexia, is a common syndrome with a significant genetic component. Genetic linkage studies have identified five potential RD loci located on chromosomes 1 [1, 2], 2 [3, 4], 6 [59], 15 [7, 10] and 18 [11]. The linkage to 6p has been the most often reproduced in independent samples with five studies showing peaks of linkage to different regional marker sets [59]. We recently constructed a BAC/PAC contig of the 6p RD locus [12] and identified the precise location and order of 29 short tandem repeat (STR) markers spanning this region [13]. A subsequent study with this new marker panel identified a peak of transmission disequilibrium at marker JA04 (G72384) [14].

We searched the expressed sequence tag (EST) database, dbEST http://www.ncbi.nlm.nih.gov/dbEST/index.html, with the genomic sequence corresponding to the peak of transmission disequilibrium at marker JA04. ESTs are partial and usually incomplete cDNA sequences prepared from various tissues. Presently, nearly four million ESTs (dbEST release 04/19/02) are catalogued in dbEST representing about 80% or more of all human genes with at least one representative entry [15]. While not every gene is accounted for in dbEST, computer database searching, also known as in silico cloning, can identify new genes without actual physically manipulating DNA. These types of analyses can also characterize intron-exon boundaries, splice variants, tissue specific expression levels, and gene homologies [16]. Furthermore, clustering ESTs together to form a contiguous sequence can predict putative open reading frame (ORFs). The first map of the human genome, which contained over 30,000 genes, was generated by mapping EST clusters to human-hamster radiation hybrid cell lines [17]. Despite their wide-ranging utility, ESTs have two inherent drawbacks: (1) they are based on single sequence reads making them vulnerable to sequencing errors, and (2) they are generated from cDNA libraries that may contain unexpressed or incompletely spliced sequences derived from heteronuclear RNA or other artifacts. The clustering of ESTs to form ORFs could contain both expressed and unexpressed sequences and should be treated with caution. Fortunately, the high redundancy of entries in dbEST permits the alignment of multiple ESTs for most genes thus diminishing the effects of these drawbacks.

To identify candidate genes for RD and other disorders mapping to this region we downloaded and searched the two million base pairs of genomic sequence surrounding the peak of transmission disequilibrium. In addition to RD, risk loci for Behçet's disease [18], inflammatory bowel disease (IBD3) [19, 20], hypotrichosis simplex (HSS) [21], insulin dependent type 1 diabetes mellitus [22], attention deficit hyperactivity disorder (ADHD) [23] and schizophrenia [24] have all been assigned to this general chromosomal location by genetic linkage analysis. Using in silico cloning we identified a total of 19 genes and 2 pseudogenes and mapped their precise physical location and direction of transcription. The expression pattern of each gene was characterized by examining the number of ESTs identified from various tissues as well as by qualitative RT-PCR with RNA from 20 different human tissues. This study also allowed us to test the usefulness of in silico cloning to identify and map new genes in a focussed region of the genome.

Results

Using the blastc13 server at NCBI, we performed in silico cloning studies of the 6p RD locus to identify coding regions. In total, 623 ESTs from 80 different tissues were identified and aligned to 2 Mb of genomic sequence. These searches captured 157 putative coding regions from 19 genes and 2 pseudogenes concentrated in the central 1200 Kb shown in detail in Figure 1 with base pair 1 starting at the 5-prime end of FLJ12671 and ending 1 Kb centromeric to the 3-prime end of RPS10. Short tandem repeat marker JA04, which identified the peak of transmission disequilibrium for RD phenotypes in previous studies [14], is at 540 Kb. The most telomeric 200 Kb and the most centromeric 600 Kb of genomic sequence are void of coding regions. Intergenic distances range from less than 1 Kb (KIAA0319 and TRAF) to 110 Kb (HNRPA1 and P24). Cytokeratin 8, transcribed telomere to centromere, is located in the intron between exons 1 and 2 of KIAA0319 (transcribed centromere to telomere). Table 1 lists the genes identified in Figure 1, their size in Kb, the NCBI accession number for the corresponding mRNA or cDNA, number of exons, genomic mapping source, and putative function.

Figure 1
figure 1

Transmission disequilibrium and genetic linkage analyses of the 6p21.3 reading disability locus, regional STR markers and transcription map. At the top of the figure is the result of the DeFries-Fulker linkage (T Score, solid line), and QTDT linkage disequilibrium (chi-square, dashed line) [14]. The location and order of the 29 STRs are shown below which identify a peak of transmission disequilibrium at marker JA04. Below the markers is a detailed representation of 1.2 Mb surrounding marker JA04. The 19 genes and 2 pseudogenes encoded in this region are shown with the telomere on the left and centromere on the right and their position and direction of transcription indicated by the arrows.

Table 1 Genes encoded within 2 Mb of JA04 in order from telomere (top) to centromere.

Thirteen genes were previously mapped to this region (RU2AS, MRS2L, GPLD1, SSADH, KIAA0319, TRAF, HT012, FLJ12619, Geminin, KIAA0386, Cyclophilin A, CMAH, and NUP50) and are present on the NCBI accession view (NT_017021) of 6p21.3-22. Our in silico studies identified an additional six genes (RPS10, FLJ12671, UBE2D2, AP3, Cytokeratin 8, and P24) and two pseudogenes (HNRPA1 and ASSP2) not represented on NT_017021. In addition we identified a putative fifth exon for P24 from hypothalamus (BG715502) and pineal gland (AA363698) cDNA libraries, located between exons 2 and 3 in the 5-prime untranslated region and flanked by non-traditional splice donor and acceptor bases [25]. The novel exon was not present in the full-length mRNA sequence (AF418980).

The functions of the newly identified genes were inferred by their similarity to other known genes identified in the BLAST searches. Ribosomal protein L21 (RPS10) is 95% identical to the genomic and mRNA sequences of a known ribosomal gene. RPS10 is one of the many proteins that make up the ribosome macromolecule. Other highly homologous genes of RPS10 are RPS5, RPS9, RPS29, RPL5, RPl27a, and RPL28 [26]. FLJ12671s a hypothetical gene with an unknown function identified by the NEDO human cDNA-sequencing project [27]. Blasting the FLJ12671 sequence against the nr database also identified hits on chromosomes 1 and 11, suggesting that this gene has been duplicated on several chromosomes. Ubiquitin conjugating enzyme E2D2 (UBE2D2) is a protein that targets abnormal or short-lived proteins for degradation by the 26S proteasome [28]. The EST sequences identified for the UBE2D2 gene on chromosome 6 are 93% identical to the human UBC4/5 gene located on chromosome 5 suggesting that this gene could be a duplicate as well. Adapter-related protein complex 3 (AP3) may be involved in intracellular protein transport [29]. Cytokeratin 8 is 95% identical to both the genomic and mRNA sequences of keratin. Vesicular membrane protein (P24), a previously characterized but unmapped gene [30] has been localized in intracellular organelles of highly differentiated neural cells and may have a role in the neural organelle transport system. ASSP2 is one of 12 pseudogenes of argininosuccinate synthetase encoded on 10 chromosomes with the only functional sequence residing on chromosome 9 [31]. Heterogeneous nuclear ribonucleoprotein A1 (HNRPA1) is also a pseudogene with three other copies on chromosomes 3, 13, and 20 [32]. The copy on chromosome 12q13.1 is thought to encode the gene responsible for the functional HNRPA1, which serves as a carrier for RNA during export to the cytoplasm [33].

The results of the qualitative RT-PCR, though not quantitative, were useful for characterizing the pattern of tissue expression (Figure 2). While most genes were about equally represented in the mRNAs of the twenty tissues in the panel, three genes, P24, NAD (+)-dependent succinic semialdehyde dehydrogenase (SSADH), and KIAA0319 had exceptional patterns. P24 was almost exclusively expressed in brain by RT-PCR, correlating with the origin of all 29 publicly accessible ESTs from brain cDNA libraries (Figure 2). RT-PCR suggested that the expression of SSADH was greatest in brain, though it was ubiquitously expressed in all tissues tested. Correspondingly, only 5 of 38 SSADH ESTs accessible on public domain servers were from brain cDNA libraries, with the remainder of mixed origin. KIAA0319 had strong signal from brain and cerebellum mRNA, reflected by the 10 of 29 publicly accessible ESTs originating from brain cDNA libraries.

Figure 2
figure 2

RT-PCR analysis of genes within 2 Mb of JA04. The results of the qualitative PCR for the 19 genes identified in this region. RNA from 20 tissues was used along with a blank control (lane 16) for the RT-PCR step. The genes are listed on the left along side their corresponding gel. The number of ESTs identified in brain libraries is listed on the right along with the total number of ESTs for that gene.

Discussion

The primary goal of this study was to identify candidate genes surrounding the peak of transmission disequilibrium for RD on chromosome 6p and to characterize their patterns of expression. A secondary goal was to investigate the usefulness of the in silico approaches and specifically the dbEST database to identify and map new genes.

We identified 19 genes within 2 Mb of the peak of transmission disequilibrium – but are some better candidates for RD than others? The patterns of tissue expression as profiled by RT-PCR and the frequency with which ESTs originated from brain cDNA libraries (Figure 2), serve to highlight five genes that are highly expressed in the brain: P24, SSADH, GPLD1, KIAA0386, and KIAA0319. Of these five, only one gene, SSADH, has been associated with a brain related phenotype. Two frameshift mutations, a G-to-T transversion in the intron 9 splice donor site, and a G-to-A transition in the intron 5 splice donor site, cause an exon to be skipped resulting in abnormal metabolism of GABA, an important neurotransmitter in the brain. The handful of described cases were originally diagnosed by anomalous GABA metabolites in the urine associated with developmental and speech delays, hyporeflexia, and behavioral problems including mild autism with clinical variation between affected family members [34]. There is no data, however, that links GABA or GABA metabolism to specific defects of reading independent of IQ. None of the other genes highly expressed in brain have associated diseases or clinical phenotypes. GPLD1 selectively hydrolyzes inositol phosphate linkages in vitro, releasing the protein bound to the plasma membrane via a glycosylphosphatidylinositol anchor into the cytosol [35]. P24 is a neuron specific membrane protein localized in intracellular organelles of highly differentiated neural cells and is involved in neural organelle transport. KIAA0386 encodes a protein that stimulates the formation of a non-mitotic multinucleated syncytium from proliferative cytotrophoblasts during trophoblast differentiation [36]. KIAA0319 encodes a protein of unknown function [37]. While HT012, an uncharacterized hypothalamus protein, could also be considered as a possible candidate gene, RT-PCR and EST searches (1 of 18 from brain cDNA libraries) do not suggest a high level or selective expression in the brain.

The in silico studies also identified candidate genes for other diseases that map to this region. The five brain candidate genes described above for RD are also reasonable candidates for the neurobehavioral disorders schizophrenia and ADHD. HSS results in the complete loss of scalp hair in childhood. Betz et al [21] described evidence for linkage to HSS with markers spanning D6S276 (400 Kb telomeric of JA04) through D6S1607 (5.6Mb centromeric of JA04). Neither the RT-PCR results nor the tissue origin for any single gene suggests any best candidates among the 19. Behçet's disease is an autoimmune disorder characterized by a systemic vasculitis that affects the joints, all sizes and types of blood vessels, the lungs, the central nervous system, and the gastrointestinal tract [38]. There is evidence for linkage with markers spanning nearly 28 Mb of 6p with JA04 in the middle [18]. Candidate genes for Behçet's would include those expressed in lymphocytes or perhaps bone marrow such as AP3 and FLJ12671 and other immune related genes such as TRAF and RU2AS. These genes may also serve as candidates for other autoimmune disorders that map to this region such as IBD3.

Overall the in silico method for identifying genes in a specific genomic region worked well here, yielding a reasonable gene density of one per 95 kb. This method is highly dependent upon the quality of the information in dbEST. Any contamination from non-coding DNA, bacterial DNA, cDNA from other species, vectors or mitochondria DNA could generate false gene assignments. Fortunately, the high redundancy of EST hits in dbEST increased our confidence that any identification was likely physiologic and that the searches were sensitive. It is possible however, that our in silico approach may have missed some genes, in particular those with small ORFs and/or large 5-prime and/or 3-prime UTRs [39]. As dbEST expands over the next few years, new genes may be identified with repeated in silico searches, or with biophysical approaches such as cDNA hybridization [40], exon trapping [41] and amplification [42], or by identification of evolutionary conserved sequences [43] and HTF islands [44].

Conclusion

In summary, we examined 2 Mb surrounding the transmission disequilibrium peak with RD at short tandem repeat marker JA04 on chromosome 6p. In silico searches of the dbEST database identified 19 possible candidate genes. While tissue expression patterns suggest five candidates that are highly expressed in brain – one with a known association with neurological disease – neither the RT-PCR data nor the EST information can absolutely rule out any of the 19 as culpable candidates. We conclude therefore that in silico cloning is a powerful and effective technique for quickly identifying existing and novel genes, which can then be used to develop cDNA single nucleotide polymorphism markers (cSNPs) for pinpointing a more precise location of the 6p RD gene, and other disease genes that map to this region.

Methods

In silico cloning

The two million base pairs of index genomic sequence surrounding marker JA04 was downloaded from the NCBI website (accession number NT_017021). A perl script was written to parse the sequence into 200 files each containing 10 Kb segments in FASTA format. Repeat sequences were masked with the RepeatMasker program (RepeatMasker at http://ftp.genome.washington.edu/RM/RepeatMasker.html). Each masked file was sent to the blastc13 server as a query for searching the dbEST database using the BLAST algorithm [45]. Only ESTs with an identity of 93% or greater were considered; all other hits were discarded. Surviving ESTs were then used to search the nr database to identify parental cDNA or mRNA matches, which assembled 623 non-overlapping ESTs into 21 gene or pseudogene antecedents. The BLAST search of nr also showed hits against chromosome 6 BAC or PAC sequence. ESTs not mapping to chromosome 6 were also discarded. The final assemblies of EST and cDNA or mRNA sequences were aligned to the index NT_017021 genomic sequence using the Martinez Needleman-Wunsch algorithm in MegAlign (DNA Star, Lasergene) permitting identification of exon-intron boundaries, new exons (P24), and the direction of transcription relative to the telomere.

Qualitative PCR

A panel containing human total RNA from 20 different tissues was purchased from Clontech (BD Biosciences Clontech). RT-PCR was performed using the RETROscript first strand synthesis kit for RT-PCR (Ambion). 2μg of RNA was denatured with 0.5μM of random decamer primers in a total volume of 12μl, and heated to 77°C for three minutes and placed on ice. First strand cDNA was then synthesized with the addition of 2μl 10 × RT buffer, 4μl dNTP mix, 1 Unit RNase inhibitor, and 100 Units MMLV-RT in a total volume of 20μl. The reaction was incubated for one hour at 44°C, and then 92°C for ten minutes and then stored at -20°C.

To check the quality of the first strand cDNA, 1μl of cDNA was used in a PCR reaction to amplify the rig/S15 ribosomal gene. 1.5μl of 10 × PCR buffer (Qiagen), 250μM of each dNTP, 0.5 Units of HotstarTaq polymerase (Qiagen) and 0.5μM of primer (Ambion) were used in a 15μl reaction. The reaction was denatured for 15 minutes at 95°C, then ten cycles of 94°C for 30 seconds, 65°C for 30 seconds (-1°C/cycle), and 72°C for 30 seconds, 20 additional cycles of 94°C for 30 seconds, 55°C for 30 seconds and 72°C for 30 seconds, and a final extension of 72°C for ten minutes. PCR reactions were performed in a MJ Research thermocycler. PCR products were electrophoresed on 2% agarose gels.

Primers (primer sequences are listed in Table 2) were designed to amplify the mRNA sequences of each of the 20 genes identified in the transcript map (Figure 1). Each amplicon was designed to be between 90 and 150 base pairs in length. For each amplicon, 1μl of cDNA, 1.5μl of 10 × PCR buffer (Qiagen), 250μM of each dNTP, 0.5 units of HotstarTaq polymerase (Qiagen) and 0.5μM of primer (Life Technologies) were used in a 15μl reaction. PCR reactions were performed as above. One lane (Figure 2, lane 16) contained water during the RT-PCR step to check for RNA contamination. Products were electrophoresed on 2% agarose gels and stained with ethidium bromide.

Table 2 Primer sequences for RT-PCR amplification of regional genes.

Note Added In Proof

Since submission of the manuscript for review, contig NT_017021 has been incorporated into NCBI contig NT_007592.13.