The interpretation of sequence polymorphism data, such as the data produced in large amounts from genome-wide association studies, is largely based on the concept of a gene as a stand-alone, separate genomic entity with discrete start and end, as defined by the current genomic annotations. The immediate logical corollary of this notion is that the effect of a nucleotide change is most likely to be local, or at least within the locus in which the change was found. However, surveys aimed at an unbiased cataloguing of the transcripts produced by human and other genomes, such as [17], challenge the notion of a gene as a separate, discrete genomic unit. This, in turn, may affect the interpretation of any nucleotide change that is found to be associated with a certain phenotype or a disease. Following the results of such surveys of transcriptional output of human and other genomes [17], the concept of a gene has expanded in several directions.

First, a multitude of different transcripts are made at any given locus. Analysis of the existing expressed sequence tag (EST) data suggests that a protein-coding locus can produce at least 5.7 different transcripts [1, 8]. Although only some of these alternative transcripts seem to have protein-coding capacity, this expands the number of transcripts that a given exon can participate in. Logically, a nucleotide change in a shared exon could affect any of the transcripts that share it, and thus the phenotypic effect of a nucleotide change is likely to be represented as a sum of the effects on the transcripts that express it. It is likely that the profile of expressed transcripts is different in each tissue [9], and the effect of a nucleotide change could thus differ depending on the repertoire of transcripts expressed by the locus in each cell. In a simple example, as shown for the annotated transcripts in Figure 1, the phenotype may show itself in a tissue that expresses an exon overlapping the variant and not in another tissue that expresses transcripts that skip that exon. In a more complex case depicted in Figure 1, a polymorphic nucleotide or stretch of nucleotides could be part of a coding exon in one tissue and a non-coding exon in another; or it could represent both a regulatory region of one group of transcripts and an exon of another group of transcripts. Even more complex scenarios are possible considering that a large number of different isoforms could be expressed in any given cell type.

Figure 1
figure 1

Examples of the potential effects of different sequence polymorphisms on two hypothetical loci, A and B. In this scenario, locus A has two annotated transcripts (RNA-1 and RNA-2, dark blue), expressed in different tissues. Sequence polymorphism 1 would affect an annotated exon of locus A that occurs in the annotated transcript RNA-1 and in unannotated transcripts (RNA-4 and RNA-5, cyan) and not in RNA-2. Variant 2 would affect a coding exon that is present in both the annotated coding transcripts and also in the non-coding transcript RNA-7. These are examples of polymorphisms that would currently be considered to be the only likely 'functional' polymorphisms in locus A, as they are the only ones to affect the annotated transcripts, RNA-1 and RNA-2. Polymorphisms 3-6 are 'non-coding' polymorphisms, with polymorphism 6 being relatively distant from locus A. However, in this example, these polymorphisms in fact overlap unannotated transcripts (cyan) within locus A, some of which extend outside locus A or encode regulatory small RNA molecules that act in trans on other loci. Polymorphism 3 overlaps a novel exon that is a part of unannotated transcripts RNA-4, RNA-5 and RNA-6. It could thus affect transcripts derived from both locus A and locus B, whether the two loci are nearby or distant in the genome. Polymorphism 4 overlaps a regulatory region for unannotated transcripts RNA-5 and RNA-6 and the 5' untranslated region of RNA-4. It could thus also affect expression of transcripts from both locus A and locus B. Polymorphism 5 overlaps a regulatory region for a non-coding RNA (transcript RNA-7) that is a precursor for a small RNA, a miRNA (RNA-8). Thus, this polymorphism and polymorphism 2, which also overlaps this non-coding RNA, could affect expression of other loci regulated by this small RNA in trans. Polymorphism 6 affects a more distant region in the genome that is connected to locus A by transcript RNA-9. All transcripts are shown transcribed from left to right; non-coding portions of transcripts are represented as thin boxes; coding portions are represented as thicker boxes; introns are shown as thin lines; asterisks indicated polymorphisms.

Second, the annotation of genomic regions that are considered exonic is incomplete. Unbiased studies using rapid amplification of cDNA ends (RACE) on the genes within the 1% of the genome chosen for the ENCODE project have shown that almost half the exons detected in these experiments do not overlap annotated exons [1]. Thus, a nucleotide change in a 'non-coding' region may in fact underlie an as-yet undiscovered exon. Overall, 90% of all genes have been shown to have either a novel internal exon or a novel 5' exon in at least one of the 12 tissues tested [3].

In addition, the boundary of a gene may extend well beyond the current annotation. A gene can have many boundaries and, in fact, exons of different genes can participate in creating chimeric transcripts. The above-mentioned RACE experiments have shown that 68.4% of all genes had a 5' extension in at least one tissue tested [3]. Novel 5' exons were found to be represented both by novel, unannotated regions and by exons of other genes. Indeed, transcripts connecting exons of nearby loci and more distant loci separated by other genes on both strands were commonly found [13]. In fact, 57% of loci that were extended at the 5' end had a connection to an exon of an upstream gene [3]. A majority of 5' extensions (87%) reached over an annotated gene [3]. Often 5' extensions were tissue- or cell-line-specific, suggesting that in different tissues the profile of gene-gene connections could be different. Connections in the ENCODE regions could be identified only up to genomic distances of around 0.5 megabases (Mb). A continuation of these studies on human chromosomes 21 and 22 found a wealth of distant connections that span megabases of genomic space [2].

These observations raise several questions. What are the mechanisms responsible for the production of chimeric RNAs encoded by genes separated by very long genomic stretches? What are the functions, if any, of such chimeric RNAs and what are the implications of the uncovered connections (gene to gene or a novel distant exon to known gene) for cell biology and disease? So far, the answers to these questions remain unknown. However, copy number variants can affect the expression of distant genes located megabases away from the bounds of the variable region [10, 11]. This shows that the effect of a genomic change does not have to be limited to the immediate vicinity of the change and could in fact result in both local and distant effects.

A third direction in which the concept of a gene has expanded results from the observation that transcripts emanating from any given locus could be carriers of trans-acting non-coding RNAs, such as microRNAs (miRNAs) or small nucleolar RNAs (snoRNAs) [5, 1214]. Thus, a polymorphism affecting either the sequence or the processing of such an RNA molecule [15] could in fact affect the expression of loci regulated by the small RNA in trans, with potentially no effect on the locus in which the polymorphism was found, as shown in a hypothetical scenario in Figure 1. Such effects could be prevalent given that we now know the repertoire of the small, non-coding transcripts in a human cell to be far greater than the annotated classes of known small RNAs, and that such novel small RNAs could be carried by long RNA precursors [16, 17].

Overall, these observations suggest that the identification of a sequence variant should not be the logical end point that automatically connects the locus that harbors it with a phenotype, but rather a beginning of a set of experimental procedures to unravel the effects of the variant. A necessary prerequisite for such experiments is unraveling the complexity of transcripts that either include the variant or originate nearby, because the variant also could affect a regulatory region of a novel transcriptional unit. Considering the vast number of unannotated transcripts present in a cell, it is important to directly characterize transcript complexity, for example using RACE with oligonucleotides positioned in or around the polymorphism in the biological samples of interest, rather than relying solely on the existing genomic annotations. One can envisage such analysis to be followed by expression profiles to estimate the effects of a sequence variant on all transcripts that it can be associated with, including the ones that could connect it to distant regions in the genome. Such experiments could be followed by direct perturbation of the candidate transcripts by knockdown or overexpression to estimate their contribution to a phenotype.

In addition to aiding our interpretation of sequence polymorphism data, the wealth of novel transcripts found in the human genome, including the chimeric RNAs that connect together distant regions in the genome, is mostly a virgin territory for biomarker discovery. Unannotated transcripts tend to be cell-type-specific [3, 18] and thus should be attractive diagnostic molecules. The potential of non-coding RNAs as biomarkers has been shown by Reis et al. [19, 20]; however, this field remains mostly unexplored because of the emphasis on annotated protein-coding transcripts. Furthermore, novel protein-coding transcript isoforms, specifically those of transcripts encoding proteins amenable to small molecule modulation, could be additional targets for small molecule therapeutics. In this respect, the high cell-type specificity of novel transcripts should provide an advantage: inhibition of a protein encoded by these transcripts is likely to be specific to a tissue or a cell type within a tissue, and thus is less likely to have side effects than the targets designed to the annotated forms of these proteins, which are likely to be the most constitutive isoforms. This calls for a systematic analysis directed at obtaining a full transcript repertoire of such a 'druggable' transcriptome in a diverse set of cell types and tissues using highly sensitive technologies, for example RACEarray [2, 3, 21].