Introduction

Advances in large-scale genotyping and DNA sequencing have yielded unprecedented insights into human genomic diversity, and yet a large proportion of genetic risk factors for complex human diseases remains unknown. How can we shed light on the 'missing heritability' [1]? Whereas genetics has traditionally focused on nonsynonymous polymorphisms that alter the encoded amino acid sequence (coding single nucleotide polymorphisms (SNPs); the term 'SNP' is used here for all variants), the focus has now shifted to regulatory variants (rSNPs), which are likely to be more prevalent than coding SNPs. Suspected as being a primary driver of evolution [24], rSNPs can undergo positive selection, potentially reaching high frequency. Intense exploration of regulatory variants has been accelerated by new genomic technologies. Here, I discuss the findings of a recent genome-wide analysis of regulatory variation [5], which is among the largest of such studies conducted so far. In a broader context, I further assess new avenues that could lead to a better understanding of human health and disease.

Measuring cis- and trans-acting factors in mRNA expression

Several studies have used expression arrays to measure mRNA levels and coupled this with genome-wide SNP analyses, mostly in transformed lymphocytes. mRNA levels can then serve as quantitative phenotypes, and associations can be found with genomic regions (expression quantitative trait loci or eQTLs) that act either in cis or in trans, depending on whether the eQTL maps to the same gene as the measured mRNA or to another genomic region [610] (Figure 1). This approach reveals that mRNA expression is subject to pervasive genetic factors, which are mostly located in cis. On the other hand, if one measures allelic mRNA expression, any differences between expression from one allele compared with the other reveals the presence of cis-acting regulatory factors, and not trans-acting influences (Figure 1) [5, 1113].

Figure 1
figure 1

Schematic representation of the detection of cis- and trans-regulatory variants and the type of polymorphisms involved in gene expression. eQTL mapping and expression arrays give information about cis- and trans-acting variants, and this can be compared with information from cis-eQTL mapping and AE measurements to determine which variants are cis-acting. These variants come in various forms, as shown at the bottom. To simplify, 'SNP' is taken here as representing all sequence variations; rSNPs affect transcription, and srSNP (structural RNA SNPs) affect RNA processing and translation.

Ge et al. [5] measured genome-wide allelic expression (AE) differences on Illumina Human1M BeadChips in lymphoblastoid cells; they then compared these with allelic genomic DNA ratios to detect AE imbalance (AEI). Using multiple filters, they detected AE ratios of ± 0.05 deviation from unity, confirming pervasive cis regulation. The loci with AEI involved 30% of the measured RefSeq transcripts and extended to unannotated transcripts. Varying estimates of AEI prevalence are a result of different cutoff values for AE ratios, methodology, and numbers of individuals studied [1113]. The simultaneous availability of genome-wide SNP analysis enabled further fine mapping of the cis-eQTLs, which showed that common SNPs accounted for 45% of the loci with AEI (when sequences up to 250 kb upstream and downstream were included) [5]. The authors demonstrated the utility of their results for finding disease-associated variants using the example of a region associated with systemic lupus erythematosus (SLE). Ge et al. [5] further compared the cis-eQTL loci detected using AE analysis with eQTLs obtained from mRNA expression arrays, and found a partial overlap. Differences between these two approaches are attributable to strong trans-acting factors (which can mask weaker cis effects), epigenetic events, and limitations of the AE analysis at individual SNPs (see below).

The authors [5] concluded that cis-acting regulatory variants are frequent and could be used to clarify the genetic risk of complex disorders. To evaluate the potential of 'expression genetics', we must account for the complexity of transcription, mRNA processing, and translation; and we must ask what we can learn from AE assays at individual SNPs and what the limitations of this approach are.

Regulatory variants and the complexity of RNA transcripts

An allelic RNA expression imbalance measured at an individual SNP indicates the presence of a cis-regulatory process [14]. Epigenetic effects can account for AEI, for example through imprinting or the random monoallelic silencing that is observed for numerous genes in lymphoblastic cells [15], which are often highly clonal [16]; however, Ge et al. [5] suggest that epigenetic silencing occurs less frequently than previously thought in transformed B lymphocytes. Moreover, this phenomenon may be less prevalent in other (non-transformed) tissues [13]. Rather, AEI seems to arise mainly from cis-regulatory variants. However, the AE ratio measurements provide only a crude picture of a highly dynamic process from transcription to translation [14]. First, many genes have multiple transcription initiation sites, so that SNPs in the transcripts typically represent multiple species of RNA, each subject to distinct regulation. Second, docking sites for proteins and RNAs (such as microRNAs) can be affected, leading to altered (m)RNA processing, splicing, editing, polyadenylation, cellular trafficking, and the formation of non-colinear transcripts [17] or antisense RNAs [18]. Given that alternative splicing is a near universal phenomenon in human genes [19], AE analysis without separating the main RNA species at any given locus cannot provide a clear answer. Ge et al. [5] have addressed alternative splicing by analyzing windows of multiple SNPs across a gene locus, offering a broad, if incomplete, glimpse of alternative splicing genetics. However, this approach fails if a splice variant has similar turnover but distinct functions, or the spliced exon does not carry a polymorphism. AE analysis must be performed specifically for each splice variant, as demonstrated for the short and long mRNA isoforms of dopamine receptor D2 [20]. Two intronic SNPs were found to alter splicing and brain activity in vivo during cognitive processing in humans [20].

SNPs residing in transcribed RNAs have extensive potential to affect function, because the RNA transcript consists of a single-stranded nucleic acid, which folds onto itself to yield an assembly of structures that determine the RNA's biology. Over 90% of all SNPs alter RNA folding - a fact exploited in single-stranded conformational polymorphism (SSCP) SNP analysis - and thus have the potential to affect function [14]. We have named polymorphisms occurring in the RNA transcript 'structural RNA SNPs' (srSNPs) (Figure 1); this type of variant might be at least as prevalent as rSNPs [13]. Furthermore, synonymous SNPs located in protein-coding regions have been neglected as carriers of functional information; however, they can alter mRNA turnover, splicing, translation, and are particularly adapted towards RNA folding structures that may have a role in evolution [21]. Increasing knowledge of transcript complexity has led to reassessment of the role of RNA variation in evolution and disease etiology.

Tissue selectivity of cis-regulatory variants

Ge et al. [5] found considerable overlap in AEI between lymphoblasts and a few tested primary cell lines of mesenchymal origin, whereas Dimas et al. [22] found from testing various blood cell types that 69 to 80% of cis-regulatory variants operate in a cell-type-specific manner. Tissue-specific enhancers determine selective expression for most genes [23] and, moreover, a large proportion of the machinery regulating transcription, mRNA processing, and translation differs from one tissue to the next. For example, a promoter SNP in VKORC1 (encoding vitamin K epoxide reductase complex subunit 1, the target of warfarin) affects expression only in the liver but not in the heart or lymphocytes [24]. Studying the TPH2 gene (encoding tryptophan hydroxylase 2, which is involved in serotonin biosynthesis) requires pontine tissues, in which the gene is actively transcribed before the protein is distributed throughout the brain [25]. Therefore, AE analysis must focus on relevant target tissues, whereas blood lymphocytes can serve as a surrogate only for a limited subset of genes.

The role of regulatory variants in evolution

Regulation of gene expression is now considered a primary driver of evolution [24]. The potential to alter gene expression only in specific target tissues imposes less constraint for developing new selectable traits. We must assume that positive selection to allele frequencies beyond those expected in a neutral model implies strong phenotypic penetrance associated with fitness, either of the individual or, more controversially, a group of individuals. When applied to humans, the concept of selection on a group includes cultural influence on human evolution and may involve 'balanced evolution', that is, the accumulation of high- and low-activity variants for key genes. Because such regulatory variants are linked to fitness rather than disease, it is not surprising that genome-wide association studies have failed to detect them. However, fitness genes can be a two-edged sword: for example, the activity of a gene product may be optimal for long life but not reproductive success. Similarly, fitness genes could conceivably contribute to disease risk if several interrelated genes have variants that cause a change in the same direction in any given individual. A disease association would become apparent only if interactions between several genes are considered. Knowing the functional variants is essential to tackle these complex interactions.

The way forward: how do we identify regulatory variants germane to fitness and disease

The results of Ge et al. [5] significantly advance our understanding of cis-regulatory factors, and their possible role in heritability of complex disorders. We can now propose steps that are required to shed light on this hidden area.

First, AE should be measured for each transcript isoform, rather than at single marker SNPs that represent the mean of all isoform transcripts. Next generation sequencing has the potential to provide this level of detail [9, 10]. Second, equal attention must be given to rSNPs and srSNPs; the latter affect mRNA processing and translation. Moreover, noncoding RNAs should be considered, as many hits from genome-wide association studies are in intergenic regions.

Because of the tissue selectivity of gene expression, the third step is that AE must be determined in relevant target tissues. Numerous tissue banks are available that provide human autopsy tissues from diseased subjects and controls that are suitable for AE analysis. Also, SNP scanning and subsequent molecular genetics studies are needed to identify the polymorphisms responsible for AEI. Knowing the main functional variants for a candidate gene greatly facilitates subsequent clinical association studies with accessible DNA samples. Furthermore, we should focus on genes that show positive selection in the human lineage, which indicates phenotypic penetrance. If multiple genes in a given pathway have frequent regulatory variants, appropriate multifactorial models should be tested for combined effects on fitness and disease.

Finally, drug targets presumably reside at critical intersections of protein networks, thereby altering the disease process. These targets should be revisited in order to check whether cis-regulatory factors have been overlooked. Polymorphisms in drug target genes often have a large effect on disease risk or treatment outcomes, which are the focus of pharmacogenomic studies.

Given the rapid advances in genomic technologies, these goals are achievable and promise breakthroughs in resolving complex disease risks, prevention strategies, and therapy outcomes.