Introduction

Understanding the mechanisms behind phenotypic variation is a key aim in human genetics, particularly in relation to disease and disease susceptibility. With approximately 1 difference per 1,000 nucleotides between any two human individuals chosen at random, deciphering which variants have function is a vital objective in modern day genetic research. The availability of a much larger range of methodologies has led to a bias towards studies focused on coding sequence variation. The functioning of each gene however, is determined not only by the protein itself but is also governed by spatiotemporal expression patterns, with differences in the timings and levels of expression of genes potentially altering phenotype. Considerable differences in gene expression have been demonstrated both between individuals and populations (Cheung et al. 2005; Stranger et al. 2005, 2007) and it has long been speculated that much of the phenotypic diversity within and between species is due to genetically determined differences in gene expression levels (King and Wilson 1975; Skelly et al. 2009).

Gene expression is controlled by genetic factors, acting both in cis and in trans, epigenetic factors and environmental influences (Stranger et al. 2007) with many complex heritable traits controlled by both cis and trans acting loci (Cheung et al. 2010; Wang et al. 2008). Mutations in cis-regulatory elements (including promoters, enhancers, silencers and insulators) can disrupt or enhance the binding of transcription factors and alter the state of gene expression of single genes. Transposable elements also appear to have influenced mammalian regulation with the creation of repeat associated binding sites (Bourque et al. 2008; Kunarso et al. 2010). Heterozygotes for regulatory SNPs display intermediate levels of expression and show allelic expression imbalance (AEI), one of the hallmarks of cis-acting variation (Fig. 1). Functional mutations in transcription factors (in trans), in contrast, are often pleiotropic because of modified binding to multiple transcription factor binding sites (TFBS), which may alter the expression of many genes. Trans-acting variants involved in gene expression operate equally on both chromosomes so no imbalance of allelic expression is seen (Fig. 1c).

Fig. 1
figure 1

Heterozygous allelic imbalance through cis-acting regulatory variation. a A mutation in a proximal promoter may prevent transcription factor binding altering expression of the allelic transcript. Marker SNPs (shown here as black/white circles) can be used to determine transcript ratios. b A mutation in a distal enhancer site may prevent combinatorial binding and affect transcription levels. c A mutation in trans will affect both alleles equally and no AEI will be seen

TFBSs are usually 6–20 base pairs in length (Lapidot et al. 2008) and imprecise; one transcription factor may bind to a range of similar sequence TFBSs. Although these sites can differ, there are clear biases in the expectations of bases that appear at each binding site position, probabilities of which can be displayed using position weight matrices (PWMs) (Lapidot et al. 2008; Stormo 2000). Because of this flexibility in specificity, mutations in binding sites may not alter binding and thus have no phenotypic effect. In other instances, a nucleotide change within a binding site leads to an altered binding affinity or even the loss or gain of a TFBS (Doniger and Fay 2007), and it is these variations which have the potential to alter gene expression and consequently affect phenotype. Since transcription factors show different gene expression profiles in different cell types, nucleotide changes in binding sites will have different implications in different cells. This review will focus primarily on current understanding of the effects of cis-regulation on the human phenotype. Note that mutations which affect copy number, RNA stability or splicing also affect transcript levels in a cis-acting manner, but are not specifically considered in this review.

Methods of detecting cis-acting variation

Until high-throughput methods for detecting gene expression were developed, techniques for the evaluation of gene expression were limited to single gene analyses. Semi-quantitative reverse transcription PCR and real time PCR were commonly used to quantify levels of expression and could be adapted to measure individual allelic transcripts of a gene by making use of marker SNPs located in the exons. The allelic transcripts could be distinguished by a variety of methods including direct sequencing and RFLP primer extension assays (Loh et al. 2010; Wang et al. 1995, 1998). By comparing the relative expression levels of allelic transcripts within a heterozygous sample, cis-acting differences or ‘allelic expression’ can be measured (AE), while internally controlling for environmental and trans-acting factors (Pastinen and Hudson 2004; Verlaan et al. 2009). This approach also detects epigenetic effects such as imprinting (Pollard et al. 2008; Verlaan et al. 2009) although in this case, analysis of parents would show exclusively maternal or paternal inheritance.

In the mid 1990’s, it became possible to measure the expression levels of thousands of genes simultaneously and various approaches have since been developed which utilise hybridisation or sequence based methods with the capability of measuring complete transcriptomes (Bertone et al. 2004; Cheng et al. 2005). Microarrays exploit hybridisation methods and fluorescence, and kits using oligonucleotide sequences or cDNA probes were made available commercially that allow mapping of patterns of expression but measure both allelic transcripts of a gene simultaneously. However by demonstrating association of SNPs near to the gene with expression levels (Stranger et al. 2005), it was possible to provide evidence of, and map, cis-acting regulatory loci. Such ‘eQTL mapping’ is a way of determining the relationship between the genome and transcriptome and has led to the observation that cis-eQTLs are a common cause of variation in gene expression (Stranger et al. 2005, 2007). The availability of bead chips has enabled genome-wide allelic expression studies by allowing the comparison of SNP allele ratios in expressed RNA transcripts normalized against genomic DNA heterozygote ratios (Ge et al. 2009).

The manipulation of next generation sequencing (NGS) methods to measure gene expression levels (and also for the detection of alternative splicing or novel transcripts) has been one of the most striking advances in transcriptomics in recent years. RNA-sequencing (RNA-Seq) of a population of RNA (total or fractionated) involves converting the RNA to cDNA fragments attached to adapters and direct sequencing using NGS methods. The transcripts are then aligned or assembled to produce a map of the transcriptome and expression levels for each gene (Wang et al. 2009). RNA-Seq is able to describe previously unknown or novel sequences and discover new variants in transcribed sequences. Importantly it is able to detect allele specific expression although, as with all AE methods that rely on heterozygous markers within exons, the number of SNPs within the coding regions and the number of individuals who are heterozygous for any particular gene transcript can be restrictive. A recent study that utilised both unspliced primary transcripts as well as mRNA, identified over 50% more genes showing AE differences than if exonic SNPs alone were used (Verlaan et al. 2009).

An important limitation in studying cis-regulation genome-wide is that analysis of the transcriptome is ultimately dependent on the cell type and developmental stage. So far the majority of genome wide studies have used a small number of cell lines giving a very restrictive representation of tissue specific gene transcripts. Because of the density and coverage of the HapMap SNP data and DNA availability from the lymphoblastoid cell lines (LCL) for HapMap individuals, the majority of regulatory studies have been conducted in LCL cells (see for example (Ge et al. 2009; Stranger et al. 2005, 2007). Concerns have been raised about the use of LCLs in determining allelic expression, because of changes that may have occurred in cell culture (Gimelbrant et al. 2007; Pastinen et al. 2004) although it has since been reported that this is not a significant source of spurious AE association and that it is possible to correct for the confounding effect (Ge et al. 2009).

Where it is possible to compare allelic expression across cell types, further information on the complexity of the regulation of gene expression can be gained. Tissue specificity of AE has been documented in mice (Campbell et al. 2008) and differences in allelic expression have been demonstrated between human brain regions (Buonocore et al. 2010). Genome wide comparison of AE in human cell lines appears to show the same effect and indicates an enrichment of gene networks associated with immunological disease in LCLs, and musculoskeletal disease in human osteoblast cells with several of these genes implicated in multifactorial disease (Verlaan et al. 2009). A study of gene expression across three cell types estimated 69–80% of regulatory variants were cell type specific (Dimas et al. 2009) and recent work by Ernst and colleagues (using chromatin profiling across nine cell types) suggests disease associated SNPs are frequently found within enhancer elements and appear to be active in specific and relevant cell types (Ernst et al. 2011). Both of these studies clearly demonstrate the importance of examining expression across multiple cell types in future studies.

Locating putative functional elements and variants

In the absence of suitable cell lines, another method of identifying potential functional regulatory variants is to look at conservation across species (Goode et al. 2010; Lomelin et al. 2010). Goode and colleagues were able to show that a very large percentage of SNPs in conserved regions, identified by evolutionary rate profiling, were in non-coding regions under evolutionary constraint (Goode et al. 2010). By making use of a transgenic mouse embryo system, Pennachio et al. further characterised some of the conserved sequences and used this information to rank and map potential enhancers across the human genome (Pennacchio et al. 2006).

However as increasing evidence suggests that many cis-regulatory sequence regions (and many of the non-coding alleles under purifying selection (Asthana et al. 2007)) may be poorly conserved (Burton et al. 2007; Kunarso et al. 2010; Pennacchio and Visel 2010; Yokoyama et al. 2011), an alternative method is to identify regions of the genome with a high density of TFBSs (proposed to be more likely functional than regions with single TFBSs due to the formation of transcription factor complexes) (Yu et al. 2007; Zinzen et al. 2009). By selecting relevant or interracting TFs for a set of tissues, a bioinformatics study that followed this approach detected genes that appeared to be regulated by different TF clusters/complexes in different tissues and observed that conserved regulatory modules are more likely to regulate essential genes than non-conserved modules (Yu et al. 2007). The recent development of bioinformatics tools, such as COMPASSS (COMplex Pattern of Sequence Search Software) (Maccari et al. 2010), provide a further resource in the detection of putative functional cis-acting elements in the genome. As the empirical datasets with binding, sequence and expression data are rapidly expanding, it is likely that the sensitivity of these methods will increase and provide further insight into the complexities of cis-acting regulation.

Methods to detect function of cis-acting regulatory regions

To identify or confirm particular functional elements in the genome, including promoters, enhancers, silencers and insulators, a number of empirical approaches can be taken. These range from methods involving transfection of reporter constructs to those examining properties of chromatin. For example, by immunoprecipitating from chromatin, ChIP-chip experiments are able to localize binding sites for any protein of interest. Approaches to identify active gene transcription include the Haplochip method which exploits the fact that the amount of chromatin-bound active Pol II RNA polymerase enzyme is related to the transcriptional activity of the corresponding gene, so that differences in active Pol II loading between the two alleles in a heterozygous sample provides a measure of allele-specific gene expression (Knight et al. 2003). However to assess the effect of particular nucleotide substitutions it is more usual to conduct gel shift assays, often in combination with antibodies, to detect protein binding (for example Hultman et al. 2010), or transfections of constructs in which the relevant mutations have been introduced (for example Jensen et al. 2011; Troelsen et al. 2003). Both these methods have the limitation that they reflect effects in vitro which may not well replicate what happens in vivo, where for example different proportions of transcription factors may be present.

Cis-acting variation and evolution

The existence of conserved non-coding elements (CNE) has been helpful in identifying tissue specific enhancers (Lee et al. 2011; Pennacchio et al. 2006) as described above. However, regulatory elements may be lineage specific and also show evidence of rapid evolution in some groups, as described, for example, in Teleost fishes (Lee et al. 2011) where a high level of loss is associated with a whole genome duplication. It is now evident that cis-regulatory mutations have been important in the adaptive evolution of populations both in terms of speciation and environmental adaptations or adaptive divergence of particular species (Fay and Wittkopp 2008; Bourque et al. 2008; Kunarso et al. 2010; He et al. 2011; Lee et al. 2011; Yokoyama et al. 2011). It has been proposed that certain adaptations or traits such as those involving immune responses, behaviour, reproduction and development and also gain of function mutations are more likely to occur through cis-regulatory mutations where transcription can be ‘fine-tuned’ to meet demands (Wray 2007). While a non-synonymous coding mutation usually affects the protein regardless of where or when it is expressed, a mutation in a cis-regulatory element has the potential to affect gene expression during a particular stage of development or in a specific cell type. Positive or adaptive selection is potentially able operate more efficiently on cis-regulatory regions than coding regions because single nucleotide changes are less likely to have an all or nothing effect, while at the same time they are usually co-dominant (i.e heterozygotes show intermediate phenotype) and therefore directly available to natural selection in the presence of the ancestral allele (Wray 2007). Repeat associated binding sites often appear to be lineage specific leading to the possibility that many binding sites have arisen fairly recently and rewired regulatory pathways.

Common cis-variation and selection in human evolution

It was first suggested some time ago that the differences between the proteins of the chimpanzee and human were insufficient to explain the differences in phenotypic characteristics and that differences in regulation could play an important role (King and Wilson 1975). One example of sequence changes in cis-regulatory regions in the human/chimpanzee divergence involves a 68 bp tandem repeat element 1,250 bp upstream from the start of transcription of PDYN (which encodes a neuropeptide with roles in cognition) for which there is evidence of function. Non-human primates have only one copy while human haplotypes contain between 1 and 4 copies (Rockman et al. 2005). Multiple other upstream polymorphisms are also thought to influence the expression of PDYN, some in a cell-specific or sex-specific manner (Babbitt et al. 2010). A selection of examples of polymorphic cis-regulatory variants with fairly certain phenotypic effect in humans are shown in Table 1. Several of these variations are involved in the inflammatory response, while others are involved in disease resistance or susceptibility and dietary adaptation.

Table 1 Examples of cis-acting functional variants thought to affect human phenotypic characteristics

Lactase persistence; an example of dietary adaptation through cis-acting regulatory polymorphisms

The ongoing expression of lactase into adulthood, in some humans but not others, is one of the classic examples of an environmental adaptation. The expression of lactase enzyme in the intestine is necessary for the breakdown of lactose, the main carbohydrate in milk. As milk is the primary source of nutrition for all newborn mammals, functional lactase enzyme is critical for survival (excluding the Pinnepedia where milk has a high fat content and low lactose content) until other food sources can be consumed. Usually lactase is down-regulated after weaning (termed ‘lactase non-persistence’) although approximately 35% of the human population continue to produce lactase into adulthood (‘lactase persistence’) (Ingram et al. 2009a). The ability to use milk as a source of nutrition without digestive complications is thought to have put some people at a selective advantage, with the expansion of this genetic trait reflecting the onset of animal domestication and milking over the last 10,000 years (Ingram et al. 2009a). It was first directly demonstrated that the inter-individual differences in expression of lactase were cis-acting using allelic expression techniques (Wang et al. 1995, 1998). Extended sequencing identified a C/T SNP 13,910 base pairs upstream of the lactase gene (see Fig. 2) for which the T allele was 100% associated with lactase persistence in Finns (Enattah et al. 2002), and also directed differential levels of promoter construct expression (Lewinsky et al. 2005; Troelsen et al. 2003) in CaCo2 cells, a colon carcinoma cell line that expresses lactase. This allele is found at high frequency (in many European populations) on an exceptionally long haplotype, indicative of directional selection (Bersaglieri et al. 2004; Poulter et al. 2003).

Fig. 2
figure 2

The lactase enhancer region. a The LCT enhancer is located upstream in intron 13 of MCM6. Multiple derived alleles are found clustered in this small region, indicated by yellow boxes. Dots above the variant sites show the 4 SNPs that have been associated with LP, or shown to have function so far. The position of this region upstream of LCT is shown above the sequence. b Known transcription factor binding sites (see Lewinsky et al. 2005; Jensen et al. 2011) in the enhancer region are shown in relation to the 4 LP associated SNPs (black dots) with approximate positions relative to LCT indicated

It was subsequently found however that this particular mutation was not responsible for the ongoing expression of lactase in all humans (Mulcare et al. 2004) and at least three further functional alleles (and likely more (Itan et al. 2010)) appear to have been selected independently (Enattah et al. 2008; Imtiaz et al. 2007; Ingram et al. 2007; Tishkoff et al. 2007). These SNPs are clustered in a region that acts as an enhancer for the lactase gene (Lewinsky et al. 2005) with several known transcription factor binding sites (Jensen et al. 2011; Lewinsky et al. 2005), outlined in Fig. 2. Interestingly, several of these SNPs can occur in a single ethnic group providing evidence in humans of a ‘soft selective sweep’ of the kind described below (Ingram et al. 2009b). Since Caco-2 is the only human cell line known to express significant levels of lactase, the genome wide expression studies so far have missed the LCT quantitative trait loci, which highlights the fact that there must be many more similar cis-acting variants to be discovered once more cell lines/tissues are used in genome wide expression studies.

Cis-regulation and selection for disease resistance, disease susceptibility and differing drug responses

Domestication also exposed populations to new pathogens and the resulting villages and towns facilitated the spread of diseases (Diamond 2002; Wolfe et al. 2007). Malaria is the disease for which there is the most evidence for human adaptation against infectious agents, in that a very large number of associations have been described with gene variants that confer resistance. Although many of these variants affect protein sequences, one of these mutations is a cis-acting regulatory mutation that disrupts the binding of the GATA1 transcription factor, preventing the expression of the Duffy blood group chemokine receptor (DARC) in erythrocytes (Tournamille et al. 1995). As the DARC protein is the usual point of entry for the malarial parasite Plasmodium vivax, individuals homozygous for the mutation are resistant to the disease and in fact selection has been so strong that most individuals in the endemic areas are homozygous for the mutation. This particular example is illustrative of the ability of a cis-acting mutation to cause a change in gene expression in one cell type only without affecting expression levels elsewhere, as individuals with the mutation still express the DARC protein in other cell types (Chaudhuri et al. 1995) and show no adverse health affects.

Variation in the number of TA repeats in the TATA box for the enzyme UDP-glucuronosyltransferase 1A1 (UGT1A1) which catalyses the conjugation of glucuronic acid to bilirubin, and also a number of pharmaceutical drugs, has been shown directly by transfection studies to affect expression of UGT1A1, and is associated with bilirubin level as well as with drug sensitivity (Borlak and Klutcka 2004; Bosma et al. 1995). Interestingly, the geographic distribution of the low activity alleles (which are proposed to be protective against various infectious blood pathogens) in Africa, suggest that this is another polymorphism that might have been under selection by malaria (Horsfall et al. 2011b) and is an example of polymorphism in which there are both costs and benefits, since while bilirubin is potentially toxic, moderately high levels seem to be beneficial to respiratory and cardiac health (Horsfall et al. 2011a). There are several genes involved in the metabolism of drugs, for which there is evidence for the effect of cis-variation in their expression (for two excellent reviews see (Hines et al. 2008; Johnson et al. 2005)). There are also many other examples of TATA box polymorphisms, some of which are involved in disease (see Savinkova et al. 2009).

Indeed there are now many examples in which promoter and other regulatory variants cause or modify Mendelian disease or disorders. Several examples come from the globin loci and have been reviewed by others (see Sankaran et al. 2010). One example of particular interest involves enhancer mutations in a cis-regulatory element upstream of the SHH gene that lead to aberrant developmental expression and incorrect limb bud formation resulting in extra digits in humans and certain other species. Several different causal mutations are located close together in this long-range (1 Mb upstream) enhancer, which is, as in the case of lactase, in an intron of another gene (Gurnett et al. 2007; Lettice et al. 2003).

Genome-wide association studies and cis-acting regulation

Genome-wide association studies to find the genetic components of multifactorial disease are able to locate regions along the genome that contain variants that underlie diseases or phenotype and it is now evident that many of these will be regulatory. Recent developments are providing more precision in mapping of the causal loci (Andrew et al. 2008), but at present bioinformatics methods are lacking power to identify the likely functional variants or mechanism behind the differences in gene expression levels. With the complication of different transcription factor complexes initiating transcription in different cell types, it is currently not possible to predict whether a non-coding variant will be functional without experimental evidence in an appropriate environment. However identification of variants in binding sites that are likely to alter gene expression and cause disease is becoming more attainable on a genome-wide scale, with the use of computational prediction methods which depend on accumulation of published experimental data (Lapidot et al. 2008). For example, in S. cerevisiae, it appears that not all nucleotide substitutions are equal; a substitution involving guanine appears to have more of an effect than a substitution of adenine (Lapidot et al. 2008) and although this effect has not yet been shown in the human genome, it may be that similar computational methods will eventually be able to predict the variants in regulatory elements that are more likely to alter transcription in humans and thus cause disease or disease susceptibility. It may also be the case that multiple independent mutations, (perhaps closely located as in the case of the enhancer polymorphisms of SHH and LCT), influence disease susceptibility and resistance and are potentially good candidates for multifactorial disease. These are characteristically harder to identify than single mutations because they may not be efficiently tagged by the SNPs tested, and this could account for some of the ‘missing heritability’ in genome wide association studies.

Since regulatory elements interact with trans-acting factors they are often the target of environmental response and can by definition be regulated, there is potential for manipulating gene expression as a method of disease prevention or therapy. A good example of this comes from the haemoglobins, in which attempts have been made to up-regulate fetal haemoglobin (HBF) to compensate for the defective adult haemoglobin in haemoglobinopathies, as reviewed in (Sankaran et al. 2010). The switch from fetal to adult globin expression is under relatively tight developmental control, for which many of the cis- and trans-acting interactions are now known, but show inter-individual differences, as well as non-genetic alterations (e.g stress response). With the understanding of the key players involved in this regulation it is now possible to work on and refine methods of manipulating the expression of HBF.

Since changes in gene expression are an important source of evolutionary adaptation, Kudaravalli and colleagues combined eQTL data from HapMap lymphoblastoid cell lines with a haplotype based method for detecting signals of selection (Kudaravalli et al. 2009). They found a strong overlap between signals, particularly for genes known to be associated with diseases with immunological involvement, such as susceptibility to HIV infection, as might be expected for LCLs. Several recent genome wide disease association studies have attempted to overlap GWA signals with eQTLs or variants that may be involved in the regulation of genes (either through imprinting or allelic variability) (Heid et al. 2010; Kong et al. 2009; Nica et al. 2010; Voight et al. 2010) and statistical methods are being developed, that take into account LD, to make these matches more robust. However further studies to determine function are necessary and analysis of data from comparable studies in multiple cell lines may provide further evidence on the function of cis-regulatory variants under selection or those associated with disease.

Parallel evolution due to cis-acting variation: soft selective sweeps

Care must be taken not to over-simplify the characteristics of selection (Przeworski et al. 2005), and it has recently been proposed that hard sweeps, have been too infrequent in the human population over the last 250,000 years to have been responsible for much human genetic adaptation (Hernandez et al. 2011). Populations can adapt to new environmental pressures via multiple mutations, with similar phenotypic effect that arise independently. If several mutations with similar effect are selected in parallel, by what has been termed a ‘soft selective sweep’ (Hermisson and Pennings 2005; Pennings and Hermisson 2006a, b), they may, for example, all get to intermediate frequencies in one population or go to fixation in different populations. Soft selective sweeps can occur from standing, new or migration of mutations and as human population densities have increased over time, so has the likelihood of parallel adaptation occurring (Ralph and Coop 2010). It has also been noted that a number of more recent human adaptations have involved repeated mutations at small mutational target sites and often occur in relatively small geographic areas, while older adaptations from single changes are more widely spread (Ralph and Coop 2010).

Changes to the pigmentation in fruit flies (Gompel et al. 2005; Jeong et al. 2008), pelvic reduction in freshwater sticklebacks (Chan et al. 2010) and lactase persistence in humans (Enattah et al. 2008; Ingram et al. 2007, 2009b; Tishkoff et al. 2007) are all well described cases of parallel evolution where multiple independent cis-regulatory changes are responsible for a convergent phenotype. Currently, most standard tests for selection in humans are based on the ‘hard sweep model’ (Smith and Haigh 1974) where one allele under selection rises quickly in frequency and is consequently found at high frequency on an extended haplotypic background (Sabeti et al. 2002, 2007; Voight et al. 2006; Zhang et al. 2006). These tests may miss the diversity characteristic of soft sweeps in populations where there are multiple alleles with the same functional effect which are likely to occur on distinct haplotype backgrounds. Furthermore recent selection may have increased the allele frequency of older ‘standing’ variation which will also have greater haplotype diversity (Ralph and Coop 2010).

The frequency and significance of cis-variation in humans

Many recent developments have enabled a rapid accumulation of regulatory data in this fast advancing field. We now know that human genomes contain thousands of cis-regulatory variants (Rockman and Wray 2002) and considerable differences in gene expression have been demonstrated between individuals and populations (Cheung et al. 2005; Stranger et al. 2007). For example the work of Stranger et al. (2007) revealed 831 genes that displayed a significant cis association in lymphoblastoid cells alone, and the work of Ge and colleagues showed the phenomenon to be extremely frequent and found several thousand ‘windows’ of imbalance of allelic expression (Ge et al. 2009). By considering genome wide SNP frequency, distributions in non-coding DNA, and patterns of conservation it is now predicted that as much as 90% of the important functional variation in the human genome may be regulatory (Goode et al. 2010). Cis-regulatory variants, which have limited pleiotropic effects, are less likely to be deleterious than those acting in trans and are thus more likely to be favoured in evolutionary adaptation. It is tempting to speculate that because of the combinatorial nature of cis-acting regulatory elements there will be many more examples of soft selective sweeps, and it will be important to develop techniques for their detection, since they may otherwise be missed in genome wide association studies. Some of these regulatory variants may have reached high frequencies as a result of selection in our past, but may now be responsible for present day disease susceptibilities.