- First Online:
- Cite this article as:
- Osbourn, A.E. & Field, B. Cell. Mol. Life Sci. (2009) 66: 3755. doi:10.1007/s00018-009-0114-3
Operons (clusters of co-regulated genes with related functions) are common features of bacterial genomes. More recently, functional gene clustering has been reported in eukaryotes, from yeasts to filamentous fungi, plants, and animals. Gene clusters can consist of paralogous genes that have most likely arisen by gene duplication. However, there are now many examples of eukaryotic gene clusters that contain functionally related but non-homologous genes and that represent functional gene organizations with operon-like features (physical clustering and co-regulation). These include gene clusters for use of different carbon and nitrogen sources in yeasts, for production of antibiotics, toxins, and virulence determinants in filamentous fungi, for production of defense compounds in plants, and for innate and adaptive immunity in animals (the major histocompatibility locus). The aim of this article is to review features of functional gene clusters in prokaryotes and eukaryotes and the significance of clustering for effective function.
KeywordsMetabolismNatural productsAntibioticsPathogensDefenseChromatinDevelopmentInnate and adaptive immunity
Operons (clusters of co-regulated genes with related functions) are a well-known feature of prokaryotic genomes. Archeal and bacterial genomes generally contain a small number of highly conserved operons and a much larger number of unique or rare ones . Functional gene clustering also occurs in eukaryotes, from yeasts to filamentous fungi, mammals, nematodes, and plants . The members of these eukaryotic gene clusters contribute to a common function but do not usually share sequence similarity. These gene clusters therefore represent functional gene organizations with operon-like features (physical clustering and co-regulation), although the genes are not usually transcribed as a single mRNA as is the case in prokaryotes. This article reviews facets of genome organization in prokaryotes and eukaryotes that are of relevance for understanding the significance of the establishment, maintenance, and dissipation of functional gene clusters and the evolutionary forces that shape genome architecture.
Although the lacI repressor gene is able to regulate expression of the lacZYA genes when placed anywhere in the chromosome, this gene is normally positioned immediately upstream of lacZYA. Co-localization of the regulatory gene with the genes that it regulates has been suggested to be a “selfish” property of the lac gene cluster, since such organization may increase the fitness of this group of genes by allowing horizontal as well as vertical inheritance [10, 11]. Operons are certainly known to undergo horizontal gene transfer (HGT) [10, 12, 13]. However, essential genes are commonly found in operons, as are other genes that are not known to be transmitted by HGT, providing evidence against the selfish operon theory [14, 15]. The selfish cluster theory also does not explain the many operons that contain functionally unrelated genes . The emergence of new operons will require not only the coming together of genes but also the establishment and maintenance of co-ordinate regulation of these genes. A more likely explanation for the existence of operons is concerned with regulation.
The regulatory model provides several potential explanations for the existence of operons . It has been argued that co-regulation could be evolved more easily by modifying two independent promoters than by placing two genes in proximity . A counter-argument to this is that, for complex regulation, an operon with one complex promoter might be expected to arise more readily than would two independent complex promoters . The dependence of several genes on a single regulatory sequence is expected to put this sequence under stronger selection, so allowing for the emergence of more complex regulatory strategies . The observation that operons do tend to have more complex conserved regulatory sequences than individually transcribed genes is consistent with this hypothesis [9, 12, 16]. Genes within new operons are significantly less likely to be optimally spaced when compared to old operons, which is consistent with the idea that canonical spacings form by deletion after the operon has already formed . In laboratory experiments, the expression level of the lac operon evolves to optimality in a few hundred generations . Thus, changes in operon spacing could reflect fine tuning of expression levels.
Transcription of genes in a single transcript is expected to diminish gene expression noise and ensure more precise stoichiometry . The most highly conserved operons do tend to code for components of protein complexes [9, 14, 18, 19]. However, there are many examples of operons that do not code for protein complexes . In addition, only a few percent of known protein–protein interactions involve genes encoded by the same operon . Furthermore, the optimal expression level of each gene is not the same for all genes in an operon .
Rapid and reliable gene regulation may require the transcription factor gene to be in close proximity to the sites within the genome to which it binds (the rapid search hypothesis [2–24]). In prokaryotes, transcription and translation are coupled spatially and temporally. Such organization would therefore provide for synthesis of transcription factors near the genes that they regulate, enabling rapid binding of co-localized sites and tight co-ordinate regulation. Computer simulations have been used to address the relationship between genome organization and biophysical constraints and have provided evidence to support the rapid search hypothesis . Transcription–translational coupling may also facilitate co-ordinate downstream processes such as assembly of protein complexes through co-translational folding , and cell compartmentalization .
Because few operons are conserved across all or even most bacteria, it is clear that after operons form many of them “die”. Operons could be lost by the deletion of one/more genes or alternatively by splitting the operon apart. Since operon formation often brings functionally related genes together, it seems unlikely to be a neutral process. If operon formation is driven by gene expression, then it should be associated with changes in the expression patterns of the constituent genes. Studies of the expression patterns of genes in E. coli operons and of orthologous genes in “not-yet” operons in the related bacterium Shewanella oneidensis MR-1 have provided compelling evidence that operon formation has a major effect on gene expression patterns . Similarly, “dead” operons are significantly less co-expressed than live operons but significantly more co-expressed than random pairs of genes. Thus, operon destruction also has a major effect on gene expression patterns, but it does not entirely eliminate the similarity of expression. The turnover of operon structure may accompany switching between constitutive and inducible expression. Although constitutive expression may seem wasteful and hence deleterious, favorable gene combinations cannot be selected for unless the genes are expressed. Constitutive gene expression, even at very low levels, would logically be expected to be necessary to enable the selection-mediated garnering of beneficial combination of genes into operons with subsequent fine tuning of expression. While genes within the same operon are under much stronger selection to remain together than genes that are in different operons [28, 29], there is also evidence of weaker selection for high level interactions between operons, some of which are so widely conserved that they are known as super- or uberoperons [21, 30].
Secondary metabolism in filamentous fungi
Filamentous fungi produce a huge array of secondary metabolites (sometimes also referred to as natural products). These compounds are commonly synthesized by groups of genes that form metabolic gene clusters [32–34]. The genes within these clusters are physically linked and co-regulated, but unlike bacterial operons they are not transcribed as a single mRNA. Nevertheless, these fungal gene clusters represent functional gene organizations with operon-like features (physical clustering and co-regulation). Examples include gene clusters for important pharmaceuticals (such as the β-lactam antibiotics, penicillin, and cephalosporin), the anti-hypercholesterolaemic agent lovastatin, ergopeptines, and carcinogenic toxins (aflatoxin and sterigmatocystin). Secondary metabolites are also important in mediating interactions between fungal pathogens and their hosts. For example, an unknown metabolite synthesized by the ACE1 gene cluster in the rice blast fungus Magnaporthe grisea is involved in recognition of particular rice cultivars [35, 36], and the host selective HC-toxin produced by the corn pathogen Cochliobolus carbonum is critical for ability to cause disease on cultivars that have the Hm resistance gene [37–39]. Such gene clusters generally involve “signature” secondary metabolic genes such as non-ribosomal peptide synthases (NRPS), type I polyketide synthases (PKS), terpene cyclases (TS), and dimethylallyl tryptophan synthetases (DMATS) genes, for the synthesis of non-ribosomal peptides, polyketides, terpenes, and indole alkaloids, respectively . These signature genes are clustered along with various combinations of genes for further metabolite elaboration (e.g., oxidoreductases, methylases, acetylases, esterases), transporters, and sometimes (but not always) regulators [32, 33, 40]. While the first fungal gene clusters were identified using a combination of genetic, molecular genetic and biochemical approaches, the advent of fungal genome sequencing has enabled facile discovery of new candidate secondary metabolic gene clusters based on genome browsing and co-expression analysis [32–34, 40].
Many fungal secondary metabolites have important biological properties such as antibacterial, antifungal, or antitumor activity . The exploitation of fungal metabolites for pharmaceutical purposes was pioneered by the discovery of penicillin in 1929 and its subsequent development for large-scale use as an antibiotic during the Second World War. The fungal natural product repertoire has since been the subject of extensive screening programmes for drug discovery . Although some secondary metabolites have been shown to be important in mediating interactions between fungal pathogens and their hosts [33, 36, 37, 42, 43], in many cases the biological significance of these compounds for the producing fungus is unknown. The production of different types of compound is often restricted to particular fungal lineages, and the most likely function of these compounds is in niche adaptation and survival. Many filamentous fungi live saprophytically in the soil where they are exposed to a diverse range of organisms. Secondary metabolites may therefore act as mediators of interactions within soil communities. Secondary metabolite production has recently been shown to protect Aspergillus nidulans from predation by an arthropod . In a more complex three-way interaction, colonization of perennial ryegrass by secondary metabolite-producing fungal endophytes has been shown to confer protection against insect herbivory .
Secondary metabolite gene clusters in filamentous fungi commonly contain a gene for a pathway-specific transcription factor that positively regulates expression of the associated biosynthetic genes, each of which has its own promoter. These transcription factors are often Zn(II)2Cys6 zinc binuclear cluster proteins, a class of protein so far found only in fungi . One such example is AflR, which is required for aflatoxin and sterigmatocystin biosynthesis in Aspergilli [46, 47]. Other transcription factors that are encoded in biosynthetic gene clusters include Cys2 His2 zinc-finger proteins (Tri6 and MRT16 for trichothecene production)  and an ankyrin repeat protein (ToxE for HC-toxin production) . Transcription factor genes for synthesis of ergovaline and lolitrem in the endophytic fungi Neotyphodium lolii and Epichloe festuca do not reside within the cognate biosynthetic gene clusters and are as yet unidentified [49–52]. In these latter cases, identification of pathway regulators will necessarily rely on forward screens for mutants with altered regulation of expression of the pathway, on reverse approaches that involve searching for regulatory proteins that bind to promoters of characterized biosynthetic genes, and/or on genome-wide co-expression analysis.
There is evidence to suggest HGT of the penicillin gene cluster from bacteria to fungi [65–67]. HGT of clusters between fungi may in part explain the discontinuous distribution of gene clusters for synthesis of secondary metabolites within the Ascomycetes [68, 69]. The selfish cluster hypothesis has been put forward to explain clustering of functionally related genes in fungi . However, fungal genomes are very plastic, and it is likely that the formation and maintenance of metabolic gene clusters in fungal genomes is driven by selection for optimized production of metabolites that fulfil an adaptive function. Many fungal metabolic gene clusters are located close to telomeres, a chromosomal location that would be expected to facilitate recombination, DNA inversions, partial deletions, translocations, and other genomic rearrangements [70–73]. Intragenic reorganization followed by vertical descent is therefore a more satisfactory explanation. Clustering may facilitate co-regulation of gene expression, although it is clearly not a prerequisite for this since expression of unlinked genes for other metabolic pathways can be readily co-regulated. As the number of complete genome sequences of filamentous fungi increases, it should become possible to elucidate and perhaps model the mechanisms that drive cluster formation and maintenance, following approaches similar to those used to study the life and death of bacterial operons. While this section has focused on gene clusters for the synthesis of secondary metabolites in filamentous fungi, it is noteworthy that clusters of diverse virulence genes with no obvious function in metabolism have recently been identified in the corn smut fungus Ustilago maydis following completion of the full genome sequence of this organism .
Metabolic gene clusters in yeast
Baker’s yeast Saccharomyces cerevisiae, unlike filamentous fungi, does not produce arrays of diverse secondary metabolites. Clusters of genes of related function are relatively unusual in S. cerevisiae by comparison with filamentous fungi, presumably for this reason. However, the S. cerevisiae genome does contain several functional gene clusters that are required for growth under certain conditions. These include gene clusters for utilization of specific carbon sources [e.g., the galactose (GAL) gene cluster], the DAL gene cluster for use of allantoin as a nitrogen source, and gene clusters for biotin synthesis, vitamin B1/B6 metabolism, and for arsenic resistance [75–77]. Studies of the distribution, origin, and fate of these gene clusters have provided important insights into the mechanisms underpinning adaptation of yeasts to new ecological niches.
The selection for formation of new metabolic gene clusters such as the DAL gene cluster is likely to be intense, driven by the need to adapt to growth under different environmental conditions. Gene clusters that have been formed by epistatic selection are expected to be recombination cold spots and so to be in linkage disequilibrium , and this is indeed the case for the DAL gene cluster . Epistatic selection for linkage may in addition be driven by the need to select for combinations of alleles that interact well in order to avoid the accumulation of toxic pathway intermediates within cells. For example, glyoxylate, which is an intermediate in the DAL pathway (Fig. 4), is toxic to yeast . Glyoxylate is produced by the Dal3 reaction and removed by the Dal7 reaction. There may therefore be selection for alleles of DAL3 and DAL7 that interact well and facilitate metabolic channeling. The finding that Dal3 enzyme activity is reduced in a dal7 mutant is consistent with this channeling hypothesis [76, 82].
Under new selection regimes, adaptations may evolve while established functions may become less important. The GAL genes, which are required for galactose utilization, are clustered in the genomes of every yeast species in which they are present . This pathway converts galactose into glucose-6-phosphate, a substrate for glycolysis. Galactose utilization is widespread amongst yeasts and is likely to be ancestral. However, several yeast species have lost the ability to use this carbon source. Comparisons of the genomes of galactose-utilizing and non-utilizing yeast species have revealed that three out of the four non-utilizing species examined lack any trace of the pathway except for a single gene. However, S. kudriavzevii, a close relative of S. cerevisiae, retains remnants of all seven dedicated GAL genes as syntenic pseudogenes, providing a rare glimpse of an entire pathway in the process of degradation . Thus, whilst a newly formed functional gene cluster confers a selective advantage in a new ecological niche, rapid and irreversible gene inactivation and pathway degeneration can occur under non-selective conditions. It has been suggested for S. kudriavzevii that this change may be associated with adaptation to growth on decaying leaves and soil rather than on sugar-rich substrates . The loss of genes and pathways through reductive evolution has been inferred for many organisms that have adapted to pathogenic or endosymbiotic lifestyles [85–92]. Adaptation to a new niche has been shown to result in a “cost” in terms of lost ancestral capabilities. These capabilities may be lost either because they are no longer under selection (neutral) or because of a deleterious effect on fitness in a new niche [75, 93–95].
Operon-like gene clusters in plants
Genes for metabolic pathways in plants are generally not clustered, at least for the majority of the pathways that have been characterized in detail to date. However, several examples of functional gene clusters for plant metabolic pathways have recently emerged. These are the cyclic hydroxamic acid (DIBOA) pathway in maize [96–98], triterpene biosynthetic gene clusters in oat [99, 100] and Arabidopsis  (the avenacin and thalianol gene clusters, respectively), and the diterpenoid momilactone cluster in rice [102, 103]. These gene clusters all appear to have been assembled from plant genes by gene duplication, acquisition of new function, and genome reorganization and are not likely to be a consequence of horizontal gene transfer from microbes. The existence of these clusters, of which at least three are implicated in plant defense [98, 99, 102–104], implies that plant genomes are able to assemble functional gene clusters that confer an adaptive advantage. The selection for rapid and recent formation of such metabolic gene clusters is likely to be intense, driven by the need to adapt to growth under different environmental conditions, and implies remarkable genome plasticity.
The benzoxazinoids are defense-related compounds that occur constitutively as glucosides in certain members of the Gramineae and in some dicots. 2,4-Dihydroxy-1,4-benzoxazin-3-one (DIBOA) is the primary hydroxamic acid in rye while its methoxy derivative 2,4-dihydroxy-7-methoxy-1,4-benzoxazin-3-one (DIMBOA) is predominant in maize and wheat [105–108]. In the Poaceae, the production of benzoxazinoids is developmentally regulated with highest levels being found in the roots and shoots of young seedlings. The glucosides are hydrolyzed in response to infection or physical damage to produce DIBOA and DIMBOA, which are antimicrobial and also have pesticidal and allelopathic activity. Induction of benzoxazinoid accumulation has also been reported in response to cis-jasmone treatment .
The complete molecular pathway for benzoxazinoid biosynthesis has been elucidated in maize (reviewed in ). The first committed step towards DIBOA and DIMBOA biosynthesis is the conversion of indole-3-glycerol phosphate to indole, which is catalyzed by the tryptophan synthase α (TSA) homologue BX1. Bx1 is likely to have been recruited from primary metabolism either directly or indirectly by duplication of the maize gene encoding TSA. BX1 and TSA are both chloroplast-localized indole-3-glycerolphosphate lyases (IGLs). BX1 functions as a monomer and produces free indole, while TSA forms a complex with the β-subunit of tryptophan synthase TSB to convert indole-3-glycerol phosphate to tryptophan [97, 110]. The subsequent conversion of indole into DIBOA is catalyzed by four related but highly substrate-specific cytochrome P450s (BX2-5) . The glucosyltransferases BX8 and 9 catalyse glucosylation of benzoxazinoids. The glucosides of DIBOA and DIMBOA have reduced chemical reactivity when compared to the aglycones, suggesting that glucosylation may reduce phytotoxicity and so be important for storage . Glucosylation takes place prior to hydroxylation by the 2-oxoglutarate dioxygenase (2-ODD) BX6  and O-methylation by O-methyltransferase (OMT) BX7 . All the Bx genes with the exception of Bx9 are linked within 6 cM of Bx1 on maize chromosome 4 [97, 111].
The distribution of benzoxazinoids across the Gramineae is sporadic. Maize, wheat, rye, and certain wild barley species are capable of the synthesis of these compounds while oats, rice, and cultivated barley varieties are not [106, 108]. The pathway to DIBOA is conserved in maize, wheat, and wild barley [97, 114–117]. The Bx gene cluster is believed to be of ancient origin. Wheat and rye have undergone a shared genomic event that has led to the splitting of the Bx gene cluster into two parts that are located on different chromosomes. This can be explained by a reciprocal translocation in the ancestor of wheat and rye . Bx-deficient variants of a diploid accession of wild wheat Triticum boeoticum have recently been identified. Molecular characterization suggests that Bx deficiency in these accessions arose by disintegration of the Bx1 coding sequence, followed by degeneration and loss of all five Bx biosynthetic genes examined . Barley species that do not produce benzoxazinoids have also lost all Bx genes [114, 117]. The precise physical distances between all of the genes within the Bx cluster are not known. However, in maize, Bx1 and Bx2 genes are 2.5 kb apart  while Bx8 is 44 kb from Bx1 . In hexaploid wheat, the Bx3 and Bx4 genes are 7–11 kb apart within the three genomes . Although several of the Bx genes are in close physical proximity this gene cluster appears to be less tightly linked than the other examples that have been considered so far in this review. Interestingly, barley lines that produce benzoxazinoids do not synthesize gramine, a defense compound that is also derived from the tryptophan pathway. Conversely, gramine-accumulating barley species are deficient in benzoxazinoids. This has led to the suggestion that the biosynthetic pathways for these two different classes of defense compound are mutually exclusive, possibly due to competition for common substrates .
Outside the Poaceae, benzoxazinoids (in particular DIBOA and its glucoside) are found in certain isolated eudicot species belonging to the orders Ranunculales (e.g., larkspur, Consolida orientalis; yellow archangel, Lamium galeobdolon) and Lamiales (e.g., zebra plant, Acanthus squarrosa) . Comparison of the BX1 enzymes of grasses and benzoxazinone-producing eudicots indicates that these enzymes do not share a common monophyletic origin. Furthermore, the CYP71C family of CYP450s to which BX2-5 belong is not represented in the model eudicot, thale cress Arabidopsis thaliana, and all members of this family described to date originate from the Poaceae. It therefore seems likely that the ability to synthesize benzoxazinones has evolved independently in grasses and eudicots.
Investigation of triterpene biosynthesis in plants has led to the discovery of two other examples of operon-like metabolic gene clusters, namely the avenacin gene cluster in oat (Avena species) and the thalianol gene cluster in A. thaliana [99, 101, 104].
Avenacins are antimicrobial triterpene glycosides that confer broad spectrum disease resistance to soil-borne pathogens [104, 121]. Analysis of the genes and enzymes for avenacin synthesis has revealed that the pathway has evolved recently, since the divergence of oats from other cereals and grasses [99, 100, 122, 123, 186]. Transferal of genes for the synthesis of antimicrobial triterpenes into cereals such as wheat holds potential for crop improvement but first requires the necessary genes and enzymes to be characterized. Synthesis of avenacins is developmentally regulated and occurs in the epidermal cells of the root meristem. The major avenacin, A-1, has strong fluorescence under ultra-violet light and can be readily visualized in these cells. This fluorescence, which is an extremely unusual property amongst triterpenes, has enabled isolation of over 90 avenacin-deficient mutants using a simple screen for reduced root fluorescence [100, 104]. This mutant collection has facilitated gene cloning and pathway elucidation.
Sad1 encodes an oxidosqualene cyclase enzyme that catalyses the first committed step in the avenacin pathway [99, 122], while Sad2 encodes a second early pathway enzyme—a novel cytochrome P450 enzyme belonging to the newly described monocot-specific CYP51H subfamily . Sad1 and Sad2 are likely to have been recruited from the sterol pathway (from cycloartenol synthase and obtusifoliol 14α-demethylase, respectively) by gene duplication and acquisition of new functions [99, 100, 122]. Sterols and avenacins are both synthesized from the mevalonate pathway . While the genes for sterol synthesis are generally regarded as being constitutively expressed throughout the plant, the expression of Sad1, Sad2, and other cloned genes for avenacin biosynthesis is tightly regulated and is restricted to the epidermal cells of the root meristem [99, 100, 122]. Recruitment of Sad1 and Sad2 from the sterol pathway by gene duplication has therefore involved a change in expression pattern as well as neofunctionalisation.
The Sad1 and Sad2 genes are adjacent, and lie ~70 kb apart in the A. strigosa genome . A third gene has recently been cloned and shown to encode a serine carboxypeptidase-like acyltransferase that is required for avenacin acylation. This gene (Sad7) [104, 186] is ~60 kb from Sad1 on the opposite side to Sad2. Four other loci that are required for avenacin synthesis also co-segregate with these cloned genes, indicating that most of the genes for the pathway are likely to be clustered . Since avenacins confer broad spectrum disease resistance, the gene cluster is likely to have arisen through strong epistatic selection for maintenance and co-inheritance of this gene collective. In addition, interference with the integrity of the gene cluster can in some cases lead to the accumulation of toxic intermediates, with detrimental consequences for plant growth, so providing further selection for cluster maintenance . Gene clustering may also facilitate co-ordinate regulation of gene expression at the level of chromatin .
The genes within the A. thaliana thalianol gene cluster are expressed predominantly in the root epidermis, as is the case for the oat avenacin gene cluster [99–101]. The THAS, THAH, THAD, and BAHD acyltransferase genes all have marked histone H3 lysine 27 trimethylation, whereas the immediate flanking genes do not, suggesting co-ordinate regulation of expression of the gene cluster at the level of chromatin . As is the case for the avenacin pathway, tight regulation of the pathway appears to be critical since accumulation of thalianol pathway intermediates can impact on plant growth and development.
There are superficial similarities between the avenacin and thalianol gene clusters in that they are both required for triterpene synthesis and contain genes for oxidosqualene cyclases, CYP450s, and acyltransferases. However, phylogenetic analysis indicates that the genes within these clusters are monocot and eudicot specific, respectively, and that the assembly of these clusters has occurred recently and independently in the two plant lineages . This suggests that selection pressure may act during the formation of certain plant metabolic pathways to drive gene clustering, and that triterpene pathways are predisposed to such clustering.
A third example of a gene cluster for synthesis of terpenes in plants has been reported from rice, in this case for synthesis of diterpene defense compounds known as momilactones [102, 103]. Momilactones were originally identified as dormancy factors from rice seed husks and are also constitutively secreted from the roots of rice seedlings. In rice cell suspension cultures and in leaves, expression of the rice momilactone genes can be co-ordinately induced in response to challenge with pathogens, elicitor treatment, or exposure to UV irradiation [102, 103]. Synthesis of momilactones is initiated by terpene synthases that are distinct from the oxidosqualene cyclases that catalyze the first committed step in triterpene synthesis. The 168-kb momilactone gene cluster lies on rice chromosome 4 and consists of two diterpene synthase genes, a dehydrogenase gene and two functionally uncharacterized P450 genes, all of which are involved in/implicated in momilactone synthesis . These genes are all co-ordinately induced in response to treatment with a chitin oligosaccharide elicitor. Analysis of the promoters of the genes within this cluster has revealed the presence of potential recognition sites for WRKY and basic leucine zipper (bZIP) transcription factors, proteins that are associated with activation of defense responses. Gene clustering has been suggested to facilitate efficient coordinated expression of the momilactone gene cluster in response to elicitation .
Functions of gene clusters in animal defense and development
Global gene expression analysis has revealed extensive clustering of non-homologous genes that are co-ordinately expressed in eukaryotes, including in animals (for reviews, see [2, 126, 127]). These groups of genes may be expressed during development, or in certain tissues and diseased states, and have been reported in studies of Drosophila, nematode, mouse, and humans. Such co-expression domains may therefore be an important source for the discovery of new functional gene clusters in animals and other eukaryotes. However, more research is needed before we can fully understand the functional significance of co-expression domains . Of the known functional gene clusters in animals, the best characterized is the major histocompatability complex (MHC), which encodes proteins involved in innate and adaptive immunity. Other classes of mammalian gene clusters include the Hox and β-globin loci, which are required for development and for the synthesis of haemoglobin, respectively. The latter two examples consist of genes that share sequence similarity and so are distinct from classical operons and from the functional gene clusters discussed above. However, investigation of these loci has revealed important insights into the mechanisms of regulation of arrayed gene clusters in eukaryotes and so these gene clusters will also be considered here.
The major histocompatability complex (MHC)
The majority of genes from the MHC class I and III regions are constitutively expressed in all somatic cell types, although expression levels can vary over two orders of magnitude depending on the cell-type or extracellular stimuli . By contrast, genes in the MHC class II region are expressed only in antigen-presenting cells or in other cell-types after induction by cytokines such as interferon-gamma. The Class II gene transactivator (CIITA) acts as a master regulator for the expression of genes in the MHC Class II region. Mutations in CIITA are one of the causes of bare lymphocyte syndrome in humans, a hereditary immunodeficiency disease characterized by a complete lack of MHC class II gene expression. CIITA acts by stabilizing a multi-subunit complex on the promoters of target genes to activate their transcription . Formation of the CIITA-complex results in a large wave of histone acetylation and intergenic transcription that spreads out bi-directionally from the target promoter . CIITA-mediated transcription is highly specific and appears to target only around 25 genes in the human genome, including the HLA Class II family genes within the MHC Class II region, two unrelated genes in the MHC Class I region, and other immune-related genes on different chromosomes . Despite the presence of such specific transcription factors, the physical clustering of genes in the MHC does appear to be important for their regulation. The chromatin fibre of the extended MHC forms extra-chromosomal loops that are periodically anchored to the nuclear matrix at matrix attachment regions (MARs) by MAR binding proteins to form a chromatin “loopscape” [134, 135]. Treatment with interferon-gamma results in remodeling of the loopscape and differential regulation of genes within the affected regions . Remodeling also leads to alterations in the composition of MAR binding proteins associated with the chromatin fibre and the subsequent recruitment of promyelocytic leukemia nuclear bodies, which may act as transcriptional factories for specific chromosomal loci [135, 137].
The origin of the major histocompatibility complex in metazoans is likely to predate the emergence of the jawed fish ~500 million years ago . The MHC region varies greatly in structure and size between species. This diversity is likely to have been driven by adaptive changes in response to selection pressure from pathogens and parasites . The linkage between MHC Class I and Class II genes has been preserved from the cartilaginous fish to humans, except in the bony fish where linkage has disintegrated . The MHC is also highly variable within species; in humans, as many as 300 alleles can be found at a single locus. Some alleles are so divergent that their common ancestor is likely to predate the formation of the species. The MHC is also a site of strong linkage disequilibrium, and large conserved blocks of specific alleles, or haplotypes, of up to 3.2 Mb can be detected . Therefore, particular patterns of MHC alleles may be fine-tuned to work together, and this may be one of the mechanisms by which clustering of the MHC is maintained .
A number of regions in the human genome are paralogous to the MHC, and these are thought to have arisen by the duplication and fragmentation of a single proto-MHC after two rounds of whole genome duplication . One of these paralogous regions is the natural killer complex (NKC) —a functional gene cluster that spans >1.5 Mb and contains C-type lectin receptor genes important for natural killer cell (NK) function in addition to other immune related genes. Remarkably, the MHC of chicken contains two C-type lectin receptor genes that are highly homologous to the receptor genes in the human NKC. The presence of these NK receptors in the chicken MHC suggests that the proto-MHC may have contained at least one ancestral C-type lectin receptor gene that was differentially retained at different loci after whole genome duplication in the lineages that gave rise to chicken and man. The leukocyte receptor complex (LRC), which contains a large array of immunoglobulin family NK receptors, is similarly thought to be derived from the proto-MHC . Together, these results suggest that the clustering of immune-related genes at the MHC, NKC, and LRC was not independent, but instead was derived from the more ancient clustering of immune-related genes at the proto-MHC. How the ancestral genes became clustered remains a matter of debate.
Homeotic mutations in animals result in the transformation of one body segment into another and have played a crucial role in shaping our understanding of animal development. In the late 1970s, it was discovered that many homeotic mutations in Drosophila mapped to the Bithorax and Antennapedia complexes [141, 142], complexes that we now know to consist of tandem duplicated arrays of genes encoding homeodomain (Hox) transcription factors [143, 144]. The first ancestral Hox cluster is thought to have appeared before the separation of the Cnidarians and Bilaterians, some 600 million years ago . Diversification of the ancestral cluster facilitated the development of diverse and complex body plans across the metazoa. For example, in mammals, there are four Hox clusters that have arisen through whole genome duplications with up to 14 Hox genes within each cluster.
Retinoic acid (RA), a vitamin A-derived morphogen, can stimulate the sequential induction of Hox genes . RA stimulation is mediated through a conserved retinoic acid responsive element (RARE) upstream of the 3′ most Hox gene, Hox1. Through the course of embryonic development, conserved RAREs upstream of progressively 5′ Hox genes propagate a wave of induction across the Hox cluster. Components of the RA signaling pathway either predate the Hox cluster or appeared shortly after its emergence, suggesting that temporal collinearity of expression may be an ancestral feature of the Hox cluster . The sequential activation of Hox genes in different metazoans is accompanied by directional changes in histone modifications, the opening up of the chromatin, and in some cases looping out of the chromatin fibre . The orchestration of this complex series of events is still not fully understood. Transcriptional repression of Hox genes through histone modifications and chromatin condensation is equally important for the establishment of appropriate Hox expression domains. Hox expression can in addition be controlled post-transcriptionally and perhaps also epigenetically by non-coding RNAs and micro RNAs (miRNAs), the genes for which are embedded in the Hox cluster .
The β-globin gene cluster
Classical operons in eukaryotes?
Conclusions and perspectives
Here we have reviewed the literature on gene clusters in prokaryotes and eukaryotes, with particular emphasis on functional gene clusters. This theme is very broad and inevitably we have not been able to cover the entire swathe of literature in this field. We apologise to those whose work we have not cited. Nevertheless, we hope that by bringing together different and disparate facets of this area we have been able to highlight some of the similarities and differences between the ways in which gene clusters are organized and regulated in different organisms. The reasons for gene clustering can be considered at both the functional level and at the population level. These two considerations are not mutually exclusive . Considering the population level first, where the fitness of an allele at one locus depends on the genotype at another locus then a selective advantage may arise for genomic rearrangements that reduce the distance between the two loci . Significantly, this ratcheting effect may be enhanced when the fitness of recombinant haplotypes is low, for example where the combination of a functional and non-functional allele at two loci results in the premature disruption of a biochemical pathway and accumulation of toxic intermediates [76, 123, 181]. Operons or gene complexes can thus be regarded as units of strong gene interaction, with tightening of linkage between structural genes . The selection for new gene clusters is likely to be intense, driven by the need to adapt to growth under different environmental conditions. At the functional level, physical clustering may be advantageous because it allows groups of genes to be co-ordinately regulated at the levels of nuclear organization and/or chromatin. Thus, one way in which alleles could interact well is by being co-localized in regions of chromosomes that facilitate co-ordinate regulation at this level and by being amenable to the same type of chromatin modification. In the future, it is likely that in-depth analysis of the levels at which functional gene clusters are post-transcriptionally regulated will reveal new facets of co-ordinate regulation that will shed further light on the mechanistic benefits of physical clustering.
The authors acknowledge their sources of funding (A.O., Biotechnology and Biological Sciences Research Council, UK, and the European Union; B.F., FEBS Long Term Fellowship) and would like to thank their colleagues for helpful discussions in the preparation of this article.