Introduction

The naive concepts regarding genome regulation that generations of biologists have learned to love are long gone, and have been supplanted by the notion of a dynamic epigenome. How did this transition happen and what are we left with?

Often presented as a continuation of the Human Genome Project (HGP), the “Encyclopedia of DNA Elements” project (ENCODE) was an attempt to systematically map the universe of potential transcription sites in the human genome. ENCODE was initiated in 2003 by the US National Human Genome Research Institute (NHGRI) and the results of its first phase were released in 2012 and included all the transcribed regions of the genome (Encode Project Consortium 2012). ENCODE’s results have been serving as a repository for the scientific community, as an encyclopedia of transcription factor binding sites as well as the chromatin modification landscape. Experimental protocols and analytical procedures have also been developed by the consortium. More recent continuations of the project are modENCODE (Celniker et al. 2009), aiming at identifying functional elements of model organism genomes; or “Roadmap,” a project to make human epigenome data publicly accessible for disease-oriented research (Bernstein et al. 2010).

While ostensibly a paradigmatic example of big science and big data as applied to biology, the project became a tinder box for debates both theoretical and conceptual far exceeding its immediate results. Here we want to illustrate that to understand ENCODE you have to situate it as a culmination of at least three strands of development in biology. The first is the history of thinking about the organization of genomes, both physical and regulatory. The second is the history of ideas about gene regulation, primarily in eukaryotes. Finally, and connecting these two issues, is how best to think about the role of genetic material for organisms: Is it best viewed as a collection of genes? As a site of regulation? Is it passive or active? Should it be studied in informational terms or in material terms? And, ultimately, whether it is the epitome of biological design or, as evolutionary biologists have argued since at least the 1970s, is mostly made up of historically accumulated junk.

The Role of Genomes in Development

The relation between genes and development has been a central concern in developmental biology for more than half a century, both before and during the heyday of molecular biology. Two influential suggestions were made by Conrad Waddington in the 1940s. His “epigenetic landscape” (Waddington 1940) depicted development as a branching series of decisions that are driven by genes as well as by external conditions and lead to equilibrium, or canalized, states. In 1942, he promulgated his notion of the “epigenotype” which refers to the whole complex of developmental processes lying between and connecting the genotype and the phenotype. Epigenetics was the term he used for the study of the epigenotype, with the goal of deciphering the causal mechanisms and network of relations linking genes and phenotypic effects (Waddington 1942; see Jablonka and Lamm 2012). The meaning of the term “epigenetics” has changed historically and is also partly dependent on the research context it is used in, often causing confusion. Broadly construed, the term “epigenetic inheritance” refers today to the persistence of certain gene regulatory changes in a (cellular or organismal) lineage. Construed more narrowly, epigenetics refers to the persistence of gene-regulatory variation through material overlap during mitosis that is not due to DNA sequence (Jablonka and Lamm 2008). For Waddington, the research agenda that the epigenotype concept meant to capture was based on the realization that the relations between genes and characters were many-to-many, as well as on the importance of gene-gene interactions.

Deciphering a mechanism of gene regulation or more precisely of the regulation of gene expression in bacteria, the operon model elucidated in 1961 (Jacob and Monod 1961) served to indicate both the goal and the difficulties with understanding gene expression and regulation in eukaryotes. These challenges included the fact that eukaryote genomes consist of chromatin, which is DNA wrapped around proteins, possibly making their confirmation harder to change via operators.Footnote 1 The role of translation control as well as transcription control also needed to be considered. In addition, as appeared to be the case in cell differentiation in multicellular organisms, the control of whole batteries of genes simultaneously would have to be accounted for. Finally, cell differentiation indicated the possibility that different mechanisms control the type of cells while others control gene expression later on (Waddington 1966, pp. 59, 68). In the 1960s, puffing patterns in Drosophila chromosomes indicated that during development gene expression in cells may localize to specific regions of chromosomes (Ritossa 1962). It was also observed that small molecules such as hormones can affect gene expression (Tomkins and Martin 1970) while unlikely to bind to specific DNA, suggesting an intermediary molecular control mechanism (Waddington 1966, p. 77).

Critically important for early studies of gene regulation was the observation that differentiation in higher organisms involves simultaneous control of multiple noncontagious genes. In a retrospectively groundbreaking paper published in 1969, “Gene Regulation for Higher Cells: A Theory,” Roy Joy Britten and Eric H. Davidson proposed a mechanistic model of gene regulation in eukaryotes that aimed to explain precisely this: how regulatory logic becomes independent of the organization of the genome. They argued that the operon model was not applicable to eukaryotes, since it is unable to account for the transcriptional changes during cell differentiation in eukaryote development (Britten and Davidson 1969). While they explicitly acknowledged that regulatory processes occur at all levels of biological organization, the model they presented dealt with regulation of transcription. Contrary to what later became common wisdom (primarily after the HGP), they accepted with some caveats the notion that more complex organisms have larger genomes suggested by Alfred E. Mirsky and Hans Ris in 1951 (Mirsky and Ris 1951). From this, however, they deduced a fundamental conclusion that is still accepted today, namely that most of the genomic changes leading to greater complexity are changes in the complexity of regulation, not the production of new protein-coding genes. This conclusion would later be a tenet of evolutionary developmental biology (evo-devo). The study of gene regulation mechanisms is one of the major subjects of study within the discipline of epigenetics, and increased in prominence once the limitations of the results of the HGP for understanding development became apparent.

Britten and Davidson were not the first researchers to emphasize the importance of novelty in gene regulation over that of novelty in genes. Earlier researchers suggesting such a role include Richard Goldschmidt (1940), Conrad Hal Waddington (1943), and Edgar Stedman and Ellen Stedman (1950) (reviewed in Wolter 2013). As Britten and Davidson note, T. H. Morgan also considered the role of gene regulation in differentiation in his 1934 book, Embryology and Genetics (Morgan 1934). Based on the discovery by Britten that genomes contained large amounts of repetitive, noncoding sequences, Britten and Davidson suggested that these sequences, whatever their origin, were used as targets for regulatory molecules. Initially, they proposed these regulatory molecules were RNAs. Histones, in contrast, were considered to lack tissue specificity and to display uniformity between “active” and “inactive” chromatin and hence the authors concluded that they serve as general inhibitors of transcription, rather than for tissue specific regulation, a sentiment that would resurface half a century later amongst those critical of the chromatin research of the 21st century.

Even though the term “chromatin” has become widespread in recent years (Deichmann 2015), its history goes back at least a century further. While the components of chromatin were biochemically described in the second half of the 19th century (Miescher 1871, 1874; Kossel 1884; Lamm et al. 2020), its functional role would only be investigated in the second half of the 20th century. Chromatin, originally named due to it being stainable, was early on identified to consist of condensed and under-condensed regions, referred to as heterochromatin and euchromatin respectively (Vanderlyn 1949; Brown 1966). The former includes chromosome regions of critical importance, specifically the centromeres and telomeres. Various considerations, none of them fully conclusive, led to the proposition that the condensed regions (heterchromatin) are entirely or largely lacking in genes or are genetically inert. The heterochromatic regions were often thought of as having a structural role (though the distinction between structural and informational aspects of the genome should be approached cautiously, especially when considered historically). These regions were also shown to not undergo recombination and their replication to rely on specific mechanisms. Later work identified the relation between chromatin state (condensed or open) and histone modifications and determined that the chromatin state of chromosome regions can change dynamically. How changes in chromatin state are related to gene regulation and the relation between chromatin state and repeatable sequences (often found in heterochromatin) are central topics in work on gene regulation.

The ENCODE project used a variety of histone modification patterns to identify functional areas in the human genome (Siggens and Ekwall 2014). These modifications affect the three-dimensional conformation of chromatin, and influence regulatory processes such as transcription factor binding and enhancer activity, in addition to affecting DNA replication, repair, and so on. In addition to indicating active chromatin, particular histone modifications are known to mark enhancers. Enhancers are short DNA regions that may be far upstream or downstream from a transcription site, yet their affect in cis increases the likelihood of transcription. It was discovered that the human genome contains a multitude of enhancers that vary in their pattern of activation in different cell types, showing a high degree of cell-type specificity (Schoenfelder et al. 2019). Enhancers regulate transcription through chromosome loops that bring the distant enhancer into physical proximity with the promoter, enabling interaction though a mediator complex. Multiple distal elements may interact with a transcription site and vice versa. The degree of specificity and information processing that occurs around enhancers has been the subject of theorizing (Arnosti 2005).

The hypothesis that “middle repetitive,” heterochromatic regions of the genome could be regulatory elements was already made in the 1960 and 1970s, when the presence of RNAs in chromatin fractions was detected (Sivolap and Bonner 1971). “Chromosomal” RNAs were discovered in nucleohistones in 1965 (Huang and Bonner 1965). However, this hypothesis was sidelined by the idea that the gene-poor, transposable-element rich heterochromatin would be mainly “junk” (Mattick and Amaral 2022). The term “junk DNA” was developed in the 1970s, primarily in the work of Japanese-American geneticist Susumu Ohno, and referred to repetitive noncoding DNA (1972).Footnote 2 The relations between gene regulation, repetitive sequences, histones (and later, histone modification), and truly “junk” DNA are critical for conceptualizing the relations between genes and development (and hence genotypes and phenotype) as well as evolution. The ENCODE project has now notoriously become associated with a debate about “junk” DNA (Doolittle 2013; Graur et al. 2013), given what was claimed to be one of the major results of the project—that the largest portions of the genome are functional, i.e., not junk. The question whether ENCODE could identify the functioning parts of the human genome (relative to the concept of function in use) is still contested. What became increasingly apparent was that much more than protein coding genes was transcribed, notably a significant portion of the intergenic genome (Djebali et al. 2012).

While it was primarily conceived as an empirical project, the ENCODE project could not escape being framed within the theoretical concerns and history discussed in this section. In that vein, the story of ENCODE is deeply connected to that of turn-of-the-century epigenetics. In the 1990s, when imprinted genes, genes whose gene-regulatory state would persist intergenerationally, were discovered in mice and humans, DNA methylation became a key area in epigenetics (Haig 2011). The term epigenetics is now used in a much narrower sense than that introduced by Waddington, and for the case of ENCODE proponents refers specifically to molecular mechanisms believed to affect chromatin structure (Goldberg et al. 2007). Such mechanisms include the covalent modifications of histones or DNA. These modifications are believed to alter chromatin structure in a way that makes it more or less prone to transcription. Since the 2000s, most chromatin research is conducted under the label of epigenetics, since most such studies are undertaken to show how the epigenetic modifications change the chromatin landscape and thus lead to gene regulatory changes (Deichmann 2015). The ENCODE project embodied a pluralistic perspective on these long-running debates concerning gene regulation. It made room for both work on transcription factors and on epigenetic mechanisms of gene regulation, a topic which we now turn to.

Histones and Coding

Whereas Waddington’s dynamic perspective on development already resisted the idea of reducing development to a series of coded instructions, after the conclusion of the HGP some researchers were loath to give up the idea of a code entirely, and suggested another level of potential encoding, that of chromatin modifications, particularly histone modifications. In the beginning of the 21st century, fueled by excitement raised by discoveries in epigenetics in particular, it was proposed that particular combinations of histone modifications would denote particular downstream events. Thus, a new code, the “histone code” (Jenuwein and Allis 2001), was proposed, which shifted the level of examination to gene/chromatin regulation.

Proponents of the histone codes argued that it might be possible to understand gene regulation and cell differentiation solely by understanding the histone code. They placed the causal primacy on histone modification:

First, the establishment of ... a combinatorial pattern of histone modification, i.e., the histone code, in a given cellular or developmental context .... Second, the specific interpretation or the “reading” of the histone code ... [which] function[s] broadly to set up an epigenetic landscape that determines cell fate decision-making during embryogenesis and development. (Mattick and Amaral 2022, p. 2002; citing Strahl and Allis 2000)

The quote implies that epigenetic processes occupy the highest level of control over developmental trajectories, notwithstanding the at-that-time common knowledge that, for instance, ectopic expression of transcription factors could change cell differentiation drastically. This debate, is, necessarily, a turf war between different subdisciplines, which some coined “transcription factor people” versus “chromatin people.” It is about the causal primacy of the respective molecules, but also, necessarily, about institutional support, such as funding into these areas of research. By considering particular covalent histone modifications as signs for transcription or lack of transcription, the ENCODE project cautiously endorsed this hypothesis. Mapping histone modifications against other indicators of transcription held the promise of decoding the histone code. However, a histone code commensurate with the genetic code was not found (Rando 2012).

An important metaphor motivating the idea of a histone code is that of writers/readers/erasers of the histone code. The idea is that there are certain proteins that specifically “write” the histone code, that is, deposit particular covalent modifications on histones; that there are other molecules that can “read” the histone code, that is, specifically recognize a particular code and trigger downstream effects; and that there are molecules that can specifically remove certain covalent modifications. However, as Henikoff and Shilatifard (2011) argue, whereas the analogy with a code and language is accurate when it comes to DNA transcription and translation, this is not the case for the histone code: Writers do not write, but only modify amino acid residues, one at a time. Readers do not read, but only bind amino acid residues, one at a time. Erasers do not erase, but only remove amino acid residues, one at a time. It is, however, thought-provoking to note that the language metaphor, which has a long history in biology, can be used more broadly. As one example, François Jacob used the metaphor of text and language not merely as an analogy, but as a cognitive model (see Rheinberger 2006). Language, he noted, “constitutes a typical system of interaction between elements of an integrated whole” (Jacob [1970]1974, p. 251), similar to other cybernetic objects, which include human societies, living organisms, and automatic devices. In each, Jacob noted, “cybernetics finds a model that can be applied to the others.” Whether such a cybernetic perspective, somewhat reminiscent of Waddington’s, is ultimately helpful, remains, as we will show, a subject of debate.

Several commentators, particularly those who have been central in discoveries pertaining to transcription factor-based gene regulation, have been critical of epigenetics and the histone code particularly. For instance, they argue that epigenetic modifications, such as methyl- and acetyl-groups, lack specificity for particular histones but require a specific machinery—largely coordinated by transcription factors—to be attached to a particular region (Ptashne 2007). Region-specific factors are necessary for efficient and targeted gene-regulatory changes. Mark Ptashne, therefore, argues that the statement that genes might be regulated by chromatin modifications is highly misleading, given this lack of specificity (2007). In a “meaningful way” they are regulated by DNA binding proteins, transcription factors, which in turn recruit further regulatory machinery.

Similarly, critics hold that the idea that nucleosome modifiers regulate genes by opening or closing chromatin structure is problematic. A similar problem is presented by the idea of “activating” or “repressive” histone modifications, implying causality where none has been demonstrated (Henikoff and Shilatifard 2011). Another problem Ptashne locates with epigenetics is that self-perpetuating loops are often insinuated, even though they have not been shown to exist (except for DNA methylation) (Ptashne 2007, 2013). Thus, “memory” of histone marks cannot be ensured.

The key question concerns the specificity of chromatin modification. The idea that histone modifications are unspecific fuels what has been coined the “cause/cog” debate (Henikoff and Shilatifard 2011). While studies have shown that it is possible to predict attributes of the chromatin landscape through analyzing histone modification (a hypothesis that ENCODE also partly rests on), many doubt that these studies can prove functions, since they cannot demarcate causation from correlation. For instance, Eric Davidson argues: “Of course the basic problems are that binding in ChIP [chromatin immunoprecipitation] does not equal function, and that motif identification does not equal (functional) binding” (2015, p. 168). One alternative explanation, for instance, is that histone modifications are primarily affected by processes such as transcription which in turn affect the physical properties of nucleosomes that help maintain the active or silent state of chromatin. They would thus be consequences, simple cogs in the wheelhouse of gene transcription. This debate on causal primacy, specificity, and the “location” of genetic control and information, echoes the questions discussed by Waddington, who rejected the possibility of reducing them to a non-dynamic, noninteractive, universally applicable answer. The cybernetic perspective suggests that stabilized states can occur through various causal routes, and that causality in complex systems contains loops, which complicate or obviate the importance of a first cause in a causal cascade (Lamm 2014).

Regulatory RNAs

As opposed to histone marks, Ptashne argues that transcription factors cannot only guarantee specificity but also memory of transcriptional changes (Ptashne 2013). On the one hand, their tertiary structures bind specific stretches of DNA. On the other hand, through feedback loops they can guarantee their own expression and thus can ensure gene regulation over several generations in a lineage of cells. While those studying transcription factors are generally quite dismissive about the causal role of histone modifications or a histone code in gene expression, another epigenetic factor fares much better in their deliberations: regulatory RNAs (Fire et al. 1998; Ptashne 2013).

The study of regulatory or noncoding RNAs became increasingly important with the advent of new sequencing technologies in the years following ENCODE, while ENCODE relied on earlier technologies for mapping transcription sites in the genome. Regulatory RNAs are noncoding RNAs that are involved in a variety of processes; they forge a link between the specificity of the genetic code and regulation. Some commentators also see in the rise of regulatory RNA biology a return to the Britten-Davidson model, which, initially published in 1969, proposed that RNAs would be the master regulators of gene regulation. More abstractly, ongoing work on noncoding RNAs is now part of the developing picture of the bridge connecting genomes and phenotypes. In the current picture of genomes, noncoding RNAs are a critical factor, in addition to transcription factors and histones.

Take X chromosome-inactivation as an example: XIST is a long-noncoding RNA, operating in cis, that in mammals with two X chromosomes is responsible for the inactivation of one X chromosome, to compensate dosage effects. XIST is transcribed from the X chromosome and can through complementary binding affect the heterochromatinization of one of the X chromosomes. XIST has been shown to be involved in histone modifications as well as DNA-methylation on the inactivated X chromosome (Wutz and Gribnau 2007). Given this guaranteed specificity, RNA-based epigenetics sits better with those critical of the histone code.

Another type of regulatory RNAs are small RNAs. Small RNAs also work through complementary binding to other RNA targets. Some small RNAs are amplifiable. In some species, such as C. elegans, they can be amplified by RNA-dependent RNA polymerases (Lipardi 2001; Sijen 2001). In other species that lack these molecules, such as humans, there are other processes that amplify certain types of small RNAs. For instance, piRNAs, which are involved in transposon silencing in the germline, can be amplified through an amplificatory cycle called ping-pong (Aravin et al. 2007). Thus, through their amplifiability, small RNA can in principle affect gene regulatory change throughout a particular lineage of cells.

We shall linger on the example of small RNAs a bit longer since they elucidate a perspective that we consider important when situating ENCODE within these debates, that of the “dynamic genome.” In the beginning, small RNAs were primarily associated with posttranscriptional gene silencing in the cytoplasm. But in several model species, including mice and human cell lines, nuclear functioning, including co-transcriptional gene silencing, has been reported. Initially, small RNA research in model organisms such as from plants and yeast was central in reconsidering the functional notions attached to euchromatin and heterochromatin, particularly where heterochromatin has been associated with low transcriptional activities. Many genomic regions that have been associated with a heterochromatic state and thus low transcriptional activities have originally received this status since these regions are not protein-coding. Nevertheless, they were known to be of high structural importance. But they are also sites of transcription (Reinhart and Bartel 2002).

Here are a few examples where small RNAs affect chromatin structure. One of the best-studied ways that small RNAs can affect chromatin structure is through directing covalent histone- and DNA-modifications. In turn, these modifications are believed to affect chromatin structure. This process is ensured by the complementarity of particular small RNAs to the targeted regions in the genome, as well as small RNA-based effector proteins that are involved in trafficking small RNAs as well as downstream effector functions. Small RNAs might act in-cis (on the locus of their transcription) and in-trans (on a locus distinct from the place of their transcription).

In mammals, piRNAs in the germline are necessary for the methylation of retrotransposon sequences throughout the genome. This process is hypothesized to work through piRNA targeting of nascent transposon sequences (Carmell et al. 2007). They have also been shown to be involved in the re-methylation of imprinted genes (Watanabe et al. 2011). Aberrant piRNA-mediated gene silencing in cancer stem cells has also been reported to have important effects in tumorigenesis (Jia et al. 2022). Additionally, other small RNA pathways have been associated with remodeling chromatin structure (reviewed in Li 2014). For instance, administration of synthetic double-stranded RNAs were shown to induce DNA methylation and histone modification (Hawkins et al. 2009). Furthermore, endogenous micro RNAs have also been associated with co-transcriptional silencing of promoters (Kim et al. 2008).

In other model species, small RNA effects on genome organization are even more extreme: Deletions in the small RNA machinery can lead to chromosome segregation and microtubule annealing problems (Volpe et al. 2003). In C. elegans, there is a particular small RNA pathway associated with a particular effector protein that is hypothesized to operate on the level of chromatin exclusively, loss of which leads to chromosome loss during cell divisions. Thus, this machinery has been hypothesized to define chromatin boundaries during centromere formation (Claycomb et al. 2009). Also, it has been shown that heterochromatic regions, such as centromeres, are regions of high rates of small RNA transcription (Reinhart and Bartel 2002). The classical histone-code-readout for a gene being off, such as the methylation of certain histones, does not correspond with high rates of small RNA transcription in these regions (Lev et al. 2017). Furthermore, small RNA-based changes to chromatin structures include depositing histone modifications. These changes are believed to be heritable, without the histone mark in itself needing to be heritable. In particular species such as Tetrahymena small RNAs take their chromatin organizing function one step further by being involved in DNA elimination (Mochizuki et al. 2002).

The roles of each of the factors we discussed—transcription factors, DNA methylation, noncoding RNAs, and histone marks—are different in different species, and not all are present in all species. Their functions, interactions, and evolutionary history are all part of the current thinking about genomes.

Genomes After ENCODE

These recent discoveries in small RNA biology bring back the three questions we listed in the beginning and their connection to ENCODE. The ENCODE project certainly contributes to a picture of the genome that is a site of constant regulation. However, ENCODE was primarily intended to provide data for future studies and was somewhat pluralistic. Most crucially, ENCODE did not choose a side in the transcription factor versus chromatin debate, and was primarily interested in transcription simpliciter, leaving undecided the causal primacy of either approach and taking traces of both as indicators of transcription. ENCODE is also agnostic as to the cause or cog debate, since, in its descriptive approach, it remains undecided on whether histone modifications are a readout of active or repressed transcription because such modifications directly cause transcription or because these modifications are the consequence of other processes that modulate transcription rates.

Nevertheless, through being associated with these debates, ENCODE contributes to the question of how genomes are organized, given that a regulatory map of the genome (no matter whether it is a “histone code,” and whether histone modifications are causes or cogs) provides hints about the physical organization of genomes—condensed and open chromatin, as well as chromatin borders. There is also a potential to disassociate the notion of “heterochromatin” from silent chromatin, given ENCODE’s mapping of transcription sites. Indeed, ENCODE used chromatin marks to identify 15 chromatin states (ENCODE Project Consortium 2012). Naturally, ENCODE also contributed to questions of gene regulation but was never set up to be an arbiter of “transcription factor” versus “chromatin” or “RNA” versus “protein.” Rather, ENCODE provides an integrative, pluralist perspective on these issues, presupposing the importance of transcription factors and chromatin modifications, not implying any hierarchies in its methodological setup.

What the ENCODE debate certainly contributed to is a notion of the genome as highly active and plastic, given that it showed that most of the genome is transcribed. Arguably, it also diffused hard and fast distinctions between gene expression and other genomic processes. However, as many others have pointed out, this point shouldn’t have been framed as showing that almost all of the genome’s DNA is functional in a selected-effect or informational sense. ENCODE was never set up to prove or deny this. The distinction between causal role and selected-effect functioning aside, the result that most regions of the genome, euchromatic or heterochromatic, are transcribed could change conceptions of chromatin structure and genome organization in general.

Many transcripts are involved in feedback loops, where their transcription affects the heterochromatic state of their region of origin. In that way, they also affect genome organization by condensing or decondensing certain areas, maintaining boundaries so certain chromatin states do not spread, and affecting chromatin organization in a way that cell divisions can take place in an ordered manner. But it has become increasingly controversial to maintain that it is possible to discern different “kinds” of chromatin based on their characteristics under the microscope, that chromatin states are static and that there is a correspondence between “eu-” and “heterochromatin” and “active” and “silent” chromatin. An idea that ENCODE is compatible with, but was never actively addressed, is viewing the genome as not only responsive, but also as being constantly negotiated—what stays, what is on, what is off, are the product of a dynamic interplay of different specific effectors and their positive and negative feedback loops, potentially also integrating environmental signals (Fedoroff and Botstein 1992; Shapiro 1992, 2011; Caporale 2006; Fontdevila 2011). Genomes, given these developments, become less a site of hierarchical firsts, gene X causes Y, but a site of dynamic exchange with other genetic materials and transcription products, having a dynamic and functionally significant three-dimensional organization that is sensitive to the environment and in some cases the site of transgenerational signals. Waddington and Jacob would have been fascinated.