Throughout this book, we have mapped the participation of different scientific communities in genomic endeavours across three species—yeast, human and pig—and the distinct processes, epistemic goals and domains of application that informed the creation of annotated reference genomes. In this chapter, we examine how the existence of reference genomes enabled the creation of increasing amounts of additional genomic data, as well as other kinds of biological data. This involved the generation of new reference resources intended to represent forms of variation within a species that were either missing or insufficiently incorporated into its reference genome. Such reference resources could include new maps or sequences, novel ways to relate or align freshly identified variation to the reference genome, or tools to capture and document variants.

We examine two main currents of post-reference genome data production, collection and analysis: functional and systematic studies. By functional analysis, we mean the investigation of the effects of variation in genes and genomes, in terms of alterations to biological processes and therefore differences in phenotypes: the measurable presentation of traits in organisms.Footnote 1 By systematics, we mean the exploration of the patterns and specific details of genomic variation both within a species and between it and related species. We conclude by considering the implications of the increasing interrelationship of the functional and systematic modes in research pertaining to each of the three species.

Before a reference genome exists, and after its creation, communities tend to focus their interest on intra- and inter-species variability. Constructing a reference genome involves abstracting away—to a greater or lesser extent—variation to create a single canonical reference standard. Although the nature of reference genomes varies across species, this abstraction is often raised as a source of concern by the many different communities that may use them but who were not involved in their construction. This concern, we contend, arises from a tension between the presumed representativeness of reference genomes and their role as standards, as stipulated bases of reference and comparison.Footnote 2 This tension may seem to place—and in some cases, conflate—conflicting demands on reference genomes. Yet, the two demands are linked. Reference genomes, through their role as standards, enable researchers to gain a greater appreciation of the range of biologically-meaningful variation present across a species, and so help seed critiques of their representativeness. Furthermore, through their very role as a scaffold on which other representations of variation can be constructed and connected, reference genomes have enabled the development of data, tools and representations to facilitate more bespoke functional and systematic explorations of the biology of the species. Such developments have also encouraged visions of the refinement or even replacement of the reference genome as a central object in genomic and allied research.

In this chapter, we once again consider genomics research on all three of the species we have concentrated on in this book, in the context of understanding the nature of genomics after the reference genome. This is often referred to as ‘postgenomics’ (Richardson & Stevens, 2015). In Sect. 7.1, we contend that this label, and some of the meanings that have been attributed to it, reflects and reinforces a misleading picture of the history of genomics that has arisen through a disproportionate focus on the elucidation of the human reference genome. In this view, a concern with relating multiple kinds of genomic and non-genomic data (what we refer to as multi-dimensionality) and the biological contextualisation of that data is postgenomic. As we have seen in the preceding chapters, however, these facets were evident in pre-reference genome research and even featured alongside the generation of the reference genome for yeast and especially the pig. Even within Homo sapiens, the compilation of genomic data went hand-in-hand—rather than preceding—an aspiration to capture variation and connect this with other biological and medical problems outside the concerted effort of the International Human Genome Sequencing Consortium (IHGSC).

To begin to make that case, in Sect. 7.2 we consider the extent to which reference genomics is an ongoing project, both for species that already have a reference genome, and for species still lacking one. This continued reference genomics does not just constitute a tidying up exercise or involve incremental improvement. To accept that would presume that there is some final standard of completion for reference genomes. Further, it would imply that reference genomes are something to be discovered, rather than constituting creative products. As we have shown in previous chapters, reference genomes are abstractions from the variation found in nature, in which decisions made about data infrastructures, mapping, library construction and use, sequencing method, assembly and annotation are pertinent to shaping the final product. Who was involved in these processes, and when, are therefore matters of deep significance.

In Sect. 7.3, we develop our argument by examining two modes of genomic research—functional and systematic studies—and the relationships between them, by first inspecting an example of pre-reference genome work: pig genetic diversity projects that ran from the mid-1990s to the mid-2000s. These were efforts that explicitly aimed at apprehending the diverse genetic resources that might be tapped for breeding programmes. They also neatly aligned with interests in the domestication, evolution, phylogeny and natural history of pigs that were held by many researchers who primarily worked on pig genetics for agricultural purposes. They therefore instantiate an early entanglement between functional and systematic work and show how some of the supposedly ‘postgenomic’ concerns with variation and multi-dimensionality operated before the creation of reference genomes.

In Sect. 7.4, we compare the manifestation of these functional and systematic modes of genomic research across yeast, human and pig. We show that genomics research across these three species exhibits different forms of entanglement (or lack thereof, at times) between systematic and functional approaches. This crucially affects the ways in which reference genomes are used and how new reference resources are developed and connected to each other. We document this through particular examples of work conducted after the release of the reference genomes of each species:

  • Yeast. EUROFAN, Génolevures and allied projects that followed the completion of the reference genome of Saccharomyces cerevisiae. EUROFAN sought to systematically produce mutants for particular genes and combinations of genes in S. cerevisiae, and so generate mutant stock collections as a standard reference intended for wider circulation. Génolevures was a network that sequenced and comparatively analysed the genomes of multiple yeast species, and in so doing explored their evolutionary dynamics and generated the comparative means for further developing functional analyses.

  • Human. ENCODE, the post-reference sequence project aiming to catalogue the functional elements of the human genome, and GENCODE as a sub-project of this. We also examine attempts to map and make sense of human genomic diversity, the establishment of reference sequences for particular populations, as well as the creation of ClinVar: a database of genomic variants associated with clinical interpretations of their possible implication in disease.

  • Pig. The Functional Annotation of Animal Genomes network (FAANG), which has grouped the pig genomics community with genomicists working on other farm animals. We also examine research on pig genomic diversity across breeds, particularly that related to tracking and understanding patterns of evolution, domestication and dispersal. Finally, we recount the creation of a SNP chip or microarray to test for the presence or absence of particular Single Nucleotide Polymorphisms (SNPs). The advent of this chip enabled further functional and systematic studies, as well as the development of novel resources and the inferential means by which researchers could connect new and existing resources. This eased the connection of genomic resources to particular modes of research and domains of application or ‘working worlds’ (Agar, 2020).

In yeast, we see first a pursuit of functional analysis to build on and enrich the reference genome as a resource, followed by systematic studies. Leading yeast genomicists, pursuing their own lines of molecular biological and biochemical research, increasingly realised the synergies between these two modes. The relationship between functional and systematic research in yeast reflected the ‘do-it-yourself’ approach of the yeast biologists who made up the genomics community, and the nature of yeast as a model organism.

In human genomics, we see continuity from the way that the reference genome effort was organised, with grand concerted efforts led by many of the same institutions encompassing the IHGSC. We focus on ENCODE and GENCODE and compare these with some contemporary systematic studies that examined human genomic diversity, such as the production of reference sequences for particular populations. We conclude the discussion of human post-reference genomics by looking at a relatively new initiative, ClinVar, which aims to connect the infrastructures, norms and practices of large-scale genomics with those of the medical geneticists who were peripheral to the IHGSC, and who had instead developed their own separate and parallel data infrastructures.

For the pig, however, after the discussion of the functional and systematic motivations and consequences of pig genetic diversity research in Sect. 7.3, it will not come as a surprise that the post-reference genome distinction between functional and systematic modes has been far fuzzier than for yeast and human; there have been multiple crossovers between the modes and an early appreciation of their synergies by the community. After examining how the pig genomics community immediately set about functionally and systematically exploiting the reference genome they had helped to create, we examine their collaboration with the sequencing technology company Illumina to produce a SNP chip. The SNP chip is an excellent illustration of the significance of the involvement of a particular community in the creation of genomic resources, in particular in shaping the generation of new reference resources. This, however, introduces constraints into these resources as much as it engenders capabilities or affordances.

We conclude (in Sect. 7.5) by observing that the coming together of functional and systematic modes of genomic research instantiates a particular stage in the development of what we term a web of reference. Over time, webs of reference feature ever-denser webs of connectedness between distinct representations of the variation of, for example, a particular species. Such representations include reference sequences (e.g. of the species or sub-species populations), genome maps and resources such as SNP chips. Connections between such representations are progressively forged by data linkages and the process of identifying and validating inferential and comparative relationships between them. This is enabled by the creation of reference resources that seed the web, with new nodes representing new forms of data scaffolded on and linking to existing ones. These webs of reference are especially dense within particular species, but they can—and indeed have and often must—be connected to genomic reference resources beyond them. The way in which these webs develop depends on prior genomic research and reference resource creation (including that for other species) and the involvement of specific communities of genomicists in those efforts.

1 Postgenomics or Post-Reference Genomics?

This chapter explores the surplus of data concerning genomic variation that has been generated in the wake of the elucidation and publication of reference genomes. We refer to this as ‘post-reference genomics’, to indicate the differences between our treatment of this work with what is usually connoted by the term ‘postgenomics’. ‘Postgenomics’ has often implicitly referred to genome-related research that followed the determination of the human reference sequence. It therefore ignores the ongoing ‘reference genomics’ of the human—beyond the conclusion of the IHGSC endeavour—as well that being conducted for other species. Many species, of course, still do not have a reference genome, while others—as we have shown throughout the book—had their reference sequences produced in a substantially different manner to the human one.

There has been some debate on what postgenomics means, beyond the chronology of simply following the initial publication of the human reference genome. Some scholars have suggested that it was conceived by IHGSC scientists to market their post-reference sequence research agenda, with parallels drawn between this agenda and the contemporary rise of the notion of translation: an imperative to transform research results and data into medical outcomes (Stevens & Richardson, 2015). While obtaining further grant funding may have been a significant driver of the framing of postgenomics as something distinct and new, other accounts have sought to characterise postgenomics as a more substantial endeavour. A common theme in this latter school of thought is that post-genomics constitutes research that aims and aimed to relate other forms of biological data—to integrate additional dimensions—to DNA sequence data, and therefore begin to properly capture the complexity of biological processes. Here, the technologies, methods, infrastructures and data generated through genomics have been used as a platform for further biological research. In this version, postgenomics comprises a recognition of complexity and a non-deterministic, non-reductionist, interactionist and holistic vision of the organism.Footnote 3

This perspective on organismal complexity was first outlined at a 1998 conference held at the Max Planck Institute for the History of Science. As by any definition, this conference was held before ‘postgenomics’ came into being in some form, postgenomics was envisaged in conjectural and promissory ways. At the conference, the biologist Richard Strohman outlined four phases of genomics: the first two being “monogenetic and polygenetic determinism”, then “a shift in emphasis from DNA to proteins” and then “functional genomics”. Following this was a burgeoning fifth stage, presumably postgenomics, but not labelled as such, which is “concerned with non-linear, adaptive, properties of complex dynamic systems”. This was an early statement of the idea that genomics pertains to the linear and deterministic while postgenomics opens out to nonlinear and nondeterministic facets of biology, but it differed from some of the accounts of later scholars by including aspects of this extra- and multi-dimensionality in genomics itself, rather than this being characteristic of postgenomics (Thieffry & Sarkar, 1999, p. 226).

Adrian Mackenzie has also evaluated genomics in terms of dimensionality. He identifies the period roughly between 1990 and 2015 as “the ‘primitive accumulation’ phase of genomics” that “has yielded not only a highly accessible stock of sequence data but sequences that can be mapped onto, annotated, tagged, and generally augmented by many other forms of data”. The single dimensionality of sequence data produced in this phase of genomics is something to be augmented with—and related to—other forms of data; it is the challenge of dealing with dimensionality that characterises “post-HGP [Human Genome Project]” biology (Mackenzie, 2015, pp. 79 and 91). In this conception, postgenomics is defined in terms of both the use of existing sequence data and associated infrastructures, and the establishment of connections between genomic data and other forms of ‘omic’ data (Stevens, 2015). Here, postgenomics involves the results and modes of research of genomics being brought together with other types of biological traditions and outputs, a process that is characterised by the advent of new forms of labour, for example the figure of the curator (Ankeny & Leonelli, 2015).

The designation of something called postgenomics as an endeavour to contextualise sequence data, indicates that genomics became conceptualised in terms of sequence production, rather than involving both sequence production and use, and featuring a range of different ways in which production and use were related and combined. This production-centred interpretation tallies with an approach to genomics that foregrounded an increase in the efficiency and speed of data production, with the pressure of this drive helping to manifest and reify a strict division between producers (submitters) and users (downloaders). However, as we have shown elsewhere (García-Sancho & Lowe, 2022) and further illustrate throughout the book, a sharp division only existed within the IHGSC effort; other approaches to genomics featured different configurations and entanglements between sequence production and use. Additionally, contextualisation of sequence data has been pursued both before and during the production of a reference genome, as much as afterwards. We can, therefore, conclude that contextualisation is not a defining attribute of postgenomics: rather, following the advent of a reference genome, existing forms of contextualisation are altered and new ones are established.

What do multi-dimensionality, augmentation, integration and contextualisation mean in the post-reference genome world? They relate to ideas of completeness, comprehensiveness and the capturing of a whole or a totality. In that 1998 conference previously mentioned, biologist and scientific administrator Ernst-Ludwig Winnacker, the founder of major yeast and human sequencing centre Genzentrum (Chap. 2), said that postgenomics should be about “an understanding of the whole” (Thieffry & Sarkar, 1999, p. 223).

Historian Hallam Stevens has articulated how genomics itself seeks wholeness and comprehensiveness. Drawing on his detailed study of the Broad Institute, and mostly informed by human genomics, he presents genomics as a special form of data-driven bioscience. The nature of data in genomics makes it amenable to the adoption and development of bioinformatics and information technology-based approaches more generally.Footnote 4 Stevens’ interpretation of genomics is that the investigation of the particular is replaced by a sensibility that aims to characterise the totality. Totality and generality are key. For example, he points to the “Added value” generated by having completely sequenced genomes (Stevens, 2013, p. 161, quoting Bork et al., 1998).

Stevens stresses the dialectic of sequence data production and the development and incorporation of informatics infrastructures and approaches. Successive different structures of databases are indicative of shifts from pre-genomics to genomics to postgenomics. Genomics is about producing databases; the reference genome and a particular way of storing and presenting data—relational databases—are mutually constitutive. Distinctions between pre-genomics, genomics and postgenomics are therefore made on the basis of the structure of databases and the place and role of DNA sequence data within them. When researchers increasingly wanted to relate DNA sequence data to other forms of data (e.g. various omics data), this necessitated a shift from one kind of database structure to another. The relational databases that were able to capture well the single dimension of DNA sequence data catalogued in strings of As, Ts, Cs and Gs, therefore gave way to more complex networked databases (Stevens, 2013).

These interpretations of genomics are usually based on specific institutions and infrastructures, often in the orbit of the IHGSC. In such expositions, the effort of producing a reference genome is detached from prior genomic research, parallel genomic research (for example, by medical geneticists), and work following it. Accompanying this separation is the projection of distinct and exclusive attributes to pre-genomic, genomic and postgenomic research.

As an alternative, we propose the designations of pre-reference genomics, reference genomics and post-reference genomics. This periodisation scheme is based on the availability (or otherwise) of an object—the reference genome—and the relationship of particular communities to it. It does not presume that each stage will exhibit specific essential characteristics. Our approach emphasises the historicity and specificity of reference genomes and helps us to discern a more fluid interconnectedness between stages. In the rest of the chapter, we illustrate this by comparing post-reference genome research on yeast, human and pig.

2 Improving Genomes

Reference genomes are not static: they are amended over time, with updated versions evaluated and validated using metrics that enable direct comparisons to be drawn between the new and the old. Even when a reference genome is considered to be ‘complete’—as the human reference genome was famously deemed in 2004—it still subsequently undergoes revisions that are intended to improve it according to existing and novel benchmarks. In what follows, we examine revisions of the human, yeast and pig reference genomes and how metrics and judgements of quality changed according to evolving and distinct objectives for the three species.

We have seen that Celera Genomics saw their full human sequence as provisional and in need of constant improvement and enrichment. This was in order that their corporate effort would be seen to offer sufficiently more value than the publicly-available data to justify paying a subscription to access it. Indeed, as we indicated in Chap. 6, Celera kept developing its whole-genome sequence: new additions that were incorporated after the initial public release in 2001 were only accessible with a paid subscription.

The working draft of the IHGSC sequence (release name: hg3) was published on the University of California Santa Cruz’s (UCSC) website on 7th July 2000. At this stage, though, it was just the sequence data that could be downloaded, with a UCSC browser to visualise it still being in the works. This version had significant gaps and ambiguous positioning of sequenced fragments. The major draft published in February 2001 could also be downloaded from the UCSC website. In the Nature paper accompanying its release, it was estimated that the draft encompassed 96% of the euchromatic regions, the parts of DNA open to transcription.Footnote 5 Much as with the addition of the pig Y chromosome sequence to the new Sus scrofa reference genome assembly in 2017 (Chap. 6), future reference assemblies of the human genome would incorporate data from several sources.

The quality of the human and other reference genomes has been assessed in a number of ways: in terms of coverage, contiguity and accuracy.

Coverage is a metric we have already encountered; it is a function of the depth of sequencing, roughly how many ‘reads’ or particular determined nucleotides are present on average across the genome. It is expressed in terms of number-X, with the number designating the average amount of reads across the genome. However, there may be heterogeneity in the coverage of different regions of the genome. This outcome can be inadvertent, due to the clones captured in library production not evenly representing all areas of the genome, or be because of the exigencies of assembling regions with different genomic properties. Or it may be deliberate, due to the kind of targeting we saw in swine genome sequencing at the Sanger Institute.

Contiguity is the extent to which the building blocks of an assembly, such as contigs or scaffolds, are connected together. A contig is a continuous sequence in which the statistical confidence level in the order of the nucleotides exceeds a stipulated threshold, while a scaffold is a section of sequence that incorporates more than one contig, together with gaps of unknown sequence. The measured level of contiguity affects the classification of the level of a sequence assembly in the GenBank database. The designation of being a “complete genome” requires that all chromosomes should have been sequenced without gaps within them. Then there is a “chromosome” level of assembly: to qualify for this level, a sequence must encompass at least one chromosome, ideally with a complete contiguous sequence; if gaps remain, there need to be multiple scaffolds assigned to different locations across the chromosome. The other two levels are “scaffold” and “contig”, pertaining to the definitions of those objects.Footnote 6 Note that these are ways of assessing genome assemblies. They do not necessarily determine whether an assembly is designated as a ‘reference genome’ or the lesser category of ‘representative genome’ by the RefSeq database (Chap. 1, note 3), both of which are incorporated in the notion of reference genome we deploy across this book. As with improvements to mapping procedures or the evaluation of new genome libraries (Chap. 5), completeness can also be ascertained by searching for known genes or markers in the assembly, and enumerating those found and not found.

As well as these designations, there are metrics that are used to assess the contiguity of assemblies in a more fine-grained way. The most significant are the enumeration of the gaps (and the different kinds of gaps) and the estimated sequence length they represent, and also the calculation of N50 and L50 figures. The L50 figure is the smallest number of contigs whose total sequence lengths add up to at least 50% of the total length of the assembly. The N50 figure is the length of the shortest contig that constitutes part of the smallest set of contigs that together add up to at least 50% of the total length of the assembly. The L50 figure will therefore be expressed as a simple integer, while the N50 figure will be expressed in terms of numbers of nucleotides. These figures pertain to the length of the assemblies, rather than the presumed length of the actual chromosomes or whole genomes that are being assessed. For assemblies of the same length, the quality is presumed to be higher if the N50 figure is larger and the L50 figure smaller. The original draft human reference sequence, published in 2001, contained N50 figures for individual chromosomes and the genome as a whole. Gaps were counted across the assembly. These metrics enabled areas for improvement to be identified and analysed, but also provided a benchmark against which further improvements could be assessed.

Finally, there are measures of accuracy, which is the extent to which an assembly—and the parts thereof—is ‘correct’. This can relate to different aspects, such as the order and orientation of sequenced clones in the assembly, or pertain to the ‘base calls’—the assignment of the identity of individual nucleotides in each position in a DNA molecule—at the sequence level. This is, of course, trickier to execute than the other measures of quality, as it requires not just the measurement of the properties of the assembly and the construction of comparable metrics, but also necessitates assessment against a recognised standard. In the 2001 human reference sequence paper authored by the IHGSC, the accuracy of the assembly was evaluated by comparing it against an ordering of parts of the genome as dictated by sequence data derived from the ends of the cloned fragments in the DNA libraries used in the sequencing. This resulted in the identification of clones that did not overlap with others. These non-overlapping clones had been sought, as their presence could indicate misplacement of fragments in the assembly; they were subjected to closer investigation, resulting in “about 150” of the 421 “singletons” being attributed to misassembly.

Sequence quality at the level of nucleotides was evaluated in terms of the ‘PHRAP score’ for each one. The IHGSC used PHRAP and PHRED, software packages that were developed by Phil Green (both of them) and Brent Ewing (PHRED) at the University of Washington in Seattle. Together, they were—and are still—used for base calling. The software analyses the fluorescent peaks in the sequence read-out. It estimates error probabilities for each base call based on figures obtained from the read-out data and generates consensus sequences with error-probability estimates (Ewing et al., 1998; Ewing & Green, 1998).Footnote 7 The resulting PHRAP scores indicate the probability that an individual base call is incorrect, and therefore the overall accuracy of the sequencing. A score of 10 denotes an accuracy of 90% and that there is a 10% chance that any given base is wrong. A score of 20 means an accuracy of 99% (and a 1% chance of a given base being wrong), 30 means 99.9% (0.1% chance of a given base being incorrect), and so on. The 1998 Second International Strategy Meeting on Human Genome Sequencing held in Bermuda promulgated sequence quality standards that included an error rate of less than 1 in 10,000 (e.g. 99.99% accurate, a PHRAP score of at least 40) and a directive that the error rates derived from PHRAP and PHRED be included in sequence annotations.Footnote 8

Following the initial 2001 publication and online availability of a draft sequence, further assemblies were made available on the internet through GenBank, the DNA Data Bank of Japan and the European Nucleotide Archive (ENA), the last of which encompassed the databases housed at the European Bioinformatics Institute (EBI) from 2007 onwards. From December 2001, these assemblies were released using the name of the National Center for Biotechnology Information (NCBI), the institution into which Genbank was incorporated. NCBI Build 28 was the first release labelled in this way.Footnote 9 Then, in April 2003, the first assembly that constituted a human reference sequence was published, known as NCBI Build 33.Footnote 10

The 2004 IHGSC paper on the ‘finished’ euchromatic sequence was working from a subsequent assembly, NCBI Build 35 (International Human Genome Sequencing Consortium, 2004). In their analysis of this build, the authors compared it against the 2001 version using some of the measures indicated above, but also pursued some deeper analysis of the quality of the new sequence. The new assembly had 341 gaps, compared to 147,821 in 2001. The N50 for the 2004 sequence was 38,500 kilobases, a dramatic improvement from the 81 kilobases determined for the 2001 version. To further examine the completeness of the 2004 assembly, the consortium looked for 17,458 known human cDNA sequences in it and found that the “vast majority (99.74%) could be confidently aligned to the current genome sequence over virtually their complete length with high sequence identity”.

The 2004 paper assessed the accuracy of sequencing by inspecting discrepancies between nucleotides in the overlapping regions of 4356 clones from the same Bacterial Artificial Chromosome (BAC) library. This required some appreciation of the rate of polymorphism (genetic variation) across humans, as a difference in a single nucleotide could be due to this inter-individual or inter-group variation rather than constituting an error. While later, we see how an appreciation of genomic variation and diversity was vital to making functional use of genomic data, here we see how such an understanding, however tentative, played a part in fundamental analyses of the quality of a reference sequence itself.

Alongside these assessments, the IHGSC members evaluated whether junctions “between consecutive finished large-insert clones” that they had used “to construct the genome sequence” were spanned by another set of fosmid clones derived from a library that they created for this purpose (International Human Genome Sequencing Consortium, 2004, p. 936). With approximately 99% of the euchromatic sequence deemed to be of the requisite finished quality, the attention of the sequencers turned to the recalcitrant 1% and the heterochromatic regions, which would require new methods and materials to resolve, rather than merely a continuing scale-up of sequence production. Next-generation sequencing methods, including long-read technologies that sequence larger stretches of DNA and therefore reduce the number of problematic gaps or misassemblies, have assisted in this (e.g. Nurk et al., 2022). Furthermore, fundamental research pertaining to particular problematic regions has generated data and information that has enabled the creators of successive assemblies to amend and improve these refractory areas.

In addition to improvements to a single canonical reference sequence, attempts were increasingly made to ensure that the reference genome was more reflective of the variation manifested by the target organism. For instance, this was realised by creating the possibility of depicting alternate loci, contigs and scaffolds that differ from the reference sequence in databases and visualisations. An example of a new presentational mode that conveys different kinds of variants alongside the reference sequence is the pangenome graph, showing where these variants diverge from the standard and how common their departures from the reference version are (Khamsi, 2022).

In order to move towards a model of reference assemblies that incorporated variation, and to manage and conduct this ongoing work, the Genome Reference Consortium was established in 2007 by the Sanger Institute, the McDonnell Genome Institute (the new name of the genome centre at Washington University), the EBI, and NCBI. They initially focused on three species: human, mouse and zebrafish, the latter two because of their role as model organisms and due to existing investments in creating gene knock-out collections for these species (Church et al., 2011). Since then, rat and chicken—also model organisms—have been added, and The Zebrafish Model Organism Database and the Rat Genome Database have joined the consortium.Footnote 11

Pig and yeast are notably absent from the Genome Reference Consortium. In the case of yeast, ongoing improvements to the sequence and annotation of the reference genome—first released in 1996—are performed by the Saccharomyces Genome Database at Stanford University, with both the sequence and annotation treated as “a working hypothesis” subject to continual revision (Fisk et al., 2006). A major revised new version of the yeast reference genome was completed in 2011, using a colony derived from the AB972 sub-strain of S288C. Linda Riles had used AB972 to construct the genome libraries for the original sequencing of the yeast genome. The new sequence reads were aligned to the existing reference genome, with low quality mismatches discarded and manual assembly and editing of the genome conducted, which involved checks of the literature for particular sequences and annotations.

While the comparison with the older reference affirmed the quality of that earlier standard, the new assembly made numerous corrections to it. The authors of the paper announcing it had sufficient confidence in it to suggest that the reference sequence was now comprehensive and accurate enough so that in future revisions greater weight would be given to incorporating variation rather than fixing errors. They also suggested that having worked towards and largely achieved a highly veridical representation of a single strain, the focus of yeast reference genomics should shift towards creating the most useful representation of the organism. One of the stated implications of this was the need to develop a pangenome including annotated sequences representing different S. cerevisiae laboratory strains and wild specimens, using some of the copious data being generated on these, as well as on related species (see Sect. 7.4; Engel et al., 2014).

In pig genomics, the first major revision after the completion of the reference genome (represented by the 2011 Sscrofa10.2 assembly) was released in 2017. The impetus for producing a new reference genome was provided by a team led by Tim Smith at the US Department of Agriculture’s Meat Animal Research Center (USDA MARC). They sequenced a boar from a population whose breed ancestry was estimated to be half Landrace, quarter Duroc and quarter Yorkshire. Smith was using Pacific Biosciences long-read sequencing technology, which held the promise of greater contiguity of sequence and fewer potential issues with assembly. However, when others in the pig genome community found out about Smith’s endeavour, the error rate for this technology made them sceptical of its worth.

They did, however, work with Smith and his team to produce a new reference genome. Together, they hit upon the strategy of using Pacific Biosciences long-read technology in conjunction with more reliable Illumina short-read technology. This, combined with the improved chemistry of the newer versions of the Pacific Biosciences technology, helped them to produce a high-quality assembly that formed the basis for Sscrofa11, which became the designated reference genome Sscrofa11.1 when the Y-chromosome data from the X+Y project (Chap. 6) was incorporated. Alan Archibald at the Roslin Institute used money acquired from the UK Biotechnology and Biological Sciences Research Council to fund a large part of this effort, paying Pacific Biosciences for an initial assembly that the community could then work on further. He was fortunate that the contractor Pacific Biosciences had engaged to do this had fallen behind schedule, meaning that Pacific Biosciences took it in-house and conducted the work themselves, ensuring that the project benefitted from the latest chemistry and the best expertise on deploying their technology.Footnote 12

The USDA assembly—resulting from Smith’s original work—was submitted separately, though it was compared with the new reference sequence in the eventual paper reporting its completion. Multiple metrics—such as the number of gaps between scaffolds, the coverage and the N50—demonstrated the superiority of Sscrofa11.1 to Sscrofa10.2, and this higher quality ensured a better automated annotation through the Ensembl pipeline, including a doubling of the number of gene transcripts identified (Warr et al., 2020).

It is worth observing here, though, that interpretations of the quality of assemblies are not straightforward. For example, the 2011 assembly Sscrofa10 has a higher number of scaffolds, gaps between scaffolds, and ‘worse’ N50 and L50 figures for scaffolds and contigs than 2010s Sscrofa9.2. This does not mean that the assembly is of a lower quality, but that additional chromosomes (such as the Y chromosome) and extra-nuclear DNA had been included in the assembly. The Y chromosome notoriously contains many repetitive sequences that are consequently difficult to assemble.

This example shows that reference assemblies can constitute—and therefore represent—different objects, even within the same species. Furthermore, for the pig, in addition to the reference assemblies of the Swine Genome Sequencing Consortium, there is the USDA MARC assembly. There have also been other assemblies published for different breeds of pig (including Chinese breeds by the company Novogene) and the minipig used for biomedical research (sequenced by GlaxoSmithKline and BGI-Shenzhen, formerly the Beijing Genomics Institute), as well as other sequences concerning a variety of breeds and populations of pigs. These more specific references, with some recognised in formal designations and database entries and others not, are examined later in the chapter.

The discussion above shows that reference genomes are not monolithic, static objects. They are continually improved, impelled towards an ever-receding horizon of completeness. But parallel to this continual improvement of the standard reference sequence, genome assemblies have also ramified, as we see with the compiling of genomes for distinct breeds of pig. Additionally, for human and yeast, new aims that guide the evaluation of reference genomes in ways that go beyond the quality metrics of old (e.g. N50) have emerged, especially concerning the variation that the reference sequences instantiate. However, this concern with variation and variants is not something that arises after the reference genome, as the story of the IHGSC and the supposed emergence of a postgenomic era may suggest: it was already present beforehand.

3 Functional and Systematic Genomics Before Reference Genomes

Pre-reference genomics occurred in different eras for each of the species: up to the mid-1990s in the case of yeast, until the late-2000s for pig and preceding the turn of the millennium for the human. These distinct timeframes are pertinent because none of the developments in genomics for these species or any others have occurred in a vacuum: particularities of each were mediated by the adaptation and adoption of tools, methods and data produced for other species, and the comparative inferential apparatus that was constructed to enable such translations.

For yeast (Chap. 2), we saw that comprehensive genetic linkage maps were produced well before the initiative to sequence the genome started. Extensive physical maps were produced by Maynard Olson in the 1980s, building on Robert Mortimer’s earlier genetic linkage maps, and then later physical mapping was conducted by the groups in charge of the sequencing to aid this undertaking for each chromosome. In the case of yeast, the dominant focus of the community was on one laboratory strain that had already had much of its variation abstracted from it in the process of its construction as a model organism.

For human, as discussed in Chap. 3, a great deal of data was generated on variation through the medical genetics community, which extensively catalogued variants of particular genes and associated these with clinical cases of specific diseases, such as for cystic fibrosis. Significant hospital-based human DNA sequencing took place, such as at the John Radcliffe Hospital in Oxford, Guy’s Hospital in London or the University of Toronto Hospital for Sick Kids. Yet, because of the notable absences of these medical genetics groups from the IHGSC membership, these maps and sequences were only marginally accounted for in the production of the reference sequence.

For the pig, mapping projects generated considerable amounts of data concerning the variation of particular genetic markers, which were discerned through crosses of different breeds suspected to be genetically distinct owing to the geographical disparity of their origins and their morphological differences. The familiarity of these geneticists with the kinds of markers used in these studies enabled a subset of them to pursue the European Commission (EC) funded projects PigBioDiv 1 and 2 (1998–2000 and 2003–2006, respectively) to characterise the genetic diversity of pig breeds first within Europe, and then across Europe and China (Ollivier, 2009). These projects, as well as prior studies of pig genetic diversity that had been conducted from the mid-1990s, represented an integration of functional and systematic approaches and concerns.

Many researchers in the pig breeding community have had research interests connected to the variation and diversity of both domesticated pigs and their wild cousins. As a result, these topics were even included in early genome mapping initiatives. A pilot study of genetic diversity across twelve rare and commercial breeds of pig formed part of the EC’s PiGMaP II programme (1994–1996).Footnote 13 PiGMaP II’s organisation reflected the collaborative division of labour approach of the PiGMaP projects more broadly, with various groups supplying DNA from, and pedigree information concerning, animals from specific breeds they had access to. Meanwhile, researchers from Wageningen University and INRA Castanet-Tolosan (a station near Toulouse) selected a panel of 27 microsatellite markers—repetitive sequences of variable length—on the basis of their level of polymorphism, distribution across the genome, and practical ability to use in genomic studies. This panel of 27 microsatellites was subsequently adopted by the Food and Agriculture Organization of the United Nations (FAO) for studying pig genetic diversity. Max Rothschild, in his capacity as the pig genome coordinator for the USDA’s Cooperative State Research, Education, and Extension Service, ensured that the appropriate PCR primers for these markers were produced and distributed among the community. In addition to the use of the microsatellites that were themselves a key product of the PiGMaP collaboration, minisatellites and DNA fingerprinting for detecting genetic variation and diversity were also trialled in PiGMaP II.Footnote 14

Beyond PiGMaP, in addition to some of the other projects discussed in Chap. 5, the community sought to further develop their work on pig biodiversity. An initial follow-up was the ‘European gene banking project for pig genetic resources’ that ran from 1996 to 1998, which assessed nineteen breeds of pig using eighteen of the standard set of 27 microsatellites together with the blood group variants and biochemical polymorphisms that had been traditionally employed in studies of variation (Ollivier, 2009).

A major development in the elucidation of pig genetic diversity was the advent of the EC-funded demonstration project, ‘Characterization of genetic variation in the European pig to facilitate the maintenance and exploitation of biodiversity’, which officially ran from October 1998 to September 2000 and was retrospectively referred to as PigBioDiv1.Footnote 15 It was led from the Jouy-en-Josas station of the French Institut National de la Recherche Agronomique (INRA) with quantitative geneticist Louis Ollivier as the coordinator. The participation of Graham Plastow of the Pig Improvement Company (PIC) reflected interest in the project by the breeding sector. On the FAO side, the involvement of Ricardo Cardellino and Pal Hajas showed that those with a longer-term and strategic view of the future of livestock also held this work to be important.Footnote 16

The aim of PigBioDiv1 was to create a means to maintain and track genetic variation. This was motivated by the breeding sector’s assumption that additional sources of genetic variation were needed in order to enable the further improvement of their commercial breeding lines,Footnote 17 to ensure the sustainability of livestock agriculture, and to respond to changing consumer and regulatory demands that might entail new breeding goals. This approach was stimulated by, and aimed to address, a growing policy concern with the conservation of “animal genetic resources” to safeguard global food security. The FAO were central to this drive and published “The Global Strategy for the Management of Farm Animal Genetic Resources” in 1999 to that end (Food and Agriculture Organization, 1999). The concept of “genetic resources”, which has been traced back to the 1970s, was adopted by the FAO in 1983 and formed part of the framework of the UN Convention on Biological Diversity in 1992. It has been criticised for foregrounding an instrumental value of biodiversity (Deplazes-Zemp, 2018), and this is certainly true in the case of the PigBioDiv projects.

Following the widespread adoption of microsatellites in pig genome mapping and the pilot diversity project, and in the light of FAO recommendations for using them in examining genetic diversity, these highly polymorphic markers formed the basis of both the PigBioDiv1 and PigBioDiv2 (February 2003 to January 2006) projects (see Table 7.1).

Table 7.1 Summary of the participating institutions, breeds studied, genetic markers used and some of the results and outputs of the PigBioDiv1 and PigBioDiv2 projects. Based on multiple sources, including Ollivier, 2009, Megens et al., 2008, and (both accessed 19th December 2022)

Sharing a view expressed by other participants, Chris Haley—a quantitative geneticist involved in the PigBioDiv projects—has observed that this work was based on the assumption that genetic diversity reflected functional diversity. Yet, microsatellites were known to be non-functional parts of the genome.Footnote 18 It was this property, however, that enabled them to be so polymorphic, and therefore useful in mapping and tracking diversity. Furthermore, despite being non-functional parts of the genome, microsatellites still had applications in functional research. Indeed, markers such as these can and have been used in animal breeding, where it is not strictly necessary to find a causative gene, but merely something—like a microsatellite—that is statistically associated with one or many genes that may themselves be implicated in phenotypic variation for traits of interest (Lowe & Bruce, 2019). As we shall see later in this chapter, SNPs generated by the pig genomics community and compiled into a SNP chip were used in this way, but were also be applied in more systematic studies of pig genetics concerned with variation and diversity.

The importance of the particular historicity of the pig genomics community, and its involvement in multiple different projects of data collection and resource generation, cannot be underestimated here. The creation of the means to identify and map markers, and exploit the data and mapping relations so generated, relied on a coming together of molecular and quantitative geneticists. In some cases, this occurred within institutions (such as at the Roslin Institute with Chris Haley and Alan Archibald, partly driven by the immediate history of that institution, see Myelnikov, 2017; Lowe, 2021) or within the overall cooperative division of labour that had been forged. This community has been able to work with populations of livestock with well-recorded pedigrees, manipulate breeding in those populations, and produce data, tools and techniques intended to aid the improvement of selective breeding practices.Footnote 19 For this community, associated as they have been with the pragmatic and instrumental concerns of breeding, genetic variation has constituted a potential resource that breeders could exploit to improve populations in the ways they desired. The pig geneticists therefore developed a different disposition to the one that prevailed in medical genetics, a discipline that has been chiefly concerned with deleterious variants, or the one in yeast biology wherein the use of a standardised model strain with variation abstracted away has been a crucial basis of research. This helps to explain why systematic and functional studies were less entangled early on in yeast and human genomics compared to research on the pig.

As well as aiding this functionally-oriented research, the instrumental discernment of pig genetic diversity has also contributed to the identification of Quantitative Trait Loci (QTL), sites of genomic variation associated with phenotypic variation. This is unsurprising, given that the mapping of the pig genome from the early-1990s onwards involved the crossing of breeds that were assumed to be genetically distinct, and that this work was itself directed towards developing the methods to home in on QTL. The generation and exploitation of diversity was implicated in this more direct form of functionally-oriented research from the beginning of pig genomics. In the words of the summary of PigBioDiv2 on the European Union’s CORDIS website, through this research, “the discerning customer can not only demand tasty meat but can help to power the academic drive for conservation”.Footnote 20

This research also added to the data and knowledge concerning other systematic aspects of the pig: phylogenetic relationships, evolutionary history, processes of domestication, and more recent histories of genetic exchange and relationships between breeds. One of the key challenges and contributions of the project was in measuring diversity. They adapted an approach to measure diversity devised by the economist Martin Weitzman, which involved measuring the genetic distance between pairs of populations using the marker data.Footnote 21 The genetic distances were then used to cluster the populations and infer phylogenetic trees and relationships between them. It therefore provided insights into the relationships between different populations, including between European and Chinese ones, and between the patterns of variation prevailing in those two regions.Footnote 22 They attributed these patterns to historical flows of genes that resulted from different modes of domestication and ways of organising livestock farming and breeding (Megens et al., 2008; SanCristobal et al., 2006).

In the next section, we show that the close relationship between these two modes of functional and systematic research—and the continuity of researchers and institutions—persisted through the production of the pig reference genome and into the aftermath of its completion. In yeast and human, with few exceptions, these modes of research were considerably less entangled in the immediate aftermath of the production of the reference genome.Footnote 23

4 After the Reference Genome

4.1 Yeast: Successive Endeavours

EUROFAN—the European Functional Analysis Network—was always considered to be the next step after the Yeast Genome Sequencing Project (YGSP) by the community of S. cerevisiae genomicists. Although individual laboratories functionally interpreted and made use of some of the data from the sequencing project in their research, more concerted large-scale functional analysis was postponed until after the completion of the reference sequence. A high-quality reference would be needed in order to effect the targeted gene deletions that formed the centrepiece of EUROFAN.Footnote 24 Like the pig biodiversity projects examined above, EUROFAN benefitted from initial pilot programmes. In the case of yeast, researchers used these pilots to develop modes of gene distruption and methods of phenotypic assay for functional analysis.

The EUROFAN participants used the same yeast strain as in the sequencing project: S288C. S288C is a laboratory strain and has therefore had invariance in and between its colonies strictly enforced. As a result, for functional analysis it was necessary to create variation, so researchers could uncover the functions of genes in the reference genome. This was done by producing a new resource, a library of mutants. EUROFAN was conceived as a continuation of the annotation of the well-established and comprehensive reference genome, and indeed recapitulated the hierarchical but dispersed nature of the prior effort to sequence the yeast reference genome, especially the EC-funded portion of it. A division of labour was instituted, between:

  • The overall coordination of the project;

  • Liaison with the Yeast Industrial Platform (Chap. 2);

  • An informatics strand—based at the Martinsried Institute for Protein Sequences (MIPS)—to manage and assess the quality of submitted data and develop a database and computational tools for data analysis;

  • The creation of the mutants;

  • The storage, curation and distribution of the mutant collection;

  • Various kinds and stages of functional analysis occurring at the bench.

Like the YGSP, EUROFAN therefore involved a wide variety of institutions. There was considerable continuity between the participants in the YGSP and EUROFAN, and consequently it involved a set of laboratories working on the cell biology, molecular biology and biochemistry of yeast. This approach reflected a continued perception of the value of these large-scale networked projects for the research endeavours of these laboratories, and the advantages of coordinating such laboratories in a network for further genomic analysis.Footnote 25 In this way, the model of functional analysis was conceived as a means of contributing material towards the further investigation of genes, rather than it being intended to transform the basis of “normal” yeast biology.Footnote 26 Indeed, all but two of the 21 participating laboratories in EUROFAN had also been involved in the YGSP; roughly a quarter of the members of that prior effort took part in EUROFAN.Footnote 27 The creation of a curated resource in the form of a mutant collection as well as the ongoing annotation of the reference genome was attractive to the EC, but also meshed explicitly with the imperative to add more resources to the toolkit of yeast as a eukaryotic model organism.

The project was labelled as systematic (in the adjectival sense) rather than comprehensive. This is because only some of the sequences that were potentially thought to contain protein-coding genes were investigated. The work included an assessment of Open Reading Frames (ORFs) identified in the reference genome sequencing. ORFs are DNA sequences between the start and stop codons that begin and terminate the initial transcription of DNA into messenger RNA. A workflow determined which of these ORFs would undergo successive forms of “increasingly specific” functional analysis. As a result, only a portion of the total of ORFs and genes identified through the initial sequencing and structural annotation of the reference genome were fully functionally characterised.

The functional analysis commenced with deletions of specific ORFs through the design of constructs—known as ‘gene replacement cassettes’—and their insertion into yeast DNA. These gene replacement cassettes contained a gene (kanMX) that conferred resistance to the fungicidal chemical geneticin. The application of the said antifungal agent—geneticin—thus yielded only the yeast that had integrated the cassette into its DNA and therefore had suffered the deletion of the ORF. This method was developed in the midst of the YGSP by Peter Philippsen—the coordinator of the sequencing of chromosome XIV of S. cerevisiae—and Achim Wach at Biozentrum (Wach et al., 1994). By observing and measuring the impact of the successful deletion of a specific ORF on the organism, researchers could infer the functional role that it played in yeast, for instance whether the deleted ORF was part of a protein-coding gene.

By 1996, researchers at the European Molecular Biology Laboratory (EMBL) had finished comparing ORF sequence data to protein sequences held in public databases. On the basis of sequence similarities, they made functional predictions for over half of all identified yeast genes (Bassett Jr et al., 1996). EUROFAN concentrated on the genes for which functional predictions of this kind were not possible. These were the so-called ‘orphans’: “novel genes discovered from systematic sequencing whose predicted products fail to show significant similarity when compared to other organisms, or only show similarity to proteins of unknown functions” (Dujon, 1998, p. 617). Functionally characterising these kinds of genes in EUROFAN would be particularly useful, considering yeast’s role as a model organism and in biotechnology. As a model organism, it would constitute a richer platform for inferring the functional implications of homologous sequences found in the less well-characterised genomes of other species. For biotechnology, the genes with novel functions that were identified could be expressed within yeast itself to yield potentially valuable products or be inserted into other organisms by transgenic techniques.

EUROFAN, as well as filling in the orphan gaps left after the EMBL’s analysis, aimed to observe gene effects and functions in ways that were missed by what leading yeast biologist Stephen Oliver described as the “function-first” approach of “classical genetics”, which relied on the detection of some observative heritable variation or change to infer the presence and function of a gene. Instead, in EUROFAN they deleted known genes to produce mutants, and then measured the quantitative effects of this, for instance on growth rates of the cells through competitive growth experiments, or the biochemical effects as assessed through measurement of metabolite concentrations (Oliver, 1997).Footnote 28

EUROFAN created mutants based on the deletion of 758 ORFs and then proceeded towards analysis of the deletants, which was led first by Peter Philippsen and then by Steve Oliver. In addition to this, parallel projects led by YGSP participants created mutant strains of smaller numbers of ORFs and Bernard Dujon’s laboratory committed “mass murder” by deleting multiple ORFs at a time and then characterising the mutant phenotypes arising from these (Goffeau, 2000).

Nevertheless, the desire to identify all of the genes in S. cerevisiae and characterise all ORF deletants remained. Funds to realise this came through a collaboration between two of the leading US figures in the original sequencing project: Mark Johnston at Washington University and Ron Davis of Stanford University. Johnston got a grant from the National Institutes of Health (NIH) for the period 1997 to 2000 for ‘Generation of the Complete Set of Yeast Gene Disruptions’, an initiative to create a comprehensive catalogue of S288C deletion strains, affecting all its genes. Davis also obtained a grant from the NIH to provide the tens of thousands of oligonucleotides—synthetic DNA sequences—that were needed for the production of the deletion cassettes (Giaever & Nislow, 2014).

This work, running from 1998 to 2002 and hosted at Stanford, became the Saccharomyces Genome Deletion Project, now Yeast Deletion Project, a consortium that involved many of the leading actors in European yeast genomics as well as the North Americans, including Howard Bussey at McGill in Canada (Giaever et al., 2002; Winzeler et al., 1999).Footnote 29 It was complementary to, and in many respects a development of, EUROFAN. The consortium analysed the deletion strains thus produced under several growth conditionsFootnote 30 and sent the strains—containing DNA barcodes to enable linkage of material and data resources—to be preserved and distributed by repositories such as ATCC (the American Type Culture Collection) and EUROSCARF (the European Saccharomyces Cerevisiae Archive for Functional Analysis).Footnote 31

All this functional annotation was captured by databases set up specifically for yeast biologists to be able to exploit the data deluge being generated by these projects. The Saccharomyces Genome Database (SGD) was founded in 1993 and first made available through the internet in 1994. It is primarily funded by the NIH—through the National Human Genome Research Institute (NHGRI)—and is hosted at Stanford University. SGD curators compile and integrate data on S. cerevisiae with the aim of presenting functionally annotated genomic data to yeast biologists in a usable form, providing them with a variety of tools that allow them to interrogate functional relationships and interactions (Dwight et al., 2004).Footnote 32 The Comprehensive Yeast Genome Database (CYGD) was established at MIPS and intended to be a development of the prior work conducted at MIPS and the European sequencing and functional annotation consortia. Expert curators manually annotated the yeast genome, using data from EUROFAN and other allied projects. Its main objectives were two-fold: to develop an informatics infrastructure to analyse and annotate complex interactions in the yeast cell and later to link data being generated on other species of yeast to S. cerevisiae, using comparative genomic approaches to improve the annotation of S. cerevisiae using this data (Güldener et al., 2005).Footnote 33

The functional efforts that populated these databases involved the creation of variation in a compendious fashion using a single well-characterised strain of yeast on the basis of a high-quality reference genome. This was a key difference with pig genomics, in which there was a long tradition of investigating variation before the reference genome was produced. For the human, there was also this tradition of investigating variation through medical genetics, but it became disconnected from the IHGSC effort to produce a reference sequence of the whole human genome.

In yeast, the functional analysis of this variation was intended to improve the value of the reference sequence by producing data to help annotate it. More broadly, it was pursued to generate and provide data and physical resources (the mutant strains), which could be used by the wider yeast research community for their own purposes, thereby improving the value of the species as a model organism. As with the YGSP, the creation of reference resources, both bioinformatic and material, was accompanied by the generation of implementable knowledge about the genome of the species that could inform the further study of wider aspects of its biology.

The creation of these reference resources also enabled the production of reference sequences for other strains of S. cerevisiae and related species. This led to a florescence of comparative and evolutionary-focused studies on S. cerevisiae and other types of yeast. One leading example is a network in which six French laboratories associated with the Centre National de la Recherche Scientifique (CNRS)Footnote 34 worked with the French national sequencing centre Genoscope. This network, Génolevures, was a programme of comparative genomics research concerning the ‘Hemiascomycetous’ budding yeasts, a group that includes S. cerevisiae.Footnote 35 In the first round of this initiative, Genoscope sequenced the genomes of thirteen species in this group at a low coverage of between 0.2 and 0.4X. The participating laboratories then analysed this sequence data with reference to S. cerevisiae, which served as a comparator, an “internal standard” according to Horst Feldmann’s description (Feldmann, 2000). This comparative approach facilitated the manual annotation of the thirteen new genomes, and, in turn, enabled the identification of 50 new genes for improving the annotation of the S. cerevisiae (S288C) reference genome. From 2000, all sequence and comparative data were stored in the Génolevures database, which has since been succeeded by three more specialised databases to hold the results produced by the consortium.Footnote 36

In 2002, Genoscope agreed to sequence the reference genomes of four species at a much higher 10X coverage: Kluyveromyces lactis, Debaryomyces hansenii, Yarrowia lipolytica and Candida glabrata, the first three of which were analysed in the initial Génolevures project, the last of which is a human pathogen closely related to S. cerevisiae (Souciet, 2011). The comparison between the genomes involved a study of evolutionary conservation and divergence, which allowed researchers to identify and then investigate a variety of evolutionary changes that occurred in and between each of the phylogenetic branches—the lineages—that the species represented. This formed the basis for further investigations in the systematic mode, including the sequencing of additional species. Intriguingly, the comparative genomics that constituted—and was enabled by—Génolevures also allowed researchers to unveil manifold differences in gene content between the related species. These data were useful for further investigation into the physiological differences between them, and therefore advanced functional analysis as well (Bolotin-Fukuhara et al., 2005; Souciet, 2011).

This connection between the functional and systematic modes of yeast genomics was recognised by leading members of the community. For example, the next major grant that Mark Johnston secured following the 1997 to 2000 creation of deletion strains was another from the NIH: ‘Comparative DNA sequence analysis of the yeast genome’ running from 2001 to 2005. Using BLAST programmes (see Chap. 6) to compare nucleotide and protein sequence data between S. cerevisiae and other members of the Saccharomyces genus, Johnston and collaborators at the Washington University Genome Sequencing Center were able to estimate genetic distances between the species. This information, they supposed, would indicate which pairings would produce the most valuable comparative data. From these comparisons, they were able to identify various genomic elements, such as potential protein-coding genes and functional non-coding sequences (Cliften et al., 2001).

Most of the collaborators on that work then pursued a comparative study of the genomes of Saccharomyces species: S. cerevisiae itself, three others with genetic distances indicative of enough evolutionary distance to ensure divergence of non-functional sequences, and two more distantly related species. The objective of this was to identify signals of conserved “phylogenetic footprints” in the sequence that would indicate the presence of functional parts of the genome, including those that had been previously difficult to find, such as non-coding regulatory elements. The results enabled the further improvement of the annotation of the S. cerevisiae reference genome, and also included predictions of functional sequences that could be experimentally tested (Cliften et al., 2003).Footnote 37

Throughout, this work was accompanied by Johnston’s ongoing molecular biological research programme on glucose sensing and signalling in the yeast cell. He became involved in Génolevures in the late-2000s (The Génolevures Consortium, 2009), contributing further to de novo and improved sequencing and annotation of the members of the Saccharomyces genus. The increasingly dense comparative relations and data so established helped forge synergies between reference genomics, functional analysis of the genome, molecular biological research and systematic studies. Indeed, this had developed to the extent that the status of “model genus” was claimed for the Saccharomyces sensu stricto genus encompassing S. cerevisiae and close relatives, due to the magnitude of data and experimental resources available across and within it (Scannell et al., 2011).

This dynamic was explicitly articulated in the yeast genomics community. They were aware of the limitations of relying solely on a reference sequence of a highly-standardised laboratory strain that was phenotypically atypical. They believed that more reference sequences were required, within the S. cerevisiae species itself and for related species. They appreciated that the data and knowledge of genome variation and evolution that they wrought from these could be used for functional analyses and inform the improvement of the reference resources that they were based on. Ed Louis, who we encountered providing advice on telomeres and chromosomal evolution during the YGSP (Chap. 2), conveyed this in terms of a virtuous cycle (Fig. 7.1). In this cycle, additional data on genomic variation allows researchers to increase their knowledge concerning conservation across genomes. This helps them to improve annotations. Better annotations allow a refinement of the localisation of features such as synteny breakpoints: regions in-between two stretches of conserved sequence of a particular kind. And these, in turn, allow fresh appreciation of structural variation (Louis, 2011).Footnote 38

Fig. 7.1
An illustration of synergistic relationship between gene annotation and genetic variation. Refinement of synteny breaks via annotated gene order and refinement of annotation via conservation are the mediators.

Illustration of the synergistic relationship between systematic and functional modes of post-reference genome research, as depicted by Ed Louis (2011, p. 32). Reprinted by permission from Springer Nature Customer Service Centre GmbH: Humana Press, Yeast Systems Biology. Methods in Molecular Biology (Methods and Protocols) by Castrillo J., and Oliver S. (eds), 2011,

Ian Roberts of the National Collection of Yeast Cultures at the Institute of Food Research (Norwich, UK) and Stephen Oliver characterised research on the vast genomic and physiological diversity of yeasts and (functionally-oriented) systems biology as the “yin and yang” of biotechnological innovation involving these creatures, therefore emphasising the complementary and co-constitutive nature of these modes (Roberts & Oliver, 2011). As well as aiding manual improvements to annotations, the data and resources concerning diversity across yeast strains and species and their comparative relationships have also been harnessed to power automated annotation pipelines (Dunne & Kelly, 2017; Proux-Wéra et al., 2012).

In yeast, then, there was a passage from creating the reference genome, to pursuing functional analysis of that resource, to then producing data on other strains and related species, and using this to seed comparative and systematic research. The particular interpretation of comprehensiveness for these researchers was not restricted to a ‘complete’ reference genome but was far richer and heterogeneous. It involved the establishment of relations between a variety of different forms of data and the creation of tools to make use of them. This reflected the desire of the yeast genomicists themselves to make use of the resources; they therefore had knowledge of what was needed for research purposes, and how the data, resources and tools could be deployed and contextualised. All this also reflected the disposition of people who were aware of what their stewardship of a model organism entailed.

Major drivers of the yeast genome research agenda, such as Stephen Oliver and Mark Johnston, were able to appreciate and leverage the synergies that could be created between the functional and systematic modes of research, because they were engaged in both. Thus, the continuity of participants across these different successive phases of yeast genome research eased and motivated their ultimate integration. It was something of a different tale than in pig genomics, where, as we showed above, systematic and functional forms of analysis had been entwined since the pre-reference genome stage. In human genomics, our next object of analysis, the functional and systematic modes were more like twin tracks, than successive or permanently-entwined endeavours.

4.2 Human: Twin Tracks

As we have seen, in the sequencing of the human reference genome, the intended user communities were progressively detached from involvement in the production and annotation processes. However, in Chaps. 2 and 3 we showed how laboratories based in hospitals or medical schools had been conducting their own sequencing and making novel contributions by identifying genes and gene variants associated with particular pathological manifestations since before the start of whole-genome sequencing efforts. This programme of variant-focused and medically-oriented sequencing continued throughout the 1990s and beyond, with more and more mutations of particular genes catalogued and analysed, and more genes and key pathological variants associated with particular diseases or conditions. In some cases, research collaborations combined this approach with the sequencing of larger genomic regions: in the early-2000s, researchers at the Toronto Hospital for Sick Children (SickKids) joined forces with other medical genetics groups and Celera to sequence, analyse and extensively annotate human chromosome 7 (García-Sancho, Leng, et al., 2022; Scherer et al., 2003).

Several databases have been established to manage and present data on gene variants concerning human pathogenicity. These include Online Mendelian Inheritance in Man and the subscription access Human Gene Mutation Database (HGMD), while other databases have been created by particular communities focused on specific diseases or genes. The HGMD was founded in 1996, at the University of Cardiff in Wales. Its model is to scan biomedical literature and curate entries on ‘disease-causing mutations’, ‘possible-disease-associated polymorphisms’ and ‘functional polymorphisms’, according to the judgement of the curators assessing multiple lines of genomic, clinical and experimental evidence. Since 2000, HGMD has collaborated with commercial actors: the up-to-date version with enriched annotations and features is available on subscription from them, while a more basic free public version is also made available containing data that is at least three years old. Celera was the first commercial collaborator and included the extensive HGMD data in its Discovery System™ until 2005. From 2006 to 2015, the German bioinformatics company BIOBASE then developed HGMD Professional, a web application accessible upon purchase of a license, to hold this premium data. In 2014, BIOBASE was purchased by the German biotechnology company QIAGEN, which had participated in the sequencing of the yeast genome.Footnote 39

Specialist disease-centred databases, such as the Toronto-based cystic fibrosis mutation database and network (Chap. 3), constitute resources and tools that are curated by the community of medical genetics of clinicians themselves, rather than being provided top-down by the NCBI or any other specialist genomics organisation. In this respect, these specialist databases are similar to some of the ones that arose out of yeast and pig genomics initiatives. They are, however, more long-lasting than many of the pig ones, more fragmented than the yeast ones, and more specialised than both. The more concentrated and global databases of yeast genomics, and the more ephemeral ones of pig genomics, result from different funding and support regimes, but also reflect the role of genomic resources in each community. Yeast, as a model organism, requires comprehensiveness and the inclusion of a multitude of different forms of data in one or a few repositories that exhibit some form of persistence and longevity. The pig community, however, corrals certain kinds of genomic data that are appropriate to the research and translational problems that need to be solved at a certain point in time, with such prioritisation trumping completeness (and permanence). For medical genetics, on the other hand, the community is much larger and divided by disease categories. The pig genomics community is not as partitioned by a focus on particular traits (even if some pig researchers have investigated some traits more than others) nor is the yeast one divided into silos investigating specific kinds of molecular mechanisms or processes.Footnote 40

We return to the medical genetics track shortly. For now, we observe that it constituted a particular form of entanglement between functional and systematics research, which looked both at variation within genes, and variation between individuals, with this data linked to functional information drawn from a variety of sources. These sources even included evolutionary ones, insofar as they provided informative evidence used by curators, such as those at HGMD. Now, though, we consider a separate track that followed the publication of the human reference sequence by the IHGSC in 2004. In this track, distinct annotation efforts were conducted, on the analogy of EUROFAN, but in a quite different form. As during the determination of the human reference sequence, the medical genetics and IHGSC-based tracks remained largely separate throughout the 2000s until recent attempts at rapprochement, including the establishment of a centralised repository of clinically-relevant genomic data in 2013. This is why we refer to them as twin tracks: they developed simultaneously but maintained separated trajectories for a significant period of time.

We have already encountered the comparative genome sequencing effort across the tree of life sponsored by the NHGRI in Chap. 6, in which two working groups provided recommendations to a Coordinating Committee that then amended and submitted them to the NHGRI Advisory Council. The aim of this was to generate data on non-human primates, mammals and selected other species to inform human genome annotation. Here, it is instructive to note two significant changes to the recommendations of the Working Group on Annotating the Human Genome made by the Coordinating Committee. One was to propose even lower coverage sequencing for non-primate mammals, effectively downgrading this component to a pilot project. The rationale for this was that there was insufficient knowledge of mammalian genome evolution at that point to be able to definitively identify particular species as ideal candidates for the deeper shotgun sequencing originally recommended. Instead, they argued that a shallower study should provide sufficient grounds for identifying candidates for deeper sequencing or de novo sequencing. Thus, systematic knowledge needed further development before it could begin to yield data from which a comparative inferential apparatus could generate homologies and hypotheses for searching the human genome for functional elements.

The other change was to postpone working on a survey of human genome variation. In spite of the Committee identifying this element as a “high priority”, it baulked at committing significant resources to what amounted to a “resequencing project”, and recommended instead to wait and see whether resequencing costs declined sufficiently over the coming years.Footnote 41 A ‘Workshop on Characterizing Human Genetic Variation’ was held in August 2004 to discuss possible ways forward, with further proposals for studying human genomic variation developed within the NHGRI in 2005, alongside collaboration with the ongoing HapMap project.Footnote 42 That, and other initiatives surveying human genomic variation are discussed later in this section. For now, it is worth noting that this systematic exploration of human genomic variation became decoupled from the effort to develop resources for human genome annotation.

Operating parallel to the ongoing efforts to develop a comparative approach to human annotation was ENCODE, the Encyclopedia of DNA Elements, an ongoing project that was conceived as a follow-up to the IHGSC effort. ENCODE was launched in September 2003 by the NHGRI, five months after the ‘completion’ of the euchromatic human genome sequence in April 2003. It has passed through successive phases and associated consortia since then (see Table 7.2 for the main participants in the Pilot Phase) but continued to work towards the overarching goal of building “a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”.Footnote 43 The rationale behind this effort was that a “comprehensive encyclopedia of all of these features is needed to fully utilize the sequence to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat these diseases”. It was therefore conceived as a bridge from the structural dataset ‘completed’ in 2003, to the ability to make use of it.Footnote 44

Table 7.2 Table of the main participants in the ENCODE Pilot Project. This project was dominated by the members of the International Human Genome Sequencing Consortium (Chap. 4, Table 4.1), although it also included institutions that did not participate in that reference sequencing effort. Elaborated by both authors from: (last accessed 19th December 2022)

ENCODE therefore aimed at, and presumed the possibility of achieving, completeness. Despite constituting an essay in functional genomics, it involved considerable structural annotation, as it involved identifying and annotating genes and other key functional elements such as regulatory regions. Its methods, though, have extended beyond those that are applied in both automated and manual annotation pipelines. The search for regulatory elements that affect the expression of genes entailed the development of a panoply of other approaches, including a return to ‘wet lab’ experimentation analogous to the functional analysis activities in the laboratories participating in EUROFAN. This involved the use of techniques that aimed to identify signs of activity in the genome, for example biochemical signatures of particular chromatin structures (the way DNA is packed) that enable access to the DNA so it can be transcribed (Kellis et al., 2014).

One of the main outcomes of ENCODE has been the increasing realisation that what constitutes a functional element is relational and context-dependent. The move towards once again conducting genomics research in biological laboratories reflects this shift, as capturing what elements of the genome become functional in particular circumstances “requires a diverse experimental landscape” (Guttinger, 2019). While the establishment of the ENCODE project came out of the IHGSC effort, its investigation of the biology of the human genome has triggered the involvement of a broader range of experimental laboratories, due to ENCODE being concerned with living biological function and not merely constituting a data gathering exercise.

During the pilot phase of ENCODE, the GENCODE consortium was created to produce reference annotations of the human genome. From its inception, it was led by the Sanger Institute, and involved participants from several institutions including the EBI. GENCODE incorporated data from a variety of automated prediction pipelines and experimental data into the Ensembl pipeline and HAVANA manual curation (Chap. 6). The desire to demarcate truly functional regions from non-functional ones meant that genes had to be distinguished from pseudogenes, and that the significance of non-coding regions needed to be assessed. In both cases, inspired by research indicating the salience of regulatory regions in complex developmental processes, the identification and annotation of transcripts assumed great significance in the project and formed the basis for the manual annotation. They used transcriptomic data from EST and messenger RNA sequences and protein sequences obtained from GenBank and Uniprot, using BLAST to align these against the sequences of the original BAC clones used in human reference genome sequencing. The data arising from these efforts led to a mounting appreciation of the prevalence of alternative splicing across the genome, wherein there may be multiple products of a single gene.Footnote 45 Reflecting on their findings, the GENCODE team emphasised that the way in which a reference annotation is constructed “is extremely important for any downstream analysis such as conservation, variation, and assessing functionality of a sequence” (Harrow et al., 2012, p. 1760; see also Kokocinski et al., 2010).

Alongside this, efforts to catalogue the extent and diversity of human genomic variation were already underway. More so than in pig genomics, and far more so than for yeast genomics, this research has concentrated on variation within the target species. Even before the production of the human reference genome, there was a concerted project to map human genetic diversity. Although it received some support through the Human Genome Organisation (HUGO) and the NIH, the Human Genome Diversity Project founded in 1991 was unconnected with the IHGSC effort, and indeed also from medical genetics, being largely an initiative of researchers interested in human evolution and anthropology (M’Charek, 2005; Reardon, 2004). This concern with intra-specific human variation and diversity, and the connection of this with the study of the inheritance of traits—particularly disease traits—pre-dated the determination of the reference sequence. This work heavily relied on the use of genetic markers such as the Restriction Fragment Length Polymorphisms developed in the 1980s (Chap. 1). The advent of microarray technology—SNP chips—made a qualitative difference to this line of inquiry and the relationship between research on human genetic diversity and the identification of particular genes with functional and pathological roles. Rather than recapitulate the research that has examined this (e.g. Rajagopalan & Fujimura, 2018), we instead explore the creation and impact of a SNP chip in pig genomics in the next section and relate this to the work on microarrays in human and medical genetics when appropriate.

In the 2000s, arising from the centralised and top-down world of the IHGSC, were the International HapMap Project (2002–2010) and the 1000 Genomes Project (2008–2015), which drew samples from populations around the world to identify common variants: those with at least 1% prevalence in any given population. For HapMap, SNPs were generated and selections of them were made according to the project criteria. Ten centres were used to genotype—genetically assess—the samples that were collected, with over 60% of this genotyping done at either RIKEN (Rikagaku Kenkyūjo, the Institute of Physical and Chemical Research) in Japan or the G5 institutions: Sanger Institute, Whitehead Institute/Broad Institute, Baylor College of Medicine, Washington University in St Louis and the US Department of Energy’s Joint Genome Institute. The resulting haplotype map identified sets of human genome variants that tended to be inherited together. The mapping was conceived as a “short-cut” to identifying candidate genes and aiding association studies to ascertain the genomic variants implicated in disease (The International HapMap Consortium, 2003). Like the follow-up 1000 Genomes Project, which sequenced whole genomes to capture genomic variation rather just sequencing parts of them, it was a top-down initiative that sought to provide a dataset to be picked up and exploited by a presumed external user community.

The efforts to cultivate and inform a user community, while directed towards helping researchers realise the value of the resource, demonstrated how separated producers and users were during the conception and realisation of such projects.Footnote 46 Furthermore, though they intended the data produced to be useful for what we describe as systematic studies, they were not conceived or generated for those purposes, but for the anticipated potential biomedical use of the data. To the extent that the data was analysed by the project for systematic purposes (e.g. The 1000 Genomes Project Consortium, 2010), it was presented as a separate application of the results. This systematic information was not articulated as being informative or indicative for functional studies, in the synergistic manner understood by the yeast genomics community by this time.

As with the reference genome produced by the IHGSC, however, we can interpret the fruits of these top-down projects in terms of the ways that they have been used as a means to create genomic resources more tailored to particular research needs. Consider the effort to produce “a regional reference genome” by a consortium of Danish researchers, “to improve interpretation of clinical genetics” in that country, enhance the power of association studies (examining the relationship between genomic and phenotypic variation) and aid precision medicine research (Maretty et al., 2017, pp. 87 and 91). This team produced 150 high-quality de novo assemblies, which they validated by aligning them against the then-current human reference genome assembly. They identified multiple forms of variants, aided by the reference panel produced by the 1000 Genomes Project and data from the NCBI’s Single Nucleotide Polymorphism database (dbSNP), which had itself been considerably enriched through the efforts of the International HapMap Project and the 1000 Genomes Project. The Danish team were therefore able to use the infrastructure and resources developed from these top-down projects, not directly to produce research that could be translated into clinical outcomes, but to construct their own local, targeted resources in the form of a local reference genome and a catalogue of variation pertinent to the populations they work with (Maretty et al., 2017).Footnote 47

The relationship between large-scale data infrastructures and more local and specific ones focusing on concrete objects, communities or research areas has recently strengthened. One manifestation of this shift has been the establishment of ClinVar and ClinGen. These represent an attempt at liaison between the separate tracks of human genome research and medical genetics research. ClinVar and ClinGen capture forms of variation and processes of evidential evaluation of their functional and pathological significance that are found in medical genetics and clinical research. It therefore promises a form of synergy involving the alignment of different modes of data practices, methods, analytical approaches and community norms.

ClinGen and ClinVar were established by the NIH in 2013, with the aim of providing open-access data on variants, tied to clinical interpretations of them. ClinGen is the overall programme that works in partnership with ClinVar, the database that is run by NCBI. Both continue to be funded by the NIH. Their founding was based on the concern that such clinically-relevant genetic data was being kept locally, either by individual researchers and laboratories, in disease-specific databases available to members of a particular community, or hidden behind a paywall like the most recent and rich data contained in HGMD. Furthermore, the different architectures of such databases and treatments of data were thought to stymie clinical interpretation.

The answer was a centralised repository, with uniform data standards and clear processes of curation and attribution of labels to individual variants indicating their potential clinical (or otherwise functional) significance. However, to make this work, it would be necessary for the submission of data to ClinVar to be contextualised with its putative medical significance. Rather than being stripped of all but a few items of contextualisation in the form of metadata—as sequence data to GenBank and other similar databases is—this data on sequence variants needs to travel with clinical interpretations made by the submitters and the various kinds of evidence used in them. ClinVar serves as a repository for this information, with agreements or disagreements in interpretation assessed by the user researchers, rather than being solved by the database itself.Footnote 48

Where ClinVar takes a more active role, is in the convening of expert panels to curate interpretations for particular genes. Applications can be made to the ClinGen Steering Committee for approval of the formation of an expert panel. The interpretations of these bodies then outrank virtually all other levels of “review status” (Landrum & Kattman, 2018).Footnote 49 One of these expert panels was called CFTR2, a group that worked—and still works—on the CFTR cystic fibrosis gene. Most of its members belong either to the Johns Hopkins University or the Toronto Hospital for Sick Children, reflecting a parallel route of research stretching back to the 1980s (Chap. 3) that was long separated from the established mainstream of human genomics research and infrastructure.Footnote 50

ClinGen also takes an active role in aggregating and curating genomic and health data from various sources and feeding this into ClinVar (Rehm et al., 2018). ClinGen and ClinVar constitute the platform for a convergence between the once wholly distinct tracks associated with the IHGSC enterprise and medical genetics. While these remain separate in day-to-day practice, the creation of a data infrastructure to draw upon the findings and expertise of clinicians and researchers—including those working in medical genetics—enables them to participate in a more concerted and unified whole-genome effort. This also provides human genomicists outside the medical genetics community—including at specialist genome centres—with access to information about variation and its clinical effects that is essential for the medical translation of sequence data.

4.3 Pigs: A Fuzzier Distinction

As with the production of a reference genome, by the time the pig genomics community was in a position to develop their own concerted functional annotation effort, they were able to benefit from the protocols, methods, data and experience of human functional annotation. This legacy enabled them to devise a pared down approach more appropriate to the levels of funding they enjoyed. There were, however, aspects of functional annotation that drew on the particular history of this community, the uses that they envisaged for the data and the affordances provided by their particular subject organisms.

An initial call for the concerted annotation of non-model organism animals was made by Alan Archibald, Ewan Birney of the EBI and Paul Flicek (who had primarily worked on mouse genomics), at the International Society for Animal Genetics conference in Cairns (Australia) in July 2012.Footnote 51 This alliance reflected the ongoing connections between the pig genome community and the EBI. However, it was at the annual Plant and Animal Genome (PAG) conference, in San Diego in January 2014, that genomicists working on a variety of farm animals started developing the basis for an international multi-species collaboration to advance functional annotation following the initial sequencing of several reference genomes (The FAANG Consortium, 2015).Footnote 52 At that PAG conference, the Animal Biotechnology Working Group of the EU-US Biotechnology Research Task Force convened an “AgENCODE” workshop.Footnote 53 As the name suggests, the aim was to emulate ENCODE, and to that end, several speakers from that project contributed to the session and to subsequent workshops and conferences held by what became the Functional Annotation of Animal Genomes (FAANG) Consortium (Tuggle et al., 2016).

Presenting the outcomes of the AgENCODE workshop in a PAG conference session the following day were key figures from the genome mapping and sequencing of chicken, cattle and pig from the previous two decades: Gary Rohrer of USDA MARC, Alan Archibald of the Roslin Institute, Christine Elsik of the University of Missouri, Elisabetta Giuffra of the Animal Genetics and Integrative Biology unit (Génétique Animale et Biologie Intégrative, GABI) at the Jouy-en-Josas station of INRA and Martien Groenen of Wageningen University.Footnote 54 Reflecting the practices and careers of many livestock geneticists, these researchers worked on the genomes of multiple species. All this demonstrates the agriculturally-inclined origins of FAANG, which has shaped the aims and outputs of the project ever since. Although other potential applications such as the use of animals as biomedical models and understanding domestication and evolution have also been cited as motivations, these have not formed a substantial part of the published output or attention of the consortium.Footnote 55

The aim of the FAANG Consortium (and its constituent steering committee and working groups) has been to “produce comprehensive maps of functional elements in the genomes of domesticated animal species based on common standardized protocols and procedures” (The FAANG Consortium, 2015, pp. 2-3). The Consortium (Table 7.3 lists pig genomicists who were founding members) narrowed its focus to the animals for which there were reference assemblies most amenable to functional annotation (chicken, cattle, pig and sheep), identified a small set of core assays and defined experimental protocols based on the experiences of ENCODE, established a Data Collection Centre based at the EBI to aid and validate submissions to the data portal hosted on the FAANG website, and defined a core set of tissues to be used. The collection and sharing of a limited set of tissues derived from populations of low genetic diversity was intended to aid the replicability and comparability of the data produced using them across the community and to ensure that associations between functional genomic annotations and quantitative phenotypic data could be made even in the early stages of the project (The FAANG Consortium, 2015).

Table 7.3 Pig genomicists who were initial members of the FAANG Consortium, identified through authorship of ‘The FAANG Consortium’ article (2015). This article indicated that “[a]ll authors are signatories of the FAANG Consortium” (p. 5). The Consortium included 30 other members who primarily worked on other livestock species such as cattle and chicken

A key feature of FAANG has been its focus on defining and decomposing the phenotype, or the phenome. Phenome is a term that denotes the phenotypic equivalent of the genome, with phenomics constituting concerted phenotyping on the model of the major genomic sequencing projects. In the farm animal world, researchers have access to extensive gross phenotypic data (such as on coat colour, slaughter weight, number of eggs laid per day) on animals with well-defined pedigrees. This is due to the role that measuring phenotypes has played in the breeding industry, with which researchers have enjoyed close ties since at least the 1960s. The means by which to measure, analyse and interpret phenotypic data are long-established and have continually evolved as animal geneticists adopted more molecular approaches in the 1980s and then pursued genome mapping, sequencing and analysis from the 1990s onwards. Both molecular and genomic approaches have intersected with quantitative genetics research and methods.

The extent of this focus on phenotypic data eclipses the other two species we have examined throughout this book. Yeast biologists have paid close attention to the phenotypes of their organism, but these are phenotypes of far less complexity than those of farm animals. Concerning the human, concerted efforts to characterise large groups of humans in phenotypic terms, for example in the history of physical anthropology (Müller-Wille & Rheinberger, 2012, pp. 106-107) or more recent initiatives such as the UK Biobank project (Bycroft et al., 2018), constitute exceptions to the general trend in which phenotypic data collection—and the development of infrastructures and practices to enable this—has been far less extensive than for at least some farm animal species. One cannot control the breeding or environmental conditions experienced by humans or track multiple phenotypic measurements in such a continuous and intrusive way as can be done for an experimental herd or flock (or for plants; see: Müller-Wille, 2018).

One of the key aims of FAANG has been to decompose the gross phenotypes they and breeders had previously been working with into more proximate molecular phenotypes (biomarkers), and then to causally link variation in these proximate molecular phenotypes to variation in gross phenotypes. Alongside other intended outputs of the FAANG collaboration, the identification of molecular phenotypes and associated specific genomic variants has been intended to better model the relationship between genotype and phenotype, to advance their agenda of improving genomic prediction from a known genotype to an expected phenotype. This emphasis on genotype–phenotype relationships and being able to more accurately predict the phenotype from a given genotype is not unique to pig or wider farm animal genomics, but it does attain a distinct salience and inflection in this area.

Within five years of FAANG swinging into action, the participants were looking beyond the initial in-depth studies of a limited range of tissues with low genetic diversity. While this had helped the Consortium to identify and map functional elements and regions, it became clear that data derived from a more genetically-diverse range of animals, and more tissues, would be necessary to further analyse the relationship between genomic variation and phenotypic variation. Genes specific to particular populations could be identified through this, and then visualised in pangenome graphs depicting variation aligned to the reference sequence. This, in turn, could aid the identification of candidate variants to be implemented in programmes of genome editing of livestock species, and in the tracing of genetic diversity in and across populations to inform conservation efforts. Beyond individual species, the functional genomic and phenotypic data that FAANG compiled enabled them to identify evolutionary conservation across species. On this foundation, they could develop comparative analyses and approaches to inform cross-species inferences as to the functional genomic basis of phenotypic traits (Clark et al., 2020).

This transition from a narrow focus to a broader outlook was eased by the design of FAANG and the long-standing entanglement of systematic and functional modes of research in pig genomics. Among pig genomicists, that entwinement had fostered both versatility and an acute appreciation of the wide array of possibilities and potential applications presented by the rich and connected data generated by the FAANG Consortium.

Beyond FAANG, there have been two other ways in which functional and systematic modes have been entangled in pig post-reference genomics. A 2013 paper reporting studies of the genetic diversity of rare breed Chato Murciano pigs kept on eight farms in Spain instantiates one of these. This research used an inspection of the extent of variation that existed in these pigs to assess their (functional) viability in the light of inbreeding and crossbreeding (Herrero-Medrano et al., 2013). The second way is the kind of cycle (as identified by Ed Louis for yeast) between further functional annotation of the genome and an appreciation of conservation and syntenic breakpoints: either across pig breeds, between related species or drawing on a multi-species comparative approach to enrich knowledge of genome evolution more broadly (e.g. Anthon et al., 2014). This, again, often depended on the construction of new sequences based on older ones, in order to establish new connections between genomes, to identify relationships, changes over evolutionary time and examples of different forms of variation. As Martien Groenen of Wageningen University observed in a review of pig genome research in the systematic mode, however, though advances in this direction were enabled by the existence of an annotated reference genome, they were also inhibited by its limitations (Groenen, 2016). A new reference sequence and improved annotation using it and through FAANG has, therefore, proved a considerable boon to both systematic and functional studies.

We close this discussion of the relationship between functional and systematic modes of post-reference genome research concerning the pig by exploring a tool that represents a powerful platform to enable both: the Illumina PorcineSNP60 SNP chip or microarray (see Fig. 7.2).

Fig. 7.2
A digital model of the second generation S N P chip. It has a bar code on the bottom, above which is the logo of Illumina. Many cells are visible on the surface of the chip.

The second-generation SNP chip for pigs: PorcineSNP60v2 BeadChip. Photograph courtesy of Illumina, Inc.

A SNP chip is a tool that enables the detection of the presence or absence of a particular set of DNA polymorphisms in a sample. In constructing them, DNA—of complementary sequence to the polymorphisms to be detected—is attached to the surface of the chip. The samples to be assayed are then labelled, typically with a fluorescent dye, and added to the chip. Any sequences complementary to the probes should attach to the chip’s surface and, when stimulated, produce a detectable signal which is recorded and can then be processed to give the results of the assay. There are numerous technical details and options that go into the construction and use of a particular chip. We focus here on the choice of the DNA to be attached to the chip surface: the probes that are used to detect particular genomic variation at the single-nucleotide allele level.

When it became possible to do so, the value of identifying and generating data on SNPs was quickly recognised by the community of pig genomicists. They had long valued the creation and mapping of genetic markers of various kinds (including those with no putative functional or mechanistic role), for the identification and mapping of QTL. SNPs are polymorphic—albeit less so than microsatellites—and abundant across the genome, including in regions poorly-represented by markers such as microsatellites. They therefore represented an opportunity to identify markers at a higher resolution and more broadly across the genome.

This is particularly significant given the translational domain most members of the pig genome community were working towards: animal breeding. While there had been efforts to identify particular genes and variants thereof from the 1980s, in many cases actual functional genes were not necessarily needed for the purposes of breeding. In the 1990s, for instance, an approach called ‘Marker-Assisted Selection’ (MAS) was developed that only required that a genetic marker be identified, provided that it was closely associated with a gene of interest that a breeder may want to select for or against (e.g. Rothschild & Plastow, 2002). While identifying a gene would be imperative for transgenic improvement of livestock, or for medical genetics research, it is not for animal breeding. Because the aim is to improve a population in measurable ways, finding and using markers that are good-enough indicators is a viable strategy. If it is mistaken in individual cases, this is not a problem, as they can simply be removed from the breeding pool. By the turn of the millennium, quantitative geneticists were proposing new ways to develop MAS. One of these was ‘genomic selection’, in which many more markers would need to be genotyped across the genome to ensure that at least some of them were closely linked to any (probably unknown) loci with an actual causative effect on the eventual phenotype (Haley & Visscher, 1998; Meuwissen et al., 2001). This, therefore, created the demand for SNPs to be generated and incorporated into a chip to enable the genotyping of multiple sets of them (Lowe & Bruce, 2019).

Alongside this, industry was pursuing SNPs with the view to identifying candidate genes. Sygen (as PIC had been renamed) secured EC funds for PORKSNP, a project running from 2002 to 2006 to identify SNPs in genes expressed in pig muscle and then run association studies to search for loci involved in meal quality traits. Sygen provided the samples for subcontracted biotechnology companies to sequence.Footnote 56 Monsanto, who had entered the pig breeding market having bought into DeKalb in 1996 (completing the purchase in 1998), were also deeply interested in SNPs for performing genome-wide association studies. In November 2001, Monsanto’s Swine Genomics Technical Lead John Byatt spoke with Jane Peterson from the NHGRI’s Extramural Program (Chap. 3) about potential support for a pig genome sequencing project. In Peterson’s notes on the event, she observed that “Really what they need are SNPs—denser needed”.Footnote 57 However, as pig genome sequencing did not proceed at the NHGRI, Monsanto looked elsewhere: to the IHGSC’s competitor, Celera. In addition to its primary biomedical focus, Celera had acquired an agriculturally-oriented biotechnology company from its parent company Perkin-Elmer, in what was effectively an internal transfer. The head of this company, Celera AgGen, was Stephen Bates. Bates persuaded Craig Venter to shotgun sequence pigs, cattle and chickens and create livestock databases using the data so generated. In February 2002, this unit was sold to MetaMorphix Inc., a biotechnology company founded in 1994 by researcher Se-Jin Lee of the Johns Hopkins University School of Medicine, who was the discoverer of the protein myostatin. As part of the deal, MetaMorphix licenced Celera’s databases for pigs, cattle and chickens. In June 2004, they licenced what they called ‘GENIUS—Whole Genome System™’ for pigs to Monsanto for one million dollars and a share of royalties in the new breeding lines (and their hybrid offspring) developed by Monsanto using their data, which encompassed approximately 600,000 mapped SNPs and related intellectual property.Footnote 58 Despite the apparent fruitfulness of this association, MetaMorphix filed for bankruptcy in 2010, and Monsanto abandoned the pig breeding sector in 2007, selling Monsanto Choice Genetics to Newsham Genetics.Footnote 59

Meanwhile, the pig genome community was also pursuing SNPs and the creation of a SNP chip. In addition to their potential utility in animal breeding, the geneticists believed that the generation of SNPs would enable the exploitation of mouse and human data for homing in on candidate genes, as well as aiding the refinement of genetic linkage maps (Rohrer et al., 2002; Schook et al., 2005). Creating the basis for the production of SNPs was to be an outcome of the project to sequence the reference genome. Martien Groenen obtained funding to perform next-generation sequencing on additional pigs to identify SNPs, brought in other members of a consortium—which became the International Porcine SNP Chip Consortium—to pursue this, and led the analysis group to identify putative SNPs (Archibald et al., 2010).

Alongside this, a commercial partner was needed to produce and distribute the chip. The consortium held what Alan Archibald has described as a “beauty contest” at the PAG conference in January 2008, between genomic services and tool manufacturers Illumina and their main competitor, specialist microarray producer Affymetrix. Both had previously produced chips for cattle, and the judges were swayed by Illumina’s articulations of the lessons learned from it.Footnote 60 Illumina’s cattle chip was produced at the behest of the USDA in 2007, with its 54,001 SNPs used in genomic evaluations of American dairy cattle. This was quickly deployed in genomic selection, a process that has produced considerable results on a short timescale and demonstrated the value of the approach (Wiggans et al., 2017). In addition to Illumina’s lessons, a group at the USDA facility in Beltsville (Maryland) offered advice based on their own involvement in creating and using the cattle chip, with Curt Van Tassell in particular contributing valuable insights.

Martien Groenen had been involved in the development of a 20K chip (containing 20,000 SNPs) for the chicken, in collaboration with the breeding industry for that species.Footnote 61 It therefore made sense for him to play a leading role in the effort to create a pig SNP chip. For this, he leveraged existing relationships, such as with the Dutch pig breeding company Topigs, which provided genotype and sequencing data derived from their breeding lines.Footnote 62 As with other pig genomics projects, each participant brought their own funding to enable them to make their contributions, which included the provision of samples, the sequencing and identification of SNPs, conducting the selection and validation of SNPs, bioinformatics work and networking with other organisations (such as the EBI) to assist in developing and publishing the data produced through the project.Footnote 63

The commercial exigencies of the pig chip structured its contents. So too did the interests of the members of the pig genome community and the kinds of pigs—and therefore DNA samples and SNPs—that were available (see Table 7.4 for members of the ‘International Porcine SNP Chip Consortium’). Marylinn Munson from Illumina participated in the weekly working group meetings of the Consortium conducted over Skype, which made the crucial decisions shaping the chip, for instance, how many SNPs were included, with roughly 60,000 chosen. Options of up to a million SNPs were floated, but this was deemed to be excessive when the trade-off between the number of SNPs and the cost of the chip was considered. For the chips designed to genotype humans, which needed to be able to identify rare alleles (possibly involved in rare diseases) and to sample a variety of different populations, a chip with as many SNPs as technically feasible was required. For the pig, however, to ensure the competitive pricing and commercial viability of the chip, advance orders of $5 million would have to be obtained. Breeders therefore had to be interested in the chip, and this meant including alleles of at least 5% prevalence that were present in a range of breeds that mainly reflected commercial populations used by the major breeders. Where possible, SNPs known to be of relevance to livestock traits were included. Proprietary SNPs were excluded.Footnote 64 The team narrowed down the approximately half-a-million SNPs to the selection of tens of thousands to be included on the chip.Footnote 65 The DNA samples used on the chip were obtained from the Duroc, Piétrain, Landrace and Large White commercial breeds from Europe and North America and wild boar from Japan and Europe.Footnote 66

Table 7.4 List of members of the International Porcine SNP Chip Consortium and their institutional affiliations, from “Pig SNP Working Group” folder, Lawrence Schook’s personal papers, obtained 6th April 2018. Note the continuity of personnel and institutions from prior mapping and sequencing projects (Chap. 5, Table 5.2). This illustrates the stability of actors in the pig genomics community and their involvement in the creation of multiple successive genomic resources, as well as the primarily agricultural orientation of the participants

SNPs were identified through a series of procedures, some of which used the latest versions of the reference assembly. The SNPs that passed validation were then put through a selection process which included assessment across a variety of parameters. The resulting PorcineSNP60 Genotyping BeadChip was released by the end of 2008 (Ramos et al., 2009). The advent of SNP chips made genomic selection in pigs feasible, and it was adopted in the pig breeding industry as it had been in cattle (Knol et al., 2016; Samorè & Fontanesi, 2016).Footnote 67 A second version of the Illumina chip has since been developed, as well as other chips created with different selections of SNPs (Samorè & Fontanesi, 2016).

In addition to the direct use in genomic selection, the chip has also been extensively used in systematic studies, for instance concerning the diversity and patterns of domestication and geographic distributions of pigs. As with yeast, such research can reveal differences between populations and signatures of selection that enable candidate genes to be identified for further functional exploration (e.g. Diao et al., 2019; Yang et al., 2017).

A plethora of more direct functional analyses have been enabled by the chip, aiding researchers in finding and investigating genetic loci related to livestock production and welfare traits, for example through association studies (e.g. Maroilley et al., 2017). It has also helped researchers developing pigs as animal models of particular diseases (e.g. for muscular dystrophy: Selsby et al., 2015).Footnote 68 And finally, SNP chips can be used to produce and/or validate new reference resources, for instance in constructing a new high-density genetic linkage map (Tortereau et al., 2012) or assessing the completeness of the new reference sequence (Warr et al., 2020).

SNP chips, much like reference genomes and other reference resources, constitute platform tools that can be deployed for a variety of purposes. They enable new characterisations of variation and the creation of fresh resources based on them. In this, the variation imprinted in it, conditions its affordances as a platform tool. And in the case of the pig, the heavy involvement of the pig genomics community in the generation and selection of the SNPs to be included, and the commercial demands driving this process, affects what the SNP chip can do, and what new resources it can help seed. For example, the lack of representation of samples of DNA from African breeds and populations of pigs in the Illumina 60K chip makes it of limited usefulness for breeding applications in that continent. As a result, there has been a call for the creation of more Africa-specific livestock SNP chips, as well as breed or region-specific reference genomes (Ibeagha-Awemu et al., 2019).Footnote 69

The development of genomic resources and the exploitation of them are therefore strongly conditioned by the historical paths taken. In the case of pig genomics, we have observed a close integration of functional and systematic modes of research from pre-reference genomics onwards, continuing even during the narrower and more concentrated endeavour to sequence the reference genome. The heavy involvement of the community of pig genomicists in the creation of genomic resources from the early-1990s onwards has enabled them to facilitate versatility in the wide use and applications of these resources once the pig reference genome was released. As we have seen though, this does not mean that the data and materials they have helped to generate lend themselves to an unlimited array of uses. It does mean, however, that they have a keen awareness of what these resources represent, how they can be built on and what they can be used for. The pig community has also benefited greatly from knowledge concerning the genomes and genomic research of other species. They have identified practices in human and cattle genomics, for example, and adapted them to their own ends and ways of working. They have also developed a comparative framework for making use of genomic data and other resources on mammals such as humans. As we have seen, the development of pig post-reference genomics differs considerably from that of human and yeast. We close the chapter by assessing the consequences of this, introducing the concept of webs of reference to help us to further characterise post-reference genomics and compare the historical trajectories of genomics across different species.

5 Seeding Webs of Reference

This chapter, together with elements of preceding ones, challenges existing views of postgenomics. By looking beyond human genomics and especially beyond the determination of the human reference sequence, we have shown that an emphasis on variation, multi-dimensionality and the contextualisation of sequence (and mapping) data has pre-existed reference genomics, and can be part of reference genomics itself, rather than simply succeeding and complementing reference genome sequences once they are produced.

Across the three species we have examined, the relationships between pre-reference genome research, reference genomics and post-reference genomics are affected by the differential involvement of particular communities in these efforts. In yeast and pig, there is a high-level of continuity across these phases, with the respective communities involved in constitutive aspects of the process of reference genome sequencing, and in enriching and improving the products. They have done this through engagement with large-scale sequencing centres (e.g. the Sanger Institute) and other centralised actors (e.g. MIPS), though in different ways. For example, the relationship of the pig community to the Sanger Institute was more like Mark Johnston’s relationship to the Genome Sequencing Center at Washington University than it was equivalent to the role of the Sanger Institute as a contributor to the YGSP.

The yeast and pig communities also differed in their overall goals, the nature of their target organisms and the variation exhibited by these organisms. The yeast community were self-consciously curating a model organism with a panoply of linked datasets and experimental resources, with an eye towards comprehensiveness, permanence and accumulation. They worked with a highly-constructed laboratory strain of S. cerevisiae specifically designed to minimise variation within and between colonies. The pig community, on the other hand, often worked with a mixture of primarily commercial breeds of pig, reflecting the mainly agricultural aims of their research but also the ready availability of these creatures. But they also used wild boar, as well as crosses between breeds presumed to be genetically distinct due to their geographical distance. They created genetic markers, maps, mapping tools, QTL detection methods, families and pedigrees of pigs, reference assemblies, annotations of these, as well as masses of SNPs and the chips to genotype selections of them. They worked in a satisficing mode, with researchers, groups and institutions contributing to consortia and collaborations with their own pots of money from various funding sources, building on and using existing sets of resources they had produced for a prior purpose. In both species, we see a convergence between functional and systematic modes of practising genomics, involving considerable overlaps between actors pursuing both modes. Both communities realised that an investigation of diversity could aid functional analyses either directly through the identification and analysis of key physiological and genetic differences, or more indirectly by using the insights gained from systematic analysis to improve the functional annotation and characterisation of reference genomes and other reference resources associated with the species.

In human genomics, there has been more than one community at play. There is the IHGSC community, that through the mid-to-late 1990s and into the 2000s became increasingly narrow and concentrated. They emphasised the technical refinement of sequencing in large-scale centres and the development, advancement and integration of informatics pipelines. Then there has been the medical genetics community, focused on variation between individuals (and across populations more broadly) and in the sequences of particular genes. This latter community, as we have seen, became increasingly divorced from the IHGSC effort. Instead, they established connections with Celera and their activities, for instance through the annotation jamboree, the sequencing and analysis of chromosome 7 (Chap. 6), and in further developing the HGMD. This interaction constitutes a rapprochement between the medical genetics community and an institution that specialised in the sequence determination and informatics aspects of genomics to an exquisite degree, mediated by its own commercial strategies and responses to the actions of the IHGSC. A newer rapprochement between medical genetics and the mode of genomics characterised by centralised infrastructures and data repositories has been through ClinGen and ClinVar. These constitute an attempt to compile and interpret more richly-contextualised data on genetic variants of potential clinical import, and in so doing incorporate medical genetics practices and practitioners more into the centralised NCBI framework.

The community dynamics we have identified, in tandem with the way that pre-reference genomics and the creation of a reference genome proceeded, have affected how post-reference genome functional and systematic research related to each other. Throughout our examination of functional and systematic research, we have found that separately assessing the limitations of individual reference resources or tools fails to capture the inter-relations between them. Inter-relatedness has been a feature across the history of genomics, however, as existing resources are used for the construction of new ones, often through the deployment of comparative practices. Additionally, reference resources can relate to each other contemporaneously, through overlapping repertoires and data infrastructures, and by the ways in which one resource can inform the interpretation or validation of another.

Through interpreting the products of genomic research as part of webs of reference that exhibit a range of connections (Fig. 7.3), we can better assess the infrastructural roles and consequences of reference resources. In the three species, post-reference genome work involved the creation of reference resources that identified and characterised more genomic variation. The reference resources refer to the reference genome, are explicitly intended to connect different manifestations of variation, and contain a surplus of possibilities for the further identification and characterisation of genomic variation and the translation of such data into a multitude of different working worlds.

Fig. 7.3
An illustration of the webs of the reference genome. It is connected to the reference sequence of the population, D N A sequence of individuals, variant discovery, S N P chips, and related species.

A simplified depiction of a web of reference in which types of resources in the web are represented, rather than individual instantiations (there may be many different resources for each type). The development of webs of reference enables the exploration and characterisation of the extent, frequency, range and combinations of different types of genomic variation across a representational domain, such as a species. In so doing, the reference genome and other resources are further developed. The development of individual webs depends on the different historical trajectories leading to, and arising from, the creation of a reference genome. Elaborated by both authors

Based on our examination of the different confluences of systematic and functional research, we can observe that post-reference genomics does not merely consist of increasing dimensionality: the recording and linking of additional genomic variation and other forms of biological variation in data infrastructures. It also involves the generation of these dimensions and the establishment of relations between them, in different concrete ways. Additional dimensions close to the level of the DNA sequence such as RNA sequences and protein sequences do not just exist in nature to be the next logical source of data to link to the reference sequence after its production. These forms of data are produced and catalogued for particular purposes and from particular sources: recall, in Chap. 6, the use of cDNA from the cloned offspring of TJ Tabasco in pig genome annotation. Other forms of data may derive from different origins, and be chosen for their practical utility rather than their representativeness of the species or particular biological processes. Here, we might consider the narrow range of genetically homogeneous tissue samples and assays used in the initial phases of FAANG. Furthermore, as FAANG shows, additional dimensions of data being arrayed on top of reference sequences may not only represent distinct kinds of macromolecules, but phenotypes as well.

Systematic studies entail and power comparative genomic approaches that generate dense sets of data and knowledge concerning the relationships between the genomes of different strains, populations or species. This helps researchers to characterise the extent and nature of genomic variation across populations, species and sets of related species. The extent of the potential variation (including different types of genomic and other biological variation) that can be apprehended and compared is limitless. Therefore, a selection of what is actually identified and represented from that limitless array of the potentially comparable is made either a priori or during the process of analysis. What dimensionality is added to the web of reference depends on the history and interests of the community producing a resource and how this community relates to the processes involved in producing and improving the reference genome. In other words, we cannot characterise this expansion of dimensionality as being a mere consequence of a simple transition from genomics to postgenomics (or even to post-reference genomics): there are different temporalities and models across (and within) yeast, human and pig genomics.

Across both functional and systematic studies separately, and even more acutely in their intersection, the variation that is measured, analysed and integrated into data infrastructures constitutes only some of the potential range that could be pursued and exploited. The dimensions that are explored, even if they are apparently of the same kind, may be directed towards distinct goals, use different materials and be related to other dimensions differently. We refer to this as a variational surplus, in analogy to the surplus possibilities open to researchers working on particular experimental systems, as characterised by Hans-Jörg Rheinberger (1997, p. 161). So, does all this just result in a blooming, buzzing confusion of different approaches to variation among distinct projects and communities? The construction of infrastructures to establish links and relationships between different forms of data and material objects, and efforts towards integration (e.g. Leonelli, 2013), suggests not.Footnote 70

The history of post-reference genomics, elements of which we have examined in this chapter, suggests that there has been a shift in the kind of research on and using genomes. In the next chapter, we explore this in terms of “epistemic iteration”, a term coined by philosopher Hasok Chang (2004). For now, we note that in the absence of direct access to the ‘truth’, the improvement of standards such as reference genomes is evaluated using epistemic virtues, values and goals as guides. This occurs through the correction and enrichment of these resources and builds on and supersedes prior standards. The past serves as a constraint or a condition but is not wholly determinative of the future course of the standard. Reference genomes and other reference resources can be seen as products of their history: the choices made by particular communities amongst those available to them, including objects, methods, and modes of validation and enrichment. These activities use and devise standards such as designated reference genomes and up-to-date maps. Each standard undergoes its own process of improvement, in which new versions succeed old ones. Linkages are made between different kinds of standards or reference resources, and such linkages are used in the construction and evaluation of one resource in terms of another. What makes the shift to post-reference genomics significant depends on two related phenomena. One is the increase in the number of linkages that contributes to the improvement of individual standards/resources and their use in the improvement of other standards/resources. The other is the amplifying and ramifying effect of such improvements at the more global level of webs of reference.

Before we discuss this shift further, however, we should acknowledge that for the purposes of organising the narrative and our analysis, we have assessed the production and nature of reference genomes, their annotation, and post-reference genomics in separate chapters. This should not be taken to imply that these are discrete aspects of genomics or that they occur in a regular and linear sequence. Rather, as we have attempted to demonstrate throughout, the boundaries between any one particular set of practices that depend on the outcomes of another set are rarely sharply drawn. Conceptually later processes such as annotation may inform revisions of assemblies or even details of the sequence of reference genomes, for example, and the distinctions between structural and functional annotation, and manual and automated means of conducting it, are rarely clear-cut.

With that in mind, we consider how the aims and shape of genomic research changed following the release of reference sequences. These reference genomes were not themselves static, but were continually modified and improved according to widely held epistemic criteria. These improvement efforts were often informed by the results of post-reference genomic projects that themselves relied on and used an existing reference sequence.

Alongside the enrichment of the reference genome, a panoply of reference resources have been created for distinct populations and individuals, and the means to make comparisons within and between species has been further developed. These have fed functional analysis, but have also enabled the increasing exploration and mapping of the terrain of variation within species and the establishment of connections between different species. While this has led to concerns about the extent to which the reference sequence represents the increasingly mapped terrain, the new locales established throughout this land were still seeded from the reference genome, and related to it. The terrain is not three-dimensional like a geographical landscape, but more like a hyperdimensional state space. In this way, webs of reference have been constructed, exploring the variational space for a given type (the species, a sub-species, or a higher-level grouping or taxon) as new reference standards are created to capture specified types or sub-types. These webs of reference, in which each node is related to others, have developed iteratively and recursively. The more linked data there is concerning the variational space of the type, the more that further exploration can be conceived, and existing reference resources improved using the new linked data. This is where the development of population-specific resources, and ways of representing genomic variation such as pangenome graphs, have taken post-reference genomics: seeding the web.

The reference genome is useful to the extent that it is a viable origin of radiation that enables functional and systematic lines of investigation to bloom and produce linkages between different kinds of data and material. Genomics involves the creation of standards that improve over time relative to the epistemic aims of their creation and use, becoming more stable over time, though never achieving completion due to shifts in epistemic goals and the non-existence of even a theoretical absolute standard. But this is just a part of the picture, particularly for post-reference genomics, in which developments include the progressive exploration of the indefinitely-dimensional variation space for particular species (or other types) and the establishment of connections between these concerning different species (or between other types). The more the space is explored, the more connections can be made and the basis for further exploration—extensively across the space and intensively in particular regions of it—is created (Fig. 7.3).

The way this process unfolds, and the webs of reference that are constructed through it, is unlikely to be generic. The greater degrees of freedom offered compared with reference genomics indicates that the involvement of particular communities in the generation of genomic resources will be at least as salient to how these webs develop as they were to how reference genomes were produced. However, the historicity and contingency underlying these webs of reference should not distract from the potentially new emergent dynamics generated through them. The existence of a web of reference at a certain level of development lowers the threshold for adding—and connecting—new reference resources. New groups and communities can draw upon and link to existing resources to generate their own, and therefore to contribute towards and help shape the web. The wider context of reference resources should therefore be considered as a factor in enabling fresh participation and the connection of genomic data and resources to more specific research goals, in addition to the more widespread and distributed capacity to conduct sequencing that has emerged in the last 20 years.