This chapter looks at the processes of annotation: the identification and adding of biologically-relevant information to the reference genome, which can then be visualised in genome browsers, with the annotations aligned against the reference sequence itself. Annotation is both a key part of the creation of a reference genome and a definitional criterion of being a designated reference genome in the RefSeq database. It is the way in which the data produced in genomics are linked with the concerns and interests of the empirical life sciences and particular problems that motivate the work of specific communities: what historian Jon Agar (2012, 2020) has termed “working worlds”.

This chapter demonstrates that the establishment of ever-more automated and refined pipelines incorporating multi-dimensional data—including cross-species comparative and ‘beyond the genome’ data such as protein sequences—was only part of the story of the development of genome annotation. We show that the manner in which annotation has developed was affected by: the ways in which the algorithms, protocols and operations of these pipelines were configured and improved; how they related to practices of manually annotating genomes; and the role played by the interactions of specialist genomicists with particular research communities. These factors were also pertinent to shaping what got annotated, how, and what use was made of the resulting enriched reference resources.

We show that different models of annotation are shaped by the relationship between reference sequence production efforts and the nature of the involvement of different communities converging around the genomes of particular species. Pig genome annotation, as a collaboration between the community of pig genomicists outlined in Chap. 5 and a well-developed annotation infrastructure at the Sanger Institute, differed in its nature and outcomes from yeast and human genomics. In yeast genomics, the community of yeast biologists was intimately involved in the reference genome production, while the initial annotation of the genome was orchestrated by the central bioinformatics coordinator, the Martinsried Institute for Protein Sequences.Footnote 1 In human genomics, two models existed: one involved the creation of high-throughput annotation pipelines at the institutions participating in the International Human Genome Sequencing Consortium (IHGSC), while the other—developed by their rival, the company Celera Genomics—was more open to input from prospective sequence users. In the former case, as with the reference genome sequencing, the medical genetics community was largely uninvolved. In the latter case, a subset of this medical genetics community was brought into the fold and contributed towards the realisation of a product—an annotated genome—distinct from that emanating from the large-scale sequencing centres leading the IHGSC effort.

One key commonality between the multiple species we have examined is the involvement of the Sanger Institute. In the previous chapter, we saw how the Sanger Institute’s relationships with the different species communities varied in important and consequential respects. In this chapter, we show how the relationship of the Sanger Institute to the existing pig genetics community, already particularly close during the production of the Sus scrofa reference genome, was even more entangled for the annotation of the resulting sequence. This annotation used data from prior annotation and sequencing (in particular of the human genome) and availed itself of the Sanger Institute’s infrastructures and procedures (pipelines) developed through human (and pre-human) sequencing projects. However, this annotation effort also had crucial input from the pig genomics community, whose members played a significant role in manually annotating the genome, confirming the automated annotations of the Sanger Institute, and contributing to an already-established panoply of comparative resources, empirical data and theoretical insights. Rather than just being a large-scale data producer, the Sanger Institute features here as a collaborator, facilitator, trainer and provider of quality assurance, as well as the manager of various data infrastructures.

This changing role exhibited by the Sanger Institute enables us to show that the story of increasingly automated and data-intensive annotation pipelines merely corresponds to some of the ways in which the IHGSC institutions operated. We demonstrate that a broader multi-species approach to examining the history of annotation practices helps us to notice strategies that connect to the working worlds of the communities using the sequence data. This allows us to disclose the activities of communities that had long been generating and interpreting sequences, and to incorporate their trajectories into the history of the production of reference genomes.

1 Annotation: Pipelines and Jamborees

1.1 What Is Annotation and How Does It Contribute to the Production of a Usable Reference Genome?

Broadly speaking, annotation is the marking of features of interest in the abstract landscape of the sequences of nucleotides. Typically, representations of the genome accessible to researchers and the lay public are in the form of a browser, a window in which the user can select or deselect different features and modes of presentation of the genome to be conveyed to them (Fig. 6.1). The different selected features are aligned vertically next to a horizontal representation of the strands of the chromosome, which depicts the order of nucleotides along it if the user zooms in sufficiently. The browsers are based on database resources, perhaps incorporating several nested layers of data drawn from different sources.

Fig. 6.1
A screen display of the e Ensembl interface. It displays the information on chromosome 1 of the pig genome. The region in detail from the location-based displays is highlighted. It consists of information on the E S T cluster, 90-way G E R P elements, genes, contigs E S T cluster, and all sequence S N Ps.

Example of a display of a reference genome—Sscrofa11.1 for Sus scrofa, the pig—on a genome browser: Ensembl. Taken from Sscrofa11.1 Chromosome 1: 90,744,428-90,875,121, Ensembl (Howe et al., 2020), release 105 (https://www.ensembl.org/Sus_scrofa/Location/View?r=1:90744428-90875121;db=core, last accessed 18th December 2022)

The features that can be annotated include:

  • Open Reading Frames (ORFs; segments between start and stop codons—specific sequences that may indicate the presence of transcribable DNA such as a gene);

  • Genes (and their structure, organisation and variants);

  • Repeat sequence regions, including those constituting telomeres at the ends of chromosomes and centromeres that perform a key role in the chromosome dynamics of cell division;

  • Pseudogenes (which appear similar to genes but do not function as such, due to mutations—these may have originally been copies of functioning genes);

  • Regulatory regions that are not themselves expressed, but that affect the expression of genes.

Beyond these, many different kinds of sequence variants can also be identified and annotated, including structural variants in which stretches of nucleotides have been deleted, inserted, added, moved and inverted (Mahmoud et al., 2019).Footnote 2 Genomic variation comes in many forms, from differences in individual nucleotides, through variation in the sequence of individual coding regions, variation in the number of copies of repeat sequence in particular regions, to differences in sequence at a more gross level such as structural variants.

Two key distinctions have emerged to describe the processes and objects of annotation: manual and automated annotation; and structural and functional annotation.

Manual annotation involves the marking of genomic features using biological knowledge, such as the known sequence and location of a given gene. This way, the sequence is interpreted and contextualised using evidence from a variety of sources that may include earlier automated annotations. In automated processes, the genome assembly is first computationally analysed to identify key features such as repeat sequences and ORFs, and then existing datasets are interrogated to make predictions as to the annotation of more complex features such as protein-coding genes. These predictions are then examined further using a variety of algorithms embedded in different software to synthesise different forms of data and thus establish consensus models of the gene, which may include its structure and the existence of different forms. The data used in these automated processes include Expressed Sequence Tags (ESTs), known protein sequences and RNA sequences. These data can concern the species being annotated, as well as other species known—through prior comparative work—to be genomically close enough to the target species in order that cross-species inferences between parts of the genomes known to be equivalent can be made (Lowe, 2022).

Typically, generic pipelines have been designed and continually developed to annotate genomes, similar to the way that ones have evolved to produce and assemble sequence data (Stevens, 2013). These pipelines involve the specification of a series of sequential tasks and associated protocols, though typically different options for routes along the pipeline may exist to enable projects with differing levels of resources to navigate it. While some projects may have the resources to, for example, pay for additional manual annotation to refine the automated annotation, others may not. The existence of generic pipelines, together with the use of cross-species data, shows how genomic endeavours for different species interact. The infrastructures are built to accommodate difference, but also to channel it to ensure that the products of the pipelines are commensurate, even though they may serve—and be used by—different communities.

Alongside the selection of the source of DNA and the planning of the project, it is in annotation that the reference genome as a creative product of a particular configuration of actors is most manifest. The ways and the extent to which the annotation process enables new forms of genomics and genome-related research and resource development, however, depends on the details of the construction of that reference genome. Such details include the libraries used and how the genomic variation of the species was abstracted into the reference sequence. Also crucial are the relationships of particular research communities to various aspects of the process from pre-reference genomics through to annotation, as we show below.

The distinction between structural and functional annotation appears to map onto the distinction between (reference) genomics and post-(reference) genomics, which is explored further in the following chapter (Chap. 7). Structural annotation is the identification of particular features of the genome such as genes and their organisation, but also other functional and non-functional elements. Functional annotation is the connection of this structural data to other forms of data that help to make sense of the products and role of particular genomic elements. Broadly, we discuss structural annotation more in this chapter and functional annotation in the following chapter, but in doing so, we reveal that the distinction and apparent temporal succession from structural to functional is not clear cut.

1.2 Creation of Annotation Infrastructures

Annotation practices pre-date the annotation of reference genomes, and even the invention of DNA sequencing: for instance, the annotations in Margaret Dayhoff’s early DNA sequence database were modelled on those in her previously-established protein sequence database (Strasser, 2019, p. 209). In the generation, collection and curation of annotations in databases such as GenBank, Stephen Hilgartner (2017) and Bruno Strasser (2019) have identified two broad periods. There was an earlier period in which database staff themselves had to collect and annotate individual sequences, by trawling the literature. Then there was the period that succeeded this, in which the producer of the sequence data was able to submit it—with pertinent annotation—directly to databases with the help of specially-designed software tools.Footnote 3 In alliance with funders and journal editors, the databases helped to increasingly transform this practice into a duty.

In the first period, in the 1980s, annotation was essentially in the form of metadata; curators would read journal articles reporting a new nucleotide sequence and annotate the sequence by indicating the source of DNA and key features within the string of nucleotides. This process was advanced by agreements forged from 1982 onwards between GenBank and the Nucleotide Sequence Data Library hosted at the European Molecular Biology Laboratory (EMBL), and later (1987) between these and the DNA Data Bank of Japan. This tripartite alliance later became formalised as the International Nucleotide Sequence Database Collaboration. They divided up the laborious tasks of going through the literature and extracting and annotating sequences between themselves. Furthermore, to get around existing compatibility problems, they strove to harmonise the format that the data was recorded in.

In spite of this, and the use of supercomputers at the US Department of Energy’s National Laboratory at Los Alamos to try to automate data processing and annotation, the rapidly-increasing production of sequences led to a backlog. This encouraged GenBank to streamline the process, in part by skipping the annotation or making it more cursory (Hilgartner, 2017, pp. 157–161; Strasser, 2019, pp. 228–230). As annotation was meant to be about making the data useful and “biologically meaningful”, enabling it to be picked up and re-used by researchers using the database, this was problematic (EMBL Director General Lennart Philipson, as quoted by Strasser, 2019, p. 232). The EMBL, closer to bench biology than the physicist-led GenBank (which was based at Los Alamos from 1982 to 1992), was less keen on short-cuts around or through the annotation process (Strasser, 2019). The inadequacy of the initial algorithms designed for annotating sequences at the EMBL led to the conscription of biology students and clerical staff to contribute to the effort. When this also proved insufficient, more senior biologists were cultivated, which involved informing them about some of the basics of the operation of the database, as well as circulating new sequences that may have been of interest to them. Biological researchers at the EMBL could then work with the database staff to refine the sequences stored on the database—and their annotations—as well as helping to improve the algorithms used in automated annotation (García-Sancho, 2012, pp. 111–114).

From 1987 onwards, there was a strategic shift towards securing agreements with journals, by which they would only publish articles including DNA sequence data if they were accompanied with accession numbers, indicating that they had been submitted to a publicly-accessible database such as the DNA Data Bank of Japan, the EMBL one or GenBank. Even though these agreements and rules were variably enforced, they succeeded in encouraging more direct submission, especially when software tools making data submission easier for researchers spread. Further changes in rules and norms of submission followed in the 1990s and improvements in the way data were submitted and accessed also occurred. There was increasing adoption and ease of internet access, additional tools to interrogate the databases were developed (such as the Basic Local Alignment Search Tool—BLAST—sequence comparison software), additional databases beyond the basic sequence ones were launched, and ongoing improvements were made to the fundamental DNA sequence databases.

In 1992, GenBank came under the umbrella of the National Center for Biotechnology Information, which maintains a panoply of other reference and software resources, including the RefSeq database (Chap. 1), and ClinVar, which is explored in Chap. 7. As we showed earlier (Chap. 4), in 1994, the EMBL database moved from Heidelberg—where the EMBL headquarters are—to what is now known as the Wellcome Genome Campus in Hinxton, Cambridgeshire, to form the EMBL’s European Bioinformatics Institute (EBI). The Wellcome Genome Campus is also where the Sanger Institute is based, a co-location of significance to the story of the development of annotation infrastructures, and the specific examples of annotation we detail in the following section.

For now, the relationship between the Sanger Institute and the EBI is pertinent, because of the role of these institutions in the creation of means by which the data in well-stocked nucleotide databases could be brought together and presented in a useable form for researchers. These resources, the database system AceDB and the genome browser Ensembl, were forged in the exigencies of reference genome sequencing: of the nematode worm Caenorhabditis elegans and the human, respectively.

AceDB, which stands for ‘A C. elegans Data Base’, was originally founded in 1989 by Jean Thierry-Mieg and Richard Durbin. The former was a Centre National de la Recherche Scientifique (CNRS) researcher in France, and Durbin was in a spell at Stanford University in-between doctoral and postdoctoral work based at the Laboratory of Molecular Biology in Cambridge; he moved to the Sanger Institute in 1992 and stayed there full-time until 2017. As it developed, AceDB allowed users to access and relate different kinds of representations of the genome of C. elegans in an internet browser, to move between representations of the DNA sequence, and the genetic linkage and physical maps. In her historical investigation of C. elegans genomics and the nature of the AceDB enterprise, Soraya de Chadarevian has highlighted the infrastructuring work that is required to make maps—that have been produced in very different ways and constitute distinct representations—commensurable in databases and visualisations generated using them. The production of new kinds of maps, including the full genome sequence, was driven by specific concrete demands (e.g. of particular communities) that were often independent of those that drove the construction of preceding maps. In making different kinds of maps interoperable through this work of commensuration, the specificities of the objectives, communities, practices and historical trajectories involved in forming these resources are flattened (de Chadarevian, 2004). This eases visualisation and navigation by users, but at the cost of abstracting the underlying specificities and lineages. As we show below, this double-edged sword—of easing inter-operability at the expense of flattening specificities—persisted in other infrastructures produced at the Wellcome Genome Campus.Footnote 4

In 1999, the same institution at which AceDB was developed—the Sanger Institute—collaborated with the EBI to launch a key platform to accelerate the IHGSC human reference sequence effort: Ensembl. The Ensembl team devised a pipeline to help assemble the reference sequence and present it online through a genome browser.Footnote 5 The Ensembl browser presents an abstracted view of any part of the genome one chooses to zoom-in to. It offers a variety of ‘tracks’ representing different annotated features of the genome that can be selected and lined up alongside the reference nucleotide sequence, which is itself arrayed horizontally (Fig. 6.1). Ensembl does not only generate these visualisations but, for vertebrate species, also produces the annotations that are included in them, through its own automated annotation pipelines. It augments this with downloaded annotation data for other key non-vertebrate species. Ensembl, therefore, exhibits the clear will in the late-1990s to automate the annotation process and to bring it ‘in-house’ into the small number of institutions producing sequence data.

The manual annotation of select species was conducted by the Human And Vertebrate Analysis and Annotation group (HAVANA) at the Sanger Institute. HAVANA had its origins in the Human Sequence Analysis team led by Tim Hubbard within ‘Team 71’, the Informatics division that was led by Durbin at the Sanger Institute. The Sanger Institute component of Ensembl led by Michele Clamp was also part of Hubbard’s team. Jennifer Ashurst (later Harrow) joined this team in April 2000 and led a distinct HAVANA group within the team from 2002. At the time she joined, there were two people working on manual annotation. However, it became apparent that Ensembl’s automated annotation generated too many false positives due to the quality of sequence data then available to them.Footnote 6 It did predict approximately 70% of human genes accurately, good enough for a rough-and-ready annotation of the draft genome, but not of the required quality for biomedical research or diagnostic purposes. To improve the quality of the annotation, manual annotation was required that would make use of data coming in from the automated pipelines, but also involve curatorial decisions based on biological knowledge.Footnote 7

HAVANA developed the curated Vertebrate Genome Annotation (VEGA) database and browser, which was built on Ensembl. VEGA was operational for human manual annotations from 2002, and mouse and zebrafish from 2003.Footnote 8 The browser was curated using both manual annotations conducted by the HAVANA group itself (such as for human chromosome 20) and by other groups and institutions (such as Ian Dunham’s for human chromosome 22, and Genoscope and the CNRS for human chromosome 14). From early on, this annotation and curation work was accompanied by the development of protocols for manual annotation. At two ‘Human Annotation Workshops’ (HAWK1 and HAWK2) hosted by HAVANA in March and September 2002, participants from multiple institutions involved in manual genome annotation discussed possible standards and guidelines. A test sequence was annotated using different manual and automated methods at HAWK1, and the results of this were compared.Footnote 9 These workshops formed the basis for the manual annotation standards used in VEGA and were intended to aid commensurability across other resources and genome browsers developed at the NCBI and University of California Santa Cruz (see note 5). The Otter manual annotation system that was developed for HAVANA by Ensembl and used in VEGA was designed in accordance with the standards formulated in the HAWK workshops (Searle et al., 2004).

From 2014 to 2017, Ensembl became solely part of the EBI. HAVANA became part of Ensembl at the EBI in 2017. By then, HAVANA had branched out to work directly with some species communities on manual annotation; the pig was one of these, as we see later in this chapter.

2 Annotating the Yeast, Human and Pig Genomes

When we consider the annotation process across the main three species we look at, we find that the nature of it depended on: the generation and use of existing genomic resources such as maps and genome libraries; the existence of data such as that on Expressed Sequence Tags (ESTs), complementary DNA (cDNA) sequences, RNA sequences, and protein sequences; the nature of the inferential apparatus available for intra-specific and inter-specific data analysis; the kind of community and actors involved and their interests; and the mode of organisation of genomic projects. The available data sources and inferences were marshalled to find and elucidate the fine structure of genes and other elements of the genome. A closer look at the specifics of annotation practices across these three different species allows us to complicate the relationships between automated and manual processes, as well as between structural and functional annotation.

Four basic models of annotation were identified by bioinformatician Lincoln Stein in an article published in the summer of 2001 (Stein, 2001). He associated these models with particular stages of the annotation process, in terms of its increasing complexity and recontextualisation through forming connections to other kinds of biological data and knowledge. Two of his terms—factory and cottage industry—are familiar from earlier debates concerning the proper organisation of genomics (Chaps. 2 and 3). We have interpreted his designations in the scheme displayed in Fig. 6.2.

Fig. 6.2
Models of annotation in diagrammatic representation. Two rows are titled, stage of annotation and annotation model consisting of nucleotide, protein, and process for the former and factory, party, museum, and cottage industry for the latter. The complexity of the stage and the functionality of models increase from left to right.

Diagrammatic depiction of models of annotation and how they relate to different stages or levels of annotation. (Produced by both authors, based on Stein, 2001)

Stein’s scheme, as interpreted, highlights the importance of the establishment of mechanisms by which existing datasets and resources can be accessed and used in annotation, as well as the significance of the role of annotation itself in enabling the creation of new links to other datasets and standard references, such as the Gene Ontology. This enables the decontextualised reference sequence to be progressively connected to other forms of biological data and therefore recontextualised.Footnote 10 This process makes use of software and algorithms to search external databases. Crucially, it also uses maps and libraries employed in the construction of the reference genome to initially annotate the sequence. This seeds further annotation by providing reference points to aid the searching of external data, and also aids the later contextualisation of the annotated data. Stein’s conception, while consisting of stages, does break down firm distinctions between manual and automated annotation, and also structural and functional annotation, as entanglements of each are implicated in any one point. Key here is that the weights of the different modes (automated/manual; structural/functional) change as the annotation process proceeds. The schematic we have drawn from Stein is a useful overview of the general trends in the annotation process, and it constitutes a helpful reference point with which to consider examples that depart from the sequential and separable stages implied by it. For instance, we may observe that the main genome browsers such as Ensembl moved towards a hybrid factory-museum model (Loveland et al., 2012).

Quite apart from the particular manifestations of sequencing, and the extent to which they may depart from Stein’s ideal types, the ways in which particular communities and genomic endeavours undertake annotation is constrained by multiple factors. These include the histories, motives and resources of particular communities of genomicists. Furthermore, groups such as HAVANA developed forms of community annotation, in which they acted as facilitators—rather than the sole conductors—of the annotation process. As we detail below, these forms of community annotation involved the creation of software tools such as Otterlace/Zmap for manual annotation on the cottage industry model, and also more direct interactions with research communities, such as the one that had been working on pig genetics and genomics (Loveland et al., 2012).

2.1 Yeast Genome Annotation

For yeast genome sequencing, as previously noted, one finds a community of geneticists, cell biologists, biochemists and molecular biologists, often dedicated to working with standardised strains of Saccharomyces cerevisiae. The ease of working on this unicellular eukaryote was what made it a model organism, and this engendered the virtuous cycle by which the existing weight of scientific capital—in the form of mounting knowledge, resources, tools, and mechanisms of dissemination and sharing—justified new investment in its further augmentation. When the perception began to grow that “[t]he yeast genome was becoming overstudied, and yet…, largely unexplored!”—that different research groups were working on the same genes while much of the genome was terra incognita—multiple laboratories across Europe, Japan, Canada and the USA rallied to participate in an unprecedented collaboration to sequence the first full eukaryotic genome (Dujon, 1996, p. 263; Chap. 2).

The structural annotation of the yeast genome reflected the hierarchical, top-down and distributed approach of the sequencing effort in Europe. Within the initiative funded by the European Commission, the centralised bioinformatics function located at the Martinsried Institute for Protein Sequences (MIPS) was married with the specific expertise of the laboratories performing sequencing, and seeking to make use of the data so generated.

MIPS, on assuring the quality of the sequences it received and assembling contiguous tracts of sequence (contigs) on the basis of them, screened the data for ORFs by identifying stretches of minimum numbers of nucleotides (from about 50 to 300, a lower number risking more false positives and a higher one more false negatives) with no stop codon. They also sought contigs with sizes below the threshold by searching for sequences that were homologous (showed sufficient similarity) to known protein sequences, based on the knowledge of the genetic code and processes of transcription and translation. Already, this analysis relied upon existing experimental knowledge of this well-studied organism, as well as the prior delineation of protein sequences and elucidation of their functions. Using sequence homologies, the MIPS team was able to classify the ORFs in terms of their putative functions (Mewes et al., 1998). Once the data had been passed on to the sequencing laboratories, the initial identification of the ORFs could be built on with a deeper analysis of these sequences. This was done either using existing biological data or materials (for example, concerning centromeric and telomeric DNA, tRNA and Ty elements for chromosome II) or by performing a variety of experiments to characterise their functional role. Following the conclusion of the reference genome sequencing, such experiments were organised and conducted in a concerted way in a successor project on functional analysis and annotation called EUROFAN, which is discussed in Chap. 7. Due to the limitations of homology analysis, with about 40% of putative genes being “orphans” either having no discovered homologues or homologues with no known function, such functional analysis would also enable the verification of the structural annotation.

Once the presumed coding regions were separated from the non-coding, the non-coding regions could be further analysed to detect sequence motifs (including promoter regions of genes) and other features such as transposable elements (Ty elements). Many of these non-coding elements were of interest to participants in the network, who could use the genomic data that they generated—and MIPS processed—to further their research. For example, Horst Feldmann at Ludwig-Maximilian University of Munich was particularly interested in Ty elements (Chap. 2) and advanced his research using the structurally annotated sequences he now had access to. These sequences had themselves been augmented using the data he had previously collected (Feldmann et al., 1994; Heumann et al., 1996; Mewes et al., 1998). While the centralised parts of this process, such as the role of MIPS, will seem analogous to some of the informatics pipelines and groups of the IHGSC discussed in the next section, the yeast biology laboratories played an important role in refining and developing the initial annotations that were made by MIPS. Unlike in human reference sequencing, in which prospective users were not involved in the processes of data production, in the yeast genome effort there was a set of users incorporated in those processes (García-Sancho, Lowe, et al., 2022).

The completion of the sequencing and sequence analysis of the different chromosomes at different times enabled innovations developed for one chromosome to be taken up by groups working on other parts of the yeast genome. For example, the methods that yeast geneticist Bernard Dujon developed for the evaluation of ORFs to identify which ones were indeed “functional genes” in the chromosome XI paper published in June 1994 were then used in the chromosome VIII paper published in September that year (Dujon et al., 1994). Chromosome XI was Europe-led, while VIII was coordinated from Washington University by Mark Johnston. While they exhibited different organisational models, as we saw in Chap. 2, there was enough of a connection for each to build on the advances of the other.

Washington University’s model of annotation was also different, though in practice they used searches of public nucleotide and protein databases to identify cross-species homologies with known genes and protein sequences, as well as examining other elements such as tRNAs, much as MIPS did. For assembly and annotation, they (along with some European-led groups) used a version of AceDB: AScDB, with ‘Sc’ standing for S. cerevisiae rather than the ‘ce’ of C. elegans. AScDB had been specially adapted for yeast by Richard Durbin, young EMBL bioinformatician Erik Sonnhammer and LaDeana Hillier, the director of informatics at the Washington University Genome Sequencing Center (Johnston et al., 1994). Hillier collaborated closely with Johnston, and also worked on C. elegans and human genomics. With the benefit of a comparative perspective gained from interaction with the yeast, human and C. elegans efforts, she observed that a significant problem with “smaller numbers of groups doing the sequencing” was that “user education” could be “an issue”. However, for “yeast the user education was taken care of because the sequencing was done at so many different places that everybody [...] understood the limits of the data” (Hillier, 2012, p. 7).

Dujon and Johnston gave assistance to the chromosome I team that mainly operated at McGill University. They were the next to publish—in April 1995—with Dujon helping with sequence analysis and Johnston providing the chromosome VIII sequence, which enabled some genome duplications to be identified. Later papers indicate a continuation of this cooperation around sequence analysis. These publications document a refinement of the processes, datasets and software used from the early published chromosomes onwards (Bussey et al., 1995; see also Galibert et al., 1996). This stands in contrast to the development of novel tools and the infrastructural transformations associated with human genome annotation or the adaptation of established infrastructures and processes to the particular demands of pig genomics.

For the Europe-led sub-projects, MIPS continued its role in sequence analysis. It did not see its task as restricted to identifying individual genomic elements, but also as aiding the global characterisation of the genome, by using their initial structural annotation to partition the genome into units. As a consequence, sequence comparisons could be made between these units, in order to identify gene duplications to aid future functional analysis and provide data that could be used in tracking the evolution of the S. cerevisiae genome. These twin approaches of targeting function and diversity that arose out of the initial work to structurally characterise the genome form an important part of the narrative of Chap. 7.

For the purposes of sequencing and annotation, yeast had clear advantages over the bulkier organisms that we consider next: humans and pigs. The yeast genome is considerably smaller in size, but also more economical, in that it contains comparatively little non-coding DNA and complex gene structures, compared with multicellular eukaryotes. As a model organism, it also had a panoply of available experimental evidence that could be used and built on to inform both automated and manual approaches to annotation. Additionally, the range and extent of functional analysis conducted by the yeast genomics community that we discuss in Chap. 7 was not possible for human and pig. This meant that distinct strategies for annotation needed to be developed for these species. For the human genome, this involved making use of the abundant ESTs and protein sequence data that had been gathered, the creation of automated and manual sequencing pipelines, and advancing the means with which to conduct analyses of homology by harnessing and further developing comparative genomic approaches.

2.2 Human Genome Annotation

In the three major papers describing the sequence of the entire human genome (authored by the IHGSC in 2001 and 2004, and by Celera in 2001), only the Celera paper includes details of the annotation process. For the IHGSC, the details of annotation are dealt with only in the subsequent individual papers describing the sequence of each chromosome. This reflects, we suggest, the IHGSC primary concern of getting assembled sequence out in the public domain to prevent its enclosure by some form of intellectual property. On the part of Celera, the inclusion of information about annotation evinces their commercial strategy of building the foundations for the exploitation of the genome for biomedical purposes. Even though they described aspects of their annotation process, users would still have to pay to access Celera’s full annotated sequence. In this way, Celera sought to make itself an obligatory passage point for those seeking the richly-annotated data that they produced.

The first chromosome that the IHGSC sequenced was chromosome 22, by a team led by Ian Dunham at the Sanger Institute. The paper announcing this appeared in December 1999, before Ensembl and HAVANA were up and running. Tim Hubbard’s sequence analysis team were involved, though, and they integrated existing data on nucleotides and protein sequences, using similarity searches (through programmes implementing the ‘BLAST’ algorithm developed at the NIH by Gene Myers and colleagues) and prediction programmes (Dunham et al., 1999). Like the annotation of subsequent chromosomes, an early stage was identifying repetitive sequences and ‘masking’ them. This meant filtering them from view so that they were not incorporated in automated analyses of the sequence data. To do this, the annotators used ‘RepeatMasker’, a piece of software developed and (then) hosted by the Genome Sequencing Center at Washington University. The remaining unmasked sequence was then analysed for the presence of various genomic features, such as spotting areas of the genome with a relatively high proportion of guanine and cytosine bases in order to discern the presence and location of CpG islands, in which cytosine is next to guanine. These are frequently located in the promoter regions of genes and are therefore a good indicator of the presence of genes.

At this point, the automated aspects of searches and the use of prediction programmes were interweaved with manual approaches. In large part, this was because of the calibration and verification required for each method, and the overall need to evaluate and refine the annotation process. A re-evaluation of the chromosome 22 annotation in 2003 re-affirmed the value of combining automated prediction, sequence similarity and comparative methods in annotation, but observed that the optimum configuration of them with respect to each other had not yet been found. Furthermore, at this time the ideal comparator species for similarity analysis was unclear. The authors acknowledged that while annotation processes would be improved, at that point automated approaches had significant limitations. As well as refining data categories and making use of new sources of data (e.g. new human ESTs and various kinds of data on related species), overcoming these limitations would involve manual analysis and experimentation (Collins et al., 2003).

The only other chromosome sequence published before the announcement of the completed draft of the whole genome in February 2001 was for chromosome 21, conducted by a consortium led by RIKEN (Rikagaku Kenkyūjo, the Institute of Physical and Chemical Research) in Japan.Footnote 11 This team also conducted gene predictions and sequence similarity searches. They additionally defined criteria by which putative gene classifications were assigned to one of five categories, depending on the strength of the evidence for them being protein-coding genes. They, therefore, placed the discernment of functional elements of the genome such as protein-coding genes at the heart of their annotation effort, an orientation appropriate to the biomedical interests of many of the institutions that worked on chromosome 21. That emphasis—and the function-centred annotation—motivated and aided the paper’s substantial analysis of the medical implications of their results (Hattori et al., 2000).

The biomedical interests of RIKEN’s collaborators were the exception rather than the rule for most institutions involved in sequencing subsequent chromosomes within the IHGSC effort. This was reflected in the way that the sequence data was analysed in the publications announcing their completion. Advances in the analysis of sequence data were heralded, but in so doing, the potential biomedical users of the data were a secondary concern. As we now detail, these analytical advances constituted refinements and additions that augmented the annotation pipelines for each successive chromosome. The augmentations that these specialist genomicists introduced were directed towards improving the capabilities of genomics qua genomics, as an enterprise in itself with its own internal goals and motivations. They sought to improve their assemblies and annotations according to internal generic metrics of quality, contiguity and coverage, guided by an overall ideal of completeness. In other words, they did not primarily shape the annotation process and its products in such a way as to fulfil the requirements of any specific external community or set of users.

The first chromosome sequence published after the announcement of the draft whole sequence was chromosome 20 in December 2001; after this, there was a gap in 2002 before a flurry were published across 2003 to 2006.Footnote 12 What did the progressive accretion of methods and sources of data consist of, across the five years since the completion of chromosome 20?

The chromosome 20 paper, signed only by authors from the Sanger Institute, was the first to use the Ensembl database in the analysis of the sequence; this sequence was, though, still assembled and visualised in AceDB. The genomicists were able to make use of sequence data from two vertebrates (the mouse Mus musculus and the pufferfish Tetraodon nigroviridis) in their comparative analyses rather than merely the mouse maps that the previous chromosomes had relied on (Deloukas et al., 2001).

For chromosome 14 (February 2003), a two-step annotation approach was employed by the collaboration between Genoscope, the Institute of Systems Biology in Seattle and the Washington University Genome Sequencing Center. In this, automated methods using computational predictions to formulate provisional models of the structure of genes, were refined by sequence similarity analysis. This was complemented by experimental data on gene expression using microarrays, a tool containing potentially many thousands of DNA probes that can indicate the presence or absence of specific complementary sequences. In the “manual curation” that followed, the genomicists used additional data to refine the gene models produced in the first stage and remove “suspicious data” such as partial matches that were not found to contain any significant coding sequences (Heilig et al., 2003, p. 607).

Washington University Genome Sequencing Center was also heavily involved in the completion of chromosome 7 (July 2003), as well as the Y chromosome (June 2003). These featured a significant focus on methods for the identification of pseudogenes, including KA/KS analysis to identify the kind and extent of selection operating on putative pseudogenes and known genes. In this type of analysis, the scientists generated reconstructed ancestral sequences to detect signatures of neutral evolution (and therefore an absence of positive or purifying selection) which would indicate the presence of a pseudogene. They then checked these inferences using the available mouse sequence data (Skaletsky et al., 2003; Hillier et al., 2003).Footnote 13

Like chromosome 20, the paper heralding the completion of chromosome 6 (in October 2003) was wholly authored by people at the Sanger Institute. Since 2001, there had been considerable developments in their annotation process. Ensembl was now more refined, and the HAVANA team was established and embarking on their extensive manual annotation. VEGA was now up and running and hosting the annotated sequence data. Built into the heart of Ensembl’s automated annotated process were two sequence-matching tools: GeneWise for exploiting protein sequence data and Genomewise for using EST and cDNA data indicative of the presence of transcribed genes (Curwen et al., 2004; Birney, Clamp and Durbin, 2004). In its design, the Ensembl pipeline had been configured to integrate and more effectively deploy existing annotation methods. In addition, it was now able to make use of sequence data on the rat (Rattus norvegicus; an animal model), another pufferfish (Fugu rubripes; with a far more economical genome than other vertebrates) and zebrafish (Danio rerio; a model organism) as well as the mouse and Tetraodon nigroviridis. Using the protocols and standards forged in the HAWK meetings in 2002, the HAVANA group manually curated the gene structures generated through the Ensembl pipeline. Given their later role in facilitating community annotation of immune response genes in the pig, it is appropriate that HAVANA’s first formal role in human genomics concerned chromosome 6, which contains the Major Histocompatibility Complex implicated in immune response.

We will return shortly to the annotation of the remaining chromosomes, focusing on the development of Ensembl and HAVANA at the Sanger Institute. For now, with the expansion of the number of creatures for which informative sequence data was available in mind, we make a brief excursion into the development of comparative genomic resources and approaches.

As we noted in earlier chapters, a comparative genomic perspective was present in genomics from its inception. Genome sequencing projects on other species were used as pilots to aid the planning of the Human Genome Project. Furthermore, the map and sequence data of those other species were used to help construct human genome maps and sequences, by applying knowledge about comparative regions between the species. Finally, it was also envisaged that establishing a rich understanding of comparative connections between human and non-human genomes would enable the more fruitful exploitation of the human resource. In one respect, this was because experimental interventions on organisms such as yeast and animal models could then be connected to and inform human biology through genomic and other omics data. In another respect, this was because of the mooted contribution of data on other species towards enriching the annotation of the human genome.

To aid human genome annotation in this way, in December 2003, the Large-Scale Sequencing Program of the US National Human Genome Research Institute (NHGRI) established two Working Groups: one on ‘Annotating the Human Genome’ chaired by Robert Waterston and the other on ‘Comparative Genome Evolution’ chaired by Laura Landweber and John Gerhart. Both groups were tasked with identifying what new sequencing could be conducted in large-scale sequencing centres to advance human genome annotation and functional analysis. The Comparative Genome Evolution group also had to identify which organisms to sequence to shed new light on human evolution and genome evolution across eukaryotes in general. Each of the groups identified three components of research, a range of organisms and appropriate sequencing strategies (including coverage to be obtained) to contribute to these components, and indicated percentages of total sequencing capacity to be allotted to each task.

The Annotation Working Group recommended that 15 non-primate mammalian genomes be shotgun sequenced at relatively low coverage in two successive sets (known as ‘Bins’). They further indicated that other genome efforts already in progress, including for non-mammals such as the chicken, should proceed further so that complete high-quality sequences be produced to aid the identification of conserved sequences across mammals. The second component suggested by the Annotation Working Group was the high-quality sequencing of two primate genomes and relatively high-coverage shotgun sequencing of three others, to enable differences to be identified between these and the human genome. The third component was a recommendation to survey human genomic variation by sequencing 1000 people at very low coverage. The group additionally suggested that “a modest cDNA effort be included as a component of all genomic sequencing projects” to aid assembly and gene prediction.Footnote 14

The Comparative Genome Evolution working group’s recommendations ranged more deeply and widely across the tree of life, further extending the selection criterion employed by the Annotation Group by which some species would be preferentially sequenced due to representing key phylogenetic positions. Both groups also deployed other criteria to recommend particular organisms as candidates for sequencing, including the quality of the submissions (‘white papers’) sent in by the relevant communities; the role of the organism as a model; its potential biomedical significance; its economic importance; the possibility that a genome sequence for it would enable the construction of reference sequences for closely-related organisms of biological significance and the size and heterogeneity of the genome.Footnote 15

A Coordinating Committee (chaired by William Gelbart) then evaluated the proposals, presenting a modified set of recommendations to the NHGRI’s Advisory Council for approval in May 2004.Footnote 16 We consider this further in the following chapter when addressing different aspects of post-reference genome work on the human. For now, it is pertinent to note that in the documented assessment of species proposals by the Working Group on Comparative Genome Evolution, their conception of the communities working on these organisms and submitting white papers to the NHGRI was very much as groups of users. The evaluations that the NHGRI made of the white papers were based on the readiness of these communities for receiving the genome. Their role was envisaged as developers of proposals for the NHGRI to judge, and as groups that needed to corral the appropriate resources to make use of what the NHGRI would end up providing for them.Footnote 17 New research goals were added for subsequent rounds of sequencing additional species, such as identifying the mammalian “core genome”. The increasing apparatus and empirical basis of comparative analysis guided the number and selection of sequencing targets and the methods deployed on them.Footnote 18

Returning to the annotation of the individual chromosomes, the remaining ones that the Sanger Institute was involved with were: 13, 9, 10, X, 17 and 1. For chromosome 13, published in April 2004, the availability of a new database for non-coding RNAs, Rfam, advanced the annotation of these, which had been deemed extremely tricky as recently as in the chromosome 6 paper published in October 2003. For chromosome 13, modifications had been made to the Ensembl pipeline to aid manual curation. With the chromosome 9 paper, published in May 2004, there was a special focus on duplications of segments of the chromosome, which were assessed using KA/KS analysis (see note 13). Having previously mapped Single Nucleotide Polymorphisms (single base changes; SNPs) against their sequence using data from the dbSNP database, for chromosome 9 the genomicists identified their own bank of SNPs by analysing the sequence data from overlapping portions of DNA fragments (clones). In May 2004’s chromosome 10 paper, the authors continued their identification of SNPs and extended this focus at the single nucleotide level by comparing 617,071 single nucleotide sequence differences between human and chimpanzee, conducting KA/KS analysis on the results to ascertain the presence of sites of selection. From this paper on, there was an increasing focus on annotating alternative splice variants, which result from transcription processes that generate multiple different messenger RNA sequences from a single gene.

In the X chromosome paper published in March 2005, there was a particular focus on the evolution of the X chromosome and comparisons were made between it and the Y chromosome. The chicken (Gallus gallus) genome assembly was used for this analysis in addition to previously mentioned comparator species, many of which now had newer versions of their assemblies that were used. For the April 2006 paper on chromosome 17, human sequencing was conducted at the Broad Institute; the Sanger Institute’s role focused more on the sequencing of mouse chromosome 11 as part of the Mouse Genome Sequencing Project.Footnote 19 The paper was mostly dedicated to a comparative analysis of the two chromosomes and a reconstructed ancestral chromosome, with the authors focusing on an assessment of the different changes to the chromosomes that occurred in the distinct evolutionary lineages.

The final chromosome to be published, in May 2006, was 1. In the paper, the genomicists aligned the chromosomal sequence to the now-standard array of comparator species (minus the chicken) to identify regions of evolutionary conservation. This paper also represented a culmination of the increasing focus on SNPs from 2004 onwards. These SNPs were used to identify and map genomic diversity within species, identify recombination at a higher resolution than previously possible, detect signals of selection, and as a resource to augment the utility of the reference genome (Dunham et al., 2004; Humphray et al., 2004; Deloukas et al., 2004; Ross et al., 2005; Zody et al., 2006; Gregory et al., 2006).Footnote 20 The comparative approaches and cataloguing of diversity were conducted to ease the process of developing genomic resources, by feeding into and augmenting the pipelines of the IHGSC participants. The intended use of the resources so produced, however, was generic rather than tailored to specific user communities.

Compared to the IHGSC effort discussed above, Celera’s approach was quite distinct, giving potential communities of users of genomic data a more active and participatory role than in the IHGSC and NHGRI’s annotation strategies. As noted above, Celera’s 2001 paper discussed annotation far more than the contemporary IHGSC one. It was an automated annotation that it chronicled, though, in a discussion of their Otto gene prediction system. This software was designed to weigh different forms of data constituting evidence for particular annotations, namely cDNAs and ESTs. The weighting was based on Celera’s previous experience of the manual annotation of the Drosophila genome. This approach therefore reaffirmed and reflected the process of genomic discovery promoted by Venter in the early-1990s, especially the crucial importance it conferred to protein-coding regions of the genome, as revealed by EST and cDNA sequence data. While the paper reported some computational validation of Otto’s results, it acknowledged that the “[e]xtensive manual annotation to establish precise characterization of gene structure” that was still deemed necessary lay in the future (Venter et al., 2001, p. 1317).

As their automated annotation took inspiration from prior work on Drosophila, so did their manual annotation, by using the jamboree model. Drosophila genomics was not the only inspiration, however. A challenge that Celera faced was the absence of information about the means and decision-making procedures by which the public project’s annotations were made. Therefore, to develop their own annotation capabilities, they needed to obtain institutional knowledge of how the sausage was made. To that end, they recruited Peter Li from Johns Hopkins University, who had worked on the GDB Human Genome Database and the Online Mendelian Inheritance in Man (OMIM) catalogue while there, and as a result was acutely aware of the details of the annotation process. The OMIM connection, deepened by the use of data from it in the annotation of Celera’s gene sets, was just as significant as the model of Drosophila genomics to the way that Celera manually annotated the human genome. OMIM used curators who were experts on particular diseases, with their knowledge of the relevant genetics feeding into the published data. The need for biological expertise to contribute towards the annotation—and more broadly, the contextualisation of the data that Celera was generating—was keenly felt by the company. Due to its particular sequencing strategy, it had invested considerably in computational infrastructure and expertise for the purposes of assembly rather than in acquiring biological knowledge. But because of the need to generate rich and translationally-relevant data to be incorporated into proprietary databases (such as The Celera Discovery System™), drawing on this kind of expertise was essential.

A variety of academics were therefore invited to participate in a human annotation jamboree that took place in April 2001, two months after the publication of the draft reference sequence. This jamboree built on the previous one that Celera had held on the Drosophila genome and involved some of the OMIM curators (García-Sancho, Leng, et al., 2022). The human genome jamboree presented an opportunity for participation on the part of medical geneticists who had been largely uninvolved in the IHGSC effort. They would contribute their expertise, in concert with the computational experts at Celera, and in turn were given access to the latest proprietary data on their area of interest, as well as the fruits of their collaboration with Celera. Following the publication of their sequence in Science in 2001, Celera kept further improvements to their assembly behind a paywall for their clients, who were primarily pharmaceutical and biotechnology companies rather than academics. At the jamboree, though, the academics could assess the sequence assemblies in regions on which they had expertise, contributing information that would not just refine the gene structures predicted by Otto, but also inform improvements to the overall automated annotation pipeline.

The involvement with medical geneticists did not end there. A further Chromosome 7 Annotation Project was initiated, prompted by a suggestion by medical geneticist Stephen Scherer to Richard Mural, the head of the Annotation Team at Celera. The result was a higher quality re-sequenced chromosome 7 that better connected to biomedical and clinical research due to the expertise and physical mapping data provided by medical geneticists. This provided the medical genetics community with a useful resource, as well as aiding Celera in its strategic reorientation towards identifying diagnostic and therapeutic targets.Footnote 21

The ways in which genomes are improved and connected to other forms of data are explored further in the next chapter. For now, we note that the institutional imperatives of the IHGSC and Celera shaped the design of their respective annotation processes. Annotation, therefore, emerged in ways that reflected the trajectories, networks and goals of practitioners; Celera was more open to the medical genetics community, while the IHGSC was more self-contained.

In the following section, we consider the annotation of the pig genome, an effort in which existing pig genomicists interacted closely with teams at different stages of the sequencing and analysis pipeline established at the Sanger Institute. This reflected the model of interaction between medical geneticists and Celera more than the way that annotation unfolded within the IHGSC human reference genome sequencing. Furthermore, the relationship between the existing community of researchers working on the pig and the Sanger Institute helped to shift some of the Sanger Institute’s operations towards a model closer to the community annotation advanced by Celera.

2.3 Pig Genome Annotation

As it came after the sequencing of other genomes at the Sanger Institute, by the time the pig genome was sequenced, the annotation process used an established pipeline derived from procedures that had been deployed and refined in previous initiatives, in particular the sequencing and annotation of Homo sapiens. Like in sequencing and assembly, the pig project adopted and used repertoires established through the experience of projects on other species, while adding distinctive twists on these.

For the sequencing itself, the community of pig genomicists through the Swine Genome Sequencing Consortium (SGSC) had contracted with the Sanger Institute rather than the project being initiated from within the IHGSC (Chap. 5). This contractual relationship did not, however, imply a hands-off approach by the community; it was intimately involved in guiding the strategic—and in some cases operational—direction of the project. Part of this direction meant indicating to the Sanger Institute where they should target sequencing efforts, so they could focus on particular areas associated with genes of interest to individual research groups. This was reflective of a desire to make genome data useable as promptly as possible. As a result, even while the sequencing was still underway the community pursued annotation, the identification of SNPs and the creation of a SNP chip that captured agriculturally-relevant genetic variation.

We discuss the creation of the SNP chip in the following chapter. Here we detail the annotation effort. Just over £1.1 million of funding was secured from the UK Biotechnology and Biological Sciences Research Council (BBSRC) for 2007–2010 by the Roslin Institute (with Alan Archibald as Principal Investigator and Andrew Law as co-investigator), the EBI (Ewan Birney as Principal Investigator) and the Sanger Institute (Tim Hubbard as Principal Investigator and Jane Rogers as co-investigator).Footnote 22 These grants funded four posts, one each in Hubbard and Rogers’ teams at the Sanger Institute, one in Archibald’s group at the Roslin Institute and one supervised by Birney at the EBI. Two of these positions (with Hubbard and Birney) were in the Ensembl teams at the EBI and Sanger Institute. As noted above, the annotation effort began while the sequencing itself was still being conducted. Like in human genome sequencing, the pig genome was scanned using algorithms to predict the presence of genomic features. Pig protein and RNA sequence data were obtained from specific databases, and data on pig cDNA and ESTs were also downloaded from GenBank. Many of the cDNAs and ESTs had been generated by the Animal Genome Research Program at the National Institute of Agrobiological Sciences in Japan, and the Japan Institute of Association for Techno-innovation in Agriculture, Forestry and Fisheries (Groenen et al., 2012 and Supplementary Information; Lowe, 2018). These resources were generated in part using samples from cloned offspring of TJ Tabasco (Schook et al., 2005; Uenishi et al., 2012).Footnote 23

A key feature of the automated annotation in the Swine Genome Sequencing Project (SGSP) was the integration into the Ensembl pipeline of multiple forms of data already generated by the community from prior projects. These data concerned maps, Quantitative Trait Loci, and clones, in addition to the cDNA and ESTs mentioned above. The community provided Ensembl with these rich resources to enable the annotated reference sequence to be connected with—and immediately contextualised by—other forms of data and information produced by pig geneticists. This enabled functional inferences to be made concerning parts of the genome, but also inferential pathways to be constructed between the pig genome and other porcine biological data, and also between the pig genome and the genomes of other species. With the means to generate comparisons with other mammalian genomes being a key product of the grant work, this connectivity was intended to boost the pig as a comparative model, with data and the results of experiments intended to travel along the connections forged within the species, but also then to be able to travel beyond the species. Crucially, this wider horizon was accompanied by a desire to embrace the varied research needs of the community of pig researchers in the annotation, through the addition of tracks comprising other forms of data to the Ensembl browser. This was effected through Ensembl’s Distributed Annotation System, and pig geneticists who were interested in adding these tracks for the forms of data valuable to them were invited to contact Archibald, who was in regular liaison with teams at the Sanger Institute and the EBI.Footnote 24

There were therefore multiple kinds of community involvement in even the automated annotation of the pig genome. The community helped to define the nature of the annotation, taking advantage of the clone-based sequencing to squeeze as much use out of the products of sequencing and assembly as possible, through integrating assembly and annotation as well as incorporating data and resources already developed by the community into the pipeline, or through the Distributed Annotation System. This was particularly important, as the resource limitations of the overall genome project entailed a trade-off between comprehensiveness and utility, with the community opting for a more rough-and-ready but more immediately exploitable resource, above aspirations for completeness.

This meant that the drawbacks of automated annotation, well-appreciated by the Ensembl team for the more refined human genome, were even greater for the pig genome. As Jennifer Harrow reported to us, the algorithms at the heart of Ensembl were only as good as the assemblies they were working on, and for the pig these were incomplete and of lower quality than for the human. Manual curation of the data by the biologically-trained members of the HAVANA team was therefore more critical for improving and developing the initial assemblies of the pig genome produced by the Ensembl pipeline, than it was for human or mouse.Footnote 25

As with human genome sequencing, the annotated sequences produced through the Ensembl pipeline were published in the Ensembl database, while additional manual annotation was published on the HAVANA-led VEGA database, built on the Ensembl database.Footnote 26 HAVANA worked closely with some of the members of the pig genomics community, such as Christopher Tuggle at Iowa State University. James Reecy, an animal geneticist in Tuggle’s group, spent his faculty leave (equivalent to a sabbatical) with them from September 2007 to August 2008. Like many pig geneticists, Reecy worked on multiple livestock species, in his case primarily cattle. Reecy was interested in developing skills in manual annotation and areas of programming, and HAVANA had put together the most comprehensive approach to manual annotation in the world at the time. He was able to pursue this because of the close interactions between the pig genomics community and leading figures at the Sanger Institute, which we saw in Chap. 5. During his visit, Reecy met with Jane Rogers, Tim Hubbard and Richard Durbin, as well as Jennifer Harrow and Jane Loveland of HAVANA, discussing what he could offer in situ at the Sanger Institute. Aided by his demonstration that an animal geneticist could pick up the techniques of manual annotation, Reecy’s advocacy of community involvement in annotation met a receptive audience in the HAVANA team.

As a result, HAVANA decided to dedicate more attention to manual annotation than they had been contracted to do and in so doing developed new means of manually annotating a genome.Footnote 27 This new model took two forms. HAVANA consulted with the SGSC members on an informal basis for guidance on what precise parts of the genome they wanted special attention paid to. This was a continuation of the targeted approach to sequencing and meant that the annotation could be preferentially refined in particular regions of interest to researchers. In the process, information was fed back to the assembly team if a problem was detected in the course of the manual curation.Footnote 28 As the annotation started while the reference genome was being assembled,Footnote 29 this allowed it to feed into the assembly (and even inform the amendment of algorithms in automated assembly pipelines), as well as adding value to the eventual sequence.

Additionally, HAVANA shifted its mode of operation, developing new capabilities in education, training and engagement to increasingly function as community annotation facilitators, providing the pig geneticists with the tools, training and assistance so that they could annotate the genome themselves. This began with a training programme hosted at the Sanger Institute in July 2008. While this event was labelled as a “jamboree”, it differed from the Drosophila and human jamborees organised by Celera. Rather than just annotating the genomes in situ, the Sanger Institute event was intended to equip the researchers to go back to their own institutions and conduct annotation on regions of the genome pertinent to their existing research projects there. Abridged guidelines were created for pig annotation, due to the need to do the annotation quickly because of resource constraints, but also to economically document the key processes and procedures for these amateur annotators scattered around the world. Conference calls were used to share problems, observations and advice, but a manual was still needed for the HAVANA facilitators to refer to, and for the manual annotators to consult in their own offices and labs between meetings (see Fig. 6.3).

Fig. 6.3
A cover page contains the logo of Havana and is dated June 2008. A selected page is titled, building new gene objects with the entire process illustrated.

Cover and selected page of a manual produced by the HAVANA team for use by manual annotators of the pig genome community. From personal papers of Alan Archibald, “Pig Sequencing” folder, obtained 17th May 2017. Reproduced with permission, courtesy of Alan Archibald and the Human and Vertebrate Annotation group at the Wellcome Trust Sanger Institute. For a larger version of this figure that can be zoomed in and out, see https://www.pure.ed.ac.uk/ws/portalfiles/portal/314800096/higheres_fig_6_3.pdf

This community annotation effort was aided by the availability of the Otterlace/ZMap system combining a relational database and graphical interface for the manual annotators to use (Loveland et al., 2012; Dawson et al., 2013). In turn, HAVANA used their close working relationship with the pig genomicists to develop their tools and annotation processes.

The initial step in the manual annotation process was the computational alignment of multiple forms of data from the pig—and other species such as human and mouse—onto the S. scrofa genome assembly. A crucial feature of the Otterlace/ZMap manual annotation system used by HAVANA and VEGA was that it enabled annotation of an ongoing assembly rather than just individual clones, which was all that previous curation tools had allowed users to annotate (Searle et al., 2004). This functionality was helpful to pig genomicists, who wanted to promptly exploit and further augment the sequences so assembled. It meshed with the more significant role that manual procedures had in the annotation of the S. scrofa reference genome. The combination of the automated pipeline with the bespoke manual sequencing distributed in laboratories across the world constituted a combination of Stein’s factory and cottage industry models, and was therefore different to the case of Ensembl discussed above (Lowe, 2018).Footnote 30

This initial curation created a visualisation that displayed the sequence data along with another layer of information indicating evidence for the possible presence of genes. With this, anyone with an account could log in to the Otterlace/Zmap system and start to annotate a chosen gene. The annotator could weigh the different forms of evidence presented to them, and amend the model of the gene according to that evidence and any specific knowledge of the gene that they have. They would then be able to submit it for inspection by a HAVANA team member, who could then work on it further to finish off the annotations to the required standard.Footnote 31

In the earlier annotation of the human genome, as well as for well-funded model organisms such as the mouse, HAVANA had generally performed manual annotation wholly in-house. Its role was quite different for the pig, instead conducting education and training to enable researchers to themselves manually annotate genes, with the HAVANA team then performing quality control on the results. The only other species that HAVANA was providing community annotation support for at the time was cattle (Bos taurus). There were, though, weaker interactions between HAVANA and the cattle genomics community, partly because its greater funding meant that a close relationship was less necessary, but also because of the less-established links that the Sanger Institute had enjoyed with members of this community compared to pig genomicists (Chap. 5).

Parallel to HAVANA’s tasks, the pig genomics community itself helped to organise the manual annotation activity. As bioinformatics coordinator, Reecy led the community side of the work and provided training on manual annotation in the USA and China. In Scotland, training was also provided by the Roslin Institute. With Reecy, Iowa State University colleague Zhi-Liang Hu set up a website listing the genes and gene families that were candidates for manual annotation, and individual researchers were invited to indicate which they intended to annotate. This has been described as an “adopt-a-gene type approach” by Reecy, building on the targeting strategy in the sequencing phase.Footnote 32 The community did not have the resources to manually curate the whole genome to a high standard. They needed to maximise the utility of the genome for their particular research purposes, and for this, selectivity and distribution of the sequencing were appropriate. The value of the genome was therefore not primarily assessed in terms of generic metrics, even if data on the number of genes annotated still constituted a useful barometer of progress. The key was the utility of what had been done, not the extent of it; such concerns with completeness were more of a priority for the IHGSC. The pig community assessed the S. scrofa genome in terms of its use as a research tool for their own purposes. They were themselves deeply imbued with an awareness of what was required in the domains of agricultural or other forms of translation that they worked towards.Footnote 33

For this manual annotation, particular groups were established based on common research and translation interests. Some of these focused on resolutely structural elements such as repetitive sequences, while others operated in areas where the line between structural and functional was blurred. Examples of the latter were the groups that aimed to annotate genes and analyse genomic regions relating to olfaction, immune response, and retroviral insertions into pig DNA such as Porcine Endogenous Retroviruses (PERVs). The range of interests of the pig genome community was reflected in these groups. In addition to the interests listed above, genomicists working on domestication and the relationships between the sequenced domesticated pig and European and Asian wild boar contributed analyses to the publication heralding the reference sequence (Groenen et al., 2012).

It was the involvement of the pig genomics community in annotation processes that helped to blur the line between structural and functional annotation. This is illustrated by the most developed of the annotation groups, which became the Immune Response Annotation Group (IRAG) and continued its activities well beyond the initial analysis of the reference genome. IRAG comprised 51 researchers based in thirteen institutions in China, France, India, Italy, Japan, UK and USA. There had been considerable work on immune response prior to genomic research, as we showed in Chap. 5. Further, a high-quality manually annotated sequence of the pig’s MHC (the Swine Leucocyte Antigen complex, or SLA) was published in 2006, as a result of work by Laboratoire Mixte CEA-INRA de Radiobiologie Appliquée (CEA-INRA), Genoscope, Tokai University in Japan, and the Sanger Institute. The HAVANA team and the CEA-INRA group (in particular, Christine Renard) performed the manual annotation of the SLA region (Renard et al., 2006). It did not therefore need to be developed further in the subsequent ‘immunome’ project.

This ambitious ‘immunome’ project and group arose out of discussions between researchers at CEA-INRA and Iowa State University, in particular Claire Rogel-Gaillard at the former and Christopher Tuggle at the latter. They each had straightforward motivations for establishing this effort, since they both worked on the immunogenetics of the pig. We have already encountered Rogel-Gaillard, part of the team at CEA-INRA (and later, just INRA) that had adopted genomic approaches to investigating immune response. This had involved studying the dense polymorphic regions containing genes implicated in it from the 1980s, as well as investigating PERVs in the late-1990s, which had implications for the prospective xenotransplantation of pig organs and tissues into humans. Together with Patrick Chardon, she had led the development of the YAC and BAC libraries of pig DNA to aid those research efforts (Chap. 5). Her research interests had increasingly been directed towards studying the genetics of immune response variability in terms of pig health and resilience against disease. Tuggle’s research had trended in a similar direction, though from a different origin: his work in the 1990s was at the heart of the mapping endeavour to try to identify (and then exploit) genes and Quantitative Trait Loci primarily involved in livestock production traits in pigs.Footnote 34

From this nucleus, a call for interested parties was issued, and once the participants were confirmed, the group set about seeking data from databases and the literature to identify a list of genes to annotate.Footnote 35 Once this list was agreed and the rules for annotation established, particular sets of genes were assigned to individual teams. The approach embodied the advantages and disadvantages of distributed, targeted community annotation, as while expertise could be applied to particular regions by researchers, this meant that some regions went unadopted, for instance those with lower sequence quality that were difficult to annotate as a result or ones that simply did not contain genes of interest.Footnote 36

Reecy provided training for the group’s annotators in a workshop, but beyond that people worked in their own offices and labs, using Otterlace. Annotators would be able to see the analysis for their particular region, with the data tracks (for example the RNAs aligned to it) depicted. They would also be able to use the software tools to tweak the predictions made at the Sanger Institute.Footnote 37 The work was coordinated, and credit negotiated, in regular conference calls, using the Webex videoconferencing application to share screens. Jennifer Harrow had overall oversight at the HAVANA end, which included making the decisions about which annotations to exclude. She guided Jane Loveland in the day-to-day management, coordinating annotation between different groups, showing annotators how to use tools and access data, conducting quality control on the annotations and giving feedback. The motivation for HAVANA was to enable communities to take on as much of the task of annotation themselves as possible, both as a general aim and a particular solution for the resource-poor pig genomics community.Footnote 38 While the HAVANA team primarily supplied support for the informatics aspects of the manual annotation, on the community side a trio of coordinators—Rogel-Gaillard, Tuggle and Harry Dawson—guided the effort with a view to making the resulting annotated sequence as valuable as possible for those who would make use of it. Dawson, based at the USDA’s Beltsville facility in Maryland, monitored which genes were being annotated, following up on any genes that remained unannotated. He also conducted cross-species comparative analyses based on the annotation data he compiled from the whole project.Footnote 39 Dawson had led the development of the Porcine Immunology and Nutrition (PIN) Database at Beltsville, which was launched in 2005 containing data on 2600 annotated pig genes, with gene expression data linked to information on gene function. The database (now known as the Porcine Translational Research Database) was configured to enable users to identify genetic pathways related to genes of interest and to connect to human and mouse databases for comparative purposes, as well as to other pig genomic databases (Dawson et al., 2007).Footnote 40

Because the annotation began with a panel of genes, rather than simply annotating the assembly that was there, genes missing from the assembly could be identified, and therefore areas of the assembly that needed further work could be pinpointed. Indeed, having conducted the annotation using version (build) 9 of the swine genome, the results of the annotation fed into the newer and improved version 10.2. The annotators refined the models of 1369 genes and elucidated 3472 transcripts from these, around a third of which were inferred using only data from other species. They extended the analysis concerning genes under positive selection undertaken in the 2012 Nature paper announcing the reference sequence. And finally, the group used transcriptomic data derived from experiments to discern the role of some of the genes involved in immune response, identify networks of co-expression of genes and to annotate accordingly (Dawson et al., 2013).

This work had direct translational impact motivating it, and this gave the group clear indications on how to target their focus and structure the division of labour within the project. To achieve the translational ends of the researchers involved, the methods and approaches employed in the project were comparative, and explorations of function were knitted together with examinations of diversity and evolution.Footnote 41 For example, inferences that the researchers made about the evolution of genes accompanied functionally-oriented transcriptomic studies. Genes identified for their putative function enabled both the functional and structural annotation of the genome to be improved. And these in turn fed into the refined assembly of the genome itself.

Concerning the improvement of the reference genome as a community-generated resource, we close with an account of the sequencing and annotation of the pig’s X and Y chromosomes. This project filled the gap left by the SGSP, which had excluded the sex chromosomes due to the complexities involved in their sequencing. The sequencing of the sex chromosomes therefore finally completed a reference sequence for the whole of the nuclear genome of S. scrofa. This project also shows how the existing community of pig genomicists were able to broker and contribute to a collaboration between the Sanger Institute and an external group of researchers who had been working on these sex chromosomes for both biomedical and agriculturally-oriented purposes.

This project involved the EBI and the Sanger Institute, was funded with a BBSRC grant, and used infrastructure and work that was supported by the European Commission and the Wellcome Trust, much like previous work we have described. It did not involve any of the ‘usual suspects’ from the pig community as a collaborative partner, however, but a group based in the Department of Pathology at the University of Cambridge who had been consistently investigating the sex chromosomes of the pig since the turn of the century.Footnote 42 Their research had a dual aspect, being motivated by biomedical objectives, as well as being supported by a major pig breeding firm, the Pig Improvement Company (PIC), due to the implications of the genetics of sperm development and male fertility for breeding purposes.Footnote 43 The Cambridge University-led arm of the sequencing and annotation of the pig X and Y chromosomes was also conducted in collaboration with PIC. A key figure in the mapping of individual genes relating to sperm fertility was Andy Day. His funding came from PIC, who he had worked for since leaving university in 1995 and continued to be employed by until 2006. Day’s research at the University of Cambridge used comparative approaches to exploit the more plentiful and refined data and resources concerning the human genome to aid in the mapping of specific genes in the pig (Day et al., 2003; Kollers et al., 2006). One of his collaborators, Claire Quilter, approached human–pig comparative genomics from a medical genetic angle: she worked on the role of the Y chromosome in male infertility and Turner syndrome, a condition that affects women and involves the lack of all or part of an X chromosome.Footnote 44

In the early-2000s, Quilter had been the lead author of a paper that surveyed porcine sex chromosomes, identifying and mapping 19 genes onto them. For this, she made use of the PigEBAC library developed by the Roslin Institute and the UK Human Genome Mapping Project Resource Centre. This work explored the evolutionary consequences of this mapping data, in part by comparing the order of genes determined on the porcine Y chromosome with the corresponding order of those genes on the human and mouse Y chromosomes (Quilter et al., 2002). As well as representing a convergence of biomedical and agriculturally-inclined research, it also presaged the entanglement of comparative, evolutionary and functional studies that would be further realised in the work conducted with the Sanger Institute, and also the relationship between systematic and functional genomics explored in Chap. 7.

The X and Y chromosomes were an interesting challenge for the HAVANA team, due to the high level of conservation in X chromosomes and the tricky genomics of the Y chromosome. Y chromosomes contain a lot of repetitive sequences and degenerated genes due to its near-complete isolation from recombination with the X chromosome during meiosis.Footnote 45 In the original reference genome operation by the SGSP, some limited sequencing of the Y chromosome had been conducted using clones from the DNA libraries derived from males. However, only 11 clones were sequenced—in a draft rather than finished condition—and a limited number of scaffolds containing positioned contigs were placed on the chromosome: hardly an assembly (Groenen et al., 2012, Supplementary Information).

On the sequencing side, the X and Y chromosomes project began under the leadership of Jane Rogers. When she left the Sanger Institute, it was taken over by Chris Tyler-Smith, a human evolutionary geneticist. The sex chromosome sequencing project began in 2009. Both sides of the project were funded by the BBSRC for three years, with the Sanger Institute being alloted £1,369,161 to Cambridge’s £349,639.Footnote 46 The endeavour would contribute an improved assembly and annotation of the X chromosome and the first assembly and annotation of the Y chromosome.

Beyond the original pig genome sequencing, the X and Y work benefited from a change in mapping techniques and improvements to sequencing techniques.Footnote 47 Optical mapping was used to build a new assembly of X. To conduct this, Kerstin Howe—who led the team that analysed, validated and improved genome assemblies such as the pig one—worked alongside David C. Schwartz, who pioneered the method for eukaryotes.Footnote 48 Optical mapping does not require the use of library clones and the technique obviates the need for reconstruction of the order of the clones. It was therefore useful in correcting problematic repetitive regions that are difficult to resolve using clone-based mapping. The new optical-based map enabled the corrected assembly to be produced, which was then improved further, for example with targeted sequencing to close gaps and resolve assembly problems. This improved assembly in turn enabled an improved annotation, with 690 protein-coding genes annotated, a considerable advance over the 422 in the original (for Sscrofa10.2), with increased numbers of non-coding genes and pseudogenes identified as well. As with the SGSP, there was close interaction between the annotation and assembly teams at the Sanger Institute.

For the Y chromosome, a bespoke library was created using DNA from a Duroc boar (the same breed as the originator of the CHORI-242 clones from which the bulk of the reference sequence was derived) donated by Genus, the company that incorporated PIC. At the Sanger Institute, a fingerprint contig map was produced using this library to create a map of overlapping clones which formed the basis of a minimum tiling path to guide the sequencing and assembly. They used and combined the outputs of multiple sequencing platforms, and then improved it further as with the X chromosome, to bring the sequence towards ‘Finished’ standard. This updated assembly was validated using PacBio long-read technology, which affirmed the high quality of the new assembly, using the same clone library as the original sequencing conducted by the SGSP.

For both the X and Y chromosomes, annotation involved the alignment of various EST, messenger RNA, and protein sequence data against the sequence. This was performed through the Otter annotation pipeline, and it then underwent manual curation by the HAVANA team, using the Otterlace/Zmap tools according to the procedures developed for both human genome annotation through GENCODE (Chap. 7) and the immunome project (Skinner et al., 2016 and Supplementary Information).Footnote 49 The Y chromosome assembly subsequently became incorporated into the updated Sscrofa11.1 assembly, which became the reference genome (at ‘representative genome’ level in RefSeq) for the pig in 2017 (Warr et al., 2020).

Cambridge University’s side of the project involved identifying shared regions between the two chromosomes to aid in the sequencing of them and in tracing their evolutionary history, identifying functional genes and non-coding sequences on the Y chromosome, and locating and analysing a gene—HSFY—found in cows to study chromosomal evolution across pigs and closely-related species. The insights gained from this project were explicitly designed to inform the sequencing and assembly of the chromosomes using the knowledge gained about their structure and the location of repetitive sequences, but also to guide the exploitation of the data.Footnote 50 This research was therefore a good example of the functional and systematic synergies that are explored further in the following chapter.

It also shows how the specific genetic expertise of a group of researchers newly admitted to the community of pig genomicists, fed into and informed the highly-developed pipelines and expertise at the Sanger Institute. Here, the Sanger Institute did not conduct this work merely at its own initiative or at the behest of the Wellcome Trust or an international collaboration like the IHGSC. It also was not merely contracted to perform the work, as per the original relationship with the pig genomicists. Instead, building on the relationships developed through pig genome sequencing, which intensified as attention was directed towards annotation and the development of a new community-oriented model of it, the X and Y project constituted a more horizontal peer-to-peer collaboration from the start. This collaboration involved the highly-refined infrastructures and personnel of a large-scale genome centre. It incorporated a community of pig genomicists with a core of operators such as Alan Archibald who married a drive towards the development of genomic resources intended for wide use with a sensitivity to particular uses to which they could be put. And finally, it included an existing set of researchers seeking to conduct sequencing and annotation pertaining directly to their ongoing interests.

The X and Y project instantiates deep entanglements between different models of sequencing and annotation. It challenges strict demarcations and distinctions, and also the linearities indicated by presumed separations between stages, whether in particular projects or pertaining to the wider development of genomics. Who would dare reduce this X and Y project—or any part of it—to a singular form of annotation along the lines of Stein’s ideal types, or even to any of the strategies pursued in prior genomics projects such as the genome centre model of the IHGSC, or the distributed model of the European Commission-funded Yeast Genome Sequencing Project? Instead, as the progression of pig genomics illustrates, aspects of these models were mobilised and combined, mediated by the historical trajectories of the actors coming together to form particular projects.

3 Annotation Strategies and Lineages of Genomics

In examining the different models of reference genome annotation for yeast, human and the pig, this chapter has begun to explore the development and use of genomic resources beyond the determination of the nucleotide sequence of the reference genome. This broader perspective expands the range of narratives that historians can mobilise to capture genomics as an ongoing and multifaceted endeavour, moulded in distinct ways by different communities.

The yeast genome annotation followed the distributed-but-hierarchical model of the European Commission’s sequencing project, with a key role for MIPS as the bioinformatics coordinator. The centralisation through MIPS reflected the division of labour of the sequencing across multiple, often small, laboratories and the need for a genome-wide perspective for some forms of genome analysis that the consortium wanted to perform. In this model, we see a strict separation of structural from functional annotation.

The human reference genome, on the IHGSC side, involved the development of the Ensembl pipeline and HAVANA to automatically and then manually annotate the sequence data. IHGSC institutions progressively added new sources of data and methods for the annotation of various elements in the human genome, such as protein-coding genes. Compared with Celera’s approach, this involved far less interaction with wider communities of researchers, and instead a concentration on developing pipelines and repertoires to improve the quality and extent of annotation, without directing or targeting it towards particular users. The aims and operations were therefore internal to a community of specialist genomicists, institutions and operatives, who sought to improve the output as measured by general metrics and guided by an ideal of completeness.

This, as we have seen, was not a fixed or essential characteristic of the genome centres, the key institution in the IHGSC model. In the case of the Sanger Institute, for example, the relationship of some of its departments and key personnel to a well-coordinated pig genome community effected a change in the way this institution worked. As a result, the model and results of the annotation of the pig genome were quite distinct from the human annotation that preceded it.

Some of this was driven by resource constraints that limited the quality of the pig genome assembly in some respects, making manual curation more crucial in correcting the automated predictions. As funding would only go so far in paying for in-house manual curation, the community would need to take up the slack. The extent they were able to do this owed much to the community’s own history of coming together to coordinate the work of identifying genetic markers, compiling and integrating genetic, cytogenetic and physical maps, and creating databases and materials (such as genome libraries and radiation hybrid panels). They pursued the creation of genomic resources because they knew what kinds of data they needed to advance their own research. Together, they advanced their overall endeavour of improving the genomic reference resources concerning the pig, secured pots of money from various sources to do so, and then worked out how to stretch what they had as far as they could. This accommodated but also drew upon the heterogeneous but often overlapping interests held across the pig genome community. For their members, like those forming the yeast genomics community, genomics has constituted a nexus around which multiple different interests could draw upon the resources generated through it, with those interests and motivations also shaping the creation of those resources in distinctive ways.

Indeed, a reference genome is a creative and dynamic product. The selection of the materials that are used in its creation and the decisions made in sequencing and assembly reaffirm that. It matters what libraries are used, what methods are used in sequencing and assembly, and what is or is not targeted for special treatment to refine sequence quality. This is even more the case for annotation. Annotation is affected by the prior steps, but in turn, what is annotated can feed back to further develop the assembly. It will also affect what the genome can be used for. The model of distributed community annotation—involving individuals, laboratories and groupings of researchers interested in genes with particular hypothesised functions—guided the annotation of the pig genome towards those regions deemed useful for proximate research purposes. In terms of the allotting of work, there was a similarity with the yeast genome sequencing network, though for the pig it was less hierarchical and comprehensive, and more discretionary.

The activities of the SGSP more generally, and IRAG and the X and Y chromosome sequencing more specifically, involved a wider set of actors, approaches and interests than the IHGSC. IRAG involved members of an existing community of pig genomicists that dated back to at least the 1990s. The project to sequence the X and Y chromosomes, though, showed how that community still had the ability to form new connections.

While the scale, speed and automation of sequencing operations had all increased at the Sanger Institute, this did not intensify the tendency we observed in the IHGSC effort: the narrowing of participation and the concentration of operations in-house (Chap. 4). Indeed, the Sanger Institute, and in particular the HAVANA group, opened out to and engaged with a specific external community to develop new genomic resources, tools and expertise through the assembly and annotation activities of the SGSP, IRAG and the X-Y project. That community shaped the direction of various aspects of the sequencing process, in so doing affecting the nature of the product. In turn, the Sanger Institute, at a time in which it was adjusting to the period following the ‘completion’ of the human reference sequence and each chromosome in turn, itself changed the way it worked.

In considering how the Sanger Institute and the pig genomics community shaped their emerging community annotation strategy and practices, we observe that the cottage industry model (Stein, 2001) needed to be implemented and combined with factory-style approaches. These genomicists, therefore, deployed modes of annotation regarded as characteristic of earlier ‘pre-genomic’ stages, in conjunction with the concentrated factory style that came to dominate the sequencing of the human reference genome. This challenge to the idea of progression through distinct and separate models and stages of activity, is an important historiographical consequence of our account of pig genome annotation.

As well as helping to re-shape the way that HAVANA operated, the work of pig genome annotation fed into the processes of assembly, automated annotation and indeed manual annotation itself. This was enabled by the temporality of annotation that existed in the pig genome project, with manual annotation occurring alongside ongoing assembly. The manual annotation was therefore able to help correct the assembly as well as contributing to the improvement of automated prediction algorithms. The pig genome community conceived the genome they were helping to produce as provisional and incomplete; their attitude was one of satisficing (on satisficing, see Wimsatt, 2007).

Of course, as we see at the outset of the following chapter, reference genomes are never complete; they are always subject to changes intended to improve their quality and utility. But the pig genome community did not hold an ideal of completeness or comprehensiveness to be paramount in the creation of the first reference assemblies. In one respect, they shared this attitude with Celera. For Celera, the very provisionality of their human sequence was its selling point; it was important that the publicly-available data it had released in 2001 quickly became outmoded, and that it was widely known to be so. This was to make access to the continually-improved genome and associated data that they held behind a paywall more valuable to potential subscribers. It was this commercial strategy, along with the model of OMIM and their experiences with Drosophila sequencing and annotation, that encouraged Celera to forge collaborations with medical geneticists who had been peripheral to the IHGSC.

We have shown that distinctions between manual and automated annotation, annotation and assembly, and functional and structural annotation should all be qualified. In the next chapter, we demonstrate something analogous as we explore the changing relationship between the functional and systematic genomic research that followed the initial sequencing and annotation of the reference genomes of our three species.