The Ensembl Genome Browser: Strategies for Accessing Eukaryotic Genome Data

The Ensembl Genome Browser provides a wealth of freely available genomic data that can be accessed for many purposes by genetics, genomics, and molecular biology researchers. Herein we present two protocols for exploring different aspects of these data: a phenotype and its associated variants and genes, and a promoter and the epigenetic marks and protein-binding activity associated with it. These workflows illustrate a subset of the data types available through the Ensembl Browser, and can be considered a springboard for further exploration


Introduction
Genome browsers are resources that integrate data at the genomic level, thereby allowing visualization of related genomic information in one space. These data can include genes, noncoding elements that regulate gene expression, genetic variation and the results of comparative genomics analyses, among other forms of annotation ( Fig. 1) [1][2][3][4]. Commonly used genome browsers include Ensembl, the UCSC Genome Browser [5] and IGV [6].
The Ensembl project was initially launched in 1999 with the aim of developing methodologies for automatic annotation of (human) genomic sequence with genes and their constituent transcripts [7]. Since that time, the project has broadened substantially in scope; the Ensembl Genome Browser [8], which came online in 2000, now includes reference genomic sequence and annotation for nearly 100 chordate organisms. Ensembl is rapidly incorporating new data, including whole clades of new species' genomes and reference sequence for multiple strains of existing species, such as mouse. In addition, existing annotation is regularly augmented by the inclusion of new data sets. Ensembl's sister site, Ensembl Genomes, provides access to nonvertebrate genomes through dedicated portals for Bacteria, Fungi, Plants, Metazoa, and Protists [9,10].
Ensembl data, annotations, and analyses are updated every 2−3 months, alongside software updates to both the public-facing website and the underlying databases. Prior releases are frozen as archive sites, and from Dec 2013 (Ensembl version 74) will remain accessible via our web interface for at least 5 years following their initial release. A dedicated site is also maintained for the GRCh37 reference human genome assembly, which is annotated with new data on a limited basis (Fig. 2) [11]; partial data from ongoing genome annotation can be accessed via the preview Pre! site.
Data from Ensembl can be accessed at multiple scales. In this chapter, we describe data access through the browser web pages and via BioMart [12], a web-based tool that allows customized retrieval of data from the Ensembl databases. However, data can also be accessed programmatically via our Perl and REST APIs [13,14] Files containing genome-wide data are available for all species represented in Ensembl via an FTP site [15]; data from all releases Fig. 1 Ensembl features. Ensembl integrates together gene annotation, genetic variation, gene regulation data, and comparative genomics onto a single genomic platform. Gene annotation is carried out in house, annotating the full intron-exon structure of coding and noncoding transcripts. Short variants, such as SNPs and indels, are pulled into Ensembl from external databases, alongside structural variants and copy-number variants. ChIP-seq and DNase-seq data is used for in-house prediction of regions of open chromatin and regulatory elements such as promoters, enhancers and CTCF binding sites on the genome and their activity in different cell types. Whole genome alignments and gene tree analysis is carried out in house to compare species in Ensembl. These data are presented alongside each other on the genome in the Ensembl browser, and can also be accessed for bulk export through BioMart, programmatically through APIs and as flat-files on the Ensembl FTP site of Ensembl can be retrieved from the FTP site, or from our databases via the Perl APIs, in perpetuity.
Beyond providing access to data related to publicly available genome annotation, Ensembl integrates a number of tools designed to process or analyze your own data. The ID History Converter Fig. 2 The Ensembl homepage. The Ensembl homepage provides access to a search function, which can retrieve information associated with, for example, genes, transcripts, proteins, variants, phenotypes, and ontology terms. In addition, links are available to Ensembl's sister site, Ensembl Genomes, as well as to the mostsearched genomes and a complete list of annotated genomes. Fully annotated genomes are available on the main Ensembl site, while genomes whose annotation is in process can be browsed on the Ensembl Pre! site. Ensembl maintains web interfaces of archived versions for 5 years. These can be accessed from a link in the lower right-hand corner. Documentation and help pages can be accessed from the homepage, as well as inhouse and external tools integrated into the Ensembl web interface. A dedicated page describing datadownload strategies is also available and presents links to the point-and-click tool BioMart, which permits bulk download of Ensembl datasets with no requirement for programming expertise, as well as APIs and FTP site converts Ensembl IDs from a previous release into their current equivalents, while the Assembly Converter maps genomic coordinates from one version of a genome assembly to another. The Variant Effect Predictor predicts the functional consequences of a set of known and/or novel variants [16]. Sequence alignment using BLAST and BLAT against Ensembl genes, genomes and proteins is also available [17,18], along with a suite of tools developed as part of the 1000 Genomes Project [19] that can be accessed on the dedicated GRCh37 browser site [11].
In this chapter we describe two workflows showcasing a subset of the data available in the Ensembl browser and indicating possible routes to access them. First, we demonstrate a phenotypecentric search highlighting variation data associated with genes and transcripts. Secondly, we present a gene-centric search illustrating gene and transcript models, and the exploration of regulatory features in the region of a gene. In each case we also indicate strategies for data export via BioMart. Those interested in our annotation methods, in programmatic access to Ensembl data, or in exploring other forms of data and annotation are encouraged to refer to our publications [20].

Materials
Computer, Internet connection.
An Internet browser: recent versions of Firefox, Chrome, Safari, and Internet Explorer are supported.

Methods
These workflows were written using Ensembl release 88 (March 2017). There may be updates to the data or interfaces if you are using a more recent release.
The Ensembl browser can be searched using a variety of terms, including genomic regions, genes, variants, or phenotypes; the following workflow describes a phenotype-based search that highlights data and annotations collated in the Phenotype, Variant, Gene, and Transcript tabs.
Non-melanoma skin cancer-principally basal cell and squamous cell carcinomas-is a relatively common pathology associated with variants in several genes [21].
1. Getting started: To explore the phenotype in more detail, type "non-melanoma skin cancer" into the search box on the Ensembl home page, www.ensembl.org, and click the "Go" button. The search autocomplete may retrieve direct links to

WF1: Phenotype-Based Searches and Identification of Associated Genetic Variation
suggested results; this will allow you to proceed immediately to step 2.
A list of search results will be generated, with "Non-melanoma skin cancer (Human Phenotype)" appearing first. Options on the left-hand side of the page permit restriction by species and/or other categories: click on the different filters individually to apply them to the search results.

2.
Studying loci associated with a phenotype: Click the "Nonmelanoma skin cancer (Human Phenotype)" link to open the Phenotype tab.
The loci associated with non-melanoma skin cancer are presented in tabular form; their external identifiers, genomic coordinates and associated genes, and the publications in which they were initially described are all listed. Links are provided to further information about the annotation source and relevant publications (in this case, the GWAS catalog [22] and PubMed [23]; Fig. 3).

Studying a variant:
One of the variants associated with nonmelanoma skin cancer, rs1805007, falls within the MC1R gene. Click the "rs1805007" link to load the Variant tab.

The Variant tab collates data relating specifically to the variant of interest (A full list of the databases from which Ensembl imports variation data can be found in the documentation [24].).
An overview of the data is found at the top of the Variant tab ( Fig. 4A), while a table indicating the phenotypes associated with the variant can be found lower down the page.
The most severe consequence linked to rs1805007 is "missense_ variant", indicating that the alternative allele at this locus lead to an amino acid substitution. All consequences of the rs1805007 variant can be explored by clicking on the "See all predicted consequences" link. Ensembl uses Sequence Ontology terms to describe variant consequences [25].  3 The Ensembl phenotype tab. The Ensembl phenotype tab allows you to explore the phenotype ontology associated with a phenotype and any loci (variants, QTLs, or genes) linked to the phenotype. Loci associated with the phenotype shown in a table on the Associated loci page. The buttons above the table allow filtering. Links take you to the database and/or paper where the link between locus and phenotype was made Below the consequence, you can see that the reference allele of rs1805007 at the genomic position 16:89919709 is C, and one alternative allele, T, has been observed. Minor allele frequency (MAF) has been calculated for the alternative allele, which was observed in 1000 Genomes Project participants: it was identified in 2% of participants in that study [2,26].
Navigating to the Variant tab from the Phenotype tab automatically loads a table containing the phenotype data relating to this variant, as mentioned above. Tanning ability, sensitivity to sun, and fair hair and skin color have all been associated with the variant, as has basal cell carcinoma, a form of non-melanoma skin cancer. Collectively, these phenotypes are consistent with the observed linkage between fair complexions and sensitivity to sun exposure.

The menu on the left presents additional options. Click
"Population genetics" to view allele frequencies in global populations.
On this page, data from the 1000 Genomes [26], HapMap [27], and NHLBI Exome Sequencing [28] Projects and the Exome Aggregation Consortium (ExAC) [29] are displayed. The data from the 1000 Genomes Project are shown at the top (Fig. 4B); the pie-charts represent allele frequencies for different superpopulations. Allele frequencies for subpopulations within each superpopulation can be viewed by clicking the "Subpopulations" link beneath the corresponding superpopulation. Allele and genotype frequencies among 1000 Genomes Project participants can also be found in tabular form immediately below the graphical views.
The frequency of the T variant allele in 1000 Genomes Project participants is highest among European subgroups, and individuals homozygous for the variant also occur only in these subgroups. This is expected given the phenotypes associated with the variant (Fig. 4B).

5.
To explore genes and transcripts with which the variant is associated, click "Genes and regulation" in the menu on the left. As we saw previously, the variant lies within the MC1R gene; the summary table here indicates that it overlaps two independent transcripts of this gene as a missense variant and is a downstream gene variant of a third transcript. Other genes and transcripts affected by the variant, as well as the associated consequences, are also shown (Fig. 4C).
In a second table, called "Gene expression correlations," you can find a list of genes whose expression has been found by the GTEx Project to be affected by the variant of interest [30].
Finally, any regulatory features or motifs in which the variant falls will be listed in two separate tables at the bottom of the page. There are no regulatory features or motifs that overlap the variant rs1805007.  [3]. Several phenotypes have been linked to rs1805007, and the MC1R gene also plays a role in coat and skin pigmentation in other organisms, suggesting a conserved function. 9. Click on the "GO: Biological process" link in the left-hand menu.
The GO, or Gene Ontology, terms related to biological processes which have been associated with the transcripts of the MC1R gene are displayed in the table (Fig. 5B) [31,32]. Each row of the table contains the GO accession number, a description of the GO term, and the evidence codes, annotation source and stable IDs of transcripts associated with that GO term. Hover over the evidence codes to see their definitions.
MC1R-encoded proteins are involved in signal transduction and the melanin biosynthesis pathway, and are located in the plasma membrane, consistent with a role in pigmentation.   (Fig. 5A). 10. Click on the "External references" link in the left-hand menu.
The level of support for a transcript prediction, and its biological relevance, can be inferred from the matching evidence records and associated flags.
12. Click the "ENST00000555147.1" link in the Transcript table; ENST00000555147.1 is the Ensembl stable ID for the MC1R-001 transcript.
The MC1R-001 transcript's biotype is listed as "proteincoding," and the transcript is colored golden in the graphical view. This indicates that it has been independently annotated with identical coordinates by both the Ensembl automated gene annotation and the HAVANA manual gene annotation methods [1] (Fig. 5A).
We are now located in the Transcript tab, which is visible in the blue navigation bar at the top of the page, next to the Gene tab. From the left-hand menu of the Transcript tab you can access complete, spliced or translated transcript sequences ("Exons," "cDNA," and "Protein," respectively), as well as graphical and tabular representations of annotated protein domains ("Protein summary" and "Domains & features," respectively). "General identifiers" provides links to related records in external repositories (Fig. 6A).
You can now click on "Hide Transcript 13. Click on the "Supporting evidence" link in the left-hand menu.
This page displays the records used in the annotation in graphical form; all records are hyperlinked to the original data in RefSeq, UniProt, and ENA [38] (Fig. 6B).
14. Click the "Variant table" link in the left-hand menu.
This table displays the set of variants associated with the MC1R-001 transcript (Fig. 7A).  Filter the table to view missense variants between amino acid coordinates 150-160.
(a) Filter the table for missense variants by clicking "Consequence" in the Filter section, then "Turn All Off" and "Missense variant." (b) Filter the table to view variants at a specific amino acid coordinate within the translated sequence of the transcript by clicking on "Filter Other Columns," then "AA Coord." Use the sliders to restrict the area for which variants are shown to 150-160. You can filter this table in numerous ways, including by consequence, source, and genomic or amino-acid coordinates (Fig. 7B). For missense variants, there are also options to filter by predicted pathogenicity score, as determined by SIFT [39] and/or PolyPhen [40] (PolyPhen calculations are available only for human variants). SIFT and PolyPhen pathogenicity predictions have been calculated for rs1805007 and the amino acid substitution is considered deleterious (An additional variant, rs149922657, has been observed at the same position of MC1R, but has not been associated with any phenotype.).

Click "Haplotypes" in the left-hand menu.
This page allows you to view linked variants that tend to be coinherited. As a default, the amino acid identities and coordinates of each haplotype are shown, along with their frequencies in different 1000 Genomes Project populations [26]; however, clicking "switch to CDS view" at the top of the table will show nucleotide sequences instead (Fig. 8A) Clicking on any haplotype will load a table indicating its frequencies in different 1000 Genomes populations in more detail (Fig. 8B), as well as a sequence view highlighting the nucleotide and amino-acid positions altered, if applicable (Fig. 8C).

Exporting Ensembl variation data: Data can be exported from
Ensembl at multiple scales. A link to the BioMart tool, which permits the download of customized datasets at intermediate scale, can be found in the navigation bar at the top of all Ensembl pages (Fig. 2) [12]. In the BioMart interface, select the Dataset "Ensembl Variation" (this will also include the release number, which is 88 at the time of writing), then "Human Short Variants (SNPs and indels excluding flagged variants)." To download all variants of ≤50 bp lying within   MC1R, as well as 5 kb upstream and downstream of the gene, filter by "Gene-associated Variant Filters," selecting "Gene stable IDs" and inputting "ENSG00000258839," the stable ID for the MC1R gene. You can choose attributes of interest under "Variant" or "Flanking sequences"-for example, the variant name, source, consequence, start and end coordinates, and pathogenicity predictions-which will be listed next to each variant in the output table. Click the "Results" button to view and download the results table (Fig. 9).

The following workflow describes a gene-based search and indicates some of the data and annotations collated in the Gene tab and Regulation tab.
The POU5F1 gene, formerly known as OCT4, encodes one of the so-called "Yamanaka factors" implicated in cellular dedifferentiation and induction of pluripotency [41,42]. We can search Ensembl to view the POU5F1 gene model and associated annotation, including predicted regulatory features.
1. Getting started: Type "POU5F1" into the search box on the homepage, www.ensembl.org, or in the upper right corner of any browser page, and click "Go." This will generate a searchresults page with "POU5F1 (Human Gene)" as the top hit. Click the title link to navigate directly to the POU5F1 Gene tab.
The gene "Summary" containing a graphical representation of the gene model loads by default following navigation from the search results.

Downloading gene sequences:
The sequence of the gene and flanking regions can be downloaded from the Gene tab in two ways.
(a) To download the sequence in FASTA format for processing in an external tool, simply click the "Export data" button below the left-hand menu. This will open a pop-up window that presents customization options.
(b) To view POU5F1 sequence in the browser, click "Sequence" in the left-hand menu. This opens a display in FASTA format; buttons to download and to BLAST the sequence are shown on this page, and download customization options are similarly available (Fig. 10).
3. Exploring regulatory features: Select "Summary" in the lefthand menu. Scroll down to the graphical view of the gene model and locate the Regulatory Build track. The Regulatory Build depicts regulatory features that have been annotated based on epigenome-scale data imported from sources such as ENCODE [43], Roadmap Epigenomics [44] and Fig. 9 The BioMart interface. BioMart allows easy export of tables of gene, variant, or regulatory feature data. A video tutorial for BioMart is available at https://www.youtube.com/watch?v=QvGT2G0-hYA&ab_channel=E nsemblHelpdesk Fig. 10 Exporting gene sequence from Ensembl. All sequence views in Ensembl allow download in either plain FASTA or annotated rich text format (RTF) Blueprint [45]. These motifs are color-coded according to the predicted function of the element (Fig. 11A). 4. Click on the red promoter overlapping the 5′ end of the longest transcript of POU5F1, POU5F1-004, to open a pop-up box with the stable ID ("ENSR00000195510"), type ("Promoter"), and genomic coordinates of the core element and flanking sequences. Click the stable ID to open the Regulation tab. Note: POU5F1 is transcribed from the reverse strand, and thus the 5' sequences containing the promoter are located to the right of the gene.

WF2: Gene-Based Searches and Identification of Regulatory Features in a Genomic Region
The Regulation tab displays a graphical representation of the genomic region surrounding the element and a table of the 68 cell types with regulation data currently in Ensembl, organized by activity state. In addition to the Regulatory Build, several tracks are shown by default; these include CRISPR/Cas9 genomeediting sites predicted by the Wellcome Trust Sanger Institute (WTSI) [30], transcription start sites identified by FANTOM5 [46], miRNA binding sites imported from Tarbase [31], and enhancers identified by VISTA [29]. Tracks with no data in the immediate region of the feature are not shown (Fig. 11B) (The term "track" refers to a data type that can be plotted against the genome.).
Feature activity by cell type can be viewed in graphical form by clicking the "Select cells" button and, in the resulting pop-up, choosing "All on" or selecting individual cell types.

5.
To view the element's activation state in individual cell types, click the "Details by cell type" button at the top of the Regulation tab or the link in the left-hand menu. Click the "Select cells" button and then choose "A549" (repressed), "Placenta" (poised), "Pancreas" (inactive), "GM12878" (active). Next, click "Select evidence," then "All on," to load the experimental data available for the cell types of interest.
You are now viewing data from cell types in which the element is active, inactive, poised, and repressed (Fig. 12). These activation states are determined on the basis of the histone modifications observed in the region, along with transcription factor and RNA polymerase II or III binding, as well as areas of DNase I hypersensitivity indicating open chromatin [4].
Additional tracks can be accessed by clicking the "Configure this page" button, at left, or the cogwheel at the top of the image. These include the evidence underlying the Regulatory Build, as well as comparative genomics analyses and variation data that may provide additional context for the annotated feature. 6. Ensure that both "Peaks" and "Signal" buttons are selected. This will display a summary of the aligned reads (signal) as well as the peaks for each assay. Annotated features are clickable; for example, clicking on a predicted promoter will indicate any transcription factors known to bind it, along with links to the JASPAR database [47], where further information on motifs is presented. For other elements, the position of the apex is indicated with black arrowheads (Fig. 12 8. Click on the blue "Configure this page" button to add regulatory data tracks for the same cell types: A549, placenta, pancreas, and GM12878. This opens a menu listing the many possible tracks available to display on the genome. Categories of tracks are listed on the left. Tracks can be turned on and off by clicking on the box alongside them. To see the activity of regulatory features in different cell types, turn them on within the "Regulatory features" section. In the "Histones & polymerases" and "Open chromatin & TFBS" sections, you will find that tracks are displayed as a matrix, with cell types along the top and evidence to the side (Fig. 13). 9. Exporting regulatory features with BioMart: A list of regulatory features, by type, in a genomic region can also be exported via BioMart. Navigate to BioMart, then select "Ensembl Regulation" > "Human Regulatory Features" (it may be necessary to refresh the window by clicking "New" if you have performed a previous query). For features within 5 kb up-and downstream of POU5F1, filter for Chromosome 6, Base pair start: 31159337, Base pair end: 31185731. As defaults, "Chromosome Name," "Start (bp)," "End (bp)," and "Feature Type" are selected as Attributes. Add "Regulatory Stable ID" and generate your results.
Nine features are returned for this genomic region, including the promoter we explored, ENSR00000195510.

Fig. 12
Evidence for and activity of regulatory features in different cell types. The Details by cell type page in the regulatory feature tab can be manipulated to show the activity of the feature in cell types of interest using the buttons at the top. For each cell type, the feature is shown colour-coded to indicate its activity, with the evidence shown below. The evidence is ChIP-seq and DNase-seq data, and is shown as peaks of significant activity and as signal giving the number of reads. The top of the peak is indicated in the peak bar by pairs of black arrows. Black blocks in the regulatory features indicate the position of transcription factor binding motifs, which are listed in a pop-up when clicked on

Discussion
Here, we describe methods to navigate variation and regulation data in the Ensembl browser, focusing on human, although the principle of navigation is relevant to queries in all species.
The typical entry point to a query in the browser is the search function. The Ensembl search is versatile and can retrieve information linked to a variety of inputs-including, but not limited to, genomic locations; gene, transcript, protein and regulatory feature IDs; GO terms; variant IDs; and phenotypes. Unless otherwise specified in the query, search results for human will be returned first; filters displayed on the left-hand side of the results page permit the restriction of results by category (e.g., gene, variant) and by organism.
Selecting a search result will open a tab that collates information on the entity: in the two workflows presented above, we present strategies for accessing the Phenotype, Variant, Gene, Transcript, Location, and Regulation tabs following phenotypeand gene-based searches. As you move from tab to tab in a single query, previously accessed tabs will remain open in the blue navigation bar at the top of the page to facilitate seamless data-retrieval in Fig. 13 Adding regulation tracks to a region view. The Region in detail view displays a genomic region and can be customized to show tracks of interest using the Configure this page button. This opens a detailed menu listing all the available tracks, using categories on the left, including regulatory features and evidence. Regulatory evidence can be added using a matrix selector, listing the cell type along the top and type of evidence down the left a minimal number of steps; you can reenter a previous tab simply by clicking on the tab header in the navigation bar (Fig. 6A).
By default, tabs will open with a summary of the information available for each entity (e.g., a transcript or variant), although herein we indicate a few cases where other data are loaded: for example, the Gene Variant table is presented immediately upon navigation from the Variant to the Gene tab. Should a view not be as expected, links to all data and annotations available in a tab can be found in the menu on the left; for Location, Gene and Transcript tabs, these links adhere to a similar framework but present annotations at different scales.
Tabs can be customized by clicking on the blue "Configure this page" button below the left-hand menu, or the cogwheel icons that appear in the upper borders of graphical displays (Fig. 13). Customization allows you to add or remove data tracks that may be useful to interpretation or analysis; for example, to view the evidence underlying an activation-state prediction for a regulatory feature in a cell type of interest (in the Location or Regulation tabs). Other examples of customization include, in the Location tab, the addition of tracks containing the data, imported from external repositories, that were used to annotate transcripts in a genomic region (ENA, UniProt, and RefSeq tracks accessible in Location tab) [11]. Public datasets can also be added from the Track Hub Registry [48], and you can import your own data, in multiple formats [49], for examination in the context of the browser.
Data can be exported directly from the browser by clicking the blue "Export data" buttons found below the left-hand menu in most tabs, or the "Download sequence" buttons above FASTA sequences in the Gene and Transcript tabs. In addition, the BioMart tool described in the workflows presented herein can be used to retrieve custom datasets from our Gene, Variation and Regulation databases, and data can be accessed programmatically from our Perl APIs and REST service. Data from all Ensembl releases can also be downloaded en masse from our FTP site.
Species' sequence data and annotations may be updated several times a year. You should therefore be attentive, when querying Ensembl data, to the current browser version, as annotations are subject to change. Data can, however, still be retrieved directly from archived versions of the browser, as well as via BioMart, while the browser web interface remains online. Following the decommissioning of any browser version, the data remain accessible from our FTP site and APIs, as mentioned above.
A dedicated email helpdesk is available to field any inquiries about Ensembl and we typically reply to messages within two days of receipt. We also hold training workshops upon invitation by research institutes; from 2013 to 2016 we participated in an average of 86 workshops, and trained 2150 students, per year. Our training materials are accessible online [50], along with a number of courses that are available on the Train Online Platform of the European Bioinformatics Institute (EMBL-EBI) [51], and we have published a blogpost outlining the process of hosting your own workshop [52]. Short help videos can be found both on our YouTube channel [53] and, for those who cannot access YouTube, on Youku [54]. We invite the community to contact us via help-desk@ensembl.org for more information about workshops, with questions regarding the browser, and to suggest features and resources which would assist their work.