Scripting Analyses of Genomes in Ensembl Plants

Ensembl Plants (http://plants.ensembl.org) offers genome-scale information for plants, with four releases per year. As of release 47 (April 2020) it features 79 species and includes genome sequence, gene models, and functional annotation. Comparative analyses help reconstruct the evolutionary history of gene families, genomes, and components of polyploid genomes. Some species have gene expression baseline reports or variation across genotypes. While the data can be accessed through the Ensembl genome browser, here we review specifically how our plant genomes can be interrogated programmatically and the data downloaded in bulk. These access routes are generally consistent across Ensembl for other non-plant species, including plant pathogens, pests, and pollinators.


Introduction
Plants play a central role in the ecology and economy of our planet and are essential to our food security. As the world population increased by 145% in the last 60 years, the yields of cereals increased even more, while not needing much more land [1]. This has been possible as a result of improved agricultural practices and crops. Currently, breeding programs take advantage of inexpensive genomic and phenotypic data. The next steps towards what is being called Breeding 4.0 [2] include adapting crops to changing environments and broadening the diversity pool to compensate for the losses occurred during domestication. For this reason wild relatives of crops are being sequenced increasingly the detailed structure of the underlying data store. One is a Perl API, while the other uses a language-agnostic REST interface [9]. The REST service allows up to 15 requests per second.
In addition to the primary databases, Ensembl Plants also provides access to denormalized data warehouses, constructed using the BioMart tool kit [10]. These are specialized databases that support efficient gene-and variant-centric queries. Finally, a variety of data selections are exported from the databases in common file formats and made available for download via an FTP site.
These resources are summarized in Table 1. Recipes to query each of them are listed in Table 7.

Overview of Data Content
2.2.1 Genomes and Core Data-Genome assemblies are typically imported from the European Nucleotide Archive (ENA) [11], which is part of the International Nucleotide Sequence Database Collaboration (http://www.insdc.org, INSDC). Gene model annotations are imported from the ENA [11], Phytozome [12], or provided by community members (see Note 3 ). For instance, the rice annotation was imported from RAP-DB [13]. After import, various computational analyses are performed for each genome. A summary of these is given in Table 2. In addition, specific datasets are imported and analyzed according to the requirements of individual communities. These datasets typically fall into two classes, markers, and variants across genotype panels.
The genomes currently included in Ensembl Plants are listed in Table 3. A summary of UniProt coverage of proteins encoded by genes within these genomes is given in Table  4 [17]. In all cases, genomes are identified by their Ensembl production name, which is usually binomial but can also include a strain name to distinguish particular cultivars or ecotypes, such as malus_domestica_golden. Details of other datasets incorporated can be found through the homepage for each species (see Note 3 ).

Variation Data-
The variation schema can store genetic variants observed in populations or germplasm collections, alleles, and frequencies, alongside sample genotype data. Supported variant types include single nucleotide polymorphisms, indels, and structural variants. The functional consequence of variants on genes is predicted with the Ensembl Variant Effect Predictor (VEP) [14]. Linkage disequilibrium data and statistical associations with phenotypes are available for selected species. The variation datasets of release 47 of Ensembl Plants are described in Table 5. The Ensembl VEP is also a command line tool that can be used to efficiently annotate variants and we provide recipes for it as well (see Table 7).

Comparative Genomics Data-The
Ensembl Gene Tree pipeline is used to calculate evolutionary relationships among members of protein families (Table 2). For each gene, the translation of the canonical transcript is selected (see Note 4 ). Briefly, this pipeline first finds clusters of similar proteins and then, for each cluster, attempts to reconcile the relationship between the sequences with the known species cladogram (Fig.  1), derived from the NCBI Taxonomy database [42]. The analysis also contains a few nonplant outgroups. The TreeBeST software (https://github.com/Ensembl/treebest) is used to construct a consensus tree, which allows the identification of orthologues and paralogues. As polyploid genomes are split into components, homoeologous genes are effectively defined as orthologues among subgenomes. A number of plant genomes are also included in a pan-taxonomic gene tree, containing a representative selection of sequenced genomes from all domains of life. Recipe R2 can be used to check which comparative analyses have been run for a particular species. This information is also displayed in the table at http:// plants.ensembl.org/species.html.
Other comparative analyses available in Ensembl Plants are pairwise whole-genome alignments and synteny (see Tables 2 and 6).

Baseline Expression
Data-Baseline gene expression reports are available as "Gene expression" on the website for selected species. An example for barley is shown at http://plants.ensembl.org/Hordeum_vulgare/Gene/ExpressionAtlas? g=HORVU5Hr1G095630;r=chr5H:599085656-599133086. The underlying curated expression data, produced by Expression Atlas [44], can be browsed and downloaded via the expression widget.

RNA-seq
Tracks-RNA-seq datasets from the public INSDC archives are mapped to genome assemblies in Ensembl Plants in every release. They are handled as ENA studies and for each of them CRAM files are created with the RNA-Seq-er pipeline (https:// www.ebi.ac.uk/fg/rnaseq/api) [45] and published at ftp://ftp.ensemblgenomes.org/pub/ misc_data/Track_Hubs. Each study contains a separate folder for each assembly that was used for mapping. These tracks can be interactively displayed in the browser, but can be of interest for high-throughput studies as well. For instance, study SRP133995 was mapped to tomato assembly SL3.0 and the tracksDb.txt file therein indicates the full path to the relevant CRAM file next to its metadata. CRAM files for a selected assembly can be discovered with recipe C1; note that the assembly name corresponds to column "assembly_default" in recipe R2. As of May 2020 there were 89,355 CRAM files available.

Methods
This section describes some of the recipes listed in Table 7 in detail so that the reader can execute or modify any of them. Software dependencies required by these recipes are listed in https://github.com/Ensembl/plant-scripts/blob/master/README.md.
The different approaches are complementary. While the native Perl API is the most powerful and used extensively by Ensembl developers, it also requires some Perl knowledge and the installation of several repositories. Similarly, the Biomart and MySQL examples require knowledge of R and SQL, respectively. However, the REST endpoints can be interrogated with any programming language; however, only a defined set of queries are currently supported. The FTP recipes allow efficient bulk downloads, but with no customization. The source code for all recipes can be found at https://github.com/Ensembl/plant-scripts.

Clone the GitHub Repository and Install Dependencies
The following steps explain how to obtain a local copy of the recipes and how to test them on Linux/MacOS operating systems (OS).

1.
Open a terminal and check whether git is installed by typing: git --version.

2.
If required install git if using the appropriate software manager for your OS.

4.
Navigate to the scripts directory: cd plant-scripts.

Perl API Recipes
The Ensembl Perl API enables access to all types of data from Ensembl Plants (genes, variation, comparative genomics, regulation, etc.) and it is documented extensively (see Note 5 ). It allows complex queries to be executed without the construction of any explicit SQL queries. The repository contains eight Perl API recipes, of which three are described here (A1, A4, and A8).

1.
Load the Registry object with details of genomes available from the public Ensembl Genomes servers (recipe A1):

2.
Set species and chromosome of interest and print BED file with repeats (recipe A4). Ensembl uses 1-based inclusive coordinates internally:

FTP Recipes
There are 12 recipes in the repository that query the Ensembl Genomes FTP server. They use shell variables and the wget program to download files. The recipes refer to the Ensembl release and the Ensembl Plants release as RELEASE and EGRELEASE, respectively. Recipe F5 involves a prewritten BioMart query. Homologies can also be downloaded in OrthoXML format [47], which renders a smaller file but requires a more complex parser.

MySQL Recipes
Direct access to the public MySQL server requires knowledge of the schemas (see Notes 1 and 8  access is recommended. Three recipes are shown here, they all require the mysql-client to be installed.

Count Protein-Coding Genes of a Particular Species-This is recipe S2.
The source code works out the current release number, but it can also be set manually as in this example: See recipe F3 to obtain the corresponding sequences.

REST Recipes
The following recipes, written in Python, can also be found in R and Perl languages in the repository. They communicate with the Ensembl REST service at https://rest.ensembl.org (see Note 9 ) using the functions get_json and get_json_post, defined in file exampleREST.py.

Check Consequences of SNPs Within CDS Sequences-Recipe R8
queries two endpoints (map/cds/ and info/vep/:species/region). The first one translates CDS to genomic coordinates, the second one retrieves the predicted consequences of the SNP in the coding sequence. This recipe can be used to annotate genomic variants in a given gene across germplasm panels, as done in [48]:

Annotate the Effect of Variants with the Ensembl Variant Effect Predictor
The Ensembl VEP tool can be used to predict the effect of variants on genes, transcripts, and protein sequences (see Note 10 ). As mentioned in Table 2, this analysis is run for all genomic variants imported into Ensembl (see Table 5). While the Ensembl VEP is available through a web interface, the advantage of a local installation is that it can be used to analyze variation sets of any species, including species that are not in Ensembl Plants. If variants are mapped to a reference genome supported in Ensembl Plants, using a cache file increases performance. However, as shown in recipe V4, it is possible to use other reference FASTA files together with the corresponding GFF/GTF annotation files. The next steps summarize how the software is installed and used following recipes F8, V1, V2, and V3.

Querying Plant Pangenomes
Upcoming Ensembl Plants releases will have an increasing number of species with multiple cultivars or ecotypes as additional assemblies are added in collaboration with the relevant communities. On the website these cultivars can be browsed from the appropriate reference genome page such as http://plants.ensembl.org/Triticum_aestivum/Info/Strains?db=core (see Note 12 ). Starting with several UK cultivars in release 48 (August 2020), Ensembl will host all cultivars of the first assembled wheat pangenome [49] from release 50 planned for early 2021 (see example Fig. 2). Note that related noncultivated species are often included in the pangenomes of crops. For example, Ensembl Plants hosts 11 Oryza species plus the outgroup plant Leersia perrieri. Both types of genome sets can be considered pangenomes.
Currently, some pangenomes in Ensembl can be interrogated using gene trees and wholegenome alignments (WGAs; see Tables 2 and 6). For example, recipe A9 can be used to retrieve syntenic orthologous genes in rice or Brassicaceae species. These analyses will be available for wheat as well once de novo gene annotation and WGAs are produced.

Getting Help
Documentation for Ensembl Plants, including FAQs, tutorials, and detailed information about the project, datasets, and pipelines that we run can be found under the "Documentation" and "Website help" links at the top of every page. Detailed information for each species can be found on the species homepage. The EMBL-EBI train online website has several free courses on Ensembl, including the recently updated "Ensembl Genomes (non-chordates): Quick tour" (https://www.ebi.ac.uk/ training/online/course/ensembl-genomes-non-chordates-quick-tour) and "Ensembl REST API" courses (https://www.ebi.ac.uk/training-beta/online/courses/ensembl-rest-api). Any data problems are reported on our blog http://www.ensembl.info/known-bugs. If the available documentation cannot answer your question, a helpdesk is provided (mail helpdesk@ensemblgenomes.org with your query).  The RFL gene (TraesCS1B02G038500) lifted over from the reference landrace Chinese Spring to three wheat cultivars (CDC Landmark, Julius and Jagger). The genes are displayed in the Ensembl Plants genome browser. While in the first cultivar there are three annotated transcript isoforms including one with two exons, the others have a single transcript with one exon. Furthermore, the locus is annotated as a pseudogene in Julius