Introduction

As the amount of gene-related data grows, a standard nomenclature for gene names should be adopted to ensure uniformity. In the absence of a universally accepted naming system, researchers have typically tried to follow naming precedents set out in published literature. For example, a two-letter species abbreviation is typically used as a prefix in gene names to identify the species of origin. For Arabidopsis thaliana, genes carry the prefix At. This species prefix is generally not required when presenting results from a single species but helps interpretation when genes from multiple species are being compared. While the use of precedent has helped reporting results more clearly, the need for naming conventions has become widely recognized especially as knowledge of genes and gene families has rapidly grown.

In response, a number of model plant communities such as A. thaliana (http://www.arabidopsis.org/portals/nomenclature/guidelines.jsp), tomato (http://solgenomics.net/static_content/solanaceae-project/docs/tomato-standards.pdf), and rice (McCouch 2008) have independently suggested naming methodologies. These were developed to address a number of issues around consistency, such as cases where the same gene has been given different names. For example, ETHYLENE INSENSITIVE 5 has also been published as AIN1 and XRN4 (Olmedo et al. 2006). In tomato, the ARF gene family was independently named by two different research groups, resulting in each gene having two names (Kumar et al. 2011; Wu et al. 2011). whereon the other hand, in some cases, the same three letter abbreviation has been given to different genes, such as CELL NUMBER REGULATOR (CNR) in cherry (De Franceschi et al. 2013) and COLORLESS NON RIPENING (CNR) in tomato (Manning et al. 2006). Once a reference genome has become available, the problem can amplify, as independent groups annotate entire gene families. Although these problems have not been entirely resolved, clarity within genomic databases has been provided by the common naming convention of giving each gene a chromosomal locus identification number (for Arabidopsis AT1G54490) as part of whole genome sequencing and gene prediction. This convention was first established for Arabidopsis and subsequently followed by other model systems (McCouch 2008). The locus IDs provide a way to distinguish genes that have been given the same name, or to identify when a single gene has been given multiple names. While this remedy provides a mechanism for database management, it does not eliminate the problem of gene names and species designations not being translatable among various reports.

Genomics research in the Rosaceae research community is expected to grow due to the availability of powerful sequencing tools combined with the importance of commercially cultivated crops such as almond, apple, apricot, blackberry, cherry, peach, pear, plum, raspberry, rose, and strawberry. The opportunity now exists to provide the plant research community with a standardized naming convention that will help avoid significant confusion with regard to gene naming. With this in mind, the Rosaceae Executive Committee (RosEXEC) along with the Rosaceae International Genomics Initiative (RosIGI) formed a subcommittee to develop a standardized naming convention for naming genes among Rosaceae species. Adoption by the research community of the recommended gene naming conventions detailed here will benefit the Rosaceae research and breeding communities, as well as all associated fields of biology. We respectfully urge the research community as well as editors of peer reviewed journals to follow the naming conventions outlined here, and to work with the naming committee to address problems or make suggestions. Finally, it is worth noting that currently, gene names will be allocated on functional analysis occasionally tested in Rosaceae species but more often in model organisms. Currently predicted genes from whole genome sequencing with no predicted functions and/or phenotype would not be eligible for the gene names and gene symbols beyond the assigned genomic identifier.

Rosaceae gene naming convention—species prefix

The comparison of genes across Rosaceae species gives valuable insights into the way that different species have evolved. Whole genome sequence is available for strawberry (Shulaev et al. 2011), apple (Velasco et al. 2010), peach (Verde et al. 2013), pear (Wu et al. 2013; Chagné et al. 2014), and Prunus mume (Zhang et al. 2012), with more to come. Genome sequence has allowed cross-species comparisons at the genome level and within gene families. Considering the approximately 3000 species within the family Rosaceae (Hummer and Janick 2009), it is clear that two-letter species abbreviations will be insufficient to distinguish between species. Even among the species most commonly studied, conflicts have already arisen. Genes in Prunus persica and Pyrus pyrifolia as well as the model species of mosses, Physcomitrella patens, have already been given the same species prefix, Pp. To facilitate comparison, we recommend using a standard set of three-letter species prefixes for the major Rosaceae species (Table 1). The use of a three-letter code was chosen as it settled most naming conflicts without adding cumbersome length to gene names. In general, the first letter of the genus name and the first two letters of the species name should be used for the three-letter prefix. Using a three-letter prefix resulted in two conflicts for the major species: using our convention, both Prunus cerasus and Prunus cerasifera would be Pce, so we recommend Pci for P. cerasifera. Likewise, both P. mume and Prunus munsoniana would be Pmu, so we recommend Pmn for P. munsoniana.

Table 1 Proposed abbreviations for major Rosaceae species

In order to distinguish all taxonomic species within Rosaceae, including non-commercial species, abbreviations would need to be substantially longer. For these non-cultivated species, we suggest that authors take a UNIPROT approach (http://www.uniprot.org/docs/speclist) (Magrane and Consortium 2011) using five-letter abbreviations; three for the genus name and two for the species such that Potentilla simplex (common cinquefoil) becomes Potsi, Waldsteinia fragariodes becomes Walfr, and Malus platycarpa (Bigfruit crab apple) becomes Malpl. This convention is still not sufficient to account for all members of the Rosaceae family and conflicts may still arise. In such cases, we recommend that researchers choose distinguishing nomenclature following the guidelines set forth here as closely as possible, making exceptions as needed to avoid conflicts with other species both within and across other plant families. Furthermore, we strongly recommend that authors do not include a species prefix in the gene symbol when submitting the gene data to NCBI, GDR, or any other databases, so as to minimize the creation of duplicated names due to the species prefix. In databases, the species data are typically stored along with the gene names, so having a species prefix in gene names is unnecessary and generates additional aliases. In GDR, the gene symbol without the species prefix will be stored with already published genes names, with or without prefix, as aliases. Authors should use the prefix only in publications for comparisons between genes of different species origin.

Rosaceae gene naming conventions—gene name

When giving a gene a name/symbol, we encourage researchers to first check with the current literature, and within the gene databases (GenBank and GDR), to ensure that the gene they want to name has not already been assigned a name in a prior publication. It is not recommended that researchers rename genes that are already published unless a compelling reason exists such as a lack of clear orthology or to correct existing confusion. When dealing with species that have high heterozygosity, it is occasionally hard to know whether differences in sequence are allelic, a result of gene duplication, or genome duplication. Researchers should aim to reduce the incidences where the same gene name has been used for different genes and where a gene has been assigned different names, as has already been observed in other species. From a community perspective, we aim to achieve a unique name for each gene (where a name has not already been established in the literature), a single name for each gene, and a link between gene name and gene model number from whole genome sequencing.

Gene naming by function, mutant phenotype, and homology

When naming genes in Rosaceae species, we propose the common nomenclature that is now standard across most of the genomics community. The Arabidopsis community has set out clear guidelines for gene nomenclature which should be followed for the Rosaceae (Meinke and Koornneef 1997) (http://www.arabidopsis.org/portals/nomenclature/guidelines.jsp). It is ideal to design the name so that it can be associated with biological function or mutant phenotype. When the biological function of the gene has not been directly assayed, there is an advantage to assigning the name of the most similar, functionally tested gene. For example, Malus × domestica ACC SYNTHASE1 (ACS1) would be named after A. thaliana ACS1. Co-naming with an Arabidopsis gene may often pose a challenge due to gene duplications in either species through evolution, and unknown or unclear biological function for a given gene family member making it impossible to equate two genes other than by sequence. However, especially for regulatory genes such as transcription factors, we recommend that whenever possible, similar naming and numbering schemes are used.

The guidelines set out by the Arabidopsis community point out the pitfalls of naming a gene based solely on a mutant phenotype or allelic form without knowing detailed biological or biochemical function. To relate a gene to a gene family, the Arabidopsis guidelines recommend the gene name to end with “-like” when information is based solely on sequence homology. Many of the Rosaceae genes may not be characterized functionally in host species, however, and be named following orthologous relationship with genes functionally characterized in Arabidopsis or other model species. So, we recommend naming genes following the closest orthologs in other model species but not adding “like” at the end.

Another recommendation from the Arabidopsis community is that gene names should not be assigned unless a full-length cDNA sequence has been obtained. For the Rosaceae, it is advantageous to accommodate genes that are identified using various sequencing methods, including those identified through high throughput transcriptome sequencing, manual editing of multiple alignments, or other experimental or computational approaches. Therefore, GDR will accept gene models with these lower constraints and ask that when authors submit gene names/symbols to GDR, they also indicate the category of evidence for the gene structure. Four categories of evidence for a gene structure have currently been established: (A) cDNA sequencing, (B) transcriptome sequencing, (C) computational evidence, and (D) other. Detailed information about these categories of evidence is shown in Table 2. This classification system will allow the research community to take advantage of all available data while enabling individual datasets to be filtered as needed based on these evidence codes.

Table 2 Category of evidence for the gene structure

Assigning gene symbols

For the gene symbol, we encourage using a standard three-letter code class symbol for members of a gene family together with a hierarchical numbering system. When two- or five-letter codes have already been used for established gene families, authors are recommended to use these for consistency and avoid inventing new names. The class symbol should be derived from the full name of the gene. The current list of gene class symbols are available in GDR (http://www.rosaceae.org/gene_class). Gene symbol can be followed by a numeric suffix. A name without the numeric suffix is presumed to be the first gene with a particular function that has been identified and therefore is the equivalent of suffix “1.” A numeric suffix greater than one should appear when the new gene has similar function or phenotype to a gene at one or more other loci. The numerical suffix from Arabidopsis will not necessarily be conserved in Rosaceae due to duplication events in the both species. As stated earlier, organism-specific prefixes are encouraged to be used only in publications for clarity and should not be part of the gene symbol.

Naming homoeologs, alleles, and splice variants

Within the same taxonomic family, it is advantageous to name genes after homoeologs that have already been named in closely related species. However, within the Rosaceae, genome duplications have occurred, resulting in nomenclature difficulties. For example, the Maloidae has undergone a genome duplication followed by a rearrangement leading to a haploid (x) chromosome number of 17, compared to 7, 8, or 9 of most other family members (Illa et al. 2011) thus presenting the possibility of having two genes that have a homoeologous relationship to a single gene in diploid strawberry or peach. Some effort has been made to give specific names to two genes derived from the whole genome duplication in apple by using related numerical suffixes. For example, two apple genes were named ARF1 and ARF101 following the strawberry homoeologous gene ARF1 (Devoghalaere et al. 2012). This solution was chosen at the time as a few studies had shown homoeologous genes maintain a degree of conservation in function, for example MYB10 (Espley et al. 2007) and MYB110 (Chagné et al. 2013) both control the anthocyanin accumulation. Due to the practical difficulties in the identification of two genes with the same ancestral origin out of multiple homologs in a species that has undergone whole genome duplication, no specific naming convention is currently proposed. It is, however, recommended that at this time, genes arising by genome duplication be named sequentially within the gene families as is done for general homologs. When the gene symbols do not contain a number, a numeric suffix can be directly attached (e.g., PG1, PG2, etc.). If the gene symbol already has a number at the end, the numeric suffix should follow a period “.” (e.g., DHN3.1, DHN3.2, etc.). If a gene has a published name, then these should be kept.

For naming alleles, we recommend a hyphen and a numeric suffix for the alleles (e.g., DHN3.1-1, DHN3.1-2) following the convention of the Arabidopsis community. For the multiple alleles found in diversity studies with multiple populations, authors are recommended to provide a table with alternate names for alleles. The alternate name will contain more information on the alleles, such as a suffix of database ID (e.g., NCBI accession number) instead of a numeric suffix. For example, the alternate name for DHN2.1-1 will be specified as DHN2.1-AB123456. For the species that have a whole genome sequence available, the sequences of the reference genome should serve as the wild type sequence. For naming splice variants, we recommend an underscore and a numeric suffix (e.g., DHN3.1_1, DHN3.1_2).

Facilitation of gene naming standardization

Researchers are encouraged to submit their gene naming data to GDR (http://www.rosaceae.org/data/submission) in addition to NCBI prior to or concurrent with publication. While NCBI does not accept named genes that do not come from single molecule sequencing, the GDR database will (see above).

When researchers submit data to GDR, they will be asked to provide the following type of data:

  1. 1.

    Species: species from which the gene is sequenced.

  2. 2.

    Species prefix: species prefix to be used in publications and presentations following the recommendations of this manuscript.

  3. 3.

    Source germplasm: germplasm name and database/germplasm repository ID if available.

  4. 4.

    Gene symbol: gene symbol that is composed of two to five letters and a numeric suffix, derived from the gene name.

  5. 5.

    Gene name: gene full name derived from molecular function of the gene product, phenotype, or homology.

  6. 6.

    Synonym or alias: any other gene symbols that have been used for the gene.

  7. 7.

    Gene class symbol: the “root” of the gene symbol, composed of two to five letters, without the numeric suffix.

  8. 8.

    Gene class symbol full name: full name of the gene class symbol (gene name without the numeric suffix).

  9. 9.

    Gene model: gene model ID from the whole genome annotation if available.

  10. 10.

    Genbank ID: Genbank accession number if available.

  11. 11.

    Description: description of characteristics of the gene such as biochemical function, expression in particular tissue and/or growth stage, effects on phenotype, and location in subcellular component.

  12. 12.

    Submitter: person who submitted the gene name data and email contact information.

  13. 13.

    Category of evidence: the category of evidence as described above. If authors chose category D, detailed information needs to be given.

  14. 14.

    Reference: citation of the publication where the gene has been published.

  15. 15.

    Comments: any other comments.

  16. 16.

    Sequence data and gene model structure: sequence in FASTA format and gene model structure (intron/exon, UTR, promoter, etc.) in a GFF file.

Table 3 shows examples of gene data to be submitted. Supplementary Table 1 shows the gene submission template, available also from GDR (http://www.rosaceae.org/data/submission). When new gene models, gene sequences, and/or splice variants are identified that are different from the whole genome sequence data, we will recommend that users also submit, in addition to the gene data submission template, a FASTA file and a GFF file that contains gene model structure such as intron, exon and promoter region.

Table 3 Example of gene data

Conclusions

Establishing a standardized naming convention for Rosaceae species, compatible with what has been done in other plant research communities, will enable researchers to perform comparative analyses within as well as outside the Rosaceae Guidelines for naming Rosaceae family genes have been developed and are presented here together with the invitation to store gene nomenclature as well as sequencing information in the central Rosaceae community database, GDR. We respectfully urge the plant research community to follow these suggestions and by doing so, we should all benefit from simplified literature reading, and less confusion in reporting the identification and function of individual genes.