Naming and Annotation of Plasmids
KeywordsCoupling Protein Submission Process Note Line Natural Plasmid International Nucleotide Sequence Database
Genome sequences are being added to the public databases at phenomenal rates as sequencing becomes faster and cheaper. Plasmids are an important subset of extrachromosomal elements that often make significant contributions to the character of their hosts. Predicting these potential attributes depends on analyzing and cataloging the plethora of sequencing data in consistent and sensible ways. Currently, there is no consensus within the plasmid community on plasmid and gene names or how to handle the annotation of plasmids during the submission process to databases such as GenBank, but there are good models which can form the basis of a general naming system. It is also important to have clearer rules for the naming of plasmid core functions such as replication, partitioning, and conjugative transfer, among others. This entry explores these issues and makes some proposals for a more sustainable and rational system for plasmid naming, annotation, and analysis as consensus is achieved among plasmid biologists. Based on the system used for naming plasmids from the Rhizobiaceae, natural plasmids should be given a unique name that, wherever possible, indicates its natural host, the host used in plasmid capture experiments (exogenous isolation), or the source in metagenome sequencing projects. This unique designation will allow less ambiguous linkage to relevant experimental data.
From the perspective of those interested in plasmid genomes, the current standard of plasmid annotation is far from ideal, despite the large amount of effort that goes into the deposition of plasmid sequences in public databases . This can be contrasted to the better defined classification and annotation of viruses and bacteriophages . Plasmids do not have the defining features of phages (morphology, nucleic acid type, etc.) and often defy classification. Although the naming of a plasmid and its bacterial host is required for submission to NCBI (National Center for Biotechnology Information), GenBank, and other databases (see INSDC at http://www.insdc.org/, the International Nucleotide Sequence Database Collaboration between three databases GenBank, DDBJ, and ENA), the natural hosts for promiscuous plasmids and plasmids identified via metagenomic sequencing projects are often not known. In addition, plasmid capture experiments using exogenous isolation techniques (e.g., Ref. ), by their very nature, result in not being able to identify the original host. Plasmids are also extremely adept at changing their composition in ways that phages is incapable of doing. Plasmids consist of backbone replication genes plus a wide assortment of accessory genes including, but not limited to, conjugation and mobilization; resistance to antibiotics, heavy metals, pollutants, and organic solvents; virulence determinants; and mobile elements (transposons, insertion sequences, integrons, etc.) as well as many genes of known and unknown functions that appear to be duplicates of chromosomal genes. They also undergo deletion, amplification, recombination, and mutation resulting in inactivated replicons, pseudogenes, and many other genetic peculiarities. These features are not usually found in phages since maintenance of their genetic content and size constraints imposed by their capsid dimensions ensure a more rigorous uniformity. Because there is no formal taxonomy scheme for plasmids, they present unusual challenges not found in other mobile genetic elements, which has led to ad hoc solutions surrounding the naming of plasmids and their genes.
The Key Issues
Natural plasmids can be divided into three groups: historically important plasmids, newly discovered plasmids that are studied in detail at the level of gene function, and plasmids that are by-products of genome or metagenome sequencing projects. Many of the latter, in all likelihood, will never receive much individual attention although they may inform the properties and evolution of a plasmid group. Any proposal for naming plasmids and their genes needs to accommodate these possibilities. The issues associated with plasmid sequences in the databases can be boiled down to two distinct points: naming and annotation. This entry raises some of these issues and makes suggestions to help resolve them, but there is a need for the wider plasmid community to debate them and adopt acceptable and sensible standards.
Plasmid naming is inconsistent and confusing for a variety of reasons. A process for the naming of new plasmids, including synthetic plasmids, was proposed by Novick et al. . Plasmids were to be named using a lower case “p” followed by an alphanumeric designation composed of a two-letter combination that reflected the researcher’s institution followed by a number. Thus, pUC18 was the 18th plasmid from the University of California. However, the enormous number of plasmids that have been isolated or constructed since the mid-1970s has overwhelmed these simple rules. Either the rules are not followed or duplicate names arise because there is no simple way to check whether a plasmid name is unique. Added to this, the names of many of the well-studied plasmids such as F, R1, RK2, and ColE1 do not conform to these rules. Perhaps if these plasmids had been given new names 35 years ago, the naming of plasmids would have developed differently. Similarly, there has been no clear distinction between natural plasmids that make up the genomes of one or more strains or species and plasmids that have been created by recombinant DNA techniques or more recently by synthetic biology.
Plasmid annotation (naming of genes) is also in disarray for many reasons. Aside from inconsistent naming of genes and gene products or the lack of annotation altogether, the legacy of different names for the same functions has exacerbated the problem. The transfer genes in plasmid F were named in the order in which they were identified using groundbreaking genetic techniques . Thus, there is traYALEKB-Z (in no particular order; traZ was later dropped) followed by trbA-J . The transfer genes in plasmid RP4 were named, more usefully, according to their position in the two main operons Tra1 and Tra2 (traABC-M and trbABC-P) . The virulence locus of Agrobacterium tumefaciens Ti plasmid, which is involved in tumorigenesis in plants, consists of seven virulence operons, virA-G with the cistrons in these operons being sequentially numbered, e.g., virB1-11 . The VirB1-11 gene products as well as VirD4 have become paradigms for the type IV secretion system (T4SS) and associated coupling protein (T4CP), respectively . The coupling protein, which is required for conjugation, is named TraD in F, TraG in RP4, and VirD4 in the Ti plasmid. However, F also encodes a mating pair stabilization protein called TraG, for instance, that has an unrelated function. Well-meaning researchers have adopted gene names from these and other paradigmatic systems without regard to the function of the gene or the position of the cistron within the operon. For instance, many coupling proteins involved in conjugation are named VirD4 even though there is no evidence that they are involved in virulence or that they are the fourth gene in the fourth (D) operon within a vir locus.
Plasmids in GenBank
Several years ago, bacterial plasmids had their own link on the GenBank home page (http://www.ncbi.nlm.nih.gov/genbank/). However, the number of plasmids is now enormous and some overlying organizational framework was deemed necessary by the NCBI. It has now moved toward a system based on phylogeny of related organisms with plasmids found within certain species or subspecies listed alongside the host. The best list of complete plasmids, which includes plasmids with no known host that are included in RefSeq (see below), can be found within the NCBI FTP directory at the following address: ftp://ftp.ncbi.nih.gov/genomes/Plasmids/.
Although the NCBI website has become extremely complicated with many subdirectories and links to web pages/pdfs that must be read, the annotation of sequences is fairly well explained. A useful overview is provided at (http://www.ncbi.nlm.nih.gov/books/NBK21105/). The RefSeq platform (http://www.ncbi.nlm.nih.gov/RefSeq/) is also very helpful in providing examples of sequences annotated to GenBank standards. This includes organisms with multiple plasmids; the suggestions for naming these plasmids and the genes therein in a sequential order are well illustrated. It is interesting to compare the annotation for “plasmid F” in GenBank (AP001918.1) and RefSeq (NC_002483.1) and see the level of detail in the latter that was added to the original GenBank submission. The gold standards for annotation are probably the entries for E. coli K12 in GenBank at U00096 and in RefSeq at NC_000913 that both have consistent annotation and are continually updated on a semi-regular basis. This illustrates the importance of updating annotations as more information accumulates, a process that is currently lacking in most cases.
A Proposal for Naming Plasmids
Although, in general, plasmid biologists agree that plasmid names should follow the scheme proposed by Novick et al. , i.e., the format pAlphanumeric, using more than two letters where needed, has deficiencies with respect both to uniqueness and information content. A system based on plasmids from Rhizobiaceae  could be used as the basis for creating the formal, unique name for each plasmid. As detailed below, the name starts with “p” to indicate “plasmid” and is followed by a series of elements that are unique for that plasmid and give information about its host, strain, and sample source or the locus tag (which reflects this information) and a letter (for known plasmids using this system) or number (for newly identified plasmids) that is unique when more than one plasmid is present within a given genome or metagenome.
For plasmids discovered and analyzed by classic procedures, the “p” designation should still be used as soon as a plasmid is identified but, for sequencing-based discovery, only complete, annotated plasmids should receive the “p” designation. Contigs that appear to contain plasmid sequences should be left as contigs until such time as it is clear that the whole sequence has been assembled and at least a preliminary annotation is completed. The rules for handling contigs are discussed at http://www.ncbi.nlm.nih.gov/genbank/wgs under Whole Genome Shotgun Submissions (WGS).
The “p” should be followed by a contraction of the genus and species names as in the case of restriction enzymes . Thus, Eco would signify E. coli. This would be followed by the strain designation; for instance, EcoK12 and the F plasmid would be pEcoK12_1. The name of the plasmid pRetCFN42f is derived from its host, Rhizobium etli, CFN42 is the strain designation, and “f” indicates that it is the sixth and largest plasmid found in the strain. This was (and is) a useful name for this plasmid even before it was sequenced. The proper name could be altered slightly to pRetCFN42_6. Note the use of an underslash between the strain designation and the plasmid number. The reason for changing over to numbers is to allow an open-ended system not limited by the number of letters available.
It would be helpful to have standard categories for plasmids that identify their source. Thus, natural plasmids could be split into endogenously isolated (isolated from a natural host, pEND), exogenously isolated (isolated by capture into a permissive host, pEXO), or reconstructed virtually from a metagenome project (pMET). Unnatural plasmids can be split into vectors (pVEC), constructs derived from vectors (pCON), derivatives of natural plasmids (pDER), or synthetic plasmids (pSYN). At some stage, these terms could also be incorporated into the long version of the plasmid name. As examples, pEcoK12_1 would be pEND_EcoK12_1 and pRetCFN42_6 would be pEND_RetCFN42_6.
Alternatively, plasmids that have been sequenced and submitted to GenBank should make use of the locus tag, which is unique for each sequencing project, to generate a shorter, informal plasmid name. The locus tag is used to construct systematic gene identifiers for each gene within a complete genome (chromosome and plasmids). It is an alphanumeric of 3–12 characters where the first character must not be a digit (http://www.ncbi.nlm.nih.gov/genbank/genomesubmit#locus_tag). The locus tag contains information about the host and strain as well as the method of isolation that is required by GenBank during the sequence submission process. If the locus tag (REH) had been used for pEND_RetCFN42_6, the plasmid name would be pREH_6 and pEND_EcoK12_1 would be pFpla_1where Fpla is the locus tag for the F plasmid.
In the case of exogenously isolated plasmids, the host used in the plasmid capture experiment and the source could be described in a series of three or more letter codes. Thus, a plasmid isolated from Mildred Lake in Northern Alberta, using exogenous techniques, would be named pEXO_PaeO_LMI_1, which indicates that Pseudomonas aeruginosa strain O was used to exogenously capture plasmid “1” from an isolate from Lake Mildred (LMI). Alternatively, if the plasmid sequence was submitted to GenBank, the locus tag could be used to give pPAO_1 where PAO is the locus tag for this sequencing project.
For plasmids identified within metagenome sequencing projects, again the locus tag prefix assigned once the Metagenome BioProject ID has been given could be used. See http://www.ncbi.nlm.nih.gov/genbank/metagenome for details. The source should be designated in a way that is acceptable to GenBank and the plasmids should be numbered rather than lettered since this does not limit the number of plasmids that could be found. Thus, pMET_BHO_SKN_4 would be the formal name for the 4th plasmid from skin (SKN would be a GenBank-approved designation) in a metagenomic project from Birmingham Hospital, which has been given the locus tag BHO. Using only the locus tag, the informal designation would be pBHO_4.
Since the resulting formal official names are quite lengthy, traditional names or shorter names derived from the locus tag could be used. Authors will normally state the official name as well as the shorter name used in their publications. It should then be possible to give older plasmids official names, but retain their short names, for example, F or RP4, for everyday usage.
If a sequencing project reveals the presence of more than one plasmid, they could be numbered consecutively from largest to smallest in size (or vice versa in the tradition of studies on plasmids from Rhizobiaceae). For example, the plasmids pREB1-9 could be the informal names for pEND_AmaMBIC11017_1-9 from Acaryochoris marina, strain MBIC11017 (see NC_009926-34), where REB is the locus tag for the A. marina MBIC11017 genome. In the event that the same plasmid is identified in two separate sequencing projects, they should be named according to the rules outlined above. Thus, identical plasmids from separate sources should have different formal names and but this can be noted in the databases or relevant publications. Informal names can reflect either the locus tag for that particular project or the first or accepted name for that plasmid.
Modern sequencing methods are often incapable of circularizing a plasmid or, in the case of linear plasmids, providing sequence at the ends of the plasmid. Incomplete sequences represent a valuable source of information and should be submitted to GenBank as contigs but should not be given plasmid names. If further information becomes available, the database entry should be updated and the corrections noted. Examination of current practices in GenBank and RefSeq should illustrate what is currently acceptable.
Naming Plasmid Genes
The protocol for naming genes and gene products is well described in GenBank under the Prokaryotic Annotation Guide (http://www.ncbi.nlm.nih.gov/GenBank/genomesubmit_annotation). Below is the entry for F plasmid TraD from RefSeq:
/db_xref="GeneID: 1263585 "
/experiment="experimental evidence, no additional details recorded"
/note="type IV secretion system coupling protein; similar to F plasmid TraD"
The F plasmid (pFpla_1) has 108 genes, which are given sequentially numbered locus tags Fpla1-108 with traD being the 104th gene. In the case of traD, extensive experimental evidence exists regarding its function although no updates providing these references are given, as suggested by the /experiment line. The /note line indicates its putative function with “putative” being the word of choice at GenBank (refrain from using “potential” or “predicted”). The /note line could be read as “ATPase; type IV secretion system-associated coupling protein; similar to F plasmid TraD; VirD4 family.” This indicates that F TraD is a paradigm for a subset of coupling proteins within the VirD4 family.
In the absence of any experimental data, a situation more and more frequently encountered, genes should remain as locus tag designations since they are unique. The /gene field need not be filled in. In this hypothetical case, Fpla104 would be referred to as gene Fpla104 in the literature. If the function of a plasmid gene product is highly probable and there has been some human oversight in the annotation process (as opposed to routine automated annotation), a more descriptive gene name that describes known or putative phenotypes based on the literature can be used. The following are generally recognized as key gene names in plasmids: rep, par, cop, inc, stb, tra, sfx, eex, rlx, pri, ssb (replication, partition, copy number, incompatibility, stability, transfer, surface exclusion, entry exclusion, relaxase, primase, single-stranded DNA binding protein) with others that surely could be added to the list. For instance, cpl could be used for the coupling protein since it is essential in the conversion of a T4SS to a conjugative system and deserves its own gene name. The usual format for gene names (abcD), as described in Demerec et al. , should be used with the last, uppercase letter reflecting the placement of that particular gene within an operon. In long operons such as the transfer operon in plasmid F (33 kb), genes with no known function or with a function that is most likely not involved in transfer could remain as the locus tag designation. Some plasmids have adopted a different naming scheme, most notably the Ti family of plasmids in A. tumefaciens, whereby the gene and gene product names reflect the operon as well as the position within the operon, i.e., virD4. This is well entrenched in the literature and has many advantages that need to be debated by the plasmid community but is not the currently accepted practice at NCBI.
/gene is the unique gene name within that genome using the standard bacterial gene naming conventions, either abcA or abcA1.
/locus tag is normally an alphanumerical designator linked to the name of the genome and the position of the gene in that genome. It can be used as the informal gene name in the absence of detailed annotation.
/note allows reportage of the likely function identified/predicted by bioinformatics and other analyses.
Plasmids often carry other, smaller, mobile elements such as insertion sequences, transposons, and integrons or gene clusters for well-characterized traits such as antibiotic or heavy metal resistance, and virulence. The conventions for naming these elements and genes should be followed as outlined in the Annotation Guide and in references such as Siguier et al.  for IS elements.
Problems with Using Top BLAST Hits for Annotation
The bugbear in the naming of genes appears to be BLAST, a wonderful tool that often instills unwarranted confidence in its user. Many annotations reflect unswerving faith in the ability of BLAST to correctly name a gene. This often leads to naming new genes after the number one hit that has the highest identity in a BLAST search, regardless of whether this name makes sense or not. It may be that the closest related gene does indeed perform a similar function but that gene was incorrectly named during its submission process. Not only is this confusing, but it has the potential to propagate incorrect gene names, which pop up time and time again in subsequent BLAST searches.
One of the most egregious examples include naming newly found genes/gene products that are associated with type IV secretion systems after the Vir gene products of the Ti plasmid. Naming a transfer gene in a non-virulent plasmid virD4, for example, is confusing to anyone not intimately familiar with the coupling protein literature. Membership within the VirD4 family should be reserved for the/note line during the annotation process. Once there is some evidence for the function of a particular gene product, including a literature search, they can be named as described above, for instance, cplA. When deciding on gene product function, some tools are better than others. Because of the problems associated with BLAST, where erroneous functions are propagated in an alarming manner, care must be taken that the correct function is assigned. Swiss-Prot (http://web.expasy.org/groups/swissprot/) is probably the most informative and accurate databank for assigning function, followed by RefSeq at NCBI, where sequences are annotated by NCBI staff. The next most reliable source is annotation done by individual labs followed by the great bulk of sequences entered into GenBank after automated annotation without much formal review.
Although it would have been helpful to discuss nomenclature for plasmids and their genes ten years ago, it is never too late to start. In the absence of guidelines from the plasmid community, researchers have used their common sense as well as followed the GenBank rules for submission to produce, for the most part, useful annotations of plasmids. However, the growing confusion about plasmid and gene names needs to be addressed. Some simple solutions as described above are hereby proposed to initiate discussion within the plasmid biologists’ community. The lists of standardized abbreviations used in naming plasmids and genes should be curated by the databases. Once the plasmid community has decided on these standardized abbreviations, the annotation process itself would be simplified and would aid database managers in understanding plasmid provenance and gene function. Since this proposal is not particularly complex nor does it require extensive maintenance by interested parties, it should be relatively easy to make the case for why it should be generally adopted. The ability to use both long and short or common names, as is done for enzymes, for instance, should avoid the charge of a bureaucracy gone mad. The naming of genes remains more problematic than naming of plasmids because of the sheer numbers of genes and the multiple errors currently present and being propagated throughout the databases. Using BLAST to generate names for new genes, coupled with the use of automated annotation services that are insensitive to these errors, has contributed mightily to this situation. These simple solutions hopefully generate discussion within the plasmid community and with automated annotation services. Hopefully, problematic gene names will eventually be diluted out or updated in the near future.
The authors wish to thank Celeste Brown (University of Idaho), Bill Klimke (NCBI), and Miguel Cevallos (CFN, Mexico) for useful discussions.
- 13.FTP directory of genomes/plasmids. ftp://ftp.ncbi.nih.gov/genomes/Plasmids/. Accessed 16 Mar 2014
- 14.International Nucleotide Sequence Database Collaboration (INSDC). http://www.insdc.org/. Accessed 16 Mar 2014
- 15.NCBI GenBank home page. http://www.ncbi.nlm.nih.gov/genbank/. Accessed 16 Mar 2014
- 16.NCBI GenBank bacterial genome submission guide (annotation). http://www.ncbi.nlm.nih.gov/GenBank/genomesubmit_annotation. Accessed 16 Mar 2014
- 17.NCBI GenBank bacterial genome submission guide (locus tag). http://www.ncbi.nlm.nih.gov/genbank/genomesubmit#locus_tag. Accessed 16 Mar 2014
- 18.NCBI GenBank metagenome submission guide. http://www.ncbi.nlm.nih.gov/genbank/metagenome. Accessed 16 Mar 2014
- 19.NCBI GenBank whole genome shotgun submissions guide. http://www.ncbi.nlm.nih.gov/genbank/wgs. Accessed 16 Mar 2014
- 20.NCBI handbook. http://www.ncbi.nlm.nih.gov/books/NBK21105/. Accessed 16 Mar 2014
- 21.NCBI reference sequence database. http://www.ncbi.nlm.nih.gov/RefSeq/. Accessed 16 Mar 2014
- 22.Swiss-Prot Group http://web.expasy.org/groups/swissprot/. Accessed 16 Mar 2014