Background

The field of genomics has been expanding at a rapid pace since the annotated Escherichia coli K-12 genome was published in 1997 [1], with the current number of published genomes exceeding 66 and with another 364 on their way according to the Genomes OnLine Database (GOLD) [2]. Deciphering the functions encoded by all gene products of the genomes is the next big challenge in the field. Function attributions through experimental, biochemical and genetic analyses and through bioinformatic studies are continuing, and microarray technology is shedding additional light on the functions associated with the gene products of the organism in question. The wealth of biological information on E. coli is still increasing [3] and is contributing to a better understanding of this organism as well as of functions encoded in other organisms. It is therefore important that the most up-to-date information on E. coli gene products is available and used by researchers.

Several databases have been assembled for various areas of knowledge about the E. coli genome [4,5,6,7,8,9]. Each compilation has a different emphasis and collects different sets of information related to the function of the gene products. In the GenProtEC database, we have been curating information on physiological function and modular construction of gene products. Other databases most closely related to ours include EcoCyc, with emphasis on metabolic pathways [6], the CGSC database, with information on the genotypes and phenotypes of mutant strains [8], and EcoGene, which includes information on gene reconstructions, alternative gene boundaries and verified amino-terminal amino-acid sequences of the mature proteins [5]. The E. coli genome project at the University of Wisconsin-Madison presents genome data on E. coli K-12 and pathogenic enterobacteria [9].

We present a functional update for E. coli K-12 gene products that incorporates information from the literature and referenced databases obtained since the 1997 GenBank deposit. Our focus has been the biological function of the gene products. Coding sequences (CDSs) encoding proteins whose function previously was imputed or not known were re-evaluated, and putative functions were assigned by manually evaluating the results from BLAST and DARWIN (data analysis and retrieval with indexed nucleotide/peptide sequences) analyses. The MAGPIE (multipurpose automated genome project investigation environment) genome annotation system [10] was also applied. MAGPIE detected alternative boundaries for some of the open reading frames (ORFs).

Results

Number of genes in the E. coliK-12 genome

For the initial annotation of the E. coli K-12 genome [1], 4,404 genes were identified with Blattner numbers (Bnums). Among the genes, 4,288 were believed to encode proteins and 116 to encode RNAs. Since then six Bnums have been retired: bo322, bo395, bo663, bo667, bo669 and bo671 (G. Plunkett, personal communication). In addition, three new genes have been identified and assigned to Bnums. These include the protein-coding b4406 (yaeP, SWISS-PROT P52099) and b4407 (thiS, SWISS-PROT 032583) and the RNA encoding b4408. The current number of E. coli genes is 4,401, with 4,285 encoding proteins and 116 encoding RNAs.

MAGPIE identified 5,527 candidate CDSs that were assigned to MAGPIE identifiers (Magnums) (see MAGPIE [11] for details). The 4,285 CDSs identified by Bnums were also identified with Magnums. Variations were detected for either the start or stop positions for 1,077 of these CDSs resulting in differences in the encoded proteins ranging from 1 to 147 amino acids, the latter in PtsA (Bnum b3947, Magnum ec_6103). The other Magnum-identified candidate CDSs include retired Bnums (six Magnums), CDSs located between the boundaries of Bnums (506 Magnums), and CDSs overlapping existing Bnums (730 Magnums). Among the Magnums located between the boundaries of Bnums are 21 CDSs that encode proteins of 80 or more amino acids. One such CDS identified by MAGPIE (Magnum ec_2510) is located between b1624 and b1625 and encodes a protein of 66 amino acids. The carboxy-terminal 41 amino acids of this CDS are identical to the amino-acid sequence of the recently characterized beta-lactam resistance protein Blr (SWISS-PROT P56976) located at the same position [12]. Other Magnums located between Bnum boundaries may correspond to short E. coli proteins.

Functional annotation of E. coliK-12 gene products

The functional assignments of the E. coli gene products in the November 97 GenBank U00096 deposit represented an accumulation of information retrieved from the literature (collected in the GenProtEC and EcoCyc databases) as well as imputed functions based on similarity of a known protein to the translated sequences [1]. Since the deposit to GenBank was made, our database GenProtEC has continually been updated with knowledge on E. coli gene products appearing in the literature [3,13]. Information on transcriptional regulators has been incorporated from the work of J. Collado-Vides [14,15], and transport protein information has been adapted from the work of M.H. Saier and I.T. Paulsen [16,17]. GenProtEC also contains imputed function assignments based on sequence similarity to orthologous or paralogous proteins, on gene (operon) location and on phenotypes of mutants [18].

Gene products whose functions were known were not considered further for the functional update. The remaining 2,294 CDSs whose gene products had a putative or unknown function assignment were analyzed using BLAST and DARWIN. BLAST analyses were carried out for both the Bnum- and the Magnum-derived protein sequences. The results for the Bnum-derived protein sequences and the automatic functions predicted by MAGPIE or HERON (human-emulated reasoning for objective notations) were manually evaluated and imputed functions were assigned. Although the manual annotation step could not compete with the speed of the automatic annotation process of HERON, it provided us with more useful function descriptions. A comparison of the manually assigned putative functions with the HERON predicted functions showed that when leaving aside issues of specificity, a nearly equivalent function was predicted in 46% of the cases, whereas in 52% of the cases less information was obtained with HERON.

After the function update of the 2,294 CDSs, 1,306 gene products were assigned a putative function and 126 gene products were described by a phenotype. The remaining gene products were given one of the following three assignments: 'conserved protein', where sequence-similar matches were found but the function could not be determined in the absence of consistent functions reported for the matching sequences; 'conserved hypothetical protein', where sequence-similar matches existed but these had no associated function; 'unknown CDS', where the translated sequence had no known sequence match outside E. coli. The current function description includes 256 conserved proteins, 282 conserved hypothetical proteins and 324 unknown CDSs. The 862 gene products with no function assignment represent 19.6% of the E. coli chromosomal genes, and the unknown CDSs at this time represent 7.4% of E. coli genes.

A sample of the annotated E. coli K-12 genes is shown in Table 1. Each gene is identified by a Bnum, Magnum, gene product type and gene product (Function April 2001). A complete table of the current 4,401 Bnums is available as an additional data file online and at MAGPIE [11]. In this table the genes are identified by their Bnum, Bnum_module, Bnum start and stop position, Magnum, Magnum start and stop position, and gene product type. The functions of the gene products are described by the currently annotated function (Function April 2001) and by the function in the GenBank deposit (Function November 1997). A continually updated table that contains the functions of E. coli gene products is available through GenProtEC [4].

Table 1 A sample of annotated E. coli K-12 genes

Many changes are evident when comparing the updated annotation to that of 1997. The number of CDSs without function assignment has been reduced from 1,354 to 862. This reduction is due to functions being experimentally determined (77 CDSs), assignment of putative functions (367 CDSs), phenotype-associated functions (14 CDSs), and genes identified as belonging to phages (138 CDSs). In addition, inferred function assignments were withdrawn for 104 CDS-coded proteins whose functions remain unknown.

The number of gene products with putative function assignments has changed from 1,120 to 1,306. New functions were inferred for 473 CDSs. Putative function assignments were also removed as a result of new experimental data (175 CDSs), assignment of phenotype (8 CDSs) or reassessment of putative function assignments (104 CDSs).

Proteins as modular entities

Some of the proteins encoded in the E. coli genome have arisen through fusion of two or more genes. Examples of such gene fusions are the multifunctional enzymes Aas (2-acylglycerophospho-ethanolamine acyl transferase and acyl-acyl carrier protein synthetase) and G1mU (N-acetyl glucosamine-1-phosphate uridyltransferase and glucosamine-1-phosphate acetyl transferase) [19,20]. We have chosen to deal with proteins as modular entities where a module is defined as a protein element that has at least 100 amino-acid residues, carries a biological function and is presumed to have an independent evolutionary history [21]. Most modules in E. coli are individual proteins. They can, however, also be part of a protein where multiple modules have been joined by gene fusion, as is the case for Aas and G1mU. Other protein types in E. coli such as transporters and regulators also involve gene fusion events. The current modular assignments are based on analysis of protein sequences within E. coli K-12 (P. Liang and M. Riley, unpublished data).

There are at present 287 compound genes identified in the E. coli genome, each containing two to four modules. Table 2 contains a list of multimodular proteins where each module encodes a distinct function. Enzymes, transporters and regulators are all present in the list. The majority of modular proteins, 217, contain modules belonging to different paralogous groups (data not shown). Other multimodular proteins appear to be a result of internal duplication (56 genes) or a combination of gene fusion and duplication (14 genes). The E. coli chromosome is currently represented by 4,401 genes encoding 116 RNAs and 4,616 protein modules. Additional modules are expected to be identified upon analysis of protein sequences from other genomes (P. Liang and M. Riley, unpublished data). Examples are the bifunctional proteins ThrA (aspartokinase I and homoserine dehydrogenase I) and MetL (aspartokinase II and homoserine dehydrogenase II) where only the amino-terminal modules representing the kinase activities have been identified on the basis of their sequence similarity to the E. coli unimodular aspartokinase III (LysC). Both the amino-terminal aspartokinase and the carboxy-terminal homoserine dehydrogenase activities of ThrA and MetL have been verified with biochemical and genetic tools [22,23]. The module representing the dehydrogenase activity has not been identified by matching internal paralogs as E. coli itself does not contain a unimodular sequence-similar dehydrogenase. Saccharomyces cerevisiae, however, does contain a unimodular sequence-similar homoserine dehydrogenase (DhoM, SWISS-PROT P31116), which can be used in identifying the carboxy-terminal module. Thus, by detecting orthologous matches to parts of genes we will be able to identify additional multimodular proteins.

Table 2 A sample of multimodular gene products of E. coli K-12

Current status

Table 3 presents a summary of the gene products encoded in the E. coli K-12 genome represented as modular entities. Half of the modules have been experimentally characterized. Enzymes are the largest gene product type, representing 43.9% of the characterized gene products and 34.2% of the total gene products. Other major gene product types are transporters and regulators. Among the remaining modules, 60% have function predictions. The gene products without a function assignment still constitute a significant portion of the E. coli genome (19% of modules). A summary of the development of information on E. coli gene products over the past eight years is shown in Table 4. It is evident that much knowledge has been gained since these analyses began in 1993 [3,24].

Table 3 Gene products encoded by the E. coli K-12 chromosome
Table 4 History of distribution of gene product types for E. coli K-12

Discussion

An updated version of the function assignments for E. coli K-12 gene products has been presented using the genes identified in the GenBank U00096 deposit. Alternative gene boundaries were produced by MAGPIE. The MAGPIE genome annotation system also identified candidate CDSs that may represent gene products not identified in the GenBank U00096 deposit. Small ORFs with biological activity are likely to be abundant in the organism but await verification by biological data. Undoubtedly, the intergenic regions of E. coli K-12, as studied by Rudd [25] and Bachellier et al. [26], are also important for the function and regulation of gene products.

The percentage of identified chromosomal gene products without a function assignment is decreasing and is currently 19.6%. Only 7.4% of E. coli genes have no match in current sequence databases. This number will be further reduced with the release of the annotated genomes of Salmonella, Shigella and other closely related organisms. Preliminary data show that the number of unknown CDSs (ORFs encoding proteins without sequence-similar matches) will be less than 170 after data on the Salmonella typhimurium genome is included (M.H.S., unpublished data).

The function assignments presented here mainly represent the molecular functions of the gene products. With the generation of microarray data, gene products will also be characterized to a greater degree by the role they play in the cell under specific conditions. We have recently developed a classification system for cellular functions of E. coli K-12 gene products and have assigned more than one cellular role to some gene products where this is appropriate [27]. There is also a need for a more uniform way of describing both the molecular and cellular roles of gene products among diverse organisms, and this issue is currently being addressed by the Gene Ontology Consortium [28].

Conclusions

We have presented a functional update of the gene products encoded by the genes of E. coli K-12 identified in the GenBank Accession U00096 deposit. The E. coli proteins were treated as modular entities where a module is at least 100 amino acids, carries a biological function, and has an independent evolutionary history. The functional update was performed by manual evaluation of the data obtained from GenProtEC, BLAST and DARWIN analyses, and MAGPIE annotation. A table containing the updated function assignments of E. coli K-12 gene products is available as an additional data file online, and at GenProtEC [4] and MAGPIE [11]. We believe these data will be valuable for analysis of E. coli K-12 itself as well as for the analysis of gene products encoded by other genomes.

Materials and methods

Automated annotation

MAGPIE ORF prediction

A three-step approach to ORF prediction was taken to prepare the MAGPIE project for E. coli. GLIMMER 2.0 with a minimum ORF length of 80 nucleotides was initially used to create the base set of predictions [29]. Glimmer 2.0 was run with all default parameters, as recommended in the documentation [29] and trained on the annotated set of ORFs from the Blattner et al. release of 1997 [1]. Because GLIMMER selectively identifies ORFs that match a statistical model of a gene for the organism [29], GLIMMER may miss genes that were laterally transferred or acquired more recently from other genomes. We therefore chose to combine the GLIMMER predictions with those of a syntactic tool encoded within MAGPIE. This tool identifies stop codons and then 'backtracks' to the farthest upstream acceptable in-frame start codon and defines this as the ORF [10]. A non-redundant set of all GLIMMER ORFs plus syntactic ORFs between GLIMMER ORFs was generated. Finally, ORFs annotated by Blattner et al. that were not present in the non-redundant set were added to the MAGPIE project.

BLAST analysis

The CDSs were compared to the NCBI nucleotide (nt) and non-redundant protein (nr) databases using gapped BLAST [30]. Protein-sequence motifs were identified by PROSITE [31]. A search against the MAGPIE-predicted proteins of over 40 completed genomes, including the previously annotated E. coli set, was also performed.

Functional annotation

Automated function annotation was provided using HERON. Description lines with low information content (for example, descriptions containing words such as "hypothetical" or "putative") were filtered out. HERON then calculated word frequencies in the remaining descriptions, identified the top three most common words, and selected the description of the highest-scoring sequence match (for homology comparisons) with one or more high-frequency words. The selected description became the automated annotation for the coding region.

Manual annotation

BLAST analysis

The protein sequences collected from GenBank Accession U00096 were compared to the nr database using gapped BLAST [30].

DARWIN analysis

DARWIN (version 2.0) was used to detect sequence-similar proteins within E. coli K-12 and in 20 additional microbial genomes [32] (P. Liang and M. Riley, unpublished data). In addition to orthologous matches, groups of paralogous proteins of E. coli K-12 were generated on the basis of the DARWIN results. In our hands, DARWIN is particularly successful in identifying distant sequence similarities, a consequence no doubt of the application of multiple substitution matrices optimized for the organism and to each sequence pair.

Functional annotation

Functions were assigned to gene products on the basis of a manual evaluation of the results from the BLAST and DARWIN analyses. The automatic function prediction was also taken into account. In addition to incorporating recent experimental information, a substantial amount of human judgment was brought to bear.

Additional data files

A complete table of the current 4,401 Bnums is provided as an Excel file.