Bioinformatics: Databasing and Gene Annotation
- 567 Downloads
“Omics” experiments amass large amounts of data requiring integration of several data sources for data interpretation. For instance, microarray, metabolomic, and proteomic experiments may at most yield a list of active genes, metabolites, or proteins, respectively. More generally, the experiments yield active features that represent subsequences of the gene, a chemical shift within a complex mixture, or peptides, respectively. Thus, in the best-case scenario, the investigator is left to identify the functional significance, but more likely the investigator must first identify the larger context of the feature (e.g., which gene, metabolite, or protein is being represented by the feature). To completely annotate function, several different databases are required, including sequence, genome, gene function, protein, and protein interaction databases. Because of the limited coverage of some microarrays or experiments, biological data repositories may be consulted, in the case of microarrays, to complement results. Many of the data sources and databases available for gene function characterization, including tools from the National Center for Biotechnology Information, Gene Ontology, and UniProt, are discussed.
Keywordsbioinformatics databases functional genomics gene annotation protein interaction toxicogenomics
Genomic experiments amass large data sets, requiring the integration of supportive information from several other sources, including the most recent gene annotations, to facilitate biological interpretation. Typically, after microarray analysis and identification of the most active, or significant, genes, further investigation must be performed to elucidate the relevant pathways and networks involved in eliciting the phenotype (e.g., toxicity). Thus, investigators must integrate complementary information including gene names, abbreviations, and aliases for literature searches; cellular and extracellular locations; functional annotation; disease processes the gene participates in; and biological interaction data (e.g., protein-protein interactions) in order to comprehensively interpret the data. This information is oftentimes available in a variety of biological databases each serving a particular purpose or devoted to a specific data domain.
In general, genome sequences, from databases such as Ensembl (1,2), Entrez Genomes (3), and the University of California Santa Cruz (UCSC) Genome Browser (4), are the root of the universe. From these genomic templates, expressed sequence tags (ESTs) and cDNAs in GenBank (3) can be clustered together and associated with genes (i.e., UniGene; Ref. 3), and exemplary, representative full-length sequences can be identified from GenBank and mapped back to locations in the genome (i.e., RefSeq; Ref. 3). These genes are then annotated in databases such as Entrez Gene (5), where functional information (Gene Ontology; Ref. 6), and disease information (Online Mendelian Inheritance in Man [OMIM]; Ref. 3) are integrated to provide a more comprehensive summary of the function of a gene. Similarly, elements from sequence-level databases (e.g., ESTs) can be associated with features printed on a microarray and related to a gene through its GenBank Accession number facilitating the annotation of gene expression profiles from the microarray experiments. Integration of genomic and proteomic data is also possible through sequence relationships, from the mRNA to the translated protein sequence. This facilitates further functional predictions, by providing protein domain and family information that may reveal functional characteristics, and protein-protein interaction data from databases such as BIND (Biomolecular Interaction Network Database) (7) and DIP (the Database of Interacting Proteins) (8).
Currently, there is significant effort in the development of public repositories such as the Chemical Effects in Biological Systems Knowledgebase (CEBS) (9), ArrayExpress (10,11), and the Gene Expression Omnibus (GEO) (12) to facilitate data integration across multiple domains and to ensure public accessibility, as well as to support the development of comprehensive networks and computational models capable of predicting toxicity.
2 Genome-Level Databases
Genome-level databases manage, at the very least, genome sequence data. However, they differ in their integration of other types of data and often in their assignment of computationally defined genes. The three primary genome-level databases are the Ensembl database (1,2), the Entrez Genomes database (3), and the UCSC Genome Browser (4). Each uses a different technique for predicting genes and gene structures (e.g., untranslated regions [UTR], regulatory regions, introns, and exons) from genome sequence data.
The National Center for Biotechnology Information (NCBI) Entrez Genomes database annotates genes based on the RefSeq database of reference, exemplary sequences. RefSeq sequences are initially aligned to the genomic sequence using the MegaBLAST algorithm to identify genes; mRNAs and ESTs are aligned through MegaBLAST to identify additional genes (www.ncbi.nlm.nih.gov/genome/guide/build.html\#contig).
The UCSC Genome Browser uses the NCBI genome builds for its annotation, thus, there are no differences between the human genome builds at UCSC and NCBI. However, prior to the December 2001 human genome freeze, the UCSC created its own genome builds, separate from the NCBI. Previously, the primary difference between the two methods was in their genome assemblies, where Entrez Genomes used sequence entries from the GenBank database to drive assemblies, whereas the UCSC Genome Browser used BAC clones and mRNA sequences, resulting in differences in the genome assemblies (14). For other genomes, such as the mouse (i.e., C57BL/6), rat (i.e., Norwegian Brown Rat), chimpanzee, rhesus monkey, and dog (i.e., Boxer), the UCSC uses builds from the respective genome authorities (see genome.ucsc.edu/FAQ/FAQreleases for further details).
To annotate the genome builds, NCBI uses the MegaBLAST algorithm for alignments to genomes, whereas the UCSC efforts use the BLAT (BLAST-like alignment tool) for alignment of mRNA, EST, and RefSeq sequences to the genome. This means that although both sources use the same build for the human genome (i.e., the NCBI genome build), there could still be differences in annotation (i.e., assignment of genes and functions to the genomic sequence). Assuming both use the same GenBank and RefSeq versions, differences may be attributed to the different alignment algorithms. In addition, the UCSC Genome Browser also incorporates gene predictions from other sources, such as Ensembl and Acembly (4), and users can also upload their own annotations for display in the browser.
3 Sequence-Level Databases
Sequence-level databases manage EST and cDNA sequence read data. Some databases, such as GenBank and RefSeq, deal with these sequences directly, whereas others manage them on a larger scale, where multiple sequences are grouped together, as in UniGene. Generally, these databases provide the first level of annotation for microarray studies, as the sequences are directly represented on the microarrays as printed features.
When a sequence read is generated, it is generally submitted to the GenBank database and assigned a GenBank Accession Number, a unique identifier representing that sequence and is typically the most commonly used identifier for probes represented on cDNA microarrays (3). The UniGene database creates nonredundant gene clusters based on GenBank sequences (3). Clusters are built by sequence alignment and annotated relative to genes in the Entrez Gene database. Consequently, UniGene clusters can be thought of as collections of GenBank sequences that most likely describe the same gene.
RefSeq Status Codes and Their Level of Annotation*
RefSeq status code
Level of annotation
Records that are aligned to the annotated genome
Predicted to exist based on genome analysis, but no known mRNA/EST exists within GenBank
Predicted based on computational gene prediction methods; a transcript sequence may or may not exist within GenBank
Sequences from genes of unknown function
Sequences represent genes with known functions, however they have not been verified by NCBI personnel
Provisional sequences that have undergone a preliminary review by NCBI personnel
Validated sequences that represent genes of known function that have been verified by NCBI personnel
4 Annotation Databases
Annotation databases provide functional information for genes and may also catalogue the structure of the gene. They serve as an initial point for data interpretation of microarray data and hypothesis generation.
Entrez Gene is a part of NCBI’s Entrez suite of bioinformatics tools. It provides information on genes that have a RefSeq or have been annotated by a genome annotation authority (e.g., Jackson Labs for mice) for several toxicology relevant species, including human, mouse, rat, and dog (5). Consequently, entries within Entrez Gene may have an associated NM (mature) or the XM (nonreviewed) RefSeq, or may not have an exemplary RefSeq sequence associated with it.
Entrez Gene Annotation Categories and Sources*
Gene names and abbreviations/symbols
Publications and genome authorities
Genome position and gene structures
Gene Ontology (GO) database, Gene References into Function (GeneRIF)
Gene Expression Omnibus (GEO), EST tissue expression from GenBank
For human studies, the Online Mendelian Inheritance in Man (OMIM) database, the online version of the Mendelian Inheritance in Man (16), provides linkages between human genes and diseases (3,17). Output pages from the Entrez Gene provide links to OMIM, which is also searchable through the NCBI Entrez system. For many of the diseases within OMIM, a synopsis of the clinical presentation is provided in addition to links to the genes associated with the disease. PubMed citations are also made available through OMIM, with hyperlinks to the PubMed database entries. OMIM also contains information on known allelic variants and some polymorphisms (17).
The GO Consortium maintains the mappings between genes and the GO terms. It is important to note that each gene may have multiple associated GO terms and that the assignment of a GO number has no other significance other than being a unique identifier.
5 Protein-Level Databases
In many instances, the gene annotation databases mentioned above provide hyperlinks to protein annotation databases to identify the proteins encoded by the genes of interest. Recently, several protein-level databases were merged into one primary protein resource, the Universal Protein Resource (UniProt). UniProt combines the Swiss-Prot, TrEBML, and PIR-PSD databases into one resource, consisting of three related databases: (1) the UniProt Archive, (2) the UniProt Knowledgebase, and (3) the UniRef database.
The UniProt Archive (UniParc) is a database of nonredundant protein sequences obtained from (1) translation of sequences within the gene sequence level databases (e.g., GenBank), (2) RefSeq, (3) FlyBase, (4) WormBase, (5) Ensembl, (6) the International Protein Index, (7) patent applications, and (8) the Protein Data Bank (19). The UniProt Knowledgebase (UniProt) provides functional annotation of the sequences within the UniParc. Examples of the annotation include the protein name, listing of protein domains and families from the InterPro database, containing protein family, domain, and functional information (www.ebi.ac.uk/interpro), (20), Enzyme Commission identifier, and Gene Ontology identifiers. Proteins represented within the UniParc and UniProt Knowledgebase are then gathered automatically to create the UniProt reference database (UniRef), a database of reference, exemplary sequences based on sequence identity. Three different versions of the UniRef database exist (i.e., UniRef100, UniRef90, and UniRef50), where the number denotes the percent identity required for sequences to be merged, from across all species represented in the parent databases, into a single reference protein sequence. Thus, UniRef50 requires only 50% identity for proteins to be merged. UniRef50 and 90 provide faster sequence searches for identifying probable protein domains and functions by decreasing the size of the search space.
The RefSeq database also contains reference protein sequences, similar in concept to the reference mRNA sequences. These are available through the Entrez Gene system when querying for a gene. For more information on RefSeq, see Section 3.
6 Protein Interaction Databases
Protein interaction databases such as the Biomolecular Interaction Network Data (BIND) database, the Database of Interacting Proteins (DIP), the Molecular Interaction database (MINT), and the IntAct database provide information on the interaction of proteins with other proteins, genes, and small molecules. Both the BIND (21) and DIP (8) manage data from protein interaction experiments, including yeast-two-hybrid and co-immunoprecipitation experiments. This data is submitted to the databases either directly or as a result of database curators scouring the literature. The data is provided to the public through querying of the Web sites or in interaction files available in the Protein Standards Initiative (PSI) Molecular Interaction (PSI-MI) XML format.
Visualization of these data sets is made possible through tools such as Osprey (22) and Cytoscape (23), which generate protein interaction networks based on input data from protein interaction databases or from other sources. Cytoscape has the additional functionality of allowing the overlay of gene expression data on the protein interaction map (23). These visualization tools provide initial support in the elucidation of pathways that may be altered after treatment, facilitating the generation of new hypotheses and the identification of biomarkers of exposure and toxicity.
7 Microarray Databases
Microarray databases ensure data are being properly managed, support analysis, archive data for long-term use, and facilitate sharing with collaborators or deposition in public repositories. The Minimum Information About a Microarray Experiment (MIAME) standards provide guidance on the types of information that must be captured and reported in support of a microarray study in order to ensure independent investigators can replicate and properly interpret the data (24). This includes information regarding the clones, genes, protocols, and samples associated with the study. Several journals require microarray submissions to adhere to the MIAME standard, and the MGED (Microarray Gene Expression Data) Society is encouraging journals to require that microarray data sets, in support of published articles, also be submitted to repositories as a condition of publication, similar to requirements that novel sequences be submitted to GenBank prior to publication (25,26). Submission of microarray data sets to the NCBI Gene Expression Omnibus (GEO) (12) or the ArrayExpress (10,11) at the European Bioinformatics Institute (EBI) fulfills this requirement. Recently, more specialized repository efforts have been undertaken, such as the Chemical Effects in Biological Systems (CEBS) Knowledgebase (9,27), which will catalogue gene expression data from chemical exposures with the associated pathology and toxicology data.
With the emergence of more pharmacology and toxicology domain specific data management systems, the International Life Sciences Institute (ILSI) Health and Environmental Sciences Institute (HESI) Technical Committee on the Application of Genomics to Mechanism-Based Risk Assessment, in cooperation with the MGED Society, began work on a toxicology-specific MIAME standard (MIAME/Tox) (28). MIAME/Tox is expected to further specify the minimum information required to replicate a toxicogenomics experiment, which will also serve to facilitate data sharing among the toxicogenomics community. Moreover, it is expected that these databases will be extended to include the management of complementary proteomic and metabolomic data as well as other toxicology relevant data such as chemical/drug structure information, adsorption, distribution, metabolism, and excretion.
The use of genomic technologies in the mechanistic understanding of drug and chemical effects in biological tissues requires effective gene annotation. Several annotation sources exist; however, no database captures all of the data, making toxicogenomic data interpretation and network development difficult. For example, information concerning the function of a gene exists within Entrez Gene, however, protein family and structure information exist within the UniProt, and protein interaction data exist within databases such as BIND, DIP, and MINT. Ideally, the integration of data from these disparate sources into a single database would allow a more comprehensive interpretation of the available data. Moreover, a centralized comprehensive knowledgebase would also facilitate the identification of mechanistically based biomarkers for human toxicity and the development of computational models with greater predictive power, which could be used to support and improve quantitative risk assessments.
- 7.Bader, G. D. and Hogue, C. W. (2000) BIND—a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bio-\ informatics 16, 465–477.Google Scholar
- 16.McKusick, V. A. (1998) Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders. 12th ed. Johns Hopkins University Press, Baltimore.Google Scholar
- 18.Cox, C. (1999) Nietzsche: Naturalism and Interpretation. University of California Press, Berkeley.Google Scholar