Encyclopedia of Metagenomics

Living Edition
| Editors: Karen E. Nelson

Human Oral Microbiome Database (HOMD)

  • Tsute ChenEmail author
  • Floyd Dewhirst
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-6418-1_13-5

Keywords

Query Sequence Human Microbiome Project NCBI Taxonomy Genome Viewer Subject Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

The human oral cavity is a rich biological site with several microbial niches including teeth, gingival sulcus, tongue, cheek, hard and soft palates, tonsils, throat, and saliva. The microbiome of the oral cavity (Dewhirst et al. 2010) and its niches have been examined based on 16S rRNA sequencing (Aas et al. 2005; Bik et al. 2010; Human Microbiome Project 2012). The metagenome of the oral cavity has been studied to a limited degree prior to 2012 due to the complexity of the site (Alcaraz et al. 2012; Belda-Ferre et al. 2012; Xie et al. 2010). More than 700 prevalent species comprise the oral microbiome, but many taxa are present at less than 0.1 % of the microbial population (Dewhirst et al. 2010). As oral bacterial reference genomes are becoming available, primarily through the efforts of the Human Microbiome Project (Human Microbiome Project 2012), it is becoming possible to attribute metagenomic sequences to organisms at genus and species level (Martin et al. 2012). The anchoring of metagenome sequence information to specific organisms in a taxonomic framework is key to developing a full description of the bacteria-bacteria and bacteria-host interactions that underlie human oral health and disease.

The Human Oral Microbiome Database (HOMD) was developed in response to the lack of any naming or taxonomic scheme for the thousands of human oral 16S rRNA clone sequences that were being generated in the early 2000s and dumped into GenBank without any taxonomic anchor. Investigators were publishing manuscripts using clone names (such as BU063) as provisional taxonomic names. The only way to phylogenetically place an oral clone was to personally align sequences and generate one’s own phylogenetic trees. We recognized that there was a need for a 16S rRNA-based provisional taxonomic scheme to name and provide reference sequences for unnamed taxa known only from clone or isolate 16S rRNA sequences. The naming scheme had to be provisional because formal naming under the bacterial code requires isolation in pure culture and full phenotypic characterization; 16S rRNA sequence by itself is insufficient for formal naming. The taxonomic scheme described more fully below is based on a Human Oral Taxon number which runs currently from 001 to 918.

At about the time we recognized the need to create a taxonomic framework for the oral microbiome, the National Institute of Dental and Craniofacial Research released a request from proposal on “The metagenome of the oral microbiome.” We responded with a proposal entitled “A foundation for the oral microbiome and metagenome,” which was funded as DE016937. The goals of the grant were to (1) set up the HOMD web-accessible database with a provisional taxonomic scheme and to present all oral genomes in a graphical interface, (2) to complete reference genomes for oral taxa, and (3) to obtain isolates of previously uncultivated taxa and make them available to the research community by placing them in national-type culture collections. We have made steady progress in achieving these goals, and this project is currently in its seventh year of funding.

The HOMD Website

The HOMD contains various types of information on human oral microorganisms including taxonomic, genomic, and bibliographic. The purpose of the HOMD website (http://www.homd.org) is to provide an easy-to-use online interface to search, retrieve, and navigate among these different types of information. HOMD also provides web-based bioinformatics software tools for data mining and analyses.

Technically, the HOMD website is constructed using a LAMP system and hosted on the web server computers. The LAMP system provides a Linux operating system, Apache web service, MySQL relational database, and PHP dynamic web page rendering. Textual contents such as the taxonomy and metagenomic information are queried and results dynamically displayed in the web browser by the LAMP system. A dedicated high-performance computer cluster is deployed to handle the computational demanding analysis such as homology sequence searches.

The HOMD has been designed to be compatible with most commonly used web browsers such as Microsoft Internet Explorer, Firefox, Google Chrome, and Safari. We suggest the use of one of these popular web browsers to ensure the functionalities of HOMD web pages and tools. All the HOMD information and tools are viewable and available to the general public without having to log in or acquiring a user account. The log-in function is mainly for the purpose of maintaining the website and the curation of the database information. If a user has been designated a curator, he or she will see additional administrative submenus.

Detailed functionalities, web interfaces, and tools as well as useful usage tips are presented below. Technical information such as the implementation and design of the HOMD has been published elsewhere (Chen et al. 2010).

Features of the HOMD Web Pages

The design of the website was based on the feedback of several researchers in the field of oral microbiology over the past several years. The user interface was designed to be user-friendly, intuitive, and practical. On top of every HOMD page (Fig. 1), there is a top banner for the HOMD logo, which automatically reduces to smaller size (in height) once the user navigates away from the home page so that the banner will not take up too much space from the requested content. Clicking the top banner image also brings the user back to the HOMD home page. Top navigation menu is located right below the top banner and is also accessible throughout all the HOMD pages. The top navigation menu provides access points to all HOMD’s tools and information on all the web pages.
Fig. 1

Screenshot of the HOMD home page

Another useful feature of the HOMD web pages is the unique page ID system. The rightmost item displayed on the top navigation menu is the page ID – a unique code that distinctly identifies the current page that a user is viewing. For example, the page ID of the HOMD home page is “HP1” (Fig. 1), and once a user navigates away from home page to, e.g., the Taxon Table page, the page ID automatically changes to “TT1.” This feature allows precise page referencing. This is particularly useful when a user needs to refer to a specific page on HOMD site for discussion, bug reporting, or suggestion.

The HOMD home page also includes a top-down oriented expandable menu on the left side and an introductory paragraph in the center. On the right side are the Meta-Database Search, the Announcement, and the Database Update boxes. The Meta-Database Search is very useful for searching desired information across all the subsets of HOMD databases, including the taxonomy, the metagenomic information, as well as the dynamic genome annotations. The result lists the number of matches to the keyword that provides links, leading to detailed information. The Announcement box displays the important system-wise updates and news for the HOMD. The Database Update box is automatically updated by the HOMD dynamic genome annotation pipeline (see “Dynamic Annotation of Genomic Sequences” section) to keep track of the status of the genome annotation.

HOMD also provides comprehensive documentation and updates history of data and tools. The HOMD User’s Guide (i.e., the help documentation) was designed to help users to use the tools, navigate the information, and interpret the results provided by HOMD. The User’s Guide is accessible through the top navigation menu on all pages and is dynamically linked to the relevant guide for each different tool. For example, when users are viewing the Taxon Table page, the “How to Use This Page” menu item shown in the top navigation menu will lead directly to the page that explains the use of the Taxon Table. Alternatively users can also browse the entire user documentation by clicking the “Table of content” tab shown on top of each documentation page as well as the “User’s Guide” links on top menu and side menu of home page. Every document of HOMD can be searched either through the search box located at the bottom of the table of contents of the documentation page or through the Meta-Database Search box located at the top-right part of the home page.

The design of the online interfaces of HOMD has been driven by suggestions from HOMD users. HOMD is open to suggestions and feedback from the research community to further improve its interface and content. Currently, HOMD provides several different ways to communicate with the research team and research community. The contact information provides e-mail addresses for direct communication with the HOMD research team. There is also a mailing list for important updates and announcement. Users can use their own e-mail address to subscribe to the HOMD Mailing List (https://groups.google.com/forum/#!forum/homd-mail) by sending an empty e-mail to the e-mail address: homd-mail+subscribe@googlegroups.com. An automatic e-mail will be sent to the subscriber for confirmation. HOMD also provides a discussion platform for the research community (https://groups.google.com/forum/#!forum/homd-forum). Note that these web links may change over time. In any case, current or updated web links provided here will be available on the HOMD website.

The HOMD Database Schema

The information and data provided by HOMD are stored in several databases. The Oral Taxon IDs and the genome IDs serve as the keys to cross-link these databases. The database table structures and the contents can be downloaded from the HOMD FTP (file transfer protocol) site at ftp://ftp.homd.org to allow users to reconstruct the databases and perform advance queries on their own computers.

Download Data from HOMD

Most of the data recorded in HOMD, including taxonomy, genomics, and 16S rRNA reference sequences, can be downloaded from the HOMD FTP site (ftp://ftp.homd.org). The FTP site provides both current and archived versions of the data for comparison. The FTP site can be accessed directly in the web browser. Each folder comes with a “readme” text file explaining the data, data format, and potential usage. Selected data such as the aligned reference sequence dataset, aligned 16S rRNA datasets for each taxon, and an HOMD taxonomy database in Excel format can be downloaded from the links provided in the HOMD web pages.

Taxonomy

Compilation of the HOMD Taxa

The HOMD describes information linked to oral microbe species. For bacteria, or archaea, that have not been validly named, there is no definition of “species.” Molecular methods to identify novel species generally have used 16S rRNA sequencing of isolates or 16S rRNA-based analysis of clone libraries. These strains or clones can then be clustered into phylotypes or taxa based on their 16S rRNA sequences. Phylotype can be defined for any similarity cutoff. In HOMD, a cutoff of 98.5 % 16S rRNA sequence similarity was used to cluster the 16S rRNA sequences at the species level to define novel oral bacterial phylotypes. Each validly named species and novel phylotype cluster was given a unique integer number called Human Oral Taxon (HOT) ID.

The original collection of oral microbial taxonomy information came from a combination of literature, primarily reports from Forsyth Institute investigators (Dzink et al. 1985, 1988; Socransky and Haffajee 1994; Tanner et al. 1979, 1998) and from Lillian Holderman Moore and Ed Moore (Moore et al. 1982, 1983; Moore and Moore 1994) formerly at the Anaerobe Laboratory at the Virginia Polytechnic Institute. 16S rRNA sequences for these named species came either from sequences obtained in our laboratory or from GenBank. Over the past 20 years, our laboratory constructed and sequenced over 600 16S RNA gene libraries and obtained over 35,000 clone sequences. The cloning, sequencing, aligning, treeing, and clustering methods used to create HOMD are described elsewhere (Dewhirst et al. 2010). In brief, sequences were manually aligned in a secondary structure-based database using the program RNA (Paster and Dewhirst 1988). Distance matrices and neighbor-joining trees were generated to determine the clustering of sequences. Sequences with similarity equal to or greater than 98.5 % were grouped together into a single taxon. Sequences were extensively checked for chimeras and several sequences and some provisional taxa were removed. As a result, several hundred apparently novel full 16S rRNA sequences were identified this way.

To share the information of both the named and novel human oral microbial taxa with the research community, we decided to build a database and designed web query interfaces and tools. When the HOMD was publicly launched in 2010, there were a total of 619 Human Oral Taxa in the initial release of the HOMD database. The 753 reference 16S rRNA gene sequences upon which this analysis was done have been released publicly for download on the HOMD website as version 10. At the time of writing this chapter, the total number of taxa described in the HOMD taxonomy database has grown to 688, represented by a total of 833 reference 16S rRNA sequences (HOMD RefSeq Version 13.1).

Navigating the HOMD Taxa

The HOMD taxonomy information can be viewed and retrieved in several different ways. The information can be viewed online directly in a web browser or downloaded as text files. For the online web browser viewing, the taxonomy pages can be searched with keywords or by visual navigation with the Taxon Table (Fig. 2) and the Taxonomic Hierarchy (Fig. 3). The Taxon Table can also be downloaded in Excel and tab-delimited plain text file from the Tools & Download page or through the HOMD FTP site. The keyword search can be done through the Meta-Database Search box on the home page or on the Taxon Table page. Both search boxes look for input keyword(s) in all text fields of the HOMD taxonomy database table.
Fig. 2

Screenshot of the Taxon Table

Fig. 3

Screenshot of the Taxonomic Hierarchy expanded at the order level Bacteroidales

On the Taxon Table page, all the human oral microbial taxa are listed in a table ordered alphabetically by organism names. The order can be changed by clicking the column name HOT IDs, Genus, or Species names, to toggle the display sort order. Three commonly used filters are also provided to show only those taxa with “named species,” “unnamed cultivated species,” or “uncultured phylotypes.” Each taxon listed in the table contains links to the individual Taxon Description page (described later) and to the genomic information, if available.

The taxa can also be viewed in the taxonomic hierarchical order, i.e., from domain, phylum, class, order, family, genus, to species levels, on the Taxonomic Hierarchy page (Fig. 3). The hierarchical tree is fully collapsed by default and can be dynamically expanded at any given level (or all levels). The link, at the species level, brings users to the detailed Taxon Description page. The designation of each level is followed by two numbers enclosed in the square brackets indicating the number of taxa and genome sequences. For example, “Phylum Proteobacteria [107, 144]” indicates that in the phylum Proteobacteria, 107 taxa were identified in the oral cavity and 144 strains have genomic sequences available at HOMD. If a strain has been sequenced by multiple groups, or multiple strains sequenced for a species, we provide each sequence when available.

Another way to check the summary of the HOMD taxa is to view the number of taxa at various taxonomy levels. The Taxonomic Level page provides a list of taxa and the number of taxa at the next lower level for each of the 7 taxonomic levels: Currently, the numbers are Domain (2), Phylum (14), Class (24), Order (40), Family (83), Genus (183), and Species (688).

Taxon Description

The HOMD Taxon Description page (Fig. 4) provides comprehensive information for a specific human oral microbial taxon. Information provided can be summarized in four categories: Taxonomic Hierarchy, biological characteristics, references, and community comments. Throughout the page, clickable dynamic cross-links are provided for additional information. The taxon page can be edited and curated by designated curators upon their logging-in. The page also allows input and comments provided by the users in the research community. Information described on this page are the following:
Fig. 4

Screenshot of the Taxon Description page

  • Human Oral Taxon (HOT) ID – The Human Oral Taxon ID is a unique numeric ID representing a particular taxon. The taxon can be unambiguously referred to from other sources of scientific literature. The taxon can be accessed on the web with an easy universal resource locator (URL) format, http://www.homd.org/taxon=NNN , where NNN is the HOT ID. The Human Microbiome Project Data Analysis and Coordination Center (DACC; accessible at http://www.hmpdacc.org) is using HOT IDs to designate taxonomic identity isolates of the oral cavity with URLs cross-referenced to HOMD. These URLs are embedded in the data provided by DACC so that user can track down to the more comprehensive information for individual genome. The HOT IDs were also embedded in the GenBank sequence records for the 35,000 clone sequences that were used to build the initial collections of the HOMD taxa. The text embedded in the GenBank records has the syntax /db_xref=“HOMD:tax_NNN,” in which NNN is the numeric HOT ID. If the GenBank sequence is viewed in the web browser through the NCBI website, the portion of the text “tax_NNN” is also clickable and links to the corresponding taxon page on the HOMD website. For example, the GenBank record for the partial 16S rRNA sequence of the Alloprevotella rava clone GB024 (Accession No. GU409552, http://www.ncbi.nlm.nih.gov/nuccore/GU409552) contains the text /db_xref=“HOMD:tax_302,” because the HOT ID for A. rava is 302. Clicking “tax_302” in this GenBank record in the web browser will bring the user to the corresponding taxon page on HOMD (http://www.homd.org/taxon=302). NCBI embeds external database reference IDs in the GenBank records for cross-database referencing. More information can be found at this link http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref.

  • Status – This field displays the culturing status for the taxon. A taxon can be either a validly named cultivated species, an unnamed cultivated species, or an unnamed uncultured phylotype. This status is shown in this field and will be updated upon the change of actual status of the taxon.

  • Type strain/reference strain – If the taxon’s status is validly named cultivated species, the Type Strain is listed here; if the taxon is an unnamed isolate, the strain information will be listed as Reference Strain. If no cultivated strain is available yet, the Reference Strain field will be listed as “None, not yet cultivated.”

  • Classification The Taxon Description page lists the nomenclatures of each taxonomic level from Domain to Species. This classification is defined by HOMD and may be different from the NCBI Taxonomy. The NCBI Taxonomy can be accessed using a dynamic link. The HOMD taxonomy is based on analysis of where each taxon falls in phylogenetic trees generated using several treeing methods and including over 100 non-oral reference taxa identified by searching the “greengenes” 16S rRNA gene database (http://greengenes.lbl.gov). For example, in HOMD, an organism such as Eubacterium saburreum is placed in the family Lachnospiraceae (because that is where it falls phylogenetically), rather than in the family Eubacteriaceae (because its incorrect genus name “Eubacterium” has not yet been revised). Synonyms of the taxon that are currently in use or were used before in the literature or publications are also provided.

  • 16S rRNA gene sequence – GenBank accession number and link to NCBI corresponding Entrez record to one or more 16S rRNA gene sequences associated with the taxon.

  • 16S rRNA gene sequence alignment This field provides the link to the downloadable clone sequences preliminarily aligned to the reference sequence to which the clones belong. The current set contains the approximately 35,000 clone sequences (Dewhirst et al. 2010) aligned for each taxon. The clone alignments are provided concatenated FASTA format with the reference sequence(s) on top which were used as the template for alignment. To view the alignment in color format and for further adjustment, third-party alignment viewing software may be used, such as SeqView (http://pbil.univ-lyon1.fr/software/seaview.html) and BioEdit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). Because some pairs of clone sequences may be nonoverlapping (i.e., 500-base sequences at opposite end of the molecule), this file must be used with caution for tree construction.

  • Phylogeny A phylogenic tree showing the position of this taxon among related HOMD taxa is provided here. The tree images are in PDF format and can be viewed or downloaded with the link provided in this field. A link to a list of all the downloadable phylogenetic tree images encompassing all the HOMD taxa is also provided.

  • Prevalence by molecular cloning The number of clones found for this taxon in an analysis of approximately 35,000 clones (Dewhirst et al. 2010). Based on the number of clones found, the rank abundance of the taxon (out of 619) is given.

  • Synonyms Lists previous names for the organism if validly named. Isolate or clone designations are given as synonyms when they have appeared in the literature as “names” for the taxon, such as “BU063.” (Zuger et al. 2007).

  • NCBI taxonomy For validly named species, there is a link to the NCBI Taxonomy. NCBI has no taxonomy for unnamed taxa; hence, the reason HOMD was created.

  • PubMed search The number of hits when the name (genus plus species) of this taxon is used in the PubMed search. HOMD automatically and periodically updates this hit number every 2 weeks. To get a most up-to-date search, simply click the “PubMed Link” to pull up the search result live from NCBI PubMed site. In general, there are no results for unnamed taxa, hence the need for HOMD. When articles referencing these taxa (often through clone numbers) are found by HOMD curators or community members, they are manually added to the Taxon Description.

  • Nucleotide search Similar search as above using NCBI Entrez “nucleotide” as reference database. The latest result (hit count) is displayed with link to NCBI for most updated search.

  • Protein search Similar search as above using NCBI Entrez “protein” as reference database. The latest result (hit count) is displayed with link to NCBI for most updated search.

  • Genomic sequence Number of genomes that have been sequenced is indicated here with a link to a detailed list of these genomes.

  • Hierarchy structure An expandable/collapsible view of a dynamically displayed taxonomy tree indicating the position of the taxon on the page.

  • Cultivability Conditions and media for growing strains of this taxon, if available.

  • Phenotypic characteristics Generic phenotypic description of the taxon if the taxon has cultivated member(s).

  • Prevalence and source Describes the frequency and source of clones and isolates from different oral sites and states of health or disease when known.

  • References – Literature and publications referencing this taxon. These references are manually curated with up to ten key references which may also include older references not indexed in PubMed.

  • Community comments Registered and logged-in users can provide their feedbacks related to this taxon. The comment requires the approval of the HOMD curators before it is shown to the public.

Identification of 16S rRNA Gene Sequence by BLAST Search

One of the most used HOMD software tools is the customized BLAST search specifically designed to identify user-provided 16S rRNA sequences against the comprehensive collection of the 16S rRNA reference gene sequences. Currently there are a total of 688 taxa defined based on version 13.1 of the 16S rRNA reference sequences. Since a phylotype can include members with up to 1.5 % sequence divergence (23 bases for a full 1,500-base sequence), multiple reference sequences have been selected where we have sequences diverging by more than 10 bases within a taxon.

HOMD provides two primary sets of 16S rRNA gene reference sequence (RefSeq) for download and for BLAST search. The first set is the HOMD 16S rRNA RefSeq. This set contains sequences representing all currently named and unnamed oral taxa. In the latest reference sequence set (version 13.1 at the time of writing), there are 834 reference sequences representing the 688 taxa. The second is the HOMD 16S rRNA Extended RefSeq. This set contains additional16S rRNA reference gene sequences that are distinctively different from existing taxa but have not yet been assigned with a taxon ID.

The HOMD reference sequences are corrected consensus sequences. Many have been corrected and extended based on alignment with other sequences for that taxon and Ns and indels removed. Therefore, for many sequences, there will be differences between the reference sequence and the GenBank sequence listed in the header information. We have not yet updated our own GenBank sequences and cannot update those from other depositors. We believe these are currently the best reference sequences available and, for the purposes of BLAST analysis, have the advantage of being of a uniform length.

On the HOMD 16S rRNA Sequence Identification page (Fig. 5), users can copy and paste the query sequences in the text field or upload from user’s computer. The query sequences should be in the concatenated FASTA format. The maximal number of query sequences allowed to upload in a single search is 5,000. Since viewing of the BLAST results in the web browser for over 5,000 query sequences becomes very slow, for search over 5,000 sequences, please contact the HOMD team. The HOMD 16S rRNA BLAST online tool was only designed for a modest number of sequences, up to a couple of thousands, which can be submitted in several batches. It is not capable of handling larger numbers of sequence reads, such as hundreds of thousands of reads from the next-generation sequencing pipeline. For larger numbers of sequences, the search can be done on a collaboration basis. HOMD provides secure FTP (sFTP) upload for large batches of user sequences, and the search will be sent manually to the HOMD BLAST server cluster on user’s behalf and results made downloadable through the sFTP site. The upload page also provides options for adjusting the BLAST search parameters although the default setting should be sensitive enough to pick up matches with even short oligo sequences.
Fig. 5

HOMD 16S rRNA Sequence Identification. (a) Query sequence input interface; (b) Result page

Once the query sequences are submitted, the sequences are uploaded to the HOMD computer servers and queued for the BLAST search. Once all the searches are done, the results are presented back to submitter in a tabularized format. Results containing up to 20 top matches for each query sequence can be downloaded in text or Excel file formats. Original full BLAST results including the alignments can also be accessed from the result page. The match identity is presented as straight BLAST results and as an adjusted percent identity (API) calculated as
$$ \mathrm{API}=100\times \mathrm{M}/\left(\mathrm{M}+\mathrm{MM}\right) $$
where M is the matched (identical) and MM the mismatch sequence length between the query and the reference sequence, respectively. This calculation excludes any gaps introduced during the alignment process of the BLAST search. We have found that this correction gives much better values for single primer sequence reads where the sequence adjacent the primer often includes indels. The top hits are ordered by their API rank, and sequences with alignment shorter than 95 % of query sequence are excluded from ranking. The top four matched reference sequences are listed by this method, and the table shown on the web page contains links to the original BLAST results as well as to the Taxon Description pages for reference sequences. The results for the 20 top matches can be downloaded as plain text or in Microsoft Excel format.

Genomics

Genomics Tools Overview

Complimentary to the taxonomy information, the HOMD also provides comprehensive information and tools for studying genomes of the human oral microbes. HOMD genomics database serves as the curated repository for the molecular sequences of human oral microbiome, including complete and partial genomics sequences, as well as 16S rRNA mentioned in the previous section. Genomic sequences available at HOMD can be fully assembled genomes, high-coverage genomes, or genome surveys. HOMD also keeps tracks of the status of ongoing genome sequencing projects for human oral microorganisms. A Sequence Meta Information page is created to hold relevant genomics and sequence meta information if a sequencing project for a human oral microbe is announced and available in the NCBI Genome Project Database. The genome project status is updated biweekly based on information collected from the NCBI Genome Project Database with an automatic query script. Once genomic sequences are publicly released, they are dynamically annotated by HOMD (Dynamic Annotation). Annotation done by other data centers, if available, is termed “static annotation” and is viewable in a separate panel in the Genome Viewer (described below). Relevant tools are provided for viewing and searching the annotation. These tools were first developed as part of the Bioinformatics Resource for Oral Pathogens (BROP: http://www.brop.org; Chen et al. 2005). The programs and the data-mining schemes used in HOMD are designed for both finished and unfinished (collections of multiple contigs) genome sequences. The tools are integrated with the HOMD website and are conveniently accessible by users. Icons or links to available tools pertaining to a specific genome are automatically presented on relevant page to users. Important genomic data and bioinformatics tools provided by HOMD are described below. Additional information on tools is also available in the previous publication (Chen et al. 2005).

Genome Table

HOMD organizes genomes in three viewing options: Taxa with Annotated Genomes, Taxa with Genomes in Progress, and View All Genomes. The first option lists the oral taxa with annotated (static or dynamic) genomic information and provides links to all the genomes available for each taxon. The View Genome button links to the Genome Table showing all the available genomes of a specific taxon. The Genome Table shows the Oral Taxon ID (HOT), the Genus and Species names, Strain Culture Collection, HOMD Sequence ID (SEQ ID), number of contigs and singlets, combined sequence length, and links to available tools and information. The second option (Taxa with Genomes in Progress) lists those oral taxa with genomic sequencing project still in progress but no sequence is yet available. The third option shows all the genomes in the alphabetical order and provides searching and sorting function for easy navigation. Each genome listed has a link to the Sequence Meta Information page described next.

Sequence Meta Information

The Sequence Meta Information page provides detailed biological, molecular biological, genetic, genomic, and taxonomic as well as annotation information for a particular strain that has been, is being, or will be sequenced (Fig. 6). Information on these pages is semiautomatically updated. Updated information from both Genomes OnLine and NCBI Genome Project Database is retrieved biweekly and compared with the existing database automatically. New or modified Genomic Project information are then added to the Sequence Meta Information pages with confirmation by curators. The Sequence Meta Information page contains the following human-curated information related to the target organism: Oral Taxon ID, HOMD Sequence ID (SEQ ID), Organism Name (genus, species), Culture Collection Entry Number, Isolate Origin, Sequencing Status, NCBI Genome Project ID, NCBI Taxonomy ID, Genomes Online Goldstamp ID, NCBI Genome Survey Sequence Accession ID, JCVI (previously TIGR) CMR ID, Sequencing Center, number of contigs and singlets, combined length (Kbp), GC percent, DNA molecular summary, ORF annotation summary, and 16S rRNA gene sequence. In addition, original external information such as NCBI Genome Project Database, NCBI Taxonomy Database, Genomes OnLine Database, and rRNA in NCBI Nucleotide Database, if available, is parsed into separate tables below the Sequence Meta Information for convenient referencing.
Fig. 6

Screenshot of the Sequence Meta Information page

Full and High-Coverage Genomes

Full genomes are the oral microbial genomes that have been fully assembled, while the high-coverage genomes are not fully assembled but represent coverage of most of the genomes. Both types of genomes are annotated and deposited in a public database such as GenBank. HOMD aims to provide frequently updated genomic annotation for oral bacterial genomes (see below). In addition, HOMD provides graphical genomic viewing for static annotations done by other public data centers such as NCBI or JCVI.

Genome Surveys

One of the original major goals of the NIH-funded project “A Foundation for the Oral Microbiome and Metagenome,” DE016937, was to partially sequence up to 100 representative human oral microbial species. A total of 12 low-coverage partial genomic sequences were sequenced and deposited in NCBI before this project fused with the Human Microbiome Project. The genome information for these 12 surveys is still maintained on HOMD even though they currently also have complete or high-coverage genomes (The Forsyth Metagenomic Support Consortium and Izard 2010). Since the launch of the Human Microbiome Project, the HOMD team has been providing genomic DNA from human oral microbes to the four HMP sequencing centers for high coverage rather than survey sequencing (The Forsyth Metagenomic Support Consortium and Izard 2010).

Dynamic Annotation of Genomic Sequences

One of the major features of the HOMD Genomic Database is the automatic and frequent updating of genomic annotation pipeline for genomes of oral isolates. Although the amount of sequence data is still growing rapidly, the computational power needed for bioinformatic analysis of this data is catching up and the cost and energy consumption per CPU decreasing due to the availability of multi-core CPU formats. The lower cost of computational power has made it feasible for us to set up a small computation farm dedicating to the annotation of human oral microbial genomes. HOMD recruited a cluster of multi-core multi-node computer servers to frequently update the annotation. Current HOMD genome annotation algorithms include (i) BLASTP (http://www.ncbi.nih.gov/BLAST; Altschul et al. 1997) search against weekly updated NCBI nonredundant protein data (ftp://ftp.ncbi.nih.gov/blast/db/), (ii) BLASTP search against Swiss-Prot protein data (http://us.expasy.org/sprot/; Boeckmann et al. 2003), and (iii) InterProScan search against various sequence databases (Zdobnov and Apweiler 2001; http://www.ebi.ac.uk/interpro/). To provide data on functional potential of genomes, BLASTP search results against Swiss-Prot are further processed for the construction of KEGG metabolic pathways and Gene Ontology Trees. We take advantage of the fact that the well-annotated Swiss-Prot protein sequence descriptions contain interlinks to the ENZYME (Bairoch 2000) and Gene Ontology (Camon et al. 2003). The dynamic genome annotation is running full time daily on the dedicated computer cluster except during the weekend, when the latest NCBI nonredundant protein database, Swiss-Prot, and InterPro databases are being downloaded to and updated on our server. Currently a total of 324 genomes representing 306 taxa are being repeatedly annotated by this pipeline. On average, each genome takes ~ 3 h to be annotated; thus, the current re-annotation frequency is approximately a month for all the 300+ genomes. Additional genomes are being added to the annotation pipeline as more sequences are made available by other public sequencing projects such as the Human Microbiome Project (http://www.hmpdacc.org). A live update status of the genome annotation is provided on the HOMD home page indicating the latest genome annotated or updated. HOMD aims to maintain frequent and dynamic computer annotation for genomic sequence of at least one isolate from each oral taxon whenever sequences are made publicly available, as well as static annotation of all annotated releases.

Genome Explorer

Genome Explorer is the centralized web interface that interconnects all the genomics resources in HOMD (Fig. 7). The front end of Genome Explorer is a user-friendly interface that allows investigators to navigate among all the genomics information provided at HOMD. HOMD Genomics Tools can be accessed either by selecting the tool or the genome first. If the user chooses the desired tool first, the user is then directed to the Genome Explorer interface for selecting genomes. Once a target genome is chosen, the interface dynamically presents all the tools, including linked external databases, available for the selected genome. Currently available tools include Genome Viewer, Dynamic Annotation, BLAST, Annotator, EMBOSS, KEGG pathways (Kanehisa 2002), Gene Ontology Tree (Ashburner et al. 2000), Genomewide ORF Alignment, and Sequence Download. The back end of Genome Explorer is a searchable annotation database that integrates all the results generated from the data-mining pipeline described below. The search result is presented in a paginated and sortable table that also provides web links to (i) a summary page for individual ORF, (ii) Genome Viewer to show the exact location of the target ORF in the genome, and (iii) the original BLAST or InterProScan results. The summary page provides all the information and tools available for a specific ORF, including all the data-mining results mentioned above, as well as convenient links to other web tools for performing fresh search and analysis. In short, Genome Explorer is a one-stop interface for all the genomic information available for each target genome or gene.
Fig. 7

HOMD Genome Explorer displaying results of Dynamic Annotation for the genome Aggregatibacter actinomycetemcomitans HK1651

Genome Viewer

Genome Viewer is a unique graphical genomic sequence viewer developed originally for the BROP project (Chen et al. 2005) (Fig. 8). The Genome Viewer was designed to alleviate the inconvenience encountered when comparing two different sets of annotations for the same genome. Genome Viewer provides a graphical, six-frame translational view of the same region of the genome with individual panels showing different sets of annotations. It has easy navigating features including zooming, centering, and searching by gene ID. For example, the genome Porphyromonas gingivalis W83 has been annotated by JCVI (TIGR), Los Alamos National Laboratory, and NCBI separately. These different annotations can be viewed and compared side by side in the Genome Viewer (http://www.homd.org/index.php?name=GenomeExp&org=pgin&gprog=gview).
Fig. 8

HOMD Genome Viewer displaying multiple sources of annotations for Aggregatibacter actinomycetemcomitans HK1651

HOMD Genomic BLAST

With the increasing number of genomes being sequenced, the output of a high-throughput BLAST search can be very complex and time-consuming to interpret, with many redundant results. We recently developed a graphic tool based on newly improved BLAST+ (Camacho et al. 2009) that allows the user to customize BLAST searches by dynamically selecting a group of any combination of the genomic sequences available in HOMD. The HOMD Genomic BLAST provides a visual taxonomy-based navigation interface (Fig. 9) for easy and dynamic selection of a set of genomes for sequence homology search. The selection can be a combination of individual genomes and/or a group of genomes related at any taxonomic level (species, genus, etc.). The BLAST parameters are dynamically presented after the genome selection, and the results are available on the web and for download in multiple formats.
Fig. 9

Screenshot of the HOMD Genomic BLAST tool – the genome selection page showing 107 Bacteroides genomic sequences selected for BLAST Search

The HOMD Genomic BLAST query interface starts with the selection of the genomes to be searched against. All the HOMD genomes available for search are displayed and selectable in a collapsible tree based on the taxonomy hierarchy. As shown in Fig. 9, upon starting the HOMD Genomic BLAST, the taxonomy hierarchical tree is fully expanded by default and can be dynamically collapsed at any given level. The links, at the species level or genomes level, lead to the detailed Taxon Description or Sequence Meta Information page, respectively. Numbers indicated in the square brackets at each level are the numbers of oral taxa, genomes with meta information, genomes with HOMD annotation, and genomes with NCBI annotation, respectively. The genome selection is flexible and can be a single genome, any randomly selected individual genomes, a group of genomes at any taxonomy level (from Domain to Species), all the genomes dynamically annotated at HOMD, all the genomes with static annotations by NCBI, or a representative genome from all the species. The total number of genomes selected is shown on top of the page.

After the genomes are selected, users are directed to the next page for providing the query sequence and options for BLAST search (Fig. 10). A summary of the selected genome(s) is presented on top of this page with an option for going back and modifying the selection. Below the summary is the query sequence form. The query sequence, in FASTA format, can be copied and pasted into the sequence field or uploaded directly from user’s computer. Multiple sequences are allowed with the limit of ten sequences. BLAST parameters are dynamically changed based on the type of query and subject sequences. The query sequences can be either nucleotide or protein sequences. The subject can be whole genomic DNA sequences or nucleotide or amino acid sequences of the annotated proteins of the selected genomes. Once the sequence type (nucleotide or protein) is selected by user for both query and subject sequences, suitable BLAST programs are dynamically displayed for selection. For example, if both query and subject sequences are proteins, only BLASTP is available for search; likewise, if both queries and subjects are nucleotides, the search can be done with BLASTN, BLASTX, or TBLASTX. Furthermore, alternative algorithms are available for nucleotide to nucleotide searches, including MegaBLAST (Morgulis et al. 2008) and Discontiguous MegaBLAST (Morgulis et al. 2008). Similarly, for protein to protein searches, available algorithms are BLASTP, PSI-BLAST (Altschul et al. 1997), PHI-BLAST (Zhang et al. 1998), and DELTA-BLAST (Boratyn et al. 2012). For each BLAST program, only the parameters and options corresponding to the selected program type and algorithm appear on this page. Detailed information about BLAST parameters is available under the link “Help.” For the advanced users, the command-line style BLAST+ parameters can be added in Advanced Option section (Camacho et al. 2009).
Fig. 10

The HOMD Genomic BLAST tool – query sequence input and BLAST parameter adjustment page

Upon submission of the BLAST search, the requested job is sent to the back-end service for processing. The back-end service consists of a computer cluster to handle multiple requests from the query interface. The selected genomes/nucleotides/proteins are dynamically compiled to a virtual sequence database searchable by the BLAST programs, using the “blastdb_aliastool” tool provided by BLAST+ (Camacho et al. 2009). The searched jobs are distributed to the computer nodes of the cluster, which is managed by the TORQUE resource manager (http://www.adaptivecomputing.com/products/open-source/torque). During the search process, user is presented with an intermediate page to monitor the job status. This status page reports a summary of the job as well as time/duration elapsed since submission. The status page periodically refreshes itself, effectively polling the server while the job runs. BLAST result is automatically presented when the job completes.

BLAST results are presented dynamically in the output interface (Fig. 11). Users can check the details of BLAST job information and choose to download the results in different formats, such as HTML, archive, text, tabular, CSV, and XML. Additional jobs can also be submitted for the same queries and subjects with modified parameters. The search strategy including the query, subject, and BLAST parameters can be saved or downloaded for future reference. The actual BLAST results are presented in a manner similar to the typical HTML format. They include a Graphical Overview section (Fig. 3) to display the alignment of the “high-scoring pairs” (HSPs) between the query and the subject sequences. HSPs are plotted against the query sequence and highlighted by different colors based on alignment scores. Every HSP on the plot is hyperlinked with the corresponding pairwise alignment in the Alignment section. Subject sequences that matched the query are listed in the Descriptions section, sorted by the expected (e) values. The Alignment section presents the alignments of the HSPs as a series of pairwise alignments. Each alignment contains a hyperlink to the corresponding HOMD- or NCBI-annotated gene, if such information is available.
Fig. 11

The HOMD Genomic BLAST tool result summary page showing different download option for the BLAST search results

To provide the research community with satisfactory experience with and the convenient features of the HOMD Genomic BLAST, we currently allow up to ten query sequences to be searched in a single job request. Since the time needed for the computation is linear-proportional to the numbers of both query and subject sequences, we expect the maximal waiting time to be no longer than 10 min, provided no previous job is waiting in the job queue. In fact, when a total of ten protein sequences with the size of 500 amino acids in length were submitted to an empty queue to search against all the protein sequences of all HOMD genomes, the job was completed in about 400 s, without any prior jobs waiting in the cluster queue. Special requests may be considered for jobs containing more query sequences than the current limit, on the collaboration basis.

The number of the genomes hosted by HOMD database has been growing from approximately 600 genomes at launch (June 2011) to nearly 1,200 genomes towards the beginning of 2013. We expect the number continue to grow, in concordance with the growth or the NCBI microbial genomes, as well as the progress of the Human Microbiome Project. To keep pace with this foreseeable growth and the computing power necessary for Genomic BLAST and other tools, we will continue the efforts to enhance the capabilities of HOMD’s computer backbone.

Conclusions

The goal of creating the HOMD website and tools has been to create a community resource for those interested in obtaining information on human oral bacteria and their genomes. We have attempted to create a useful provisional taxonomic scheme so that investigators can refer to phylogenetically defined taxa rather than unanchored clones or OTUs. We provide full-length reference sequences and BLAST tools tied to our taxonomic scheme. Finally, we provide access to all genomes completed for human oral bacteria.

References

  1. Aas JA, et al. Defining the normal bacterial flora of the oral cavity. J Clin Microbiol. 2005;43:5721–32.PubMedCentralPubMedCrossRefGoogle Scholar
  2. Alcaraz LD, et al. Identifying a healthy oral microbiome through metagenomics. Clin Microbiol Infect. 2012;18 Suppl 4:54–7.PubMedCrossRefGoogle Scholar
  3. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.PubMedCentralPubMedCrossRefGoogle Scholar
  4. Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–9.PubMedCentralPubMedGoogle Scholar
  5. Bairoch A. The ENZYME database in 2000. Nucleic Acids Res. 2000;28:304–5.PubMedCentralPubMedCrossRefGoogle Scholar
  6. Belda-Ferre P, et al. The oral metagenome in health and disease. ISME J. 2012;6:46–56.PubMedCentralPubMedCrossRefGoogle Scholar
  7. Bik EM, et al. Bacterial diversity in the oral cavity of 10 healthy individuals. ISME J. 2010;4:962–74.PubMedCentralPubMedCrossRefGoogle Scholar
  8. Boeckmann B, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–70.PubMedCentralPubMedCrossRefGoogle Scholar
  9. Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.PubMedCentralPubMedCrossRefGoogle Scholar
  10. Camon E, et al. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003;13:662–72.PubMedCentralPubMedCrossRefGoogle Scholar
  11. Chen T, et al. The bioinformatics resource for oral pathogens. Nucleic Acids Res. 2005;33:W734–40.PubMedCentralPubMedCrossRefGoogle Scholar
  12. Chen T, et al. The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database (Oxford). 2010;2010:baq013.CrossRefGoogle Scholar
  13. Dewhirst FE, et al. The human oral microbiome. J Bacteriol. 2010;192:5002–17.PubMedCentralPubMedCrossRefGoogle Scholar
  14. Dzink JL, et al. Gram negative species associated with active destructive periodontal lesions. J Clin Periodontol. 1985;12:648–59.PubMedCrossRefGoogle Scholar
  15. Dzink JL, et al. The predominant cultivable microbiota of active and inactive lesions of destructive periodontal diseases. J Clin Periodontol. 1988;15:316–23.PubMedCrossRefGoogle Scholar
  16. Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012a;486:215–21.CrossRefGoogle Scholar
  17. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012b;486:207–14.CrossRefGoogle Scholar
  18. Kanehisa M. The KEGG database. Novartis Found Symp. 2002;247:91–101. discussion 101–103, 119–128, 244–152.PubMedCrossRefGoogle Scholar
  19. Martin J, et al. Optimizing read mapping to reference genomes to determine composition and species prevalence in microbial communities. PLoS One. 2012;7:e36427.PubMedCentralPubMedCrossRefGoogle Scholar
  20. Moore WE, Moore LV. The bacteria of periodontal diseases. Periodontol. 1994;2000(5):66–77.CrossRefGoogle Scholar
  21. Moore WE, et al. Bacteriology of severe periodontitis in young adult humans. Infect Immun. 1982;38:1137–48.PubMedCentralPubMedGoogle Scholar
  22. Moore WE, et al. Bacteriology of moderate (chronic) periodontitis in mature adult humans. Infect Immun. 1983;42:510–5.PubMedCentralPubMedGoogle Scholar
  23. Morgulis A, et al. Database indexing for production MegaBLAST searches. Bioinformatics. 2008;24:1757–64.PubMedCentralPubMedCrossRefGoogle Scholar
  24. Paster BJ, Dewhirst FE. Phylogeny of campylobacters, wolinellas, Bacteroides gracilis, and Bacteroides ureolyticus by 16S ribosomal ribonucleic acid sequencing. Int J Syst Bacteriol. 1988;38:56–62.CrossRefGoogle Scholar
  25. Socransky SS, Haffajee AD. Evidence of bacterial etiology: a historical perspective. Periodontology. 1994;5:7–25.CrossRefGoogle Scholar
  26. Tanner AC, et al. A study of the bacteria associated with advancing periodontitis in man. J Clin Periodontol. 1979;6:278–307.PubMedCrossRefGoogle Scholar
  27. Tanner A, et al. Microbiota of health, gingivitis, and initial periodontitis. J Clin Periodontol. 1998;25:85–98.PubMedCrossRefGoogle Scholar
  28. The Forsyth Metagenomic Support Consortium, Izard J. Building the genomic base-layer of the oral “omic” world. In: Sasano T, Suzuki O, editors. Interface oral health science 2009: proceedings of the 3rd international symposium for interface oral health science. New York: Springer; 2010.Google Scholar
  29. Xie G, et al. Community and gene composition of a human dental plaque microbiota obtained by metagenomic sequencing. Mol Oral Microbiol. 2010;25:391–405.PubMedCentralPubMedCrossRefGoogle Scholar
  30. Zdobnov EM, Apweiler R. InterProScan – an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001;17:847–8.PubMedCrossRefGoogle Scholar
  31. Zuger J, et al. Uncultivated Tannerella BU045 and BU063 are slim segmented filamentous rods of high prevalence but low abundance in inflammatory disease-associated dental plaques. Microbiology. 2007;153:3809–16.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of MicrobiologyThe Forsyth InstituteCambridgeUSA
  2. 2.Department of Molecular GeneticsThe Forsyth InstituteCambridgeUSA