Human Oral Microbiome Database (HOMD)
KeywordsQuery Sequence Human Microbiome Project NCBI Taxonomy Genome Viewer Subject Sequence
The human oral cavity is a rich biological site with several microbial niches including teeth, gingival sulcus, tongue, cheek, hard and soft palates, tonsils, throat, and saliva. The microbiome of the oral cavity (Dewhirst et al. 2010) and its niches have been examined based on 16S rRNA sequencing (Aas et al. 2005; Bik et al. 2010; Human Microbiome Project 2012). The metagenome of the oral cavity has been studied to a limited degree prior to 2012 due to the complexity of the site (Alcaraz et al. 2012; Belda-Ferre et al. 2012; Xie et al. 2010). More than 700 prevalent species comprise the oral microbiome, but many taxa are present at less than 0.1 % of the microbial population (Dewhirst et al. 2010). As oral bacterial reference genomes are becoming available, primarily through the efforts of the Human Microbiome Project (Human Microbiome Project 2012), it is becoming possible to attribute metagenomic sequences to organisms at genus and species level (Martin et al. 2012). The anchoring of metagenome sequence information to specific organisms in a taxonomic framework is key to developing a full description of the bacteria-bacteria and bacteria-host interactions that underlie human oral health and disease.
The Human Oral Microbiome Database (HOMD) was developed in response to the lack of any naming or taxonomic scheme for the thousands of human oral 16S rRNA clone sequences that were being generated in the early 2000s and dumped into GenBank without any taxonomic anchor. Investigators were publishing manuscripts using clone names (such as BU063) as provisional taxonomic names. The only way to phylogenetically place an oral clone was to personally align sequences and generate one’s own phylogenetic trees. We recognized that there was a need for a 16S rRNA-based provisional taxonomic scheme to name and provide reference sequences for unnamed taxa known only from clone or isolate 16S rRNA sequences. The naming scheme had to be provisional because formal naming under the bacterial code requires isolation in pure culture and full phenotypic characterization; 16S rRNA sequence by itself is insufficient for formal naming. The taxonomic scheme described more fully below is based on a Human Oral Taxon number which runs currently from 001 to 918.
At about the time we recognized the need to create a taxonomic framework for the oral microbiome, the National Institute of Dental and Craniofacial Research released a request from proposal on “The metagenome of the oral microbiome.” We responded with a proposal entitled “A foundation for the oral microbiome and metagenome,” which was funded as DE016937. The goals of the grant were to (1) set up the HOMD web-accessible database with a provisional taxonomic scheme and to present all oral genomes in a graphical interface, (2) to complete reference genomes for oral taxa, and (3) to obtain isolates of previously uncultivated taxa and make them available to the research community by placing them in national-type culture collections. We have made steady progress in achieving these goals, and this project is currently in its seventh year of funding.
The HOMD Website
The HOMD contains various types of information on human oral microorganisms including taxonomic, genomic, and bibliographic. The purpose of the HOMD website (http://www.homd.org) is to provide an easy-to-use online interface to search, retrieve, and navigate among these different types of information. HOMD also provides web-based bioinformatics software tools for data mining and analyses.
Technically, the HOMD website is constructed using a LAMP system and hosted on the web server computers. The LAMP system provides a Linux operating system, Apache web service, MySQL relational database, and PHP dynamic web page rendering. Textual contents such as the taxonomy and metagenomic information are queried and results dynamically displayed in the web browser by the LAMP system. A dedicated high-performance computer cluster is deployed to handle the computational demanding analysis such as homology sequence searches.
The HOMD has been designed to be compatible with most commonly used web browsers such as Microsoft Internet Explorer, Firefox, Google Chrome, and Safari. We suggest the use of one of these popular web browsers to ensure the functionalities of HOMD web pages and tools. All the HOMD information and tools are viewable and available to the general public without having to log in or acquiring a user account. The log-in function is mainly for the purpose of maintaining the website and the curation of the database information. If a user has been designated a curator, he or she will see additional administrative submenus.
Detailed functionalities, web interfaces, and tools as well as useful usage tips are presented below. Technical information such as the implementation and design of the HOMD has been published elsewhere (Chen et al. 2010).
Features of the HOMD Web Pages
Another useful feature of the HOMD web pages is the unique page ID system. The rightmost item displayed on the top navigation menu is the page ID – a unique code that distinctly identifies the current page that a user is viewing. For example, the page ID of the HOMD home page is “HP1” (Fig. 1), and once a user navigates away from home page to, e.g., the Taxon Table page, the page ID automatically changes to “TT1.” This feature allows precise page referencing. This is particularly useful when a user needs to refer to a specific page on HOMD site for discussion, bug reporting, or suggestion.
The HOMD home page also includes a top-down oriented expandable menu on the left side and an introductory paragraph in the center. On the right side are the Meta-Database Search, the Announcement, and the Database Update boxes. The Meta-Database Search is very useful for searching desired information across all the subsets of HOMD databases, including the taxonomy, the metagenomic information, as well as the dynamic genome annotations. The result lists the number of matches to the keyword that provides links, leading to detailed information. The Announcement box displays the important system-wise updates and news for the HOMD. The Database Update box is automatically updated by the HOMD dynamic genome annotation pipeline (see “Dynamic Annotation of Genomic Sequences” section) to keep track of the status of the genome annotation.
HOMD also provides comprehensive documentation and updates history of data and tools. The HOMD User’s Guide (i.e., the help documentation) was designed to help users to use the tools, navigate the information, and interpret the results provided by HOMD. The User’s Guide is accessible through the top navigation menu on all pages and is dynamically linked to the relevant guide for each different tool. For example, when users are viewing the Taxon Table page, the “How to Use This Page” menu item shown in the top navigation menu will lead directly to the page that explains the use of the Taxon Table. Alternatively users can also browse the entire user documentation by clicking the “Table of content” tab shown on top of each documentation page as well as the “User’s Guide” links on top menu and side menu of home page. Every document of HOMD can be searched either through the search box located at the bottom of the table of contents of the documentation page or through the Meta-Database Search box located at the top-right part of the home page.
The design of the online interfaces of HOMD has been driven by suggestions from HOMD users. HOMD is open to suggestions and feedback from the research community to further improve its interface and content. Currently, HOMD provides several different ways to communicate with the research team and research community. The contact information provides e-mail addresses for direct communication with the HOMD research team. There is also a mailing list for important updates and announcement. Users can use their own e-mail address to subscribe to the HOMD Mailing List (https://groups.google.com/forum/#!forum/homd-mail) by sending an empty e-mail to the e-mail address: email@example.com. An automatic e-mail will be sent to the subscriber for confirmation. HOMD also provides a discussion platform for the research community (https://groups.google.com/forum/#!forum/homd-forum). Note that these web links may change over time. In any case, current or updated web links provided here will be available on the HOMD website.
The HOMD Database Schema
The information and data provided by HOMD are stored in several databases. The Oral Taxon IDs and the genome IDs serve as the keys to cross-link these databases. The database table structures and the contents can be downloaded from the HOMD FTP (file transfer protocol) site at ftp://ftp.homd.org to allow users to reconstruct the databases and perform advance queries on their own computers.
Download Data from HOMD
Most of the data recorded in HOMD, including taxonomy, genomics, and 16S rRNA reference sequences, can be downloaded from the HOMD FTP site (ftp://ftp.homd.org). The FTP site provides both current and archived versions of the data for comparison. The FTP site can be accessed directly in the web browser. Each folder comes with a “readme” text file explaining the data, data format, and potential usage. Selected data such as the aligned reference sequence dataset, aligned 16S rRNA datasets for each taxon, and an HOMD taxonomy database in Excel format can be downloaded from the links provided in the HOMD web pages.
Compilation of the HOMD Taxa
The HOMD describes information linked to oral microbe species. For bacteria, or archaea, that have not been validly named, there is no definition of “species.” Molecular methods to identify novel species generally have used 16S rRNA sequencing of isolates or 16S rRNA-based analysis of clone libraries. These strains or clones can then be clustered into phylotypes or taxa based on their 16S rRNA sequences. Phylotype can be defined for any similarity cutoff. In HOMD, a cutoff of 98.5 % 16S rRNA sequence similarity was used to cluster the 16S rRNA sequences at the species level to define novel oral bacterial phylotypes. Each validly named species and novel phylotype cluster was given a unique integer number called Human Oral Taxon (HOT) ID.
The original collection of oral microbial taxonomy information came from a combination of literature, primarily reports from Forsyth Institute investigators (Dzink et al. 1985, 1988; Socransky and Haffajee 1994; Tanner et al. 1979, 1998) and from Lillian Holderman Moore and Ed Moore (Moore et al. 1982, 1983; Moore and Moore 1994) formerly at the Anaerobe Laboratory at the Virginia Polytechnic Institute. 16S rRNA sequences for these named species came either from sequences obtained in our laboratory or from GenBank. Over the past 20 years, our laboratory constructed and sequenced over 600 16S RNA gene libraries and obtained over 35,000 clone sequences. The cloning, sequencing, aligning, treeing, and clustering methods used to create HOMD are described elsewhere (Dewhirst et al. 2010). In brief, sequences were manually aligned in a secondary structure-based database using the program RNA (Paster and Dewhirst 1988). Distance matrices and neighbor-joining trees were generated to determine the clustering of sequences. Sequences with similarity equal to or greater than 98.5 % were grouped together into a single taxon. Sequences were extensively checked for chimeras and several sequences and some provisional taxa were removed. As a result, several hundred apparently novel full 16S rRNA sequences were identified this way.
To share the information of both the named and novel human oral microbial taxa with the research community, we decided to build a database and designed web query interfaces and tools. When the HOMD was publicly launched in 2010, there were a total of 619 Human Oral Taxa in the initial release of the HOMD database. The 753 reference 16S rRNA gene sequences upon which this analysis was done have been released publicly for download on the HOMD website as version 10. At the time of writing this chapter, the total number of taxa described in the HOMD taxonomy database has grown to 688, represented by a total of 833 reference 16S rRNA sequences (HOMD RefSeq Version 13.1).
Navigating the HOMD Taxa
On the Taxon Table page, all the human oral microbial taxa are listed in a table ordered alphabetically by organism names. The order can be changed by clicking the column name HOT IDs, Genus, or Species names, to toggle the display sort order. Three commonly used filters are also provided to show only those taxa with “named species,” “unnamed cultivated species,” or “uncultured phylotypes.” Each taxon listed in the table contains links to the individual Taxon Description page (described later) and to the genomic information, if available.
The taxa can also be viewed in the taxonomic hierarchical order, i.e., from domain, phylum, class, order, family, genus, to species levels, on the Taxonomic Hierarchy page (Fig. 3). The hierarchical tree is fully collapsed by default and can be dynamically expanded at any given level (or all levels). The link, at the species level, brings users to the detailed Taxon Description page. The designation of each level is followed by two numbers enclosed in the square brackets indicating the number of taxa and genome sequences. For example, “Phylum Proteobacteria [107, 144]” indicates that in the phylum Proteobacteria, 107 taxa were identified in the oral cavity and 144 strains have genomic sequences available at HOMD. If a strain has been sequenced by multiple groups, or multiple strains sequenced for a species, we provide each sequence when available.
Another way to check the summary of the HOMD taxa is to view the number of taxa at various taxonomy levels. The Taxonomic Level page provides a list of taxa and the number of taxa at the next lower level for each of the 7 taxonomic levels: Currently, the numbers are Domain (2), Phylum (14), Class (24), Order (40), Family (83), Genus (183), and Species (688).
Human Oral Taxon (HOT) ID – The Human Oral Taxon ID is a unique numeric ID representing a particular taxon. The taxon can be unambiguously referred to from other sources of scientific literature. The taxon can be accessed on the web with an easy universal resource locator (URL) format, http://www.homd.org/taxon=NNN , where NNN is the HOT ID. The Human Microbiome Project Data Analysis and Coordination Center (DACC; accessible at http://www.hmpdacc.org) is using HOT IDs to designate taxonomic identity isolates of the oral cavity with URLs cross-referenced to HOMD. These URLs are embedded in the data provided by DACC so that user can track down to the more comprehensive information for individual genome. The HOT IDs were also embedded in the GenBank sequence records for the 35,000 clone sequences that were used to build the initial collections of the HOMD taxa. The text embedded in the GenBank records has the syntax /db_xref=“HOMD:tax_NNN,” in which NNN is the numeric HOT ID. If the GenBank sequence is viewed in the web browser through the NCBI website, the portion of the text “tax_NNN” is also clickable and links to the corresponding taxon page on the HOMD website. For example, the GenBank record for the partial 16S rRNA sequence of the Alloprevotella rava clone GB024 (Accession No. GU409552, http://www.ncbi.nlm.nih.gov/nuccore/GU409552) contains the text /db_xref=“HOMD:tax_302,” because the HOT ID for A. rava is 302. Clicking “tax_302” in this GenBank record in the web browser will bring the user to the corresponding taxon page on HOMD (http://www.homd.org/taxon=302). NCBI embeds external database reference IDs in the GenBank records for cross-database referencing. More information can be found at this link http://www.ncbi.nlm.nih.gov/genbank/collab/db_xref.
Status – This field displays the culturing status for the taxon. A taxon can be either a validly named cultivated species, an unnamed cultivated species, or an unnamed uncultured phylotype. This status is shown in this field and will be updated upon the change of actual status of the taxon.
Type strain/reference strain – If the taxon’s status is validly named cultivated species, the Type Strain is listed here; if the taxon is an unnamed isolate, the strain information will be listed as Reference Strain. If no cultivated strain is available yet, the Reference Strain field will be listed as “None, not yet cultivated.”
Classification – The Taxon Description page lists the nomenclatures of each taxonomic level from Domain to Species. This classification is defined by HOMD and may be different from the NCBI Taxonomy. The NCBI Taxonomy can be accessed using a dynamic link. The HOMD taxonomy is based on analysis of where each taxon falls in phylogenetic trees generated using several treeing methods and including over 100 non-oral reference taxa identified by searching the “greengenes” 16S rRNA gene database (http://greengenes.lbl.gov). For example, in HOMD, an organism such as Eubacterium saburreum is placed in the family Lachnospiraceae (because that is where it falls phylogenetically), rather than in the family Eubacteriaceae (because its incorrect genus name “Eubacterium” has not yet been revised). Synonyms of the taxon that are currently in use or were used before in the literature or publications are also provided.
16S rRNA gene sequence – GenBank accession number and link to NCBI corresponding Entrez record to one or more 16S rRNA gene sequences associated with the taxon.
16S rRNA gene sequence alignment – This field provides the link to the downloadable clone sequences preliminarily aligned to the reference sequence to which the clones belong. The current set contains the approximately 35,000 clone sequences (Dewhirst et al. 2010) aligned for each taxon. The clone alignments are provided concatenated FASTA format with the reference sequence(s) on top which were used as the template for alignment. To view the alignment in color format and for further adjustment, third-party alignment viewing software may be used, such as SeqView (http://pbil.univ-lyon1.fr/software/seaview.html) and BioEdit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). Because some pairs of clone sequences may be nonoverlapping (i.e., 500-base sequences at opposite end of the molecule), this file must be used with caution for tree construction.
Phylogeny – A phylogenic tree showing the position of this taxon among related HOMD taxa is provided here. The tree images are in PDF format and can be viewed or downloaded with the link provided in this field. A link to a list of all the downloadable phylogenetic tree images encompassing all the HOMD taxa is also provided.
Prevalence by molecular cloning – The number of clones found for this taxon in an analysis of approximately 35,000 clones (Dewhirst et al. 2010). Based on the number of clones found, the rank abundance of the taxon (out of 619) is given.
Synonyms – Lists previous names for the organism if validly named. Isolate or clone designations are given as synonyms when they have appeared in the literature as “names” for the taxon, such as “BU063.” (Zuger et al. 2007).
NCBI taxonomy – For validly named species, there is a link to the NCBI Taxonomy. NCBI has no taxonomy for unnamed taxa; hence, the reason HOMD was created.
PubMed search – The number of hits when the name (genus plus species) of this taxon is used in the PubMed search. HOMD automatically and periodically updates this hit number every 2 weeks. To get a most up-to-date search, simply click the “PubMed Link” to pull up the search result live from NCBI PubMed site. In general, there are no results for unnamed taxa, hence the need for HOMD. When articles referencing these taxa (often through clone numbers) are found by HOMD curators or community members, they are manually added to the Taxon Description.
Nucleotide search – Similar search as above using NCBI Entrez “nucleotide” as reference database. The latest result (hit count) is displayed with link to NCBI for most updated search.
Protein search – Similar search as above using NCBI Entrez “protein” as reference database. The latest result (hit count) is displayed with link to NCBI for most updated search.
Genomic sequence – Number of genomes that have been sequenced is indicated here with a link to a detailed list of these genomes.
Hierarchy structure – An expandable/collapsible view of a dynamically displayed taxonomy tree indicating the position of the taxon on the page.
Cultivability – Conditions and media for growing strains of this taxon, if available.
Phenotypic characteristics – Generic phenotypic description of the taxon if the taxon has cultivated member(s).
Prevalence and source – Describes the frequency and source of clones and isolates from different oral sites and states of health or disease when known.
References – Literature and publications referencing this taxon. These references are manually curated with up to ten key references which may also include older references not indexed in PubMed.
Community comments – Registered and logged-in users can provide their feedbacks related to this taxon. The comment requires the approval of the HOMD curators before it is shown to the public.
Identification of 16S rRNA Gene Sequence by BLAST Search
One of the most used HOMD software tools is the customized BLAST search specifically designed to identify user-provided 16S rRNA sequences against the comprehensive collection of the 16S rRNA reference gene sequences. Currently there are a total of 688 taxa defined based on version 13.1 of the 16S rRNA reference sequences. Since a phylotype can include members with up to 1.5 % sequence divergence (23 bases for a full 1,500-base sequence), multiple reference sequences have been selected where we have sequences diverging by more than 10 bases within a taxon.
HOMD provides two primary sets of 16S rRNA gene reference sequence (RefSeq) for download and for BLAST search. The first set is the HOMD 16S rRNA RefSeq. This set contains sequences representing all currently named and unnamed oral taxa. In the latest reference sequence set (version 13.1 at the time of writing), there are 834 reference sequences representing the 688 taxa. The second is the HOMD 16S rRNA Extended RefSeq. This set contains additional16S rRNA reference gene sequences that are distinctively different from existing taxa but have not yet been assigned with a taxon ID.
The HOMD reference sequences are corrected consensus sequences. Many have been corrected and extended based on alignment with other sequences for that taxon and Ns and indels removed. Therefore, for many sequences, there will be differences between the reference sequence and the GenBank sequence listed in the header information. We have not yet updated our own GenBank sequences and cannot update those from other depositors. We believe these are currently the best reference sequences available and, for the purposes of BLAST analysis, have the advantage of being of a uniform length.
Genomics Tools Overview
Complimentary to the taxonomy information, the HOMD also provides comprehensive information and tools for studying genomes of the human oral microbes. HOMD genomics database serves as the curated repository for the molecular sequences of human oral microbiome, including complete and partial genomics sequences, as well as 16S rRNA mentioned in the previous section. Genomic sequences available at HOMD can be fully assembled genomes, high-coverage genomes, or genome surveys. HOMD also keeps tracks of the status of ongoing genome sequencing projects for human oral microorganisms. A Sequence Meta Information page is created to hold relevant genomics and sequence meta information if a sequencing project for a human oral microbe is announced and available in the NCBI Genome Project Database. The genome project status is updated biweekly based on information collected from the NCBI Genome Project Database with an automatic query script. Once genomic sequences are publicly released, they are dynamically annotated by HOMD (Dynamic Annotation). Annotation done by other data centers, if available, is termed “static annotation” and is viewable in a separate panel in the Genome Viewer (described below). Relevant tools are provided for viewing and searching the annotation. These tools were first developed as part of the Bioinformatics Resource for Oral Pathogens (BROP: http://www.brop.org; Chen et al. 2005). The programs and the data-mining schemes used in HOMD are designed for both finished and unfinished (collections of multiple contigs) genome sequences. The tools are integrated with the HOMD website and are conveniently accessible by users. Icons or links to available tools pertaining to a specific genome are automatically presented on relevant page to users. Important genomic data and bioinformatics tools provided by HOMD are described below. Additional information on tools is also available in the previous publication (Chen et al. 2005).
HOMD organizes genomes in three viewing options: Taxa with Annotated Genomes, Taxa with Genomes in Progress, and View All Genomes. The first option lists the oral taxa with annotated (static or dynamic) genomic information and provides links to all the genomes available for each taxon. The View Genome button links to the Genome Table showing all the available genomes of a specific taxon. The Genome Table shows the Oral Taxon ID (HOT), the Genus and Species names, Strain Culture Collection, HOMD Sequence ID (SEQ ID), number of contigs and singlets, combined sequence length, and links to available tools and information. The second option (Taxa with Genomes in Progress) lists those oral taxa with genomic sequencing project still in progress but no sequence is yet available. The third option shows all the genomes in the alphabetical order and provides searching and sorting function for easy navigation. Each genome listed has a link to the Sequence Meta Information page described next.
Sequence Meta Information
Full and High-Coverage Genomes
Full genomes are the oral microbial genomes that have been fully assembled, while the high-coverage genomes are not fully assembled but represent coverage of most of the genomes. Both types of genomes are annotated and deposited in a public database such as GenBank. HOMD aims to provide frequently updated genomic annotation for oral bacterial genomes (see below). In addition, HOMD provides graphical genomic viewing for static annotations done by other public data centers such as NCBI or JCVI.
One of the original major goals of the NIH-funded project “A Foundation for the Oral Microbiome and Metagenome,” DE016937, was to partially sequence up to 100 representative human oral microbial species. A total of 12 low-coverage partial genomic sequences were sequenced and deposited in NCBI before this project fused with the Human Microbiome Project. The genome information for these 12 surveys is still maintained on HOMD even though they currently also have complete or high-coverage genomes (The Forsyth Metagenomic Support Consortium and Izard 2010). Since the launch of the Human Microbiome Project, the HOMD team has been providing genomic DNA from human oral microbes to the four HMP sequencing centers for high coverage rather than survey sequencing (The Forsyth Metagenomic Support Consortium and Izard 2010).
Dynamic Annotation of Genomic Sequences
One of the major features of the HOMD Genomic Database is the automatic and frequent updating of genomic annotation pipeline for genomes of oral isolates. Although the amount of sequence data is still growing rapidly, the computational power needed for bioinformatic analysis of this data is catching up and the cost and energy consumption per CPU decreasing due to the availability of multi-core CPU formats. The lower cost of computational power has made it feasible for us to set up a small computation farm dedicating to the annotation of human oral microbial genomes. HOMD recruited a cluster of multi-core multi-node computer servers to frequently update the annotation. Current HOMD genome annotation algorithms include (i) BLASTP (http://www.ncbi.nih.gov/BLAST; Altschul et al. 1997) search against weekly updated NCBI nonredundant protein data (ftp://ftp.ncbi.nih.gov/blast/db/), (ii) BLASTP search against Swiss-Prot protein data (http://us.expasy.org/sprot/; Boeckmann et al. 2003), and (iii) InterProScan search against various sequence databases (Zdobnov and Apweiler 2001; http://www.ebi.ac.uk/interpro/). To provide data on functional potential of genomes, BLASTP search results against Swiss-Prot are further processed for the construction of KEGG metabolic pathways and Gene Ontology Trees. We take advantage of the fact that the well-annotated Swiss-Prot protein sequence descriptions contain interlinks to the ENZYME (Bairoch 2000) and Gene Ontology (Camon et al. 2003). The dynamic genome annotation is running full time daily on the dedicated computer cluster except during the weekend, when the latest NCBI nonredundant protein database, Swiss-Prot, and InterPro databases are being downloaded to and updated on our server. Currently a total of 324 genomes representing 306 taxa are being repeatedly annotated by this pipeline. On average, each genome takes ~ 3 h to be annotated; thus, the current re-annotation frequency is approximately a month for all the 300+ genomes. Additional genomes are being added to the annotation pipeline as more sequences are made available by other public sequencing projects such as the Human Microbiome Project (http://www.hmpdacc.org). A live update status of the genome annotation is provided on the HOMD home page indicating the latest genome annotated or updated. HOMD aims to maintain frequent and dynamic computer annotation for genomic sequence of at least one isolate from each oral taxon whenever sequences are made publicly available, as well as static annotation of all annotated releases.
HOMD Genomic BLAST
The HOMD Genomic BLAST query interface starts with the selection of the genomes to be searched against. All the HOMD genomes available for search are displayed and selectable in a collapsible tree based on the taxonomy hierarchy. As shown in Fig. 9, upon starting the HOMD Genomic BLAST, the taxonomy hierarchical tree is fully expanded by default and can be dynamically collapsed at any given level. The links, at the species level or genomes level, lead to the detailed Taxon Description or Sequence Meta Information page, respectively. Numbers indicated in the square brackets at each level are the numbers of oral taxa, genomes with meta information, genomes with HOMD annotation, and genomes with NCBI annotation, respectively. The genome selection is flexible and can be a single genome, any randomly selected individual genomes, a group of genomes at any taxonomy level (from Domain to Species), all the genomes dynamically annotated at HOMD, all the genomes with static annotations by NCBI, or a representative genome from all the species. The total number of genomes selected is shown on top of the page.
Upon submission of the BLAST search, the requested job is sent to the back-end service for processing. The back-end service consists of a computer cluster to handle multiple requests from the query interface. The selected genomes/nucleotides/proteins are dynamically compiled to a virtual sequence database searchable by the BLAST programs, using the “blastdb_aliastool” tool provided by BLAST+ (Camacho et al. 2009). The searched jobs are distributed to the computer nodes of the cluster, which is managed by the TORQUE resource manager (http://www.adaptivecomputing.com/products/open-source/torque). During the search process, user is presented with an intermediate page to monitor the job status. This status page reports a summary of the job as well as time/duration elapsed since submission. The status page periodically refreshes itself, effectively polling the server while the job runs. BLAST result is automatically presented when the job completes.
To provide the research community with satisfactory experience with and the convenient features of the HOMD Genomic BLAST, we currently allow up to ten query sequences to be searched in a single job request. Since the time needed for the computation is linear-proportional to the numbers of both query and subject sequences, we expect the maximal waiting time to be no longer than 10 min, provided no previous job is waiting in the job queue. In fact, when a total of ten protein sequences with the size of 500 amino acids in length were submitted to an empty queue to search against all the protein sequences of all HOMD genomes, the job was completed in about 400 s, without any prior jobs waiting in the cluster queue. Special requests may be considered for jobs containing more query sequences than the current limit, on the collaboration basis.
The number of the genomes hosted by HOMD database has been growing from approximately 600 genomes at launch (June 2011) to nearly 1,200 genomes towards the beginning of 2013. We expect the number continue to grow, in concordance with the growth or the NCBI microbial genomes, as well as the progress of the Human Microbiome Project. To keep pace with this foreseeable growth and the computing power necessary for Genomic BLAST and other tools, we will continue the efforts to enhance the capabilities of HOMD’s computer backbone.
The goal of creating the HOMD website and tools has been to create a community resource for those interested in obtaining information on human oral bacteria and their genomes. We have attempted to create a useful provisional taxonomic scheme so that investigators can refer to phylogenetically defined taxa rather than unanchored clones or OTUs. We provide full-length reference sequences and BLAST tools tied to our taxonomic scheme. Finally, we provide access to all genomes completed for human oral bacteria.
- The Forsyth Metagenomic Support Consortium, Izard J. Building the genomic base-layer of the oral “omic” world. In: Sasano T, Suzuki O, editors. Interface oral health science 2009: proceedings of the 3rd international symposium for interface oral health science. New York: Springer; 2010.Google Scholar