Background

Genomic data are essential resources for modern biology and are most useful when freely accessible to all. This is especially true when databases are curated and simple and efficient data mining tools are available. Major centralized repositories have been useful, and play a crucial role [15]. However, due to the complexity and diversity of genomic data, it is very difficult, if not impossible, to meet all scientific demands solely through these major repositories. Well-designed smaller, more specific (clade or family) databases and websites can be vital for analysis and research, especially for individual laboratories focusing on model organisms [6].

The first haloarchaeal genome sequenced was that of Halobacterium sp. NRC-1 [7, 8]. Initially, the 191 kilobase pair plasmid pNRC100 was sequenced and made public in 1998 [7]. In 2000, with the sequencing of the remainder of the 2.57 megabase pair genome of NRC-1, the annotation of pNRC100 was extensively revised and updated [8]. To provide access to the most current data and facilitate functional genomic studies on Halobacterium sp. NRC-1, we created a custom database and website named HaloWeb. The prototype HaloWeb site was made available to the public in 2000 as a service to the community and has been available for the past ten years [621].

With the recent increase in the number of completed genomes, including ten additional haloarchaeal genomes [2228], research efforts have shifted from the single- to the multiple-genome level. As a result, it became necessary to update the HaloWeb site to incorporate the newly sequenced genomes, including up-to-date annotation data. The updated HaloWeb site incorporates enhanced data access and mining tools for Halobacterium sp. NRC-1 and the other haloarchaeal genomes.

Among the onsite tools are those for genome map generation, gene and intergenic sequence extraction, and sequence editing, which have been developed and implemented on the website. In addition, other popular web tools and resources have been dynamically linked. The database and website also provide templates for additional on-going genome sequencing projects, and we expect to maintain and update resources for future data mining. Finally, the HaloWeb platform also provides an information management system to our laboratory for integration of public genomic data with additional proprietary transcriptomic and comparative genomic resources.

Results and Discussion

The HaloWeb server has been established utilizing Free/Libre and Open Source Software (FLOSS) including the Linux, Apache, MySQL, and Perl (LAMP) stack [29]. The HaloWeb gateway page (Figure 1) contains links to the 11 haloarchaeal homepages, as well as other useful resources such as HaloEd, a database for education using halophilic microorganisms, and convention and conversion information. Most information is freely accessible in the public domain portion of the site.

Figure 1
figure 1

HaloWeb Database Gateway Page. This page provides information (sequencing status and date, physiology and ecology, and genome size) and links for the eleven sequenced haloarchaea.

HaloWeb Genome Home

The genome homepage for each organism contains links to the organism's gene table, search page, and genome maps, along with the sequence editing tool, links to BLAST and genome sequence download pages in NCBI, as well as abstracts on the organism in PubMed.

Gene Table

The gene table allows for genomic analysis of all 11 organisms by providing selection options using different criteria, such as replicon and gene type. Having a uniform interface for interaction also generates a consistent view, from which database transversal is facilitated. The gene table contains data for locus, orientation, replicon, annotation, and gene ID, for each gene.

Search Tool

The search tool provides a comprehensive approach to data mining, allowing a search for genes based on ID number, name, annotation, or location in each genome. This is implemented using MySQL queries to the organism's database, optimized for quick retrieval by using the minimum columns necessary to complete the table, in a unified interface.

Gene Page

The HaloWeb gene pages (Figure 2) allow access, via links, to information resources for the gene using our custom query interface tools to the database. The tools permit BLASTing the gene against protein and nucleotide databases at NCBI, accessing protein data at GenBank [2], and accessing the associated COGs and KOGs from NCBI [30]. There are also links to the KEGG [3] and GTOP [4] databases. A table is also generated containing links to surrounding genes, the number of which may be selected by a dropdown menu. The table also contains each gene's ID, name, size, and annotation.

Figure 2
figure 2

HaloWeb Gene Page. An example of a gene page is shown for Halobacterium sp. NRC-1 orc7. A variety of information (when available) is linked near the top of the page, followed by a table of the gene region, a corresponding genetic map, and sequence data form.

For an alternate way of navigating the database, a gene map with links to surrounding genes is available. The number of genes is regulated by the dropdown menu, and uses an image map to add informational popups and links to the otherwise static map. Controls below the map move the gene map window by changing the gene selected or by allowing leaps to either end of the current map. Below the map is a form containing controls for a popup with sequence data for the current gene region. Optionally, sequence data for an area around the gene, including intergenic sequences, can be retrieved.

Maps

Map queries are also possible in HaloWeb (Figure 3). The first dialog is accessed by clicking on the "Genome Map" link. This dialog contains a replicon selection radio box and a button to continue to the next section. The second section is a form to set the format for the generation of the map, including dropdown menus for bases per line, pixels per line, and a list of genes. The list of genes is used for selecting the first and last genes, using buttons to fill in the read-only text boxes. There are also check boxes to use links or get the entire replicon. Finally, the map is generated by clicking the "Submit" button, which opens a new tab with the image.

Figure 3
figure 3

HaloWeb Map Query Page. An example of a map query is shown for the Halobacterium sp. NRC-1 chromosome. After input of the desired specifications, a map of the region is generated.

Conclusions

With the completion of the updated HaloWeb site, genome data from a major family of microorganisms, the haloarchaea, are readily accessible. This resource has served the academic research community for many years. In addition, HaloWeb also includes proprietary in-house generated data, including microarray and protein cluster data, and serves as a useful laboratory information management system [31].

Methods

Software Tools

Red Hat Enterprise and Fedora Linux, in both 32 bit and 64 bit versions, are used to run the servers. The Apache 2 web server is used to serve up web pages, and a MySQL Community server is used for the database backend. Most scripts are implemented using Perl, connecting to MySQL using the DataBase Independence (DBI) Perl module from Common Perl Archive Network (CPAN) as our database frontend, to allow the greatest flexibility in script writing and database program usage. The usage of the Perl language allows easy graphics generation by the GD library, such as the gene mapping utility, through the GD object-oriented module, and parameter passing is through the Common Gateway Interface (CGI) module. In some cases, JavaScript code is also utilized.

Genome data

Genome data for the following organisms was obtained from NCBI: Halobacterium sp. NRC-1, Haloarcula marismortui, Natronomonas pharaonis, Haloquadratum walsbyi, Haloferax volcanii, Halorubrum lacusprofundi, Halobacterium sp. R-1, Halomicrobium mukohataei, Halorhabdus utahensis, Halogeometricum borinquense, and Haloterrigena turkmenica.