Background

Recently, two marsupial genomes and one monotreme genome have been sequenced: the grey short-tailed opossum (Monodelphis domestica; 7× coverage) [1], the tammar wallaby (Macropus eugenii; 2×) (in prep.), and the platypus (Ornithorhynchus anatinus; 6×) [2]. Marsupial and monotreme lineages branched off approximately 148 My and 166 My ago from the lineage leading to eutherian mammals [3]. They hold a unique evolutionarily position providing a link to the reptilian phase of our ancestry. Combined with their unusual biological traits, they are capable of providing important insights to our understanding of mammalian biology and evolution.

Genome sequencing has generated huge amounts of genomic data. This has expedited the identification of genes in these species. Despite the availability of genome assemblies, only the most phylogenetically conserved immune genes have been identified using automated gene annotation pipelines. Genes involved in the immune response are subject to intense selective pressure due to the need to overcome pathogenic challenges. As a result, it is common for immune genes, particularly those with immunomodulatory roles, to show very low levels of sequence conservation between species [4, 5]. This has lead to many key immune molecules being missed by the Ensembl [6] and NCBI's Gnomon http://www.ncbi.nlm.nih.gov genome annotation platforms. Less than a third of all opossum immune genes that were annotated using specialized search strategies by Wong et al. 2006 [7], Belov et al. 2006 [8] and Belov et al. 2007 [9], were predicted by the Ensembl pipeline [6]. Aside from high levels of sequence divergence, many immune gene families have also evolved through rapid successions of gene loss and gain, resulting in a lack of direct orthologs. Hence, these genes are difficult to characterize through local pairwise similarity search algorithms, such as BLAST [10], which use a single gene sequence to query a database.

To overcome the lack of annotated sequence information for immune genes, targeted, manually-curated strategies were applied [7, 9, 11, 12]. Identification of the most highly divergent sequences required an intensive combination of strategies incorporating hidden Markov model searches, exploitation of conserved syntenic regions, sensitive local search algorithms and gene prediction integrating extrinsic information [7, 9, 11, 12]. Less divergent genes missed by Ensembl could be identified and annotated using chained-BLAST searches [9].

Here, we present a database of curated marsupial and monotreme immune sequences. We have included novel predicted and expressed sequences as well as previously annotated genes [7, 9, 1145]. Examples of gene groups represented in the database include chemokines, interleukins, Natural Killer (NK) receptors, Major Histocompatibility Complex (MHC) antigens, surface receptors, antimicrobial peptides. Annotations derived from a transcriptomic analysis on a primary lymphoid organ have also been included [46]. Many of these genes (e.g. 209 expressed tammar genes) have not been annotated by Ensembl and their sequences are not curated by other public databases. The database consists of a simple interface, and features several methods for users to query the sequences. On entry to the database, sequences were further annotated to provide searchable functional information. Availability of a comprehensive gene set assists large-scale projects such as transcriptomic analysis and microarray studies. Also, it facilitates the development of marsupial- and monotreme-specific reagents allowing for detailed analyses of metatherian and prototherian immune responses.

Construction and content

IDMM was implemented using the Python web framework Django (version 1.1) [47] with a SQLite3 (version 3.6.3) database [48]. Data can be easily updated by approved managers through a simple web interface. Once sequences are added, they are automatically matched to HGNC names and GO terms through a BLAST search. Amino acid sequences are additionally searched against the Conserved Domain Database (CDD) [49] to create protein domain annotations. Sequences are stored in FASTA format and are identified by their sequence header description which includes the gene name and species name.

Database content and data source

A total of 2,935 genes, 602 expressed (538 tammar wallaby, 24 opossum, 16 platypus, 11 echidna, 6 red-necked wallaby, 4 brushtail possum and 3 bandicoot) and 2,333 predicted (1,639 opossum, 694 platypus), are currently stored in the database. The database includes 1,985 published sequences. We have integrated data from various published resources, which include expressed and predicted genes from opossum (1,663) [79, 18, 34, 35, 38, 39, 4345, 50], tammar (37) [2126, 33, 3537, 45], brushtail possum (4) [18, 20, 2729, 40] echidna (11) [14, 17, 19, 30, 32], bandicoot (3) [15], red-necked wallaby (6) [42] and platypus (261) [1114, 16, 18, 19, 31, 41]. Manually annotated gene families include: major histocompatibility complex (MHC), leucocyte receptor complex (LRC), cytokine, defensin, cathelicidin, natural killer complex (NKC) and Fc receptor genes. Both opossum and platypus sequences were annotated using a curated list of human immune genes from the IRIS database [51]. For predicted genes, candidate gene regions were first identified using either BLAST [10] or HMMER hidden Markov model [52] searches. Following this, best hits were either concatenated into genes or used to predict a full gene model using a gene prediction program. 516 wallaby genes were annotated based on opossum genes identified in Wong et al. 2006 [7] and Belov et al. 2007 [9]. Of these, at least 217 were not annotated by Ensembl (version 58). Wallaby reads were derived from the pyrosequencing of wallaby thymus transcriptomes and annotated using the wallaby (v1.0) genome assembly [46]. For each annotated wallaby gene there were often multiple, overlapping reads; these were assembled and included in the database (1,786 wallaby reads in total). 449 platypus gene sequences were obtained by concatenation of the highest-scoring IRIS BLAST hits against the platypus genome assembly (v5.0) (Unpublished). Of these, 366 genes were not annotated by Ensembl (version 58).

Sequence annotation

All reads were defined by their species name, a gene symbol, the method of identification and sequence type (nucleotide or amino acid). To facilitate the retrieval of genes associated with specific immune roles, we categorized genes based on nine functional terms. These include the broad categories of humoral and cellular immunity and components of the innate (inflammation and complement system) and adaptive (antigen processing and presenting and phagocytosis) immune responses, as well as genes with regulatory functions such as chemokines and transcription factors. To provide additional sequence-based and functional information, all sequences were automatically annotated upon submission to the database. Automatic annotation was performed by searching the human SWISS-PROT [53] database at NCBI [54] with the submitted sequences using the network BLAST client (netblast) [10]. This resulted in the association of sequences with the official human gene names [55], GO ontology terms [56], and, for protein sequences, domain names. The accession of the best hit from each BLAST search was retrieved and matched to a list of pre-generated accession-specific tags if the E-value was less than 1e-3. These tags were linked to human gene names and gene ontology annotations using Entrez Gene data [57].

Utility and Discussion

User interface

Users can interrogate the database and retrieve gene sequences through a variety of simple query tools. The search interface spans three webpages. From the main page, users can query the database through keyword, organism name, human gene name, protein domain name and by the method through which sequences were obtained (Figure 1). A link exists to a GO term browser where terms can be examined in a tree structure that supports the natural relationships between GO terms (Figure 2). Finally, the BLAST program is implemented for users to search against sequences in the database.

Figure 1
figure 1

Main search page of IDMM.

Figure 2
figure 2

Gene Ontology (GO) tree browser.

Search by curated gene symbols

All gene names determined through annotation can be browsed. All sequences have been annotated with a gene name based on the human gene symbol, with the exception of lineage-specific expansions, such as NKC genes and MHC genes. Characterized species-specific expansions (i.e. without human one-to-one orthologs) are labelled using the gene family name followed by a unique set of numeric identifiers.

Keyword search

A simple keyword search permits users to query the database using any string of characters from any description line in FASTA sequences, human gene descriptions and GO names. All FASTA descriptions contain the common name of the species from which the sequences were derived. In addition to terms present in the FASTA header description, users may also search terms generated by automatic annotation which include full gene name (in addition to HGNC symbol) and GO terms. Only sequences of high similarity (E-value < 1e-3) to human genes were automatically annotated. Two keyword searches are available: one for exact but case-insensitive match in sequence headers only and one which matches all terms containing the keyword from all associations, including, for example, GO descriptions.

Search by sequence identification method

Sequence retrieval via the initial sequence identification method (e.g. BLAST) allows simple discrimination between expressed and predicted genes. It is important to note that while chained high scoring BLAST alignments may provide more sequence information, the predicted sequence may not be identical to the actual transcribed sequence. We have also provided information on the identification method used on each sequence label.

Search by HGNC gene symbols

To facilitate the retrieval of marsupial and monotreme homologs to human genes, a list of human gene symbols is available for browsing. We queried marsupial and monotreme database sequences against all human proteins and linked the best hits based on the E-value. The resultant annotations are, in effect, reciprocal best hits of predicted genes. By comparison of gene symbols, users can rapidly determine the accuracy of an ortholog assignment. This strategy provides a measure of the level of confidence in the assigned gene name.

Search by conserved protein domains

To facilitate rapid identification of gene family members, users can search for sequences based on annotated protein functional units from the Conserved Domain Database (CDD). CDD names can be browsed by list and by hyperlinks via 'tag cloud'. Conserved domain annotations are only available for amino acid sequences.

Search based on GO terms

Users may interrogate biological and molecular functional processes and structural components through a GO browser (Figure 2). The browser follows the tree-like hierarchy of GO data by linking general terms to specific terms. The GO terms associated with monotreme and marsupial sequences are inferred through sequence similarity to human Entrez gene annotations. For each term the number of associated database sequences is located after the GO name. By clicking on this number users can extract all associated sequences. Note that Entrez GO terms often miss higher level terms which will underestimate the number of genes in a category. Therefore, it is advisable to browse through GO child terms.

Search through the BLAST interface

Users can direct BLAST queries against the sequence database. Users can perform nucleotide, translated nucleotide and protein searches. Results are presented in standard BLAST text output format.

Sequence retrieval

With the exception of BLAST searches, sequences are viewed through a standard retrieval interface. FASTA headers uniquely identify each sequence. In addition to the option of retrieving reads individually, users may choose to retrieve all identified sequences at once. Users can also fetch all associated annotations for each sequence. An option to display amino acid or nucleotide sequences is available.

Conclusion

Targeted search strategies for immune genes and gene families have led to the annotation of previously unidentified marsupials and monotreme genes in the recent genome assemblies of the opossum, tammar wallaby and platypus. Genes involved in immunity are generally poorly annotated in genome assemblies due to their high rate of sequence divergence and gene duplications. This high sequence divergence of marsupial and monotreme immune genes also renders them difficult to isolate with classical lab techniques. IDMM provides easy access to marsupial and monotreme immune sequences. It hosts a catalogue of novel and integrated sets of published genes, searchable through a simple-to-use and fast interface. The availably of these sequences will facilitate the development of species-specific immunological reagents, enabling accurate studies of immune responses in these species. This database will be useful for comparative studies of immunity.

Availability and requirements

IDMM is publicly available at http://hp580.angis.org.au/tagbase/gutentag/.