Introduction

In recent years, immunological databases and analysis resources (DBARs) have become a common tool, widely and increasingly utilized by the biological and immunological research communities. DBARs are utilized by scientists working in academia, biotechnology companies, and large pharma alike, to aid in the design and evaluation of new vaccines, diagnostics, and immunotherapeutics. The basic scientist utilizes them to aid the design and interpretation of experiments probing the nature of host pathogen interactions, autoimmune diseases, cancer, transplantation, and allergies. In addition, bioinformatics scientists utilize immunological databases as a source of data to explore, refine, and develop new tools and algorithms. Finally, it should be underscored that the development of immunological databases has played an important role in the design of formal data ontologies, and their integration within the broad, mainly grass roots efforts to develop a global ontology of biological events and investigations.

For the purpose of this review, we briefly summarize databases and data analysis resources of potential immunological interest and then focus in detail on two main categories of DBARs—databases hosting primary data and experimental details relating to immune epitopes and analysis resources that host tools to analyze such data and/or to predict epitopes or epitope characteristics in unknown antigenic systems. Because of our role in the development of the Immune Epitope Database and Analysis Resource (IEDB), this resource is reviewed in more detail as both a prototype and test case.

The task of compiling a listing of all online resources of potential immunological interest is in and of itself not an easy one, perhaps a testament to the tremendous growth and richness of the field. Herein, a list of over 40 different DBARs (Table 1) has been assembled by compiling resources known to us and immunological databases published in the 2009 Nucleic Acids Research Immunological Database List (www.oxfordjournals.org/nar/database/cat/14). The list has been broadly classified into ten different categories, relating to the scope of each particular DBAR. Table 1 lists each DBAR, its scope, the principal investigator(s), and the year each DBAR was established. In some cases, multiple DBARs with similar scopes were consolidated, such as DBARs with common principal investigators (i.e., Rhagava, Brusic, and Flower), and the various National Institute of Allergy and Infectious Diseases (NIAID) Bioinformatics Resource Centers (Aurrecoechea et al. 2007; Squires et al. 2008; Greene et al. 2007a, b; McNeil et al. 2007; Brinkac et al. 2009; Snyder et al. 2007; Lawson et al. 2007; Greene et al. 2007a, b).

Table 1 Database and analysis resources of immunological interest

Databases hosting immune epitope data

With respect to databases hosting primary data and experimental details relating to immune epitopes, several different resources should be considered. Some resources such AntiJen (Toseland et al. 2005), FIMM (Schönbach et al. 2005), and HLA-ligand (Sathiamurthy et al. 2003) are not currently maintained and/or the data contained within them were migrated to newer versions and websites. With regard to the scope of the data curated, there is considerable overlap between some of the main databases. However, some clear distinctions can be made. For example, the SYFPEITHI database (Rammensee et al. 1999) currently contains the most comprehensive collection of naturally processed and cancer-derived epitopes. The HIV Molecular Immunology Database (Los Alamos) contains the most comprehensive and highly curated collection of HIV/SIV derived epitopes (Korber et al. 2007). Finally, while the Immune Epitope Database and Analysis Resource (Peters et al. 2005) does not currently curate cancer- and HIV-derived epitopes, it does contain the most comprehensive and highly curated epitope collection relating to infectious diseases, microbes (excluding HIV), allergens, and autoimmunity. It is expected that all transplantation epitopes will become available in the IEDB within the next year.

We sought to perform a comparative analysis of the data housed in each DBAR in terms of references curated, number and types of epitopes and assays, number of antigens and proteins from which the epitopes are derived, and host organisms in which the immune response directed against the epitope originated. This analysis was challenging because, in some cases, the resources are no longer available online, while, in other cases, the data are not available, as the databases can only be searched for specific records, and global searches are not feasible.

With these caveats in mind, the results of our analysis were compiled following thorough examination and querying of each DBAR or extracted from metrics published by the DBARs themselves and are listed in Table 2. Although the bases for the comparisons are IEDB data parameters, a concerted effort was made to retrieve equivalent metrics from the other DBARs. However, the exact definitions of each parameter may not necessarily be consistent among the various DBARs.

Table 2 Data content in epitope DBARs

As shown in Table 2, in terms of curated references, the HIV Molecular Immunology Database hosts data derived from about 2,500 references and the IEDB from about 7,000. Given that the focuses of these two databases are non-overlapping, these two DBARs combined are the most comprehensive in terms of curated references. However, it should be pointed out that neither the IEDB nor the HIV Molecular Immunology Database currently curate cancer references. Other databases, such as MHCBN (Lata et al. 2009), EPIMHC (Reche et al. 2005), and SYFPEITHI can be used to fill this gap.

Perhaps, as a result of its broad focus and comprehensive approach to curation, the IEDB seems to be the most comprehensive repository in terms of number of epitopes and specific assays curated. Exceptions to this are found in the realms of MHC ligand elution assays and information on peptides interacting with TAP. The former are abundantly represented in the SYFPEITHI resource. While actual numbers were not available to us, we estimate that the number of records relating to this type of assay present in SYFPEITHI vastly outnumbers those present in the IEDB. The MHCBN database provides a search interface that enables the user to query for TAP-associated peptides in human, mouse, or rat hosts and provides the results in terms of binding affinity.

In conclusion, each of the DBARs examined has a clear focus in terms of the scope of the data it houses. In the following section, we describe in more detail the IEDB, in whose design and implementation our group has been involved.

The development of a formal ontology for the IEDB

The IEDB is unique within DBARs hosting primary data in two respects. First, the IEDB was designed with an experiment-centric view. Rather than hosting lists of epitopes and associated characteristics, the IEDB data structure is based on curation of the actual experimental data associated with a given potential epitope structure. For this reason, the experimental details relating to the organism that represents the source of the epitope and the details relating to the host whose immune system recognized the epitopes are both captured. Likewise, the experimental circumstances surrounding the immunization, the assay system, and the ultimate readout utilized are also captured.

Second, the experimental data are captured in a searchable format, thus allowing the user to select the type of host, experimental procedures, taxonomic domains, or immunological outcome of interest. This circumvents the need for somewhat arbitrary stipulations of how to define an epitope and allows searches to be flexible and adapted to specific questions (Vita et al. 2008).

Soon after work commenced on the IEDB, it became apparent that development of a formal ontology was necessary to accurately represent experimental detail in a relational database, encompassing for each captured experiment as many as 300 different data fields. Formal ontologies, as described in detail elsewhere (Schulze-Kremer 2002; Bard and Rhee 2004) are a formal representation of the different entities encountered in a given domain and their relation to each other. Development of a formal ontology for the IEDB became instrumental in ensuring the uniform and consistent curation of the data, so that different curators could consistently represent different papers and experiments. Ultimately, a formal ontology allowed the representation of complex processes in a computer-readable format and made it possible to integrate the knowledge contained in different databases.

The first version of the IEDB ontology was developed before any information had been curated and was used to guide the design of the database itself (Sathiamurthy et al. 2005). With the database implemented and data being curated, a more formal and comprehensive ontology was developed. This was done in parallel with the initiation of a collaborative project, the Ontology of Biomedical Investigations (OBI), which aims to represent entities necessary to describe investigations in general, such as assays, reagents, and data (Lord et al. 2009).

Thus, the information captured in the IEDB can be described in the same terms as in other resources that also utilize OBI. The specific terms necessary to describe epitopes and their recognition were captured in the ONTology of Immune Epitopes (ONTIE; Greenbaum et al. 2009a, b). Having the IEDB data represented using terms rigorously defined in a formal ontology has facilitated the ability to perform data consistency checks, formulate highly expressive queries, and has enhanced the potential for seamless interoperability with other data resources (Peters and Sette 2007).

Analysis resources: prediction of T cell epitopes

In parallel to databases hosting primary data, a number of online resources provide tools that facilitate predictions of T cell epitopes on the basis of MHC class I and class II binding, their propensity to being transported by TAP, or their generation by proteosomal processing (for class I restricted epitopes only). In terms of predicting MHC binding, the simplest approach is based on motifs describing primary and secondary anchors associated with epitopes or ligands for specific allelic molecules. Several different resources provide motif listings (see SYFPEITHI, Center for Biological Sequence Analysis, the HIV Molecular Immunology Database), but it is widely recognized that predictions based on motifs alone are associated with poor performance because too many potential leads are identified and many epitopes lack canonical motifs (Ruppert et al. 1993). Accordingly, more sophisticated predictive tools have been developed, such as quantitative matrices, artificial neural networks, and support vector machines.

In discussing these types of analytical tools in the context of the various analysis resources, two separate issues can be identified. First, the evaluation of the accuracy and sensitivity of the tools provided by the various resources and second, the breadth of MHC class I and class II molecules for which such predictions are available.

A rigorous evaluation of the performance of the various tools was lacking until recently when side-by-side evaluations of various tools were presented for both MHC class I and class II molecules (Peters et al. 2006; Lin et al. 2008; Wang et al. 2008). In those evaluations, care was taken to ensure that all methods were benchmarked on large and rigorously curated datasets and that the methods were not evaluated using the same datasets utilized for training the algorithms. Rigorous measures of the accuracy, sensitivity, and true predictive value of the algorithms were defined and consistently applied. When such across-the-board, plain-level field evaluations were performed, it was found that, in general, different methods provide relatively similar levels of performance. Within the different methods evaluated, however, non-linear methods such as those using artificial neural networks and consensus tended to provide the best overall performance compared with linear ones (e.g., scoring matrices).

The main determinant of the performance of a specific algorithm, in actuality, appeared to be the amount of data available to train and evaluate the predictions. In that respect, it is predicted that the performance of MHC binding predictions will continue to improve as the quantity of experimental data available to the bioinformatics community continues to steadily increase. As expected, the overall performance of MHC class I predictions was significantly better than their class II counterpart. No systematic evaluations of the value of proteosomal cleavage and TAP transport predictions have been published, but our empirical experience suggests that, although of theoretical relevance, these predictions generally provide little, if any, improvement in performance over predictions of MHC binding alone.

Table 3 provides a summary of the scope of T cell prediction tools available in epitope-related DBARs. The IEDB provides the largest number of predictors, reflective of the fact that multiple predictive methods are offered while also allowing the user to generate consensus predictions, which have been shown to be most effective (Mallios 2003; Wang et al. 2008; Zhang et al. 2009). In terms of breadth of algorithms available, the HIV Molecular Immunology Database provides the most extensive library of MHC class I predictors. However, the only method utilized for prediction is HLA binding motif, which has been shown to be less accurate and sensitive than other methods, such as neural networks and Stabilized Matrix Method (Peters and Sette 2005; Lundegaard et al. 2008). Most other resources are comparable in the breadth (number) of MHC class II allelic predictors. In terms of hosts for which predictors are available, in general, human and murine MHC are the most frequently found. Predictions for other hosts are also offered such as non-human primates (chimpanzee, macaque), rat, and cow.

Table 3 Tools content in epitope DBARs

In summary, a variety of different analysis resources provide tools that can be utilized to predict class I and class II restricted epitopes, by a number of different methods and for a number of different alleles. However, several areas appear to be worth considering for future developments, and they include improving the performance of class II predictive tools, expanding the breadth of class II alleles for which predictive tools are available and also increasing coverage of host species beyond mice and humans.

The prediction of B cell epitopes

In contrast to the progress made in the realm of MHC binding prediction tools, the prediction of B cell epitopes has, thus far, proven a more challenging task. The performance of various B cell prediction tools was evaluated by Blythe and Flower (2005) and also scrutinized in a specific focus panel (Greenbaum et al. 2007). A key difference from T cell epitope prediction tools is that the specificity associated with MHC molecules is not present, and as such, the prediction methods are developed to be generally applicable irrespective of genetic polymorphisms and species of the immune responses host.

Most algorithms utilized are based on the assumption that epitopes recognized by antibody responses are exposed on protein surfaces and/or are enriched in the content of specific amino acid residues. Accordingly, various combinations of structural predictions, molecular modeling, hydrophilicity, and solvent exposure scales are utilized. At best, the various methods are associated with area under the curve (AUC) values around 0.7 (with 0.5 being the AUC value of random predictions and up to 0.99 and 0.89 for state-of-the-art MHC binding predictions for certain class I and class II alleles, respectively).

Several DBARs [ABCPred and BcePred (Saha and Raghava 2007), IEDB (Ponomarenko et al. 2008)] host state-of-the-art B cell epitope prediction tools. BcePred provides prediction of linear B cell epitopes utilizing the traditional approach based on physicochemical properties, a strategy which has in the past been shown to be only marginally stronger than random (Blythe and Flower 2005). ABCpred, on the other hand, utilizes a more progressive recurrent neural network approach, which, when evaluated on protein sequences not used in the development of its algorithm, was shown to produce relatively better predictive performance (Saha and Raghava 2006). Newer approaches are also being developed, especially in the context of a recent initiative from the NIAID that awarded large-scale B cell epitope discovery contracts with the recommendation to utilize the data generated to improve B cell epitope prediction methods.

Analysis tools: sequence conservation, population coverage, and epitope visualization

The two preceding sections describe bioinformatics tools aimed at the prediction of T and B cell epitopes. In addition, various other types of tools are available to the scientific community. These tools can be collectively designated as analytical tools, as they are designed to assist in understanding the data associated with various epitopes rather than prediction of new ones. Examples of these tools are sequence conservation tools, population coverage tools, and epitope visualization tools.

The HIV Molecular Immunology Database provides a number of analysis tools designed to aid researchers in applying epitope knowledge to vaccine design. For example, the Hepitope tool tests for HLA alleles that are enriched in a set of individuals that react with a set of known reactive peptides. The Epicover tool, which computes how well a potential vaccine cocktail (antigen set) covers potential user-specified epitopes, can also be harnessed for vaccine development (Thurmond et al. 2008). An alignment tool called Epilign is also available and allows the user to align epitopes or functional domains to HIV1, HIV2, or SIV.

The IEDB also hosts several analytical tools. The epitope conservancy tool, for example, enables the user to specify the sequence of a set of epitopes of interest, and the tool can return the degree to which each epitope is conserved in a set of related protein sequences of interest, also specified by the user (Bui et al. 2007a, b). Within the IEDB, a tool also allows users to compute the population coverage projected for a given T cell epitope(s) based on its known HLA restriction or binding characteristics and on the frequency of HLA molecules in different ethnic groups (Bui et al. 2006).

Another class of analytical tools can be collectively designated as epitope visualization tools. These tools range from tools that allow visualizing the location of an epitope or a series of residues within a given 3D structure, to genome browsers that map and visualize the epitope location within different ORFs and their respective location within genomic information (Beaver et al. 2007). MHCBN offers a peptide mapping tool that displays the location of known MHC binders, TAP binders, and T cell epitopes available in MHCBN database on the protein sequence provided by the user.

The curation of immune epitope data

It is becoming more and more apparent that curation of large amounts of biological data is a requisite to the establishment of large data depositories in general. A key element of large-scale curation is the development of objective criteria for curation, which is dependent on development of ontologies, as described above. Another key element is process automation, which is, in turn, dependent upon ontology and objective process development.

These issues apply to the development of biological databases in general and to immunological DBARs in particular. As stated above and described in more detail elsewhere (Vita et al. 2008), the curation of experimental data at the level of detail and granularity required by the IEDB ontology requires the establishment of a rigorous yet objective process to ensure consistency and compatibility with partial automation.

The IEDB process of curating relevant scientific published literature starts with automated PubMed queries that are executed at 3-month intervals. These queries are designed to be broad in nature, in order to capture as many potentially relevant papers as possible. Specifically, of the over 18 million papers listed in the PubMed resource, we have to date identified approximately 145,000 as being potentially relevant. The abstracts of these potentially relevant references are then scanned by automated text classifiers (Wang et al. 2007) and also further inspected by senior immunologists, to select the truly relevant references to continue in the curation process per se.

A total of roughly 24,000 references have been identified at this stage and divided into major reference classes (infectious disease, HIV, autoimmunity, allergy, transplantation, cancer, and “others”). Within each class, each reference is then placed in one of several categories. For example, the autoimmunity class is further categorized into diabetes, multiple sclerosis, rheumatoid arthritis, lupus, etc. Within each category, subcategories are used to more accurately categorize the references. For example, the diabetes category includes subcategories of insulin and GAD. The classes, categories and subcategories, are used to prioritize and organize the curation flow. They have also provided interesting insights relating to global disease morbidity and mortality data. We have found that, in most cases, diseases associated with high morbidity and mortality have been the most studied, while some areas such as dengue, Schistosoma, HSV-2, Bordetella pertussis and Chlamydia trachomatis were associated with far less extensive coverage. These types of analyses may provide a justification for focusing research towards relatively less well-studied yet critical disease areas (Davies et al. 2009).

Following categorization, the references are curated by a staff of doctoral-level curators. Quality control is provided by computer-based validation and by a system of peer review of curated records (Vita et al. 2006). Currently, the curation of microbes and allergen epitopes is essentially complete and up-to-date, while curation of autoimmune and transplant epitopes is ongoing.

Meta-analysis of influenza immune epitope data

A corollary of the availability of large amounts of data in specialized data repositories is that the data itself can be mined to investigate trends that might not be revealed by examining the data included in a given study because of small sample size (Liberati et al. 2009). Meta-analysis of immunological data is particularly effective in revealing, in a given field of research, pathogen, or disease system, which areas have been targeted most extensively by research and which areas conversely represent knowledge gaps. Immune epitope data meta-analyses for a given disease or pathogen facilitate the use of the data, engage community experts, and can lead to formulation of novel hypotheses. Typically, a meta-analysis is based on the inventory of current knowledge of T cell and antibody epitopes, host organisms, disease states, conservancy, and other relevant variables.

In the case of influenza, analysis of the immunologic data available in the literature as of the end of 2006 (Bui et al. 2007a, b) provided a comprehensive catalog of influenza epitopes, thus establishing a resource for investigators wishing to utilize them in basic studies, or in the evaluation of different vaccination strategies or vaccine constructs. The analysis, however, also revealed several gaps existing at that point in time. Relatively few epitopes were defined in birds and non-human primates, and there was a striking paucity of well-defined antibody epitopes, especially in humans. Few epitopes were characterized for their protective potential. Overall, a limited number of epitopes were reported for avian influenza strains and subtypes. Finally, other than HA and NP proteins, there were relatively few epitopes reported for the other influenza proteins. It should be noted that several of these knowledge gaps have since been significantly bridged by researchers from many different groups (Ekiert et al. 2009; Sun et al. 2009; Yu et al. 2008).

An updated analysis of influenza epitope data with special emphasis on swine-origin H1N1 (Greenbaum et al. 2009a, b) examined the sequence of reported epitopes, which, by definition, represent the pool of preexisting immunity in the general human population. As expected, the majority of antibody epitopes were not conserved in the novel swine-origin influenza (S-OIV), supporting the notion that widespread vaccination with an S-OIV-specific vaccine is required to prevent infection in the general populace. However, the majority of the epitopes recognized by CD8+ T cells were completely invariant. Based on these results, it was hypothesized and then experimentally demonstrated that some T cell immunity is preexisting in the general population against S-OIV and of magnitude similar to that preexisting against seasonal H1N1 influenza.

Meta-analysis of immune epitope data of additional pathogens

Additional epitope meta-analyses relating to the knowledge associated with tuberculosis, botulinum/anthrax toxins, malaria, and poxviruses have been produced (Bui et al. 2007a, b; Blythe et al. 2007; Zarebski et al. 2008; Vaughan et al. 2009; Moutaftsi 2010).

While the Mycobacterium tuberculosis genome contains approximately 4,000 potentially expressed proteins, epitopes have only been identified from approximately 150 of them (∼4%). Furthermore, 23 of these proteins contain ten epitopes or more. These 23 proteins account for more than 71% of the total epitopes identified. It is possible that immune responses are highly focused on very few antigens, and the immune system is oblivious to the vast majority of the coding ORFs. More likely, many antigens have not been characterized and investigated, and a genome-wide approach to epitope/antigen identification would reveal many additional antigens. Another noteworthy finding was that, while epitopes have been described for various disease states, such as clinically active versus latent tuberculosis, exposed, but not converted, and BCG vaccinated studies comparing these different patient populations side-by-side in a systematic fashion have been scarce.

A similar analysis revealed a wealth of plasmodial epitope data available for the scientific community (Vaughan et al. 2009), including a total of 1,566 unique epitopes consisting of 892 T cell (mostly CD4+) and 896 B cell. Strikingly, as in the case of mycobacteria, epitopes were derived from relatively few antigens. While antigens from all life cycle stages were represented, most epitopes were derived from antigens expressed at the parasite surface during liver and asexual blood stages. In all, epitope data were available for only 46 plasmodial antigens, and more than 95% of the malaria genome was not represented. As in the case of M. tuberculosis, high throughput epitope/antigen identification might reveal many new promising antigens.

Indeed, analysis of poxviruses and vaccinia virus highlighted that a large number of antigens spanning virtually every ORF can be targeted by immune responses in complex pathogens with a genome composed of more than 200 different ORFs (Moutaftsi et al. 2007; Sette et al. 2008). In conclusion, meta-analysis of epitope data is a novel avenue to provide the scientific community with a “forest” rather than “tree” level view of the content and granularity of the scientific literature related to specific disease indications.

The impact of immunological databases

As described in the previous sections, a number of different DBARs are available to the scientific community. In this and following sections, we explore in more detail the specific impact that these resources have had on the scientific community. We first sought to estimate the impact of each DBAR in terms of the number of PubMed citations of primary publications describing each DBAR. To this end, we assembled a list of publications associated with each DBAR. First, we performed PubMed queries, adopting a uniform and unbiased approach that would facilitate interdatabase comparisons and designed to retrieve publications relevant to each DBAR. Accordingly, each query consisted of the name(s) of the principal investigator(s), in addition to the word “database” followed by a wild-card character. The resulting publication list was then manually inspected to exclude those publications that were obviously irrelevant. The error rate of this approach was analyzed by using the IEDB as a test case. This query strategy produced 35 publications. Cross-checking against our master list, the query successfully retrieved 22 of the 36 (61%) IEDB-related references.

Table 4 lists the number of publications obtained for each DBAR following this procedure, both in absolute terms and in terms of publications/year. It was found that a total of eight DBARs were associated with at least ten different publications, with yearly rates in the 0.67 to 4.5 publications range. Within this group were several different epitope-based DBARs, including SYFPEITHI, AntiJen, BCIpep, and, the HIV Molecular Immunology Database.

Table 4 Publication and citation metrics for immunologically relevant DBARs

The number of citations made to these primary publications was also quantified by the ISI Web of Knowledge and Google Scholar. This analysis revealed that immunological and epitope-related databases have a very significant impact. Taken together the 13 main epitope-related DBARs receive on average 466 citations per year, roughly equivalent to one third of the annual citations generated by the RCSB PDB resource, which has been in operation since 1998, has a much broader scope containing over 60,000 structures of all biological molecules and as such is of relevance for immunological and non-immunological applications alike.

Probing the nature of citations to assess database usage

The preceding sections illustrate the breadth of DBARs available, and how these resources are widely utilized and cited in the scientific literature. We were interested in probing the usage of the DBARs in more detail and specifically ascertaining how the online resources are used. While it is not immediately straightforward to establish the specific use associated with each user and visit, analysis of the papers referencing a DBAR can more readily provide insights in this respect.

To this end, the IEDB citations were further evaluated by subdividing them by year and category (general IEDB, analysis resource, and curation/meta-analyses). It was found that citations of general IEDB publications grew at the fastest rate and doubled each of the first 3 years, subsequently reaching a plateau. By contrast, citations of publications relating to the analysis resource as well as the meta-analyses grew at slower rates, but now represent the dominant source of IEDB citations. Citations of the meta-analysis started later than the other two categories (the first meta-analysis was published in 2007), but also appears to grow at a similar rate (data not shown).

Next, each manuscript citing at least one IEDB publication was manually reviewed. We found that those citations fall into several broad categories, such as retrieval of specific T or B cell datasets (20% and 6%, respectively) and utilization of specific tools (24%). A number of references utilized the IEDB to further ontology development and/or integration (8%) or to develop and improve predictive or analytical tools (23%). The distribution of each of the citation categories is shown in Fig. 1.

Fig. 1
figure 1

IEDB citation categorization by nature of the citation made

Thus, this analysis suggests that the usage of immunological database and analysis resources is roughly balanced (IEDB dataset retrieval constitutes 26% of citations, while tool usage accounts for 24%). Furthermore, the results indicate that over 80% of all citations are attributable to practical applications of DBARs, either in terms of tool/dataset use or further development of new tools and applications.

Conclusions and discussion

The last decade has witnessed unprecedented growth in the number of publicly available immunological databases and analysis resources (Bourne 2005). Though these resources can be of considerable value to scientists working in myriad settings, the explosive rate of their proliferation presents challenges to those scientists to maintain a clear appreciation of the resources at their disposal. We thus undertook this review to investigate and present the content of various immunological DBARs, the scope of their predictive and analytical capabilities, and their overall impact on the scientific community with the ultimate goal of informing the readers and potentially guiding them to the resource(s) that may be of greatest utility for their research interests.

After compiling a list of resources of potential immunological interest, we systematically examined those DBARs hosting experimental data relating to immune epitopes. Our survey revealed that in terms of data content, each DBAR examined tends to have a clear strength in certain data subsets or disease areas and can, therefore, perhaps better cater to the needs of scientists seeking those particular data. A noteworthy trend among DBARs is the growing integration of formal data ontologies (Noy et al. 2009; Yip 2009). Such standardization has already proven to facilitate interdatabase connections and data sharing, as evidenced by links between resources of diverse focuses. A prime example of this is found in the IEDB, where users can follow links from epitope data to relevant data in BioHealthBase and EuPathDB, which, in turn, also place links from their respective websites to the IEDB. Therefore, it is envisioned that data will become increasingly accessible and integrated with other data resources in the near future.

We further considered the DBARs that provide access to predictive and analytical tools for immune epitope data. Our intent was not specifically to comparatively examine the performance of these tools, as such analyses have been published elsewhere (Mallios 2003; Blythe and Flower 2005; Peters et al. 2005; Peters et al. 2006; Saha and Raghava 2006; Lin et al. 2008; Lundegaard et al. 2008; Wang et al. 2008; Zhang et al. 2009). However, several conclusions about the current state of predictive tools did emerge from this examination. Specifically, our survey highlights clear shortcomings in the predictive tools available. Namely, MHC class II and B cell epitope predictive tools merit improvement, both in terms of predictive performance and, for MHC class II, in terms of coverage of species and alleles currently available. We anticipate progress in these realms will follow the emergence of larger experimental datasets that will become publicly available in the near future.

We also explored the impact of immunological databases by examining their impact on the scientific community, as well as their strength as a means to clearly and concisely represent empirical data in a centralized resource. To this end, we undertook an effort to systematically quantify the impact of immunological DBARs by collecting metrics on their publication and citation rates. The high cumulative citation rate of the epitope-related DBARs is a clear indicator of the degree to which these resources permeate the scientific community and help guide research. A closer examination of the nature of these citations, using the IEDB as an example DBAR, revealed that these citations are mostly attributable to practical applications of the IEDB and represents further evidence of the direct impacts of DBARs.

Another indicator of DBAR impact is their utility for performing systematic meta-analyses (Kaczorowski 2009). To illustrate this point, we presented several examples of meta-analyses that have been performed to date based on the data available in the IEDB. With these examples, we hope to both raise the reader's awareness of their existence and to promote further meta-analyses as a means of driving and guiding continued applications of empirical data.

In this review, we have highlighted both the present utility of the diverse collection of immunological databases and analysis resources, while also exposing areas that require further development. In the final analysis, it is clear that, while immunological DBARs are presently widely utilized by the scientific community, in many respects, the field is still in its early stages, and continued development and refinement are necessary. Hence, it is reasonable to anticipate that the future years will see a diminishing lag between the emergence of robust experimental data and the ability of the scientific community to efficiently access and interpret such data.