Findings

The maturity of modern genomic sequencing technology has seen genomic databases being generated for more and more species and public databases growing larger every day. Owing to advanced instrumentation and powerful search engines, this mounting comprehensiveness and the refinement of databases have benefited mass spectrometry (MS)-based protein identification and biomarker discovery. However, despite improvement in these areas, MS-based protein characterization using public databases has not yet been perfected for all species. For instance, annotation of individual genes and their related protein products has not been standardized. As the setup of sequence-focused protein identification by MS is primarily based on post-proteolytic enzyme-digested peptides, much important annotation information, including the functions of proteins, can be ignored by the applied search engine [1]. It has been shown that search results can be optimized when using custom databases which focus on protein function with clear annotation, such as those generated using programs such as “Database on Demand” [1, 2]. It has also been reported that search algorithms lose sensitivity when the search space (i.e. database size) is increased [3], and the more similar the database sequence to that of the protein of interest, the more accurate the search result [4]. These points are especially important during biomarker discovery and validation, as well as the protein identification of “non-mainstream” organisms [5]. Currently, many custom protein databases have been created to meet the special circumstances of the examined molecule, including prokaryotic ubiquitin-like protein (Pup) [6], proteins of O-GlcNAcylation [7], and a bio-molecular interaction network database [8].

In this paper, four projects spanning six years at the National Microbiology Laboratory in Canada, involving curated database creation and application for the purpose of biomarker identification and validation, are presented. All MS-based protein identification was performed using liquid chromatography tandem mass spectrometry (LC-MS/MS) detection and a Mascot database search algorithm. All the curated databases are presented in FASTA file format in Additional file 1. The detected proteins of interest are shown in Table 1.

Table 1 Search output produced by searching MS sequence data of various peptides against curated databases (CD) and the public databases, MSDB, NCBInr, and PBR

The first project involved analyzing two SDS-PAGE (sodium dodecyl sulfate polyacrylamide gel electrophoresis) protein bands derived from sheeppox virus [9]. A western blot demonstrated that one protein band (“band A”) was immunologically very reactive to serum from sheep infected with the virus and, if identified, could have implications in vaccine design and/or reagent development for viral diagnoses. In-gel digestion was performed on this band, and LC-MS/MS implemented on the extracted tryptic peptides for peptide separation and detection. Mascot (Matrix Sciences) was used to perform the database search. When searching the public database, MSDB (Mass Spectrometry Sequence Database; 3,229,079 sequences; created by the Proteomics Group at Imperial College London), a protein identified as “putative virion core protein-lumpy skin disease virus” was identified with a Mascot score of 859 and a matched peptide number of 51. When searching the curated poxvirus specific database (21,000 sequences), created from the PBR (Poxvirus Bioinformatics Resource Centre) website (http://www.poxvirus.org/index.asp?bhcp=1), a more accurate identification was obtained (i.e. the “sheeppox virus protein”) with higher confidence (Mascot score = 1039) based on 80 peptide matches (Additional files 2 and 3). This observation clearly demonstrates that a smaller but more focused database is very useful for confirmation and validation of the molecule under study.

The second project employed MS to detect a protein with transcript variants. Microtubule-associated protein tau (or simply “tau”) has several variant forms [10, 11]; examined in this study was tau transcript variant 2 (tau-2, GenBank accession NM_005910), routinely used in our laboratory as a biomarker for prion disease diagnosis [12]. When tau-2 MS data was searched against the public database, NCBInr (National Center for Biotechnology Information Non-Redundant), the “peripheral nervous system (PNS) specific tau” protein was primarily identified (Table 1, Additional file 4), when in fact tau-2 is a central nervous system tau variant. Moreover, top hits representing different variants of the same protein were obtained from searches using in-gel and in-solution digestions. These inconsistencies rendered quality control assessments of MS data difficult and consequently, a curated database with clear annotations was used to perform the search, where a consistent result was obtained (Table 1, Additional file 5).

In the third project, a curated database was employed to detect a protein that does not normally exist in nature. A recombinant sheep-hamster chimeric prion protein was designed for use in a novel and promising assay called “real-time quaking-induced conversion” (RT-QuIC), where low levels of infectious prion can be detected in human cerebral spinal fluid [13]. When the NCBInr database was used to confirm the existence of the chimeric protein from a digested SDS-PAGE band, only one peptide representing prion protein from different species (i.e. neither sheep nor hamster) was revealed (Table 1, Additional file 6), while the actual proteins [hamster (Mesocricetus auratus) and sheep (Ovis aries)] represented only the third and fourth hits, respectively. In order to accurately identify the chimeric protein, a curated database called “PrpSheep-Hamster” was created to accurately annotate and identify the protein (Table 1, Additional file 7). Indeed, database searches of MS data obtained from two separate but identical in-gel digested protein bands demonstrated that higher identification confidence and more sequence-specific peptide matches resulted from the smaller, more focused database (Table 2). This situation exemplifies that the characterization of proteins possessing rare tryptic enzyme digestion sites for MS analysis may benefit by using smaller and hence more accurate databases.

Table 2 Search output produced by searching sheep-hamster PrP MS sequence data against a curated prion protein database (CD) alone and in conjunction with the public database, Swissprot

The fourth project highlights the ability of both MS and curated protein database to supplement traditional E. coli flagellar serotyping. As there are 53 flagellar serotypes in E. coli bacteria, serotyping by way of antigen-antibody agglutination reactions is a costly and tedious process [14, 15]. In response to this, a unique method was developed to enrich flagella for high quality MS detection and identification [15], but problems arose when specific H types (i.e. serotypes) could not be obtained when searching the resulting MS data against the NCBInr database. Using the flagellar serotype H37, for example, a search of NCBInr listed the sequence as simply “flagellin” (Table 1, Additional file 8). To solve this problem, a curated E. coli flagellar database representing all serotypes was created as a FASTA file, using sequence data obtained from this public database of NCBInr. The custom database was used to successfully identify all examined flagella H types from reference E. coli strains [15] (Table 1 and Additional file 9 shows one example, H37). Searches using only the curated database, rather than using the curated and public database, Swissprot, in conjunction, also produced a larger number of matched peptides with higher confidence scores and often attained better coverage amidst shorter search times (Table 3). Lastly, MS sequence searches against the curated and public database, Swissprot and NCBInr, demonstrated that only the smaller, more focused curated database was able to obtain accurate top hit information with 100 % sensitivity and specificity (Table 4).

Table 3 Search output produced by searching E. coli flagellin MS sequence data against a curated E. coli flagellin database (CD) alone and in conjunction with the public database, Swissprot
Table 4 Top hits produced by searching E. coli flagellin MS data against a curated E. coli flagellin database (CD) and the public databases, Swiss-prot and NCBInr a

Conclusions

With the growing comprehensiveness of many species’ genomes and the maturity of MS-based technology, biomarker application and validation are being applied more and more for use in disease diagnosis and improvements of conventional bio-assay methods. From the above cases, it is evident that curated databases are very useful for accurate, specific, and consistent identification and confirmation of proteins and biomarkers of interest. Moreover, clearly annotated, fit-for-purpose databases prove extremely useful for high quality and standardized method development and validation using MS-based technology. Due to the sophistication of MS instrumentation and specific software requirements, together with variations in protein expression and posttranslational modifications, detection of analogous proteins through MS remains complicated. This paper will hopefully serve as an example and reminder for all MS users, especially those performing specific and/or “non-mainstream” research and applications, recombinant DNA technology quality control, and targeted biomarker identification and validation, to use curated fit-for-purpose databases in order to consistently and accurately identify MS data.

Availability of supporting data

All the databases are available in the Additional file 1-Database.zip. Any questions regarding the application of the databases should be addressed to K. C. (Keding.Cheng@phac-aspc.gc.ca).