Introduction

The ability to search for proteins of interest via text query is a standard utility of protein biomedical resources such as NCBI Protein [1], MMDB [2], UniProt [3], sites created for the Protein Data Bank (PDB) by the members of the wwPDB [4] (RCSB Protein Data Bank [5], PDBe [6], BMRB [7], and PDBj [8]), and the Structural Biology Knowledgebase (SBKB) [9]. These resources offer search services over a variety of annotations. For example, NCBI protein has curated information regarding protein sequences that is available for text query. UniProt hosts text searches over of a collection of annotation records of the protein sequences, which were collected based a review of the associations documented in the literature and/or were derived computationally. The SBKB provides searches for protein structures over summary text fields from the primary literature citations. The fields includes abstracts and associated terms such as medical subject headings or MeSH terms [9]. The wwPDB websites offer a variety of searches that include those over the collections of text fields from primary literature citations and the cross-referenced annotations from other protein databases [5, 8]. As examples, searches for ligands contained within the protein structures have also been implemented [5, 6, 8]. With these and related protein resources, users have at their disposal a means to search for protein structures based on a collection of associated protein annotations and attributes. A recent review of protein databases and some of the associated searches that are available therein is provided by Chen et al. [10].

The presentation of the results of a text query within a protein database can be done in which users can browse all the entries that match any of the text fields or browse only entries that have matches within specified fields. For example, UniProt allows a user to retrieve matches based on all the annotations collected within the UniProt data files or restrict the search to matches within a particular annotation fields. Annotation fields in UniProt include the protein or organism name fields. Similarly, the RCSB PDB provides a list of all the protein structures found based on matches across all the available text fields or the results for searches that are restricted to matches within particular annotation categories, such as the enzyme type or a Gene Ontology term category. Given that text searches may produce a large collection of annotations and structures that may possibly be browsed, the user may ask the following. Which structures are the most relevant to my query? Of the annotations retrieved, which ones are the most relevant? These questions are analogous to those commonly made for website searches with regard to which topics and which web pages are estimated to be the relevant to a given query. User demand to expand the utilities of web search engines has lead to the development of more efficient and effective methods to retrieve the most relevant topics and web pages to a given text query [11].

With the goal to achieve improved efficiency and effectiveness for searches for protein structures and their associated annotation categories, a ranking tool, KB-Rank, is described. The KB-Rank tool provides a means to retrieve a list of protein structural chains and annotation categories that are relevant to the provided text query. Structural chains within each retrieved category are ranked according to their estimated relevance to the queried text. The annotation categories are also presented according to their estimated relevance. These utilities can be used to address a variety of searches that are conducted by users of protein structural databases. The tool facilitates informational searches to learn more about particular topics, e.g. the retrieval of information associated with a particular disease. An example of an informational search example is to gain a better understanding of the pathogenic mechanisms of asthma. Navigational searches are also enabled that provide a means to identify specific structural chains that can be used to address particular research questions. One such type of search is to find structures that may be used in a structure based drug design protocol, for example protein structural chains may be used in drug design strategies for the treatment of melanoma.

Materials and methods

Annotation assembly

The assembly and integration of the protein annotations from open sources is done weekly to coordinate with the release of new protein structures and to ensure that the analysis is up date for all available structures. Annotations are mapped to protein structures at the level of the protein structural chain. A full list of protein structural chains is available from the ftp site at a URL at the PDB <ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt>. The following annotations are assembled. Cellular and biochemical pathway assignments were extracted from BioCyc [12], CellMap [13], HumanCyc [13], INOH [14], and the NCI Pathway Interaction Database (PID) [15]. Small molecule associations were from BioCyc, BindingDB [16], HumanCyc, DrugBank [17], ChEBI [18], ChEMBL [19], and SMPDB [20]. SNPs3D [21] and OMIM [22, 23] provided disease associations. Molecular functions, biological processes, cellular components were from the Gene Ontology (GO) classification system [24], as assigned in SIFTS [25]. Enzyme classifications were from the EC2PDB database [26]. Structural domains assignments were provided through the CATH [27] and SCOP [28] databases. Sequence domain assignments were identified through the Pfam database [29]. Further structural groups were based on the jFatCat alignment algorithm [30, 31]. The FEATURE resource provided predictions of functional sites [32]. The annotations utilized in the current study have been described previously for the purpose of the prediction of protein function [33], and more complete description of their assembly is provided therein.

Query and presentation of protein structures and annotation categories

At the first stage of the text query, the search is over the text fields associated with the primary literature citations of the protein structures and text associated with domains within the structures as retrieved by the Pfam protein family database [29]. The fields from the primary literature citations include the title, author list, abstract, medical subject headings or MeSH terms, and the substance list. Text from the Pfam database is from the description and comment fields. A search over all of the fields provides a means to identify and rank structure entries as a list of PDBIDs. Such a search using text from the primary literature fields has been implemented in the SBKB [34]. The scoring for the ranking is implemented using MySQL’s “Match…Against” method [35]. The following gives a summary of the formula utilized as discussed at the URL <http://forge.mysql.com/wiki/MySQL_Internals_Algorithms>.

$$ {\text{rank}} = \left( {\log \,({\text{dtf}}) + 1} \right)/{\text{sumdtf }} \times {\text{U}}/\left( { 1 + 0.0 1 1 5\times{\text{U}}} \right) \times { \log }\left( {\left( {{\text{N}} - {\text{nf}}} \right)/{\text{nf}}} \right) $$

In the equation, the variable dtf is the number of times the term appears in the document, sumdtf is the sum of (log (dtf) + 1)’s for all terms in the same document; U is the number of unique terms in the document; N is the total number of documents; and nf is the number of documents that contain the term. Based on the keyword match within the text fields and using equation A, structural entries or PDBIDs are retrieved and ranked. The first 200 entries found by the text search are saved for further analysis.

The next task is to order the protein chains within the retrieved structural entries. From all the structural chains within the entries retrieved by a given text search, a nonredundant set is obtained by identifying representative chains that are nonidentical in primary amino acid sequence. All the annotations associated with these nonidentical structural chains are then retrieved. For each of the structural chains, an array of values that corresponds to each of the annotations retrieved was created. If a structural chain had a given annotation, as found in any of the representative chains, a one is entered for that position in the array. If not, a zero is entered at that position. A matrix with the arrays of the structural chains versus annotation presence or absence is then generated. To rank the structural chains, the array for each structural chain is multiplied by the entire matrix created from the representative set of chains in primary sequence. All the elements of product matrix are summarized to get a rank value. Structural chains are ordered according to value of the rank, which referred to as the relevance score. Annotation categories are also ranked according to the average relevance score of the structural chains in each category. See Fig. 1 for an illustration of the method. A comparable method was used to find a relevant transcription factor binding sites among potential promoter sequences [36].

Fig. 1
figure 1

A truncated annotation profile matrix, structures versus annotations, provides a schematic of the method that is used to order the protein structural chains according to their relevance to a queried text. If an annotation is associated with a given structure, the entry in the corresponding point in the matrix was a one and zero otherwise. Structural chains are ranked based on the product of its annotation profile array multiplied by the entire annotation profile matrix that was created for the given search. The product for each chain is referred to as the chain’s relevance score for the given text search. See text for details. In the diagram shown, structure with PDBID 1HZI chain A is ranked highest, followed by 2ZEC chain A, and so on. The annotations are: Annotation A-CATH homologous family Trysin-like serine proteases, Annotation B—SCOP superfamily Ferritin-like, Annotation C—GO cellular component extracellular region, Annotation D—SNPs3D disease association hypertension, Annotation E—ChEMBL small molecule association l-Serine, and Annotation F—jFatCat structural group with similarity to structure with PDBID 3GOV, chain B

Results

The web interface

An interactive web interface for the KB-Rank tool was created to search and browse the protein structural chains retrieved and their associated annotation categories. The main page is shown in Fig. 2. A text search box is provided whereby the user can initiate a query. Annotation categories that are available for browsing include cellular pathways retrieved from the National Cancer Institute’s Pathway Interaction Database [15]; superfamily designations provided from the SCOP database [28]; small molecule associations and metabolic pathway associations as assigned in BioCyc [12]; enzyme classification assignments from EC2PDB [26, 37]; molecular function, biological process, and cellular component term designations as found in the Gene Ontology term hierarchy [24]; and small associations that are assigned within the ChEBI [18], ChEMBL [38], SMPDB [20], and DrugBank [17] resources. The resources are listed in the tabs at the top of the search results page where one can choose to retrieve results based on each. As the annotation categories are presented on the web interface, links are forged to the annotation provider’s website. That is done either to the home page of the resource or to the page that describes the annotation category selected, whichever is appropriate. A legend is provided on the results page to give a summary of each resource and what annotations are utilized from each.

Fig. 2
figure 2

The landing or main page of the KB-Rank tool describes its utility as a means to identify protein structures and biomedical annotations via text search. Types of search terms are given as examples are protein functions or disease associations of the protein structures. The search term shown is asthma

A utility of the KB-Rank query tool is that annotation categories and structural chains are ordered and presented according to their estimated relevance to the queried text. Relevance scores are used as described in Materials and Methods. To make the interpretation of the relevance scores more visually intuitive, colors are used to indicate where each annotation category or structural chain lies within the entire ranges of the scores. As an analogy to a traffic light, a green color indicates that a category or structural chain is most associated with the queried text while a red color indicates that is least relevant. Colors in between are used to indicate intermediate scores and corresponding relevance. The coloring method is comparable with that used within the protein modeling portal [39], where model quality for a predicted structure, rather than relevance to text query, is similarly assessed.

User case scenarios

Illustrative user case scenarios are now described. For the first scenario, the aim is to perform an informational search on a particular topic. An example search is regarding the disease asthma. The user wants to learn more about that disease based on a review of the protein structures and annotation categories that are found to be relevant. Upon executing the text search, the user selects from annotation categories that can be browsed. A selection of the Gene Ontology resource for the categories within the ontology domain of molecular function is shown in Fig. 3, panel A. The highest ranked molecular function category retrieved is interleukin-4 receptor binding, GO + 0005136. A review of the primary literature shows that interactions of interleukin-4 are involved in the proinflammatory response in asthma; and interleukin-4 protein mediates the development of allergic reactions [40, 41]. For the cellular component ontology domain of GO, the results show that the highest ranking category is the extracellular matrix, GO + 0031012 (Fig. 3, panel B). Based on a literature review, it is known that in asthmatic patients, abnormal extracellular matrix components are deposited [42]. Also, in fatal cases of asthma, the fractional area of the extracellular matrix within airway smooth muscle is larger [43]. The categories of interleukin-4 binding and extracellular matrix were found to be the highest ranking in their respective Gene Ontology domains for queried text asthma. That corresponds well with each category’s importance regarding the pathogenesis of the disease. The utility of the tool in this case is that it aids the user in efficiently collecting relevant information about the disease.

Fig. 3
figure 3

Panel A resolution of the structures retrieved by the queried text asthma into molecular function categories as assigned in the Gene Ontology hierarchy. The results indicate that for protein structures associated with asthma the molecular functions associated with cytokine activity are prevalent. Also prevalent are protease activities. Panel B resolution of the structures retrieved by the queried text asthma into cellular component categories as assigned in the GO hierarchy. The results of the annotation category show that the protein participants in the disease engage in activity at the extracellular matrix. The ranking of the categories indicates that the component extracellular matrix, colored with a green correspondence, is relatively more relevant to the disease than the component stored secretory granule, which is shown in orange-yellow

A second utility of the KB-Rank tool is that it orders the protein structural chains within each annotation category according to their relative relevance to the queried text. That utility can aid the user in identifying the relatively more important chains, among all those retrieved, to a queried topic. The text search for asthma is further used to demonstrate that utility. As shown in Fig. 3 in panel A, cytokine activity, GO + 0005125, is ranked fifth among the list molecular function categories retrieved. The link provided for the 26 structural chains within that category can be expanded. In Fig. 4 is the structure of interleukin-5, PDB + 1HUL chain A, is listed, which is in the middle portion of the list. The associated orange color is used to indicate that the structure is estimated to have intermediate relevance to the queried text. At the top of the list retrieved but not shown in the Fig. 4, is interleukin-4, PDB + 1HZI, chain A. It has an associated bright green color that corresponds to the highest ranking structural chain found for the cytokine activity annotation category.

Fig. 4
figure 4

A list of the structures retrieved by the queried text asthma ordered according to their relative relevance to the disease. The results indicate that the cytokines eotaxin and IL-5, which are shown in orange, are estimated to be relatively less relevant than the cytokine IL-4. IL-4 is much higher in the list and has an associated green color. The PDBID and chain designation are given for each entry

Based on a review of the literature, the importance of IL-4 and IL-5 in the development of asthma can be assessed. It is known that that IL-4 contributes in a variety of ways to the development of asthma, one of which is the stimulation of Th0 lymphocytes to Th2 lymphocytes [40, 44]. Th2 lymphocytes secrete other cytokines that include additional IL-4, IL-5, IL-9, and IL-13. IL-5 thereby has a secondary role to disease development as compared to IL-4 in terms of the sequence of the disease mechanism. Further, IL-4 based therapies for asthma have shown improved clinical outcome for the treatment of asthma while IL-5 based therapies have not [45]. The results indicate the relative importance of the two cytokines in the pathogenesis of asthma, and that matches with relevance ranking found by the KB-Rank search tool. The ranking of the structures by the tool thereby provides a starting point for further understanding of the disease mechanism with regard the important protein players and their roles.

In addition to providing informational searches that utilize the ranking of the structural chains and annotation categories, navigational searches are also possible with the KB-Rank tool. In a navigational search, the purpose is to identify a particular structural chain that can be used for further investigation and research. An example type of a navigational search is for the identification of structural chains that can be used in a structure based drug design (SBDD) protocol and virtual screening. For that application, a user searches for a potential drug target that is particularly important to the disease of interest [46]. Selection is further made to find those protein structures that are druggable, i.e. protein structures that have binding pockets and/or that can accommodate a drug molecule [47].

As an illustrative example of a navigational search for SBDD, a search was made to identify those that can be targeted to treat melanoma. See Fig. 5 where the text query is melanoma. Based in part on information from the DrugBank resource, the highest ranked small molecule found by the search is 5-Bromo-N-(2,3-Dihydroxypropoxy)-3,4-Difluoro-2-[(2-Fluoro-4-Iodophenyl)Amino] Benzamide, DrugBank + DB03115. The entry in DrugBank for the small molecule indicates that it is an experimental molecule, and the protein target is MEK1, PDB + 3E8N chain A. The finding that the MEK1 structure binds to small, drug-like molecule indicates that it is likely a druggable target. The next step was to verify that MEK1 plays an important role in the mechanism of disease in melanoma. The primary citation for the structure of MEK1, PDB + 3E8N chain A, indicates it is targeted for the treatment of various types of cancer including lung, colon, melanoma, pancreatic, and prostate cancer [48, 49]. The MEK1 protein is within a signal transduction cascade, the RAS-RAF-mitogen-activated protein kinase (MAPK)/extracellular signal-regulated kinase (ERK) kinase (MEK)-ERK pathway, that leads to cancer [50].

Fig. 5
figure 5

Structures associated with the disease melanoma were searched. A resolution of the drug associations of the structures retrieved was done based on the DrugBank resource. The structure of the B-Raf kinase is found to interact with the drug Sorafenib. The search demonstrates an application for the identification of structures that can be used for structure based drug design for the treatment of melanoma. The result of the search provides a protein of known three dimensional structure that is known to be a druggable protein target for the disease

Upon examination of the other small molecules found from DrugBank for the query of melanoma, we see that the second molecule listed is the drug Sorafenib. It inhibits another protein along the RAS-RAF-MEK-ERK pathway, B-Raf kinase35, PDB + 3C4D chain B. Inhibition with Sorafenib has not proven to be effective in clinical trials for melanoma [51, 52]. But the protein structure is demonstrated to be druggable, and further inhibitors of that target have been developed and demonstrated to have effective anti-melanoma effects in humans [53, 54]. These results indicate that applications of SBDD for the B-Raf kinase target are ongoing and yielding effective results. The identification of the structures of B-Raf kinase and MEK1 with the KB-Rank tool as structures can be used for SBDD for the treatment of melanoma demonstrates that the tool provides a point of entry for the identification known and potential protein structural targets. The high ranks of the viable targets found, based on the text searches, illustrate the utility of the tool for that purpose.

Discussion

The KB-Rank tool provides a means to attach a relevance score to structural chains and/or associated categories retrieved by a given a text query. It is anticipated that as more annotations are utilized for the ranking process, e.g. through the addition of more annotations associated with the primary amino acid sequence and/or the three-dimensional structures of the protein chains, the display order will more accurately reflect the order of their relevance to the queried text. The annotation categories can be expanded within the types of annotations that have already been assembled. These types include additional three dimensional structural characteristics, small molecule interaction assignments, functional site assignments, and cellular/biochemical pathway designations. The resultant granularity for the searches and subsequent ranking is at the chain level rather than at the level of the structural entry as found in the PDB. That has the advantage of narrowing down a search to particular chain within an entry that has multiple chains. It has the ability to identify relevant protein chain that resides within a complex that may not be directly relevant to the text searched.

The relationship between function and disease are anticipated topics for searches. At the first stage of the search, a text search is implemented over the summary fields extracted from PubMed abstracts of the primary citations of the protein structures and the descriptive fields of constituent domains of the structures as extracted from the Pfam database. In the second stage, an integrated set of annotations are used for categorizing the functional roles of the protein structural chains, and to subsequently rank the retrieved chains by an expected relevance to the queried text. The annotations used for the final ranking need not contain a match with the queried text; they need only be prevalent in the structures retrieved by the text search. The prevalence of a given annotation within the structural chains retrieved is used as an indicator that it is relevant to the queried text. Structures with a relatively larger number of the prevalent will be ranked relatively higher. Also, structures that have been well characterized with a relatively larger number of any annotations will tend to be ranked higher as well. That tendency is analogous to what is found for webpage ranking where the interest level in web pages, as reflected number of its links and its link structure [55], is used to facilitate the ranking.

Data integration effort forms the substrate for the search tool and connections forged between the annotations further lend utility to the search tool. For example, UniProt entries are connected with chain entries from the PDB; and DrugBank entries are connected with UniProt entries within the integrated database that is utilized by the KB-Rank tool. A result is that for the melanoma search example, the user can identify a small molecule in DrugBank that is associated with melanoma and be provided with a relevant protein structural chain. The result demonstrates the utility of the data integration aspect of the tool as an important component of the tool’s functionality and utility. To complete the data integration, sequence comparisons are done to map the protein chain to annotations. The mapping of entries in BindingDB to the structural chains was done by finding the corresponding sequences with greater than 90% sequence identify through sequence comparison using the BLAST program [56]. The SIFTs resource and the UniProt data files are also utilized to provide connections between the protein structural chains with a host of sequence and functional information [3, 25, 57].

To improve the calculation of the rank score, sequence redundancy of the protein chains was considered. Repetition of the same annotation profile due to the inclusion of chains identical in primary sequence ultimately causes such chains to be ranked unduly higher in the search results. A study by Devos and Valencia demonstrated that protein chains with as high as 95% identity can have a different annotation profiles [58]. To remove chains redundant in primary sequence but limit the loss of annotations, chains were considered redundant if they were identical in sequence [33]. As discussed in the “Materials and methods” section, the representatives of these redundant sequence groups were used to calculate the relevance scores. Through the removal of chains identical in primary sequence, the annotation profile matrix created for a given search became more accurately weighted and when used so generated more accurate relevance scores.

The organization of the text search of the KB-Rank tool application is user-friendly, intuitive, and interactive. As part of the web tool, computer applications access specified annotations only at the required times. The search process itself is implemented in steps that are organized in hierarchical fashion; and each step is run according to user’s request. The organization makes the tool scalable with regard to the further addition of informative annotations from a variety of data sources.

The results of performing a search with the KB-Rank tool include an ordered list of annotation categories and an ordered list of protein structural chains within each category. As each protein structural chain is displayed, links are provided that include a redirection to the corresponding annotation page with chain specific information at the SBKB. At the corresponding page in the SBKB, annotations that are specific to the chain at a resolution below the annotation category can be retrieved. As examples, such links include more specific structural domains contained within the chains, and the differential tissue specific expression patterns that can be found through resources that are linked to the SBKB. In that way, the KB-Rank tool can be used in conjunction with the SBKB to retrieve annotations at different levels of granularity.

Conclusion

The KB-Rank tool provides a means to improve the efficiency and accuracy of searches to identify relevant protein structural chains and functional annotations relevant to a given text query. User search scenarios were described that demonstrate the tool’s utilities for informational and navigational searches. An example informational search was an examination of the protein structures and functional annotations that have a role in the disease asthma. An example navigational search identified potential structures that may be used to further investigate potential treatments for melanoma via a structure based drug design strategy. We demonstrate, through the illustrative examples, how annotations from different data sources were integrated from biomedical resources to enable research. Features of the tool include a staged integration of biomedical text information and the subsequent use of annotations of protein structural chains. It allows the user to effectively identify protein structural chains and annotation categories given a text search regarding protein functional or disease associations.