An important step in functional systems biology is to elucidate the relationships between biomolecules. Interactions between proteins define biological pathways, and knowledge of the processes in which proteins are involved is essential to gaining a fundamental understanding of the cellular machinery. The study of protein interactions is a pressing biological imperative, and thus characterizing protein interaction partners is crucial to our understanding of both the function of individual proteins and the organization of entire biological processes [1, 2].

More and more interaction data are being published in the literature as a result of the development of high-throughput experimental technologies, such as the yeast two-hybrid screening and affinity purification coupled with mass spectroscopy. These experimental techniques make it possible to study protein interactions on a much larger scale, although they suffer at times from poor resolution. To provide reliable protein interaction data for biologists, interaction databases such as Molecular Interactions Database (MINT) [3] and IntAct [4] manually detect and curate protein interactions from different information sources. However, it is becoming difficult for database curators to keep up with the rapidly expanding literature and the increasing number of newly discovered proteins.

In addition to the rate at which interaction data are being produced, there are other challenges for manual interaction curation. Experimental methods are not equally reliable, and when they extract protein interactions curators must place emphasis on thorough description of the experimental evidence. Furthermore, many authors continue to use ambiguous gene or protein names in their reports, or they fail to provide the organism or tissue from which the genes or proteins originate. Difficulty in mapping gene and protein names to SwissProt/UniProt [5, 6] identifiers increases the work of an annotator, who must gather more information from references, supplemental material, and so on. Finally, many types of interactions are scattered in the literature, although many are irrelevant to physical protein-protein interactions (for example, physical interaction [MI:0218] from the Molecular Interaction Ontology [7] is defined as interaction among molecules that can be direct). Genetic interactions (MI:0208: functional relationship among genes revealed by the phenotype of cells carrying combined mutations of those genes) are considered to be distinct from physical interactions between proteins, and they are not currently curated. Similarly, neither drug-drug interactions nor interactions between protein complexes and proteins are considered to be relevant in physical interaction curation.

Because of the accumulation of interaction data in the biomedical literature and the challenges that present for manual curation, there is an urgent need for text-mining tools to facilitate the extraction of such information. In particular, the extraction of physical protein-protein interactions, defined as the co-localization or direct interaction between protein molecules, is becoming extremely important because physical interactions are the most reliable data produced in high-throughput experiments. The development of effective text-mining tools could aid the mapping of proteins to SwissProt/UniProt identifiers, as well as the discovery of experimental evidence for interactions and the discrimination of physical interactions from other types of interactions. In comparison with previous studies on bio-text mining, BioCreative 2006 [8, 9] addressed some of these difficulties, such as normalizing gene/protein mentions to molecule identifiers, discriminating physical interactions from other interactions, and gathering as much reliable experimental evidence as possible.

The protein-protein interaction (PPI) task of BioCreative 2006 is comprised of several subtasks: the interaction article subtask (IAS) to determine whether an article (abstract only) is relevant to some physical interactions; the interaction pair subtask (IPS), in which interacting protein pairs should be extracted from full-text articles; the interaction sentence subtask, in which participants are required to submit a list of summary sentences for each interaction; and the interaction method subtask, in which the experimental detection methods should be given for each interaction.

Here, we present the methods and results from our participation in the PPI task of BioCreative 2006 [10]. By using Kullback Leibler divergence, we study the quantitative divergence between the training data and the final test data in the IAS, indicating that this originates from not being able to provide an adequate set of irrelevant articles. We propose solutions to overcome this issue and, in addition to the term features, other features are studied to reduce the distribution divergence such as the string, entity, and template features. Information fusion from both the feature perspective and the classifier perspective is studied, and our results rank in first place in terms of accuracy and in second place in terms of area under the receiving operator characteristic curve (AUC) in the benchmark evaluation. With this improvement, the tool may be useful in practical interaction curation.

In addition, we propose a named entity recognition framework that utilizes the information on the organism in articles. We present a quantitative analysis of how the extraction of physical interactions is influenced by the errors caused by the named entity recognition module. We point out that the framework is extremely important because, in the interaction curation task, protein names must be normalized to molecule identifiers so that molecular properties such as sequence can easily be identified. Finally, a profile-based method is proposed for the IPS. The goal of the task is to extract the protein pairs that have experimentally verified evidence, which requires the curator to collect information from multiple sentences in the article. This goal is different from the general aim of PPI extraction systems, most of which extract PPIs at the sentence level. Inspired by experience in curation, we constructed a profile vector for each candidate interaction from the whole article. By integrating evidence from the whole article, a better prediction is achieved for robust interaction curation. Our results were ranked in first place in terms of F1 score on the SwissProt-only subset, and in second place on the entire dataset. However, these results clearly need improvement if they are to be useful for any real task.

Results and discussion

Article filtering for efficient interaction curation

Automatically filtering out articles that are irrelevant to interaction curation will be useful to database curators. According to the reports from database curation projects, it takes 2 to 3 hours to process a paper even for highly qualified curators [1]. The IAS of BioCreative 2006 specifically addressed this issue of how to assess an article filtering tool in order to facilitate this process [8]. The task is difficult because the relevance of some articles cannot be determined through reading their abstracts alone, and curators usually must obtain evidence from the full text. Moreover, articles describing genetic interactions are hard to separate from those with physical interactions.

In the training dataset, there are 3,536 articles relevant to physical interactions and 1,959 irrelevant ones, and the official test dataset has 375 relevant and 375 irrelevant articles. A serious problem in this task is that the performance with the training data is much better than that with the official test data (0.95 versus 0.80 in terms of F1 score), which was also observed by [11]. To analyze the problem, 750 articles (375 positive) were taken out of the training corpus at random and defined as the leave-out dataset. The top 50 features, whose significance was measured using the χ2 test, were selected from the remaining training dataset. Based on these 50 features, three probability distributions were estimated from the leave-out dataset by using Equation 3 (see below), from the remaining training dataset, and from the official test dataset. We then calculated the average Kullback Leibler divergence (defined by Equation 4 [see below]) between two distributions to measure the divergence between distributions (Table 1), where the term features are unigrams/bigrams and the string features are strings with seven characters.

Table 1 The average Kullback Leibler divergence between the distributions of different datasets

For Pr(x|c+), the probability of a feature x occurring in the relevant articles, there was no remarkable difference between the term distributions estimated from the leave-out dataset, the remaining training dataset, or official test dataset. In other words, the three different datasets have almost the same term distribution. However, significant differences were observed for Pr(x|c-), the probability that a feature x appeared in the irrelevant articles, whose distributions are illustrated in Figure 1. There is much greater divergence between the distribution estimated from the official test set and that from the training dataset (0.992 versus 0.188). We hypothesize that the term distribution is different in the official test set, and that this may be the reason why the model did not hold well in the official test dataset. This was also verified by [11], in which much better performance was obtained when the training dataset and final test dataset were reversed. When the string is selected as a feature, the divergence diminished notably (0.992 versus 0.188), which might explain why the string feature proved even better than the term feature in these runs, as shown in Table 2.

Figure 1
figure 1

The probability of a feature x occurring in irrelevant articles. The figure shows the three distributions of the leave-out dataset, remaining training dataset, and official test dataset. The probability of a feature x occurring in irrelevant articles (Pr(x|c-)) in different datasets are shown (only 40 features are listed here).

Table 2 Article filtering performance with different features and classifiers

We might conclude that the problem originates from the fact that the irrelevant articles are not sufficiently representative of the entire sample space. In the interaction curation task, irrelevant articles are more randomly distributed, where some articles describing genetic interactions are very similar to those dealing with physical interactions, some discuss other types of interactions (for example, drug-drug interactions), and some are completely different and can easily be filtered out. It is difficult to provide a good set of representative irrelevant articles, and so these irrelevant articles introduce more uncertainty and bias into the learning machines.

In an attempt to overcome the problem, we first took strings as features based on the above analysis. Furthermore, we propose a new scheme (defined by Equation 5 [see below]) to diminish the divergence between the training data and the test data. The new scheme takes into account the probability of a feature being observed in both relevant articles and in irrelevant ones, instead of simply using TF*IDF (term frequency × inverse document frequency). Alternatively, we tried to incorporate more high-level semantic features such as the named entity features and template features. The entities were recognized using ABNER (A Biomedical Named Entity Recognizer) [12], including the protein, DNA, RNA, cell line, and cell type. The TF was calculated for these entity features, and template features were exploited to represent the specific syntactic dependency between entities. In a third attempt, we integrated more information from different classifiers, and by fusing different classifiers the performance was boosted markedly.

We first studied how the features influence performance in terms of classification using a support vector machine (SVM) with a linear kernel as the classification model (Table 2). The string features easily defeated the term features (F score: 0.788 versus 0.756; AUC: 0.841 versus 0.803), and we speculated that this was because the string features are more powerful, thus eliminating the divergence between the training data and the test data (as mentioned above). Note that an attraction of the entity features is the very high recall obtained (0.96), indicating that almost all the original relevant articles have been selected out. This is very useful if the precision of the classification is to be improved by further processing. By integrating all of the features together, the AUC is further improved to 0.861.

Second, we studied how this issue is influenced by different classification models, each model analyzing the data from a different point of view. SVM learns to separate data by a decision hyperplane, whereas the naïve Bayes classifier and multinomial classifier estimate probability distributions and try to interpret data from the probability perspective. The linear kernel SVM requires the data to be represented as feature vectors, whereas the p-spectrum kernel SVM simply views an example as a string. The different description powers can be combined by AdaBoost [13] and the best performance approached the needs of practical usage, with a precision of 80% and a recall of 90%.

Readers should note that the results presented in Table 2 are significant at the 0.02 level because we performed t-test experiments to determine whether the observed improvements were statistically significant. More details are presented in our paper published elsewhere [14].

Normalizing protein names to SwissProt identifiers

It is extremely useful to normalize protein names with molecule identifiers, which will largely ease the process of interaction curation. However, the task is challenging because inconsistent naming terminologies are used. It is common in reported research to cite just a few, nonstandard abbreviations, or to mention proteins without specifying species or organisms, or without specifying isoforms. This problem can be exemplified as follows.

1. Common terms, such as p53, are not easily normalized without any contextual information.

2. The same term is used to name different molecules from the same or related genes but different organisms. For example, PI3K may refer to different molecules in mouse (P42337), human (P42336), and cow (P32871), whose genes have the same term PIK3CA.3. The same term is used to name different isoforms of molecules. For example, in mouse PI3K refers to both Q8BTI9 (the β isoform of the protein) and O35904 (the δ isoform). There are two important steps in normalizing names to SwissProt identifiers. First, the terms of database entries must be curated to canonical forms, and the new terms used to detect the mention of proteins. Second, the ambiguities of multiple mapping of protein mentions to molecule identifiers should be removed by using organism and contextual information. The following rules are used to curate database entries.

1. The gene names/synonyms and gene product names/synonyms for the same entry are included.

2. Prefixes and suffixes that are not crucial for entity identification are removed. For example, the prefixes c, n, and a of PKC are removed where these prefixes mean conventional, novel, and atypical, respectively.

3. Terms with digits or Roman/Greek numbers are transformed into a unified format: alphabetical + white space + digits. This rule affects examples such as the following: IL-2 and IL2 become IL 2; and CNTFR alpha, CNTFR A, and CNTFR I become CNTFR 1.

4. Terms that are not in abbreviated forms are converted to lowercase.

About 230,000 normalized entries are produced from SwissProt database and, as mentioned previously, there are many ambiguous mappings to database identifiers, even with normalized terms. To solve these ambiguities, the nearest neighbor principle is used, based on the organism context. The presumption here is that each protein name belongs to a particular organism context. Organisms in each sentence are identified, and the organism context of a protein name is defined by organisms appearing in adjacent sentences. The organism of candidate proteins is determined by the nearest neighbor principle.

Our work on protein mention normalization is very simple and coarse, and there is still much room for improvement. There is a thorough study on how to create a dictionary by combining multiple gene/protein databases together [15]. A number of spelling variation rules were studied in that work, and the investigators pointed out that many rules appeared to have no effect and some appeared to have a detrimental effect on precision. These discoveries will be very useful in making improvements to our module in the future.

In the IPS of BioCreative 2006, 740 full-text articles are provided for training and 358 for testing. The articles were provided for evaluating the extraction of protein pairs, but there was no separate step to evaluate protein mention normalization. However, the organizers returned to participants the results related to normalization, which were evaluated in a different manner. Our results are shown in Table 3, in which the average results were based on 45 runs from 16 teams. These results were from the official evaluation but they were not published in the BioCreative workshop. Clearly, our results are better than the average results (ours > mean + dev). However, the results presented here are considerably poorer than the results obtained from the gene normalization (GN) task of BioCreative 2006. The best result for the GN task was a precision of 78.9%, a recall of 83.3%, and a F1 score of81.0%, which are significantly better than the results we present here. Readers should note the major differences between results in the PPI task and those in the GN task. First, the PPI task required participants to process both abstracts and full texts, whereas only abstracts were processed in the GN task. Second, the measurement in the two tasks was different. In the PPI task, the gold standard set was made up of only the protein molecules that have annotated interactions. Other correctly identified proteins without interaction annotation were not included in the gold standard set. In the GN task, the gold standard set included all the gene identifiers that should be normalized, and all of the submitted identifiers were evaluated.

Table 3 Comparative results for protein name normalization

Physical interaction extraction

The module for interaction extraction will greatly facilitate the process of interaction curation for database curators. It is not always easy to identify a single sentence that clearly describes an interaction in a paper. However, most previous methods extract interactions at the sentence level [1622], where each sentence is handled independently. Inspired by the fact that curators must gather sufficient evidence to decide whether an article claims that there is a physical interaction, we propose a profile-based method to extract physical interactions by integrating evidence from the whole document.

The results shown in Table 4 confirm that our method performs much better than others (ours > mean + 2 × dev). In the benchmark evaluation, our results rank first in terms of F1 score, second in terms of precision, and third in terms of recall on the SwissProt-only subset, whereas on the entire dataset our results were all in second place (F1 score, precision, and recall). Also, our method outperforms traditional template-based methods, considering that ONBIRES (Ontology-Based Biological Relation Extraction System) represents the state-of-the-art performance [22].

Table 4 Comparative results for interaction pair extraction

There are reasons for the success of our profile-based method in this particular task. First, by integrating evidence from the whole article, the method is more robust when extracting physical interactions. For example, the sentence 'A interacts with B' will definitely be taken as a positive example by template-based methods, although it may describe a genetic interaction. In the profile-based method other evidence is required to make such a claim. Second, abundant features such as term features, entity features, template features, and position features are all integrated into the method. Here, we analyze the errors in detail in order to identify the problems hindering overall performance. There are 798 manually annotated interaction pairs in the 358 test articles, and although 339 protein pairs were extracted, only 100 of these are true positive pairs. There were 8,172 pairs that co-exist, and many of these include incorrectly recognized names (Figure 2). Among the 239 false-positive errors (area III), the first 50 errors were manually checked, and they fell into three categories.

Figure 2
figure 2

Errors of interaction pair extraction. The figure shows the distribution of errors in the interaction pair extraction. The blue ellipse contains 798 annotated pairs, the yellow ellipse 8,172 coincident pairs, and the green circle 339 extracted pairs. I, 100 true-positive samples; II, 166 coincident but false-negative samples; III, 239 false-positive samples; IV, 7,135 true-negative samples; V, 532 false-negative samples but never coincident.

1. Twenty-two errors were due to incorrectly normalized names. For example, in the sentence 'BAF60c interacts directly with PPAR gamma in vitro', the annotated interaction is Q6STE5 (BAF60c)-P37231 (PPAR gamma), and although we correctly extracted the interaction, the names were unfortunately normalized to Q6P9Z1-O19052.

2. Twelve errors were due to false-positive names in sentences where protein A and B physically interacted, and a false positive recognized protein C coupled to A or B.

3. Sixteen errors were due to the classifier, which included classifying nonphysical interaction pairs as exhibiting a physical interaction and other problems. This problem is partly due to the classification model and partly to the incompleteness of the training set, which does not provide evidence of samples that truly interacted.

Among the 698 false negative errors (areas II + V), the majority were caused by the identification and normalization of protein mentions (532 errors), whereas 166 were due to the interaction extraction model. In the 166 pairs that co-existed, we found that 37 were negatively classified because the sentences did not contain sufficient evidence. Examples of such sentences include 'A activates B' or 'Camptothecin-induced nuclear export of A does not require B'. This problem is also due to the fact that the evidence of physical interactions is not confined to a single sentence. The remaining 129 errors are believed to be caused by our classifier.

From these analysis, we conclude that the difficulty of protein name normalization leads to the majority of errors, producing about 64% (34/50) false-positive errors and 76.2% (532/698) false-negative errors. The second problem is that of incomplete annotation. Because the annotation only specifies the interacted protein ID in the article without the passages providing evidence or the location of these molecules, it makes the training process untraceable and the process of error analysis extremely difficult. Currently, a major limitation of our method is the requirement of protein coincidence within a sentence. This is not always the case in practical interaction curation, where curators often find evidence from contextual sentences, each of which may contain only one protein of the interacting pair. For example, the sentence 'the two proteins are co-purified together' may describe a physical interaction, even though both proteins are mentioned in the preceding sentences instead of in this sentence itself.


In this report we discuss three key issues related to practical interaction curation. Specifically, we deal with filtering articles irrelevant to physical PPIs, identifying the mention of proteins and normalizing them to molecule identifiers, and extracting experimentally verified interactions. Different levels of features, including the string, term, named entity, and template features, are exploited to study the problem of distribution divergence between the training and test data. An AdaBoost-based information fusion technique is studied to integrate the various powers of description of different classifiers. Through these improvements, high-performance article filtering produces a system that may facilitate the process of interaction curation. Although the current state of protein name identification and normalization leaves much to be desired, our method utilizes the organism information to reduce ambiguity and, with further improvements, it will aid biologists. The profile-based interaction extraction method combines evidence from sentences in the whole document, thereby providing some improvement in the ability to predict physical interactions. In comparison with traditional methods that extract interactions at the sentence level, our method utilizes information from the whole article.

There are still many difficulties and challenges in extracting biologically meaningful knowledge, for example recognizing biological molecules with widely accepted identifiers and mining physical interactions with experimentally verified evidence. This report provides methods to help resolve these problems both from the perspective of the feature and that of the classifier.

Materials and methods

We first present the architecture of our method (shown in Figure 3), and then we describe the models and algorithms used in our method. There are three major modules in our framework: the first for filtering irrelevant articles; the second for identifying and normalizing protein mentions to SwissProt identifiers; and the third for extracting PPIs.

Figure 3
figure 3

The system architecture of our method. Blue rectangles are the three main modules in our system. The figure shows the architecture of our system, and there are three main modules in the system that have been colored in blue. MR, molecule recognition; PPI, protein-protein interaction.

Article filtering module

We studied three models in the article filtering module (the naïve Bayes classifier, multinomial classifier, and SVM classifier [23]). All of these classifiers require the prior selection of features that represent the data that are to be classified. As mentioned before, there are a large number of term, string, and template features, and thus we use the χ2 test to select the most significant ones. The naïve Bayes model for article filtering is defined as follows:

R N B ( d ) = log Pr N B ( c + | d ) Pr N B ( c | d ) = log Pr ( c + ) Pr ( c ) + i = 1 n log Pr ( w i | c + ) i = 1 n log Pr ( w i | c ) MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaafaqadeGabaaabaGaamOuamaaBaaaleaacaWGobGaamOqaaqabaGccaGGOaGaamizaiaacMcacqGH9aqpciGGSbGaai4BaiaacEgajuaGdaWcaaqaaiGaccfacaGGYbWaaSbaaeaacaWGobGaamOqaaqabaGaaiikaiaadogadaWgaaqaaiabgUcaRaqabaGaaiiFaiaadsgacaGGPaaabaGaciiuaiaackhadaWgaaqaaiaad6eacaWGcbaabeaacaGGOaGaam4yamaaBaaabaGaeyOeI0cabeaacaGG8bGaamizaiaacMcaaaaakeaacqGH9aqpciGGSbGaai4BaiaacEgajuaGdaWcaaqaaiGaccfacaGGYbGaaiikaiaadogadaWgaaqaaiabgUcaRaqabaGaaiykaaqaaiGaccfacaGGYbGaaiikaiaadogadaWgaaqaaiabgkHiTaqabaGaaiykaaaakiabgUcaRmaaqahabaGaciiBaiaac+gacaGGNbGaciiuaiaackhacaGGOaGaam4DamaaBaaaleaacaWGPbaabeaakiaacYhacaWGJbWaaSbaaSqaaiabgUcaRaqabaGccaGGPaaaleaacaWGPbGaeyypa0JaaGymaaqaaiaad6gaa0GaeyyeIuoakiabgkHiTmaaqahabaGaciiBaiaac+gacaGGNbGaciiuaiaackhacaGGOaGaam4DamaaBaaaleaacaWGPbaabeaakiaacYhacaWGJbWaaSbaaSqaaiabgkHiTaqabaGccaGGPaaaleaacaWGPbGaeyypa0JaaGymaaqaaiaad6gaa0GaeyyeIuoaaaaaaa@7DAE@

Where d denotes an article, c+/c- denotes relevant/irrelevant articles, w i indicates a feature, and R NB (d) is the output score indicating the degree of relevance. The multinomial model implies a different distribution:

R M N ( d ) = log Pr M N ( c + | d ) Pr M N ( c | d ) = log Pr ( c + ) Pr ( c ) + i = 1 n x i log Pr ( w i | c + ) i = 1 n x i log Pr ( w i | c ) MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaafaqadeGabaaabaGaamOuamaaBaaaleaacaWGnbGaamOtaaqabaGccaGGOaGaamizaiaacMcacqGH9aqpciGGSbGaai4BaiaacEgajuaGdaWcaaqaaiGaccfacaGGYbWaaSbaaeaacaWGnbGaamOtaaqabaGaaiikaiaadogadaWgaaqaaiabgUcaRaqabaGaaiiFaiaadsgacaGGPaaabaGaciiuaiaackhadaWgaaqaaiaad2eacaWGobaabeaacaGGOaGaam4yamaaBaaabaGaeyOeI0cabeaacaGG8bGaamizaiaacMcaaaaakeaacqGH9aqpciGGSbGaai4BaiaacEgajuaGdaWcaaqaaiGaccfacaGGYbGaaiikaiaadogadaWgaaqaaiabgUcaRaqabaGaaiykaaqaaiGaccfacaGGYbGaaiikaiaadogadaWgaaqaaiabgkHiTaqabaGaaiykaaaakiabgUcaRmaaqahabaGaamiEamaaBaaaleaacaWGPbaabeaakiGacYgacaGGVbGaai4zaiGaccfacaGGYbGaaiikaiaadEhadaWgaaWcbaGaamyAaaqabaGccaGG8bGaam4yamaaBaaaleaacqGHRaWkaeqaaOGaaiykaaWcbaGaamyAaiabg2da9iaaigdaaeaacaWGUbaaniabggHiLdGccqGHsisldaaeWbqaaiaadIhadaWgaaWcbaGaamyAaaqabaGcciGGSbGaai4BaiaacEgaciGGqbGaaiOCaiaacIcacaWG3bWaaSbaaSqaaiaadMgaaeqaaOGaaiiFaiaadogadaWgaaWcbaGaeyOeI0cabeaakiaacMcaaSqaaiaadMgacqGH9aqpcaaIXaaabaGaamOBaaqdcqGHris5aaaaaaa@8211@

Where x i denotes the number of times that a feature w i appears in the document d. In these two models, we must estimate the probability of each feature appearing in both relevant and irrelevant articles. This can be easily implemented using the equation below:

Pr ( w i | c + ) = 1 + N ( w i , c + ) V + w j N ( w j , c + ) = 1 + d j P O S T F ( w i , d j ) V + w k d j P O S T F ( w k , d j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaciiuaiaackhacaGGOaGaam4DamaaBaaaleaacaWGPbaabeaakiaacYhacaWGJbWaaSbaaSqaaiabgUcaRaqabaGccaGGPaaabaGaeyypa0dabaqcfa4aaSaaaeaacaaIXaGaey4kaSIaamOtaiaacIcacaWG3bWaaSbaaeaacaWGPbaabeaacaGGSaGaam4yamaaBaaabaGaey4kaScabeaacaGGPaaabaGaamOvaiabgUcaRmaaqababaGaamOtaiaacIcacaWG3bWaaSbaaeaacaWGQbaabeaacaGGSaGaam4yamaaBaaabaGaey4kaScabeaacaGGPaaabaGaam4DamaaBaaabaGaamOAaaqabaaabeGaeyyeIuoaaaaakeaaaeaacqGH9aqpaeaajuaGdaWcaaqaaiaaigdacqGHRaWkdaaeqaqaaiaadsfacaWGgbGaaiikaiaadEhadaWgaaqaaiaadMgaaeqaaiaacYcacaWGKbWaaSbaaeaacaWGQbaabeaacaGGPaaabaGaamizamaaBaaabaGaamOAaaqabaGaeyicI4Saamiuaiaad+eacaWGtbaabeGaeyyeIuoaaeaacaWGwbGaey4kaSYaaabeaeaadaaeqaqaaiaadsfacaWGgbGaaiikaiaadEhadaWgaaqaaiaadUgaaeqaaiaacYcacaWGKbWaaSbaaeaacaWGQbaabeaacaGGPaaabaGaamizamaaBaaabaGaamOAaaqabaGaeyicI4Saamiuaiaad+eacaWGtbaabeGaeyyeIuoaaeaacaWG3bWaaSbaaeaacaWGRbaabeaaaeqacqGHris5aaaaaaaaaa@76EC@

Where V is the total number of features, POS is the set of relevant documents, and TF(w i , d j ) is the frequency that the feature w i is observed in the document d j .

These two models are called probabilistic models, because they interpret the data by estimating a probability distribution. As mentioned previously, to analyze the quantitative difference of two distributions, we define the average Kullback Leibler divergence as follows:

A K L ( p , q ) = 1 2 x ( q ( x ) log q ( x ) p ( x ) + p ( x ) log p ( x ) q ( x ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaacaWGbbGaam4saiaadYeacaGGOaGaamiCaiaacYcacaWGXbGaaiykaiabg2da9KqbaoaalaaabaGaaGymaaqaaiaaikdaaaGcdaaeqbqaamaabmaabaGaamyCaiaacIcacaWG4bGaaiykaiGacYgacaGGVbGaai4zaKqbaoaalaaabaGaamyCaiaacIcacaWG4bGaaiykaaqaaiaadchacaGGOaGaamiEaiaacMcaaaGccqGHRaWkcaWGWbGaaiikaiaadIhacaGGPaGaciiBaiaac+gacaGGNbqcfa4aaSaaaeaacaWGWbGaaiikaiaadIhacaGGPaaabaGaamyCaiaacIcacaWG4bGaaiykaaaaaOGaayjkaiaawMcaaaWcbaGaamiEaaqab0GaeyyeIuoaaaa@5A05@

Where p and q are two distributions over the random variable x. If the two distributions are identical, then the AKL is 0; otherwise it is positive.

The SVM is a discriminative model, which constructs a hyperplane in the feature space to separate the data into categories. The classification decision is made by calculating the distance of a sample from the hyperplane and in this module; we investigated two types of SVM models. The first is a traditional SVM with a linear kernel and in which each sample is represented as a feature vector. Instead of using TF*IDF, we proposed a new computational scheme to overcome the issue of divergence between the training set and test set, as follows:

T F * M L P ( w i , d j ) = T F ( w i , d j ) * log Pr ( w i | c + ) Pr ( w i | c ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaacaWGubGaamOraiaacQcacaWGnbGaamitaiaadcfacaGGOaGaam4DamaaBaaaleaacaWGPbaabeaakiaacYcacaWGKbWaaSbaaSqaaiaadQgaaeqaaOGaaiykaiabg2da9iaadsfacaWGgbGaaiikaiaadEhadaWgaaWcbaGaamyAaaqabaGccaGGSaGaamizamaaBaaaleaacaWGQbaabeaakiaacMcacaGGQaGaciiBaiaac+gacaGGNbqcfa4aaSaaaeaaciGGqbGaaiOCaiaacIcacaWG3bWaaSbaaeaacaWGPbaabeaacaGG8bGaam4yamaaBaaabaGaey4kaScabeaacaGGPaaabaGaciiuaiaackhacaGGOaGaam4DamaaBaaabaGaamyAaaqabaGaaiiFaiaadogadaWgaaqaaiabgkHiTaqabaGaaiykaaaaaaa@5884@

Where TF(w i , d j ) is the frequency of the feature wi observed in the document dj. The computational scheme performs much better than TF*IDF in the benchmark evaluation. The decision variable in the SVM model is as follows:

R S V M ( x ) = b + x i S V y i α i K ( x i , x ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaacaWGsbWaaSbaaSqaaiaadofacaWGwbGaamytaaqabaGccaGGOaGaamiEaiaacMcacqGH9aqpcaWGIbGaey4kaSYaaabeaeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaOGaeqySde2aaSbaaSqaaiaadMgaaeqaaOGaam4saiaacIcacaWG4bWaaSbaaSqaaiaadMgaaeqaaOGaaiilaiaadIhacaGGPaaaleaacaWG4bWaaSbaaWqaaiaadMgaaeqaaSGaeyicI4Saam4uaiaadAfaaeqaniabggHiLdaaaa@4B15@

Where SV means the support vectors. The kernel function provides an alternative mechanism to represent data in a composite manner in addition to the feature-vector representation. For instance, the p-spectrum kernel computes the number of common substrings shared by two input samples [24]:

K p ( x , y ) = < φ p ( x ) , φ p ( y ) > = u Θ p φ u p ( y ) * φ u p ( y ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaacaWGlbWaaSbaaSqaaiaadchaaeqaaOGaaiikaiaadIhacaGGSaGaamyEaiaacMcacqGH9aqpcqGH8aapcqaHgpGzdaahaaWcbeqaaiaadchaaaGccaGGOaGaamiEaiaacMcacaGGSaGaeqOXdy2aaWbaaSqabeaacaWGWbaaaOGaaiikaiaadMhacaGGPaGaeyOpa4Jaeyypa0ZaaabeaeaacqaHgpGzdaqhaaWcbaGaamyDaaqaaiaadchaaaGccaGGOaGaamyEaiaacMcaaSqaaiaadwhacqGHiiIZcqqHyoqudaahaaadbeqaaiaadchaaaaaleqaniabggHiLdGccaGGQaGaeqOXdy2aa0baaSqaaiaadwhaaeaacaWGWbaaaOGaaiikaiaadMhacaGGPaaaaa@59AD@
φ u p ( x ) = | { ( v 1 , v 2 ) | x = v 1 u v 2 , u Θ p } | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciGacaGaaeqabaqabeGadaaakeaacqaHgpGzdaqhaaWcbaGaamyDaaqaaiaadchaaaGccaGGOaGaamiEaiaacMcacqGH9aqpdaabdaqaamaacmaabaGaaiikaiaadAhadaWgaaWcbaGaaGymaaqabaGccaGGSaGaamODamaaBaaaleaacaaIYaaabeaakiaacMcacaGG8bGaamiEaiabg2da9iaadAhadaWgaaWcbaGaaGymaaqabaGccaWG1bGaamODamaaBaaaleaacaaIYaaabeaakiaacYcacaWG1bGaeyicI4SaeuiMde1aaWbaaSqabeaacaWGWbaaaaGccaGL7bGaayzFaaaacaGLhWUaayjcSdaaaa@505C@

Where x and y are two strings (or documents) defined in the alphabet Θ, and Θpindicates all possible substrings of length p. In our method, we take p = 7, which is about 1.5 times the average length of unigrams. An example of how to construct string features is shown in Table 5. An article here is treated as a string, and no other semantics are considered. This low-level representation reduces the distribution divergence between the training and test data.

Table 5 An example for constructing string features

Molecule recognition module

Identifying protein mentions and normalizing them to molecule identifiers is a necessary step toward the extraction of protein interactions. In contrast to traditional named entity recognition tasks, this task requires the submitted protein pairs be mapped to unique SwissProt identifiers rather than presenting the original names in the text. We not only must identify named entities but also we must map them to unique molecule identifiers. As shown in Figure 4, there are four main processes in the molecule recognition module: database curation, organism detection, dictionary-based matching, and removing ambiguity from the mapped names.

Figure 4
figure 4

The flowchart of the molecule recognition module. Gray boxes are the input of our molecule recognition module and the figure illustrates the flowchart of the molecule recognition module.

After curation, there are in total 230,000 protein identifiers, and more than 1 million terms. Obviously, it is not feasible to use all of the terms during the dictionary-based matching process. Moreover, the same terms, particularly abbreviations, may correspond to many protein identifiers. This is common when the same gene products only differ in organisms, and thus the organism context is crucial to remove such ambiguities. We first detect the organism information in an article, and then this information is used to rule out irrelevant database entries and to remove ambiguities when the terms are mapped to multiple protein identifiers. Our assumption here is that physical interactions described in one paper should occur only within a few organisms. The organism database used here is the NCBI (National Center for Biotechnology Information) taxonomy [25]. Dictionary-based matching is used to detect organisms, and the five most frequent organisms are left such that each sentence can be linked with several detected organisms. To remove the ambiguity from the mapping of the names identified to molecule identifiers, the nearest neighbor principle is used, implying that the organism associated to a recognized name is the organism in the nearest sentence where the name is detected.

Profile-based PPI extraction moduleTo extract experimentally verified physical interactions in practical interaction curation, curators usually collect evidence from multiple sentences. Previous methods to extract protein interactions are all at the sentence level, where each sentence is processed independently, and thus they fail to synthesize information from multiple sentences. Our profile-based method is able to exploit profile features from multiple sources throughout the whole document, and for each candidate interacting protein pair a profile vector is constructed from multiple sentences. In comparison with traditional methods, the profile-based method is more robust for the IPS of BioCreative 2006 and gained the best and the second best results on the SwissProt-only subset and the entire dataset, respectively.

Every protein pair coincident in a sentence is viewed as an interaction candidate. For each pair, profile features are calculated from all the sentences in which the pair coincides. The corresponding bit is set to 1 if the feature is found in these sentences (Figure 5). Through such a representation, information from the whole document can be integrated together and a SVM with the linear kernel can be trained on the profile feature vectors. There are three types of profile features.

Figure 5
figure 5

The profile vector in the extraction of interaction protein pairs. The construction of the profile vector for each candidate protein pair is shown in this figure. The term feature (unigram/bigram), template feature, and position feature are used in this process.

1. One hundred and sixty-eight unigram/bigram features. One hundred of these features are selected using the χ2 test, and 68 are taken manually from the branches of the physical interaction and detection method in the Molecular Interaction (MI) ontology [7].

2. Ninety-one template features. These features are generated in a semi-supervised manner [26] and their form is like 'Protein1 * bind to * Protein2', where * means that any word can be omitted. Some template examples are listed in Table 6.

Table 6 Examples for template features used in the profile-based method

3. Two position features. One of which is whether the two proteins coincide in the title and the other is whether they coincide in the abstract.

Our method is more robust than the traditional methods because a single description, such as 'Protein1 binds to Protein2', does not necessarily indicate the existence of a physical interaction. However, if there is other evidence, such as 'The binding of Protein1 to Protein2 is determined by Y2H', then the interaction is more trustworthy. Clearly, more evidence will strengthen confidence in the interaction. In addition, our algorithm is more robust when the performance of the molecule recognition module is far from satisfactory. For example, in the sentence 'The Y2H experiment proved the interaction between Protein1 and Protein2, CGA ...', CGA, whichis the sequence of Protein2, will be recognized as chromogranin A precursor, and then it will coincide with Protein1 and Protein2. The previous methods will fail, and although these false pairs are less statistically significant across the whole document, our method is able to resolve the problem by incorporating evidence from multiple sentences.