Background

Proteins are linear chains of amino acids. The polypeptide chains are folded into complicated three-dimensional (3D) structures. With different structures, proteins are able to perform specific functions in biological processes [114]. To study the structure-function relationship, biologists have a high demand on protein structure retrieval systems for searching similar sequences or 3D structures [15]. Protein pairwise comparison is one of the main functions of such retrieval systems [16]. The need to retrieve or classify proteins using 3D structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. In folding simulations, similar intermediate structures might be indicative of a common folding pathway [17].

Related work

The structural comparison problem in a protein structure retrieval system has been extensively studied. In [18], a rapid protein structure retrieval system named ProtDex2 was proposed by Aung and Tan [18] , in which they adopted the information retrieval techniques to perform rapid database search without accessing to each 3D structure in the database. The retrieval process was based on the inverted-file index constructed on the feature vectors of the relationship between the secondary structure elements (SSEs) of all the protein structures in the database. In order to evaluate the similarity score between a query protein structure and a protein structure in the database, they adopted and modified the well-known ∑(tf × idf) scoring scheme commonly used in document retrieval systems [19]. In [20, 21], a 3D shape-based approach was presented by Daras et al. The method relied primarily on the geometric 3D structure of the proteins, which was produced from the corresponding PDB files, and secondarily on their primary and secondary structures. Additionally, characteristic attributes of the primary and secondary structures of the protein molecules were extracted, forming attribute-based descriptor vectors. The descriptor vectors were then weighted and an integrated descriptor vector was produced. To compare a pair of protein descriptor vectors, Daras et al. [20, 21] used two metrics of similarity. The first one was based on the Euclidean distance [22] between the descriptor vectors, and the second one was based on Mean Euclidean Distance Measure [20, 21].

Later, Marsolo and Parthasarathy presented two normalized, stand-alone representations of proteins that enabled fast and efficient object retrieval based on sequence or structure information [17, 23]. For the range queries, they specified a range value r and retrieved all the proteins from the database which lied within a distance r to the query. In their work, distance referred to the standard Euclidean distance [22]. In [24], Sael et al. introduced a global surface shape representation by 3D Zernike descriptors for protein structure similarity search. In their study, three distance measures were used for comparing 3D Zernike descriptors of protein surface shapes, i.e., Euclidean distance, Manhattan distance [25], and correlation coefficient-based distance. A fast protein comparison algorithm IR Tableau was developed by Zhang et al. for protein retrieval purposes in [26], which leveraged the tableau representation to compare protein tertiary structures. IR tableau compared tableaux using feature indexing techniques. In IR Tableau [26], a number of similarity functions were applied for comparing a pair of protein vectors, i.e., cosine similarity [27], Jaccard index [28], Tanimoto coefficient [29], and Euclidean distance.

The basic components of a protein retrieval system includes a way to represent proteins and a dissimilarity measure that compares a pair of proteins. Most of the aforementioned studies focus on the feature representation of the proteins, while neglecting the comparison of the feature vectors. Such studies usually apply a simple similarity or dissimilarity measure for the comparison of the feature vectors, such as Euclidean Distance Measure used in [17, 20, 21, 23, 24, 26]. Most of the existing protein comparison techniques suffer from the following two bottlenecks:

  • The dissimilarity measure is a pairwise distance measure, which is computed only considering the query protein x0 and a database protein x i as d(x0, x i ). It does not consider other proteins in the database, neglecting the effects of the contextual proteins. If we consider the distribution of the entire protein database X = {x j }, j = 1 ... N when computing the dissimilarity as d(x0, x i |X), the retrieval performance may benefit from the contextual proteins {x j }, ji.

  • The dissimilarity measure is computed in an unsupervised way, which does not use the known information of the class labels L = {l j }, j = 1 ... , N in the database. Although we may have no idea about whether x0 and x i belong to the same class (having the same folding type etc., l0 = l i ) or not (l0l i ), we do know some prior information about other proteins L. In all of the previous studies, prior class labels L were not adopted to calculate the dissimilarity d(x0, x i ).

Due to these two bottlenecks, traditional protein retrieval systems using pairwise and unsupervised dissimilarity measure usually do not achieve satisfactory performance, even though many effective protein feature descriptors are developed and used. In this paper, we investigate the dissimilarity measure and propose a novel learning algorithm to improve the performance of a given dissimilarity measure.

Recent research in machine learning points out that contextual information can be used to improve the dissimilarity or similarity measures. This kind of algorithms are called contextual or context-sensitive dissimilarity learning [3034]. Unlike the traditional pairwise distance d(x0, x i ) which only considers the two refereed proteins x0 and x i , contextual dissimilarity also considers the contextual proteins X when computing the dissimilarity d(x0, x i |X). The existing contextual similarity learning algorithms can mainly be classified into the following two categories:

Dissimilarity regulation

The first contextual dissimilarity measure (CDM) was proposed by Jegou et al. in [30, 31]. They introduced the CDM, which significantly improved the accuracy of the image search problem. CDM measure took the local distribution of the vectors into account and iteratively estimated the distance update terms in the spirit of Sinkhorns scaling algorithm [35], thereby modified the neighborhood structure. This regularization was motivated by the observation that a good ranking was usually not symmetric in an image search system. In this paper, we will focus on this type of contextual dissimilarity learning.

Similarity transduction on graph

In [32, 33], Bai et al. provided a novel perspective to the shape retrieval tasks by considering the existing shapes as a group and studying their similarity measures to the query shape in a graph structure. For a given similarity measure, a new similarity was learned through graph transduction. The learning was done in an iterative manner so that the neighbors of a given shape influenced the final similarity to the query. The basic idea is actually related to the PageRank algorithm, which forms a foundation of Google Web search. This method is further improved by Wang et al. in [36]. Similar learning algorithms were also used to rank proteins in a protein database as in [37, 38]. Kuang et al. proposed a general graph-based propagation algorithm called MotifProp to detect more subtle similarity relationship than the pairwise comparison methods. In [38], Weston et al. reviewed RankProp, a ranking algorithm that exploited the global network structure of similarity relationship among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges.

The drawbacks of the above algorithms lay on two folds. On the one hand, such algorithms do not utilize the class label information of the database images L, and thus work in an unsupervised way. The only one used L is [38]. However, the algorithm proposed in [38] had basically the same framework as [32, 33, 37], i.e., protein label information L was only used to estimate the parameters. On the other hand, the "context" is fixed in the iterative algorithms of most of the transduction methods [32, 33, 37, 38]. A better way is to update the context using the learned similarity measures as in [30, 31].

To overcome these drawbacks, we develop a novel contextual dissimilarity learning algorithm to improve the performance of a protein retrieval system. The novel dissimilarity measure is regularized by the dissimilarity of the contextual proteins (neighboring proteins), while the contextual proteins are updated using the learned dissimilarities coherently. The basic idea comes from [39, 40], which assume that if two local features in two images are similar, their context is likely to be similar. In comparison to [30, 31], which use neighborhood as a single context, we partition the neighborhood into several hierarchical sub-context corresponding to the learned dissimilarities. With the sub-context, we compute the dissimilarity of sub-context of a pair of proteins and construct the hierarchial sub-contextual dissimilarity vector. Moreover, using the label information L, we select pairs of proteins belonging to the same classes {(x i , x j )|l i = l j } as the relevant protein pairs. We also select the irrelevant protein pairs {(x k , x l )|l k l l }.

Finally, we train a support vector machine (SVM) [41] to distinguish between the relevant and the irrelevant protein pairs. The output of the SVM will further be used to regularize the dissimilarity in an iterative manner.

Methods

This section describes our contextual protein-protein dissimilarity learning algorithm, which utilizes the contextual proteins and class label information of the database proteins to index and search protein structures efficiently. We will demonstrate that our idea is general in the sense that it can be used to improve the existing similarity/dissimilarity measures.

Protein structure retrieval framework

In a protein retrieval system, the query and the database proteins are firstly represented as feature vectors. Here, we denote the query protein feature vector as x0 and database protein feature vectors as X = {x1, x2, ... , x N }, where N is the number of proteins in the database. Then, based on a distance measure d0i= d(x0, x i ), we compute the distance of x0 and all the proteins in the database, i.e., {d01, d02, ... , d0N}. The database proteins are then ranked according to the distances. The k most similar ones are returned as the retrieval results. We illustrate the outline of the protein retrieval system in Figure 1.

Figure 1
figure 1

Flowchart of protein retrieval systems.

ProDis-ContSHC: the contextual dissimilarity learning algorithm

In this section, we will introduce the novel contextual protein-protein dissimilarity learning algorithm. We first give the definition of the hierarchical context of a protein, which will be used to compute the contextual dissimilarity and regularize the dissimilarity measure. Then a more discriminative regularization factor is learned using the class labels of the database proteins. Finally, we propose the S upervised regulating of Pro tein-protein Dis similarity and updating of the H ierarchical Cont ext C oherently in an iterative manner, resulting in the ProDis-ContSHC algorithm.

Using hierarchical context to regularize the dissimilarity measure

Here, we define a protein x i 's context as its K nearest neighbors N ( i ) . The dissimilarity between two sets of context is measured by the contextual dissimilarity as

r i j = 1 K 2 m N ( i ) , n N ( j ) d m n
(1)

The contextual dissimilarity is illustrated in Figure 2(a).

Figure 2
figure 2

Illustration of context-based dissimilarity and hierarchical context-based dissimilarity. The two proteins x i and x j , on which the dissimilarity is to be measured, are in the first row. The nearest neighbors of these two proteins are listed below them as the context, respectively. (a) The traditional context N ( i ) ; (b) The proposed hierarchical context N p ( i ) , p = {1, 2, 3}.

Furthermore, instead of averaging all the pairwise dissimilarities between the two context N ( i ) and N ( j ) , we propose the hierarchical context by splitting the context N ( i ) to P "sub-context" N p ( i ) ,p= { 1 , , P } according to their distances to x i . To be more specific, sub-context N p ( i ) is defined as

N p ( i ) = { x j | x j i s a m o n g t h e k - t h t o k - t h n e a r e s t n e i g h b o r s o f x i , a c c o r d i n g t o { d i j } , j { 1 , , i - 1 , i + 1 , , N } }
(2)

where k' = (p - 1) × κ, k'' = (p - 1) × κ + κ, κ is the size of a sub-context, and P is the number of sub-context. In this way, we can compute the contextual dissimilarity by averaging the dissimilarity of the sub-context as

r i j = 1 P p 1 κ 2 m N p ( i ) , n N p ( j ) d m n = 1 P p d i j ( p )
(3)

where d i j ( p ) = 1 κ 2 m N p ( i ) , n N p ( j ) d m n ,p=1,,P, is the hierarchical sub-contextual dissimilarity. Figure 2(b) illustrates the idea of sub-contextual dissimilarity.

Intuitively, if the context of two proteins is dissimilar to each other (r ij is higher than the average), they should have a higher dissimilarity value, and vice versa. We implement this by multiplying a coefficient, which is the ratio of r ij to the average of all the contextual dissimilarity r ̄ = 1 N 2 i , j r i j ,

d i j * = d i j × r i j r ̄ = d i j × δ i j
(4)

Here, δ i j = r i j r ̄ is a regularization factor for d ij , with which we can improve d ij by its contextual information. Moreover, this procedure can be done in an iterative manner. We can use the regularized dissimilarity measure d i j * to re-define the new hierarchical context N p ( i ) . In this way, we can learn the protein-protein dissimilarity d i j * and hierarchical context N p ( i ) coherently.

Supervised regularization factor learning

We try to utilize the label information L = {l1, ... , l N } of the database proteins to learn a better regularization factor δ ij . The class information is adopted both in the intraclass and interclass dissimilarity computation to maximize the Fisher criterion [42] for protein class separability. Firstly, we can select a number of protein pairs {γ = (i, j)|i, j = 1, ... , N}. For each pair, we compute the hierarchical contextual dissimilarities and organize them as a P-dimensional dissimilarity vector d γ = [d ij (1) d ij (2) ... d ij (P)], as shown in Figure 3. Then, inspired by the score fusion rule [43, 44], using L, we further label each pair γ = (i, j) as a relevant pair y γ = +1 if l i = l j , or an irrelevant pair y γ = - 1 otherwise.

Figure 3
figure 3

Differentiate relevant and irrelevant proteins by classification. (x i , x j ) is assumed to be a relevant pair and (x i , x k ) is assumed to be an irrelevant pair. The contextual dissimilarity vectors of both pairs are distinguished by a binary SVM model.

Now with the training samples as Γ = {(d γ , y γ )}, γ = 1, ... , N C2, we train a binary SVM [41] classifier to distinguish between the relevant pairs and the irrelevant pairs. The publicly available package SVMlight [45] is applied to implement the SVM on our training set Γ. This package allows us to optimize a number of parameters and offers the options to use different kernel functions to obtain the best classification performance [46]. The separating hyperplane generated by SVM model is given by

f ( d ) =dw+b
(5)

where w is a vector orthogonal to the hyperplane, and b is a parameter that minimizes ||w||2 and satisfies the following conditions:

y γ ( d γ w + b ) 1
(6)

for all 1 ≤ γ ≤ N C2, where N C2 is the total number of examples (protein pairs). An SVM model with a linear decision boundary is shown in Figure 3 to distinguish the relevant protein pairs from the irrelevant ones. Note that not all the N C2 possible protein pairs are necessary to be included to train the SVM model (5). For any pair of proteins (x i , x j ), after we compute its contextual dissimilarity vector d ij , the trained SVM classifier is applied to get the distance of this point to the margin boundary of SVM as i j =f ( d i j ) . Apparently, i j is a measure of dissimilarity of the context of this pair of proteins. Thus, it can be used to form a regularization factor as

δ i j = e x p ( - y ̃ i j σ ) = e x p - ( d i j w + b ) σ
(7)

where σ is a preemptor of the factor. With this regularization factor learned from the contextual proteins, we regularize the dissimilarity d ij of protein pair (x i , x j ) as

d i j * = d i j × δ i j
(8)

Updating the context and dissimilarity coherently

With the learned dissimilarity measure d i j * , we can re-define the "context" of a protein x i according to its dissimilarity to all the other proteins d i j * ,j { 0 , , i - 1 , i + 1 , , N } . The new "hierarchical-context" relying on d i j * is donated as N p * ( i ) ,p= { 1 , , P } . In this way, we can develop an iterative algorithm that learns d i j * and N p * ( i ) , p = { 1 , , P } coherently. Since N p * ( i ) implicitly depends on d i j * through the nearest neighbors of x i , we use a fixed-point recursion method [47] to solve d i j * . In each iteration, N p * ( i ) is first computed by using the previous estimation of d i j * , which is then updated by multiplying the regularization factor δ i j as in (8). The iterations are carried out for T times, as given in Algorithm 1.

With the learned dissimilarity matrix D(t+1), we use D(t+1)[0; 1, ... , N] as the dissimilarity between the query protein x0 and the database proteins {x1, ... , x N }. Thus we can rank the database proteins in an ascending order.

Efficient implementation of ProDis-ContSHC

The proposed learning algorithm is time-consuming. Therefore, it is not suitable for realtime protein retrieval systems. Here we propose several techniques to significantly improve the efficiency of the algorithm.

  • Similar to [33], in order to increase the computational efficiency, it is possible to run ProDis-ContSHC for only part of the database of the known proteins. Hence, for each query protein x0, we first retrieve N' ≪ N of the most similar proteins, and perform ProDis-ContSHC to learn the dissimilarity matrix of size (N' + 1) × (N' + 1) for only those proteins. Then we calculate the new dissimilarity measure D' (N' + 1) × (N' + 1)for only those (N' + 1) proteins. Here, we assume that all the relevant proteins will be among the top N' most similar proteins. This strategy is illustrated in Figure 4(a) and 4(b).

Figure 4
figure 4

Efficient implementation of ProDis-ContSHC. (a) Performing ProDis-ContSHC on the original matrix of size (N + 1) × (N + 1) from the entire dataset; (b) Performing ProDis-ContSHC on a subset of the database proteins, i.e., a dissimilarity matrix of size (N' + 1) × (N' + 1); (c) Using the symmetry property of the dissimilarity matrix to reduce the training time.

  • Most of the dissimilarity and similarity measures are symmetric ones, i.e., d ij = d ji . As can be observed in (13), the regularization of d ij is also symmetric. Therefore, it is possible to develop an efficient learning algorithm by using this property. In the algorithm, all the computation results of (i, j) (such as d ij and δ ij ) can be used directly by (j, i). In this way, we can save almost half of the computational time, as shown in Figure 4(c).

  • A bottleneck of ProDis-ContSHC may be the training procedure for the SVM model in each iteration. For a database of N proteins belonging to C classes, there are N C2 protein pairs, in which c = 1 C N c C 2 are relevant pairs, while c = 1 C c c N c × N c are irrelevant pairs, where C is the number C of the protein classes and N c is the number of proteins in the c-th class ( c = 1 C N c = N ) . There might be a huge number of protein pairs available for the SVM training. However, it is not necessary to include all of them in the training process. One can select a small but equal number of the relevant and the irrelevant pairs to train the SVM classifier. This is an effective way to reduce the training time of SVM.

Algorithm 1 ProDis-ContSHC: S upervised Learning of Pro tein Dis similarity and Updating H ierarchical Cont ext C oherently.

Require: Input D = [d ij ](N+1)×(N+1): matrix of size (N+1)×(N+1) of pairwise protein feature distances, where x0 is the query protein and {x1, ... , x N } are the database proteins;

Require: Input κ: size of the hierarchical sub-context;

Require: Input P: number of the hierarchical context;

Initialize dissimilarity matrix: D(1) = D;

for t = 1, ... , T do

Update the hierarchical context for each protein x i : N p ( t ) ( i ) , ( p = 1 , , P ) ,

N p ( t ) ( i ) = { x j | x j i s a m o n g t h e k - t h t o k - t h n e a r e s t n e i g h b o r s o f x i , a c c o r d i n g t o D ( t ) ( i ; 1 , , N ) }
(9)

where k' = (p - 1) × κ, k'' = (p - 1) × κ + κ, and D ( t ) ( i ; 0 , , N ) = [ d i 0 ( t ) , , d i N ( t ) ] .

Compute the contextual proteins dissimilarity vector d i j ( t ) for each pair of proteins (i, j), i, j ∈ {0, ... , N}:

d i j ( t ) = [ d i j ( t ) ( 1 ) d i j ( t ) ( 2 ) d i j ( t ) ( P ) ]
(10)

where d i j ( t ) ( p ) = 1 k 2 m N p ( t ) ( i ) , n N p ( t ) ( j ) d m n ( t ) .

Select relevant and irrelevant protein pairs and label them as y γ = +1 and y γ = - 1 respectively, train an SVM model for their contextual dissimilarity vectors d γ ( t ) as

f ( t ) ( d ) = w ( t ) d+ b ( t )
(11)

Compute the distance to the SVM margin boundary for the contextual dissimilarity vector d i j ( t ) of each pair of proteins as i j ( t ) = f ( t ) ( d i j ( t ) ) , and set a regularization factor for this pair of proteins:

δ i j ( t ) = e x p ( - y ̃ i j ( t ) σ )
(12)

Update the pairwise protein dissimilarity measures:

for i = 0, 1, ... , N do

for j = 0, 1, ... , N do

d i j ( t + 1 ) = d i j ( t ) × δ i j ( t )
(13)

end for

end for

D ( t + 1 ) = [ d i j ( t + 1 ) ] ( N + 1 ) × ( N + 1 ) .

end for

Output the dissimilarity matrix: D(t+1).

Benchmark sets

To evaluate the proposed ProDis-ContSHC algorithm, we conduct experiments on two different benchmark sets, i.e., the ones used in [21] and [26] respectively.

ASTRAL 1.73 protein domain dataset

Following [26], we use the following database and queries as our first benchmark set:

Database

The ASTRAL 1.73 [48] 95% sequence-identity non-redundant data set is used as the protein database. We generate our index database from the tableau data set published by Stivala et al. [49], which contains 15,169 entries.

Queries

A query data set containing 200 randomly selected protein domains is used in our experiment. For each query, a list that contains all the proteins in the respective index database is returned with the ranking scores.

We generate a vector of features x for a given protein based on its tableau representation [49].

FSSP/DALI protein dataset

To evaluate the performance of the proposed methods, a portion of the FSSP database [50] is selected as in [21]. This dataset has 3,736 proteins classified into 30 classes. It’s constructed according to the DALI algorithm [51, 52]. The protein numbers in different classes varies 2 to 561. For protein feature representation, the following two features are extracted from the 3D structure and the sequence of a protein as in [20, 21]:

  • The Polar-Fourier transform, resulting in the FT02 features;

  • Krawtchouk moments, resulting in the Kraw00 features.

The descriptor vectors are weighted and an integrated descriptor vector is produced as x, which will be used for the protein retrieval tasks.

Results and discussion

Results on ASTRAL 1.73 dataset

To compare a query protein x0 to a protein x i in the ASTRAL 1.73 dataset, we compute the cosine similarity [27] as the baseline similarity measure as in [26]. Cosine similarity [27] simply calculates the cosine of the angle between the two vectors x i and x j .

s i j =C ( x i , x j ) = x i x j | | x i | | | | x j | |
(14)

A higher cosine similarity score implies a smaller angle between the two vectors. Although ProDis-ContSHC is proposed to learn protein-protein dissimilarity d ij , it can be extended easily to learn similarity s ij as well. The only difference is to set the regularization factor as δ i j =exp ( y ̃ i j σ ) instead of δ i j =exp ( - y ̃ i j σ ) in (7).

ROC curve and precision-recall curve performance

SCOP [53] fold classification is used as the ground truth to evaluate the performance of the different methods. To fairly compare the accuracy, we use the receiver operating characteristic (ROC) curve [54], the area under this ROC curve (AUC) [54], and the precision-recall curve [55]. Given a query protein x0 which belongs to the SCOP fold l0, the top k proteins returned by the search algorithms are considered as the hits. The remaining proteins are considered as the misses. For the i-th protein x i belonging to the SCOP fold l i , if l i = l0 and ik, the protein x i is defined as a true positive (TP). On the other hand, if l i l0 and i ≤ k, x i is defined as a false positive (FP). If l i l0 and i > k, x i is defined as a true negative (TN). Otherwise, x i is a false negative (FN). Using these definitions, we can then compute the true positive rate (TPR or recall), the false positive rate (FPR), recall and precision as follows:

T P R = T P P = T P T P + F N F P R = F P N = F P F P + T N
(15)
R e c a l l = T P T P + F N P r e c i s i o n = T P T P + F P
(16)

TPR k , FPR k , Recall k , and Precision k are calculated for all 1 ≤ kN , where N is the size of the database. The ROC defines a curve of points with FPR k as the abscissa and TPR k as the ordinate. Precision-recall defines a curve with recall k and precision k as abscissa and ordinate respectively. We use the area under the ROC curve (AUC) as a single-figure measurement for the quality of a ROC curve [54], and use the averaged AUC over all the queries to evaluate the performance of the method.

To demonstrate the contribution of the supervised learning idea, we also compare ProDis-ContSHC with its unsupervised counterpart, i.e., contextual dissimilarity algorithm based on the unsupervised learning, i.e., ProDis-ContHC. ProDis-ContHC is also applied to improve the cosine similarity. We also compare with the widely-used contextual dissimilarity measure [30, 31] (CDM), which tries to take into account the local distribution of the vectors and iteratively estimates distance update terms in the spirit of Sinkhorns scaling algorithm, thereby modifying the neighborhood structures.

The performance of different methods are compared, as shown in Figure 5. Figure 5(a) shows the ROC curves of the original cosine similarity and its improved versions by three contextual similarity learning algorithms on the ASTRAL 1.73 [48] 95% dataset, with different numbers of proteins returned to each query. It can be seen from Figure 5(a) that the TPR of all the methods increases as the FPR grows. The reason is due to the fact that, provided the number of queries is fixed, when the number k of returned proteins to each query is very small, the returned proteins are not enough to "represent" the class features of the query, which then causes the low TPR. Meanwhile, in this situation, most of the returned proteins are highly confident of belonging to the same class as the query, resulting in a low FPR. Moreover, the TPR is almost 100% when the FPR> 50%. It is clear that the ROC curve of ProDis-ContSHC completely embodies the ROC curves of the other three methods, which implies ProDis-ContSHC is the best method among the four. That also means that supervised learning is better than unsupervised learning for this purpose. ProDis-ContHC, on the other hand, is the second best method among these four, which demonstrates the contribution of the hierarchical sub-context idea to the traditional contextual dissimilarity measures. The overall AUC results are listed in Table 1, from which similar conclusions can be drawn. It is noticeable that the AUC for ProDis-ContSHC is very close to 1, which means ProDis-ContSHC works almost perfectly on this dataset. We further compare these four methods by the precision-recall curves, which are shown in Figure 5(b). It can be seen that the proposed contextual similarity learning algorithms significantly outperform the traditional methods. ProDis-ContSHC, again, is consistently the best method among the four.

Figure 5
figure 5

Performance of similarity measures on the ASTRAL 1.73 90% dataset. (a) The ROC curves of the original similarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively. (b) The precision-recall curves of the original similarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively.

Table 1 Performance of different retrieval methods on the ASTRAL 1

Regarding the efficiency of the method, in this experiment, the learning time of the ProDis-ContSHC is longer than that of the ProDis-ContHC and CDM. This is because in each iteration of the learning algorithm, a quadratic programming problem with many training protein pairs have to be solved to train the SVM. In addition, the computation of the regularization factor of supervised similarity learning algorithm needs more function evaluations.

We also compare the proposed algorithms with seven other protein retrieval methods, i.e., tableau search [56], QP tableau [49], Yakusa [57], SHEBA [58], VAST [59, 60], and TOPS [61, 62]. The overall AUC values are shown in Table 1. It can be concluded that the tableau feature based methods do not always achieve better performance than other methods, such as tableau search. Among the existing tableau feature based methods, IR tableau outperforms the others. Yakusa and SHEBA also have comparable performance. As seen in Table 1, the AUC of the proposed algorithms is clearly better than all the other methods.

Improving different similarity measures via contextual dissimilarity learning algorithms

To further evaluate the robustness of our method, we test the behavior of ProDis-ContSHC and other contextual similarity learning algorithms on different similarity measures. A group of experiments are conducted on the ASTRAL 1.73 95% dataset with the following similarity measures:

  • The cosine similarity [27] as introduced in the previous section.

  • The Jaccard index [28]: it is defined as the size of the intersection divided by the size of the union of two sets, i.e.,

    J ( x i , x j ) = | x i x j | | x i x j |
    (17)
  • The Tanimoto coefficient [29]: it is a generalization of the Jaccard index, defined as

    J ( x i , x j ) = x i x j | | x i | | 2 + | | x j | | 2 - x i x j
    (18)
  • Squared Euclidean distance [22]: it is another means of measuring similarity of proteins.

    d i j = ( x i x j ) ( x i x j ) = m ( x i ( m ) x j ( m ) ) 2
    (19)

where x i (m) is the m-th element of vector x i .

ProDis-ContSHC, ProDis-ContHC, and the CDM algorithms are applied to improve each of these similarity measures, respectively. The AUC values of the corresponding retrieval systems are plotted in Figure 6. In general, improving the original similarity measure by ProDis-ContSHC leads to the largest improvement. The only exception is for Tanimoto coefficient, on which ProDis-ContSHC has slightly lower AUC than ProDis-ContHC, but comparable AUC to the CDM. One possible reason is that the supervised classifier fail to capture the real distribution of the contextual similarity. ProDis-ContHC, on the other hand, also performs better than the CDM algorithm and the original similarity measures. This strongly suggests that our previous conclusions are valid and consistent. That is, hierarchical sub-contextual information can remarkably improve the traditional context-based similarity measures, whereas supervised learning can further improve the accuracy for most of the input similarity measures.

Figure 6
figure 6

Performance of similarity measures on different base measures on the ASTRAL 1.73 90% dataset. Performance of similarity measures on different base measures on the ASTRAL 1.73 90% dataset. The four base measures being tested are cosine similarity [27], the Jaccard index [28], the Tanimoto coefficient [29], and the Euclidean distance [22].

Results on FSSP/DALI dataset

Unlike the similarity measure used in the last experiment, here we use the Euclidean distance [22] to compare a pair of proteins as the baseline dissimilarity measure as in [20, 21]. In this way, we have an idea about how our algorithms work with both similarity and dissimilarity measures. For a query protein x0, the pairwise Euclidean distances, d0i, i = 1, 2, ... , N , are ranked. The top k proteins are returned as the retrieval results. To evaluate the performance of the proposed algorithms, we test them on both the protein retrieval and the protein classification tasks, following [20, 21].

Performance on protein retrieval

The efficiency of the proposed dissimilarity learning algorithm is first evaluated in terms of the performance on the protein retrieval task. In this case, each protein x i X of the dataset is used as a query x0 and the retrieved proteins are ranked according to the shape dissimilarity d0jto the query, where j = 1, 2, ... , i - 1, i + 1, ... , N. We also use the precision-recall curve to demonstrate the performance of the proposed methods, where precision is the proportion of the retrieved proteins that are relevant to the query and recall is the proportion of the relevant proteins in the entire dataset that are retrieved as the results.

To test the robustness and consistency of our methods, we apply our methods to three different protein descriptor vectors, i.e., Daras et al.'s FT02, Kraw00, and FT02&Kraw00 [20, 21] geometric descriptor vectors. We also apply the unsupervised version of our algorithm, ProDis-ContHC, and the CDM algorithm to the same dissimilarity measure and the same descriptor vectors to compare with ProDis-ContSHC. Figure 7 shows the precision-recall curves for different algorithms on different protein descriptor vectors. As mentioned in [20, 21], there is always a tradeoff between the precision and recall values. This is clearly shown in Figure 7(a), (b), and 7(c), in which the algorithms reach their peak precision values at the smallest recall values. It can be seen that ProDis-ContSHC has a clearly better performance than any other method, whereas ProDis-ContHC is the second best one. This is quite consistent with what is observed in the last experiment, in which a similarity measure is used. Therefore, our algorithms can consistently improve any similarity/dissimilarity measure. Among the three protein descriptor vectors, ProDis-ContSHC performs the best on the combined vector, i.e., Kraw00 &FT02. This is because this vector not only employs the context, but also their relevant information to predict the relationship between the query and the database proteins.

Figure 7
figure 7

Performance of dissimilarity measures on the FSSP/DALI dataset. (a) The precision-recall curves of the original dissimilarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively, with the descriptor vector FT02&Kraw00. (b) The precision-recall curves with the descriptor vector FT02. (c) The precision-recall curves with the descriptor vector Kraw00.

Performance on protein classification

The performance of the method is also evaluated in terms of the overall classification accuracy [20, 21]. To be more specific, for each protein x i in the database, a dissimilarity measure is applied after removing that protein from the database ("leave-one-out" experiment [63]). A class label l0 is then assigned to the query x0 according to the label of the nearest database protein. The overall classification accuracy is given by:

O v e r a l l C l a s s i f i c a t i o n A c c u r a c y = N u m b e r o f c o r r e c t l y p r e d i c t e d p r o t e i n s T o t a l n u m b e r o f p r o t e i n s i n t h e d a t a b a s e
(20)

We again conduct this experiment with the three descriptor vectors, i.e., FT02, Kraw00, and FT02&Kraw00. The overall classification accuracy is shown in Table 2. It can be seen that ProDis-ContSHC has a consistently higher than 99% accuracy on all the three descriptor vectors. Each dissimilarity measure achieves its highest accuracy on Kraw00 &FT02. Among the four dissimilarity measures, ProDis-ContSHC has the highest accuracy, whereas ProDis-ContHC is the second best one. Therefore, this conclusion has been demonstrated on both similarity and dissimilarity measures on different datasets with different descriptor vectors.

Table 2 Overall classification accuracy using different protein descriptors and the Euclidean distance measure

Conclusions

We have introduced in this paper a novel contextual dissimilarity learning algorithm for protein-protein comparison in protein database retrieval tasks. Its strength resides in the use of the hierarchical context between a pair of proteins and their class label information. By extensive experiments, this novel algorithm has been demonstrated to outperform the traditional context-based methods and their unsupervised version.

We formulate the protein dissimilarity learning problem as a context-based classification problem. Under such a formulation, we try to regularize the protein pairwise dissimilarity in a supervised way rather than the traditional unsupervised way. To the best of our knowledge, this is the first study on supervised contextual dissimilarity learning. We propose a novel algorithm, ProDis-ContSHC, which updates a protein's hierarchical sub-context and the dissimilarity measure coherently. The regularization factors are learned based on the classification of the relevant and the irrelevant protein pairs. The algorithm works in an iterative manner.

Experimental results demonstrate that supervised methods are almost always better than their unsupervised counterparts on all the databases with all the feature vectors. The proposed method, even though mainly presented for protein database retrieval tasks, can be easily extended to other tasks, such as RNA sequence-structure pattern indexing [64], retrieval of high throughput phenotype data [65], and retrieval of genomic annotation from large genomic position datasets [66]. The approach may also be extended to the database retrieval and pattern classification problems in other domains, such as medical image retrieval [6769], speech recognition, and texture classification [70].