We use the following notation throughout this paper: let \(\mathbf {c} \equiv (c_1, ..., c_{d(\mathbf {c})})^{\top }\) be a feature vector of a compound and \(\mathbf {p} \equiv (p_1, ..., p_{d(\mathbf {p})})^{\top }\) be a feature vector of a protein, where \(d(\mathbf {c})\) and \(d(\mathbf {p})\) are the number of dimensions of vector \(\mathbf {c}\) and vector \(\mathbf {p}\), respectively.
The ranking prediction model f of learning-to-rank is represented as \(f(\mathbf {x}) = f(\varvec {\Phi }(\mathbf {c}, \mathbf {p}))\), where \(\mathbf {x} \equiv \varvec {\Phi }(\mathbf {c}, \mathbf {p})\) is an input feature vector and \(\varvec {\Phi }\) is a feature map. In this section, we explain the method proposed by Zhang et al. that uses the tensor product as \(\varvec {\Phi }\) [7] as well as the proposed method PKRank, which is a learning-to-rank-based VS using a pairwise kernel. The former method is a special case of the latter, as described presently.
Previously proposed method (tensor product)
Zhang et al. [7] introduced the tensor product as feature map \(\varvec {\Phi }\) as follows:
$$\varvec {\Phi }(\mathbf {c}, \mathbf {p})=\mathbf {c}\otimes \mathbf {p},$$
(1)
where \(\otimes\) is the tensor product operator. If \(\mathbf {c}\) is a \(d(\mathbf {c})\)-dimensional feature vector and \(\mathbf {p}\) is a \(d(\mathbf {p})\)-dimensional feature vector, \({\Phi }(\mathbf {c}, \mathbf {p})=\mathbf {c}\otimes \mathbf {p}\) is a \(d(\mathbf {c})\times d(\mathbf {p})\)-dimensional feature vector. Zhang et al. used a general descriptor [11] (GD, 32 dimensions) as compound feature vector \(\mathbf {c}\), and composition transition and the distribution feature [12] (CTD, 147 dimensions) as protein feature vector \(\mathbf {p}\). Hence, they used a 4,704-dimensional feature vector as input to the ranking prediction model f. GD and CTD represent the physicochemical properties of a compound and a protein, respectively.
Proposed method (PKRank)
The pairwise kernel [9] was originally proposed in the context of the drug–target interaction problem [8]. Pairwise kernel \(k: \mathbb {R}^{d(\mathbf {c})\times d(\mathbf {p})} \times \mathbb {R}^{d(\mathbf {c}) \times d(\mathbf {p})} \rightarrow \mathbb {R}\) is defined between two pairs of proteins and compounds \((\mathbf {c}, \mathbf {p})\) and \((\mathbf {c}', \mathbf {p}')\) as follows:
$$k((\mathbf {c}, \mathbf {p}),(\mathbf {c}', \mathbf {p}')),$$
(2)
$$\quad = {\Phi }(\mathbf {c}, \mathbf {p})^{\top } {\Phi }(\mathbf {c}', \mathbf {p}'),$$
(3)
$$\quad =({\Phi }_{\text {com}}(\mathbf {c})\otimes {\Phi }_{\text {pro}}(\mathbf {p}))^{\top }({\Phi }_{\text {com}}(\mathbf {c}')\otimes {\Phi }_{\text {pro}}(\mathbf {p}')),$$
(4)
$$\quad =({\Phi }_{\text {com}}(\mathbf {c})^{\top } {\Phi }_{\text {com}}(\mathbf {c}'))\times ({\Phi }_{\text {pro}}(\mathbf {p})^{\top } {\Phi }_{\text {pro}}(\mathbf {p}')),$$
(5)
$$\quad =k_{\text {com}}(\mathbf {c}, \mathbf {c}') \times k_{\text {pro}}(\mathbf {p}, \mathbf {p}'),$$
(6)
where \(k_{\text {com}}: \mathbb {R}^{d(\mathbf {c})} \times \mathbb {R}^{d(\mathbf {c})} \rightarrow \mathbb {R}\) is a compound kernel between two compounds, and \(k_{\text {pro}}: \mathbb {R}^{d(\mathbf {p})} \times \mathbb {R}^{d(\mathbf {p})} \rightarrow \mathbb {R}\) is a protein kernel between two proteins.
RankSVM [13] is a learning-to-rank model based on a pairwise approach using SVM. Zhang et al. compared several learning-to-rank prediction models and concluded that RankSVM is the best. RankSVM can be extended to use the kernel method as well as SVM; thus, the pairwise kernel can be used.
Our proposed PKRank is a learning-to-rank method that uses a pairwise kernel and RankSVM. There are two steps in the training of PKRank: (1) A Gram matrix of the pairwise kernel K is generated. (2) RankSVM is trained with the input of the Gram matrix of the pairwise kernel K and the order of activity of compounds against target proteins. PKRank requires \(k_{\text {com}}\) and \(k_{\text {pro}}\) to generate the Gram matrix of the pairwise kernel K. Figure 1 shows the training overview of PKRank and Fig. 2 shows the overview of the generation of the Gram matrix of the pairwise kernel K.
If both \(k_{\text {com}}\) and \(k_{\text {pro}}\) are represented as linear kernel \(k(\mathbf {x},\mathbf {x}') = {\Phi }(\mathbf {x})^{\top } {\Phi }(\mathbf {x}') \equiv \mathbf {x}^{\top }\mathbf {x}'\), we obtain \({\Phi }_{\text {com}}(\mathbf {c}) = \mathbf {c}\) and \({\Phi }_{\text {pro}}(\mathbf {p}) = \mathbf {p}\) from (5) and (6). Then, we get \({\Phi }(\mathbf {c},\mathbf {p}) = \mathbf {c}\otimes \mathbf {p}\) from (3) and (4). This is equivalent to the method that uses the tensor product; hence, this method is a special case of PKRank.
PKRank has several advantages: (a) The tensor product method cannot handle high-dimensional feature vectors because the number of dimensions of the tensor product is large [as previously described, \(\mathbf {c}\otimes \mathbf {p}\) is a \(d(\mathbf {c})\times d(\mathbf {p})\)-dimensional feature vector]. However, PKRank can handle it because the pairwise kernel uses not a feature map \({\Phi }(\mathbf {c}, \mathbf {p})\), but the compound kernel \(k_{\text {com}}\) and the protein kernel \(k_{\text {pro}}\). (b) In case of the tensor product method, the compound kernel \(k_{\text {com}}\) and the protein kernel \(k_{\text {pro}}\) are fixed to use a linear kernel, as described. However, PKRank can use any kernel function in addition to the linear kernel. (c) PKRank can handle similarity measurements for prediction. This is because the pairwise kernel method needs only the Gram matrix, whose elements represent the similarity between compounds or proteins; thus, a feature vector representation of compound \({\Phi }_{\text {com}}\) or protein \({\Phi }_{\text {pro}}\) is not always required.
To make full use of advantage (a) of PKRank, we introduce Extended-connectivity Fingerprints [14] (ECFP4, 2,048 dimensions) as a compound feature vector, which is a topological fingerprint representing the presence of substructures. ECFP4 cannot be dealt with by the method of tensor product because of its large dimensionality (if ECFP4 and CTD are used in this method, a 301,056-dimensional feature vector is the input to the ranking prediction model). With regard to advantage (b), we introduce a polynomial kernel, a radial basis function (RBF) kernel and the Tanimoto kernel [15], for the compound kernel \(k_{\text {com}}\), and a polynomial kernel, an RBF kernel, for the protein kernel \(k_{\text {pro}}\). To exploit advantage (c), we introduce the normalized Smith–Waterman score (nSW) [16] for protein kernel \(k_{\text {pro}}\), which is a normalized local alignment score between sequences. The nSW is calculated only using two amino acid sequences, and shows the similarity between proteins. Thus, the nSW can also be used as protein kernel \(k_{\text {pro}}\).
Kernels
We use a linear kernel, a polynomial kernel, an RBF kernel, and the Tanimoto kernel for the compound kernel \(k_{\text {com}}\), and a linear kernel, an RBF kernel, and the normalized Smith–Waterman score (nSW) for protein kernel \(k_{\text {pro}}\). Here, we explain these kernels.
-
A linear kernel between two features \(\mathbf {x}, \mathbf {x}'\) is
$$k(\mathbf {x}, \mathbf {x}')=\mathbf {x}^{\top }\mathbf {x}' .$$
(7)
As seen above, if both \(k_{\text {com}}\) and \(k_{\text {pro}}\) are represented as a linear kernel, PKRank is equivalent to the tensor product method.
-
A polynomial kernel between two features \(\mathbf {x}, \mathbf {x}'\) is
$$k(\mathbf {x}, \mathbf {x}')=(\mathbf {x}^{\top }\mathbf {x}'+1)^{z}.$$
(8)
It has a hyper-parameter z, the manner of tuning which is explained in Sect. 3.4.
-
An RBF kernel between two features \(\mathbf {x}, \mathbf {x}'\) is
$$k(\mathbf {x}, \mathbf {x}')=\exp (-\gamma \Vert \mathbf {x}-\mathbf {x}'\Vert ^2).$$
(9)
It is widely used in kernel-based machine learning. It has a hyper-parameter \(\gamma\), the manner of tuning which is explained in Sect. 3.4.
-
The Tanimoto kernel between two binary vectors \(\mathbf {x}, \mathbf {x}'\) is
$$k(\mathbf {x}, \mathbf {x}')=\frac{\mathbf {x}^{\top }\mathbf {x}'}{\mathbf {x}^{\top }\mathbf {x}+{\mathbf {x}'}^{\top }\mathbf {x}'-\mathbf {x}^{\top }\mathbf {x}'}.$$
(10)
The Tanimoto coefficient is used to measure similarity between compounds using a binary feature. In the drug–target interaction problem, the Tanimoto coefficient is used as a compound kernel. We use the Tanimoto kernel only with ECFP4 (binary feature).
-
The normalized Smith–Waterman score (nSW) between two proteins \(seq, seq'\) is:
$$k(seq, seq') =\frac{\text {SW}(seq, seq')}{\sqrt{\text {SW}(seq, seq)}\sqrt{\text {SW}(seq', seq')}}, $$
(11)
where \(\text {SW}(\cdot ,\cdot )\) is the Smith–Waterman score, which is a local alignment score between amino acid sequences.
Table 1 Three datasets used as benchmark in this study (PDE, CTS, ADOR)