A sequence-based multiple kernel model for identifying DNA-binding proteins

Qian, Yuqing; Jiang, Limin; Ding, Yijie; Tang, Jijun; Guo, Fei

doi:10.1186/s12859-020-03875-x

A sequence-based multiple kernel model for identifying DNA-binding proteins

Research
Open access
Published: 31 May 2021

Volume 22, article number 291, (2021)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

A sequence-based multiple kernel model for identifying DNA-binding proteins

Download PDF

Yuqing Qian¹,
Limin Jiang²,
Yijie Ding¹,
Jijun Tang² &
…
Fei Guo ORCID: orcid.org/0000-0001-8346-0798³

2125 Accesses
11 Citations
2 Altmetric
Explore all metrics

Abstract

Background

DNA-Binding Proteins (DBP) plays a pivotal role in biological system. A mounting number of researchers are studying the mechanism and detection methods. To detect DBP, the tradition experimental method is time-consuming and resource-consuming. In recent years, Machine Learning methods have been used to detect DBP. However, it is difficult to adequately describe the information of proteins in predicting DNA-binding proteins. In this study, we extract six features from protein sequence and use Multiple Kernel Learning-based on Centered Kernel Alignment to integrate these features. The integrated feature is fed into Support Vector Machine to build predictive model and detect new DBP.

Results

In our work, date sets of PDB1075 and PDB186 are employed to test our method. From the results, our model obtains better results (accuracy) than other existing methods on PDB1075 ($84.19\%$) and PDB186 ($83.7\%$), respectively.

Conclusion

Multiple kernel learning could fuse the complementary information between different features. Compared with existing methods, our method achieves comparable and best results on benchmark data sets.

Identification of DNA-Binding Proteins via Fuzzy Multiple Kernel Model and Sequence Information

DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation

Article Open access 20 October 2015

Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information

Article 19 January 2021

Background

DNA-Binding Protein (DBP) plays a vital role in the function of various biomolecules, containing DNA transcription and replication. To detect DNA-binding protein via biological assays, researchers usually employed electrophoretic mobility shift assay, chromatin immunoprecipitation, Yeast One-hybrid System (Y1H) and X-ray crystallography. However, above methods are still time consuming and extremely expensive. The machine learning-based methods have been developed to solve the problem of detecting DNA-binding protein [1,2,3].

In the identification study of DNA-binding proteins, the main task is to determine an unknown protein whether it can bind to DNA. In the previous works, many researchers detected DBP based on structural information. Nimrod et al. [4] constructed a random forest prediction model for DNA-binding protein recognition using the average surface electrostatic potential, dipole moment, and amino acid conservation pattern information; Bhardwaj et al. [5] used overall charge, surface patches and composition feature to train a predictive model via Support Vector Machine (SVM) [6]. Ahmad et al. [7] trained a neural network model to predict DBP. The feature of protein contained the net charge of the protein, electric dipole moment and fourth moment tensor.

The number of protein sequences is larger than the number of known protein structures. The number of protein with relevant structural information is very low and most of the proteins do not have the corresponding structural information. Therefore, the structure-based models cannot be widely used to detect DBP. A method based on protein sequence [8] constructed a Support Vector Machine (SVM) model with amino acid composition and materialized property information. Liu and Cai et al. [9,10,11] extracted overall amino acid composition and Pseudo Amino Acid Composition (PseAAC) to represent protein feature. Liu et al. [12] developed a model called iDNAPro-PseAAC, which is extended with evolutionary information of protein sequence. Kumar et al. [13] used Position Specific Scoring Matrix (PSSM) to propose a classifier called DNAbinder, which is based on SVM. PSSM was produced via PSI-BLAST software [14], which could obtain evolutionary conservation information. The Local-DPP [1] captured local conservation information of PSSM and trained an ensemble model to predict DBP. DBPPred [15] employed Random Forest (RF) to get the optimal feature subset and trained Gaussian Naive Bayes model for predicting DBP. Zou et al. utilized a Fuzzy Kernel Ridge Regression model with Multi-View Sequence Features (FKRR-MVSF) [16] to predict DBP. To further improve the accuracy of DBP prediction, Ding et al. [17] employed a Multi-Kernel SVM based on Heuristically Kernel Alignment (MKSVM-HKA) to integrate different features from protein sequence. In addition, a multiple kernel-based fuzzy SVM model [18] of DNA-binding proteins also was developed to improve prediction performance. Liu et al. [19] proposed a stacking framework model for predicting DBP by orchestrating multi-view features. This stacking framework model was named as MSFBinder. Rahman et al. [20] developed a DNA-binding Protein Prediction model using Chou general PseAAC (DPP-PseAAC) and SVM based Recursive Feature Elimination (RFE) approach. Adilina et al. [21] extracted several features via PseAAC and carried out two different types of feature selection to build predictive model of DBP.

In practical applications, the sequence-based approaches are more adaptable. DNA-methylation sites, recombination spots, Post Translational Modification (PTM) sites (protein) and Protein-Protein Interactions (PPI) have been predicted by sequential methods. In recent years, machine learning methods have been widely used in bioinformatics [16, 17, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. And some of the biological problems are solved very well, including O-GlcNAcylation sites [23], protein subcellular localization [25, 39, 40], Methyladenosine Sites [22, 26], drug-target interactions [27,28,29,30,31, 37, 41], drug-drug interactions [42, 43], lncRNA-Protein interaction [35, 36] protein crystallization prediction [32, 44], potential disease-associated microRNAs [24, 33, 34, 45, 46] and other RNAs [47,48,49,50].

Inspired by the previous work [1, 8, 9, 11, 13, 16, 17], we propose a new predictive model for DNA-binding protein through multi-kernel support vector machine. Firstly, several types of features are extracted from protein sequences. And these features are employed to construct kernel matrices. We use Multi-Kernel Learning-based on Centered Kernel Alignment (MKL-CKA) algorithm to combine these kernels and obtain an integrated kernel for training SVM model. We call this model as Multi-Kernel SVM (MKSVM) model. Finally, MKSVM is utilized to detect new DNA-binding proteins. Compared with other state-of-the-art models, the proposed method achieves better results. The accuracy of our model are $84.19\%$ and $83.7\%$ on the PDB1075 (leave one out test) and PDB186 (independent test) data sets, respectively.

Results

In this section, we test our method on PDB1075 and PDB186 data sets. Firstly, we perform a Leave One Out Cross validation (LOOCV) on the PDB1075. Next, our model are trained by the PDB1075 and tested on the PDB186. Other existing methods are also test on PDB1075 and PDB186. The data set and source code (with Python Programming Language) is obtained from https://figshare.com/s/cf56cef6659c7eed16c9.

Data sets

The details of PDB1075 and PDB186 data sets are list in Table 1. The benchmark data sets (PDB1075 and PDB186) are selected from Protein Data Bank (PDB) [51]. Any two sequences have not more than $25\%$ similarity. Protein sequences which less than 50 amino acids or contain the ‘X’ character must be removed. The PDB1075 data set (constructed by Liu et al. [9]) is used to test our model under LOOCV. The PDB186 data set (constructed by Lou et al. [15]) is used for independent testing.

Table 1 The detail information of two benchmark data sets

Full size table

Table 2 The ACC of different parameter values on PDB1075 (five-fold cross validation)

Full size table

Measurements

The main measures for the evaluation of performance are Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), Sensitivity (SN), Specificity (SP), and Area Under ROC (AUC). The calculation formulas of ACC, SN, SP and MCC indicators are calculated as follows:

$$\begin{aligned} ACC&=\frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$

(1a)

$$\begin{aligned} SN&=\frac{TP}{TP+FN} \end{aligned}$$

(1b)

$$\begin{aligned} Spec&=\frac{TN}{TN+FP} \end{aligned}$$

(1c)

$$\begin{aligned} MCC&=\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FN)\times (TN+FP) \times (TP+FP) \times (TN+FN)}} \end{aligned}$$

(1d)

where TP is the correct number of positive samples, TN is the correct number of negative samples, FN is the number of false negative samples and FP is the number of false positive samples. Area Under of receiver operating characteristic Curve (AUC) is obtained by calculating the area under the Receiver Operating characteristic Curve (ROC). The higher value of AUC, the better predictive effect.

Parameters selection

To achieve the best performance, we need to select optimal parameters of predictive model. In this section, we employ grid search method to select optimal parameters for SVM model.

The parameters selection of features

To select the optimal parameters of feature NMBAC and PsePSSM, we test the different parameters (the max value of $lag_{max}$ and lg for PsePSSM and NMBAC) under five-fold cross validation (on PDB1075 data set). We set the range of lg (NMBAC) and $lag_{max}$ (PsePSSM) values from 5 to 45 (step of 5). In Table 2, the results of the prediction show that the optimal lg (NMBAC) as 30 and $lag_{max}$ (PsePSSM) as 10 in this study.

Selection of C and $\boldsymbol{\gamma}$

For the selection of SVM parameters, we use the grid search method and the 5-fold Cross Validation (5-CV) method. We set the range of parameter from $2^{-5}$ to $2^{5}$ with step $2^{1}$. The optimal parameters of results are show in Table 3.

Table 3 The optimal parameters for SVM (single kernel)

Full size table

Table 4 The performance of different kernels (RBF kernel) on PDB1075 data set (leave one out)

Full size table

Table 5 The weight of six kernels (RBF kernel) by MKL-CKA

Full size table

Table 6 The sensitivity of different kernels (features) on PDB1075 data set (under the specificity of 0.5)

Full size table

Before combining multiple kernels, the parameter $\gamma$ for 6 types of kernels are obtained from their single kernels (Table 3). To achieve the optimal parameters of C under MKSVM (average weight for each kernel), we also utilize the above C range. Comparing the accuracy of different C values, the corresponding values of ACC are shown in the Fig. 1. When $C=2 \ (logC=1)$, the MKSVM (average weight for each kernel) achieves best ACC ($82.8\%$). In our study, the parameter (C) of MKSVM (with MKL-CKA) is same as MKSVM with mean weighted.

To obtain the optimal parameter ($\lambda$) of MKL-CKA, we try the different value of $\lambda$ from 0 to 1 (step is 0.05) under 5-CV on PDB1075 data set. The results are shown in the Fig. 2. When $\lambda = 0.8$, the ACC value is the highest. We set 0.8 as the optimal parameter ($\lambda$) of MKL-CAK.

Performance analysis on PDB1075

We test the performance of different kernels (features) on PDB1075 (under LOOCV). The results are shown in Table 4 and Fig. 3.

As we can see from the table, the results of multi-kernel learning are much better than single kernel model. The PSSM-AB (MCC: 0.547), PSSM-DWT (MCC: 0.522) and PsePSSM (MCC: 0.573) kernels with PSSM information are better than those of GE (MCC: 0.432), MCD (MCC: 0.417) and NMBAC (MCC: 0.424). Among them, we calculate the weights of six kernels by MKL-CKA method (Table 5). The integrated kernel (with MKL-CKA) has the highest results in ACC ($84.2\%$), MCC (0.684), SN ($85.9\%$), SP ($82.6\%$) and AUC (0.914). Obviously, the integrated kernel (with MKL-CKA) is higher than mean weighted kernel.

Under the specificity of 0.5 (on PDB1075 data set), the sensitivity values of different kernel are following: ${\mathbf {K}}_{GE}$: 0.8857, ${\mathbf {K}}_{MCD}$: 0.8495, ${\mathbf {K}}_{NMBAC}$: 0.8590, ${\mathbf {K}}_{PSSM-AB}$: 0.9352, ${\mathbf {K}}_{PsePSSM}$: 0.9657, ${\mathbf {K}}_{PSSM-DWT}$: 0.9523, mean weighted kernel: 0.9847, and ${\mathbf {K}}_{MKL-CKA}$: 0.9885. Some kernels have bias in the learning process. MKL-CKA could filter noise kernels (reducing bias of kernels) by setting low weights of kernels. And the sensitivity of MKL-CKA (0.9885) is better than best single kernel (${\mathbf {K}}_{PSSM-AB}$: 0.9352). Although our MKL algorithm only improves sensitivity value with a few percentage points, the purpose of MKL is to filter noise feature (kernel) and integrate multiple effective features. The Table 6 shows the sensitivity of different kernels (features) on PDB1075 data set (Under the specificity of 0.5).

Table 7 The running time of different kernels (features) on PDB1075 data set (training)

Full size table

Table 8 The performance of different kernel functions on PDB1075 data set (Five-fold cross validation)

Full size table

We also evaluate the running time of different models with different kernels. The results are shown in Table 7. The programs are carried out on the computer Intel Core i5 3.2 GHz CPU 8 GB RAM. The running time (s) of our methods are ${\mathbf {K}}_{GE}$: 0.418, ${\mathbf {K}}_{MCD}$: 3.79, ${\mathbf {K}}_{NMBAC}$: 0.627, ${\mathbf {K}}_{PSSM-AB}$: 0.678, ${\mathbf {K}}_{PsePSSM}$: 3.7, ${\mathbf {K}}_{PSSM-DWT}$: 3.47, mean weighted kernel: 28.7, and MKL-CKA: 68, respectively. Because multiple kernel matrices are calculated and the weight value of each kernel matrix is estimated, MKL-CKA is the most time-consuming.

What’s more, other kernel functions (e.g. linear kernel, polynomial kernel, and sigmoid kernel) are also test. We compare RBF kernel with other 3 types of kernel functions under five-fold cross validation. The results are list in Table 8, which shows that RBF kernel obtain better ACC on GE ($69.97\%$), MCD ($70.21\%$), PSSM-AB ($76.54\%$), PSSM-DWT ($76.26\%$) and PsePSSM ($78.36\%$), respectively. MKL-CKA also is employed to combine 6 features with four kernel functions, respectively. The RBF kernel (with MKL-CKA) achieves best ACC ($83.01\%$).

Comparison to existing predictors on PDB1075

Table 9 Compared with existing methods on PDB1075 data set (LOOCV)

Full size table

The MKSVM (with MKL-CKA) model and other methods are also test on PDB1075 data set (under LOOCV). The results of ACC, MCC, SN and SP are list in Table 9. Existing methods include IDNA-Prot|dis [2], DNAbinder [13], iDNAPro-PseAAC [10], Kmer1+ACC [12], iDNA-Prot [52], DNA-Prot [53], PseDNA-Pro [9], MKSVM-HKA [17], MSFBinder [19], FKRR-MVSF [16] and Local-DPP [1]. Among these methods, MKSVM-HKA (MCC: 0.63), MSFBinder (MCC: 0.67), FKRR-MVSF (MCC: 0.67), iDNA Pro-PseAAC (MCC: 0.53), PseDNA-Pro (MCC: 0.53), IDNA-Prot|dis (MCC: 0.54) and Local-DPP (MCC: 0.59) also obtained good performance. Local-DPP and iDNAPro-PseAAC take advantage of the PSSM feature to improve performance. MKSVM-HKA, FKRR-MVSF and MSFBinder employed MKL algorithm and ensemble strategy to integrate multiple information and further improve the predictive accuracy. Our method (MKSVM with MKL-CKA) is also based on MKL and achieves best MCC (0.68). Although, the SP value of MSFBinder ($83.09\%$) is higher than our method ($82.55\%$). Our method is the highest in ACC ($84.19\%$), MCC (0.68), SN ($85.91\%$).

The statistical significance tests of the differences is necessary. The results in Table 10 list that our method make statistically significant improvement over the other methods (P-value $<0.05$, by t-test, in term of MCC). The comparison is under 10 fold cross validation on PDB1075. The difference between Local-DPP and our method is significant (P-value: 6.0421E$-$6). Comparing with MKSVM-HKA (P-value: 1.5438E$-$4), MSFBinder (P-value: 0.0098) and FKRR-MVSF (P-value: 0.0103), our method also shows significantly better prediction accuracy.

Table 10 The statistics of different methods

Full size table

Table 11 The results of comparison between MKSVM (with MKL-CKA) model and other existing methods on PDB186 data set (independent test)

Full size table

Independent test

In order to further evaluate the performance of MKSVM (with MKL-CKA) model, we use PDB1075 to construct MKSVM model and test it via PDB186 data set. The results of comparison are shown in Table 11.

Our method achieves $83.7\%$, 0.691, $93.6\%$, and $74.2\%$ on ACC, MCC, SN, and SP, respectively. From the results of independent test, we can find out that our method has certain accuracy in the prediction of DBP. Adilina’s work (MCC: 0.670), MKSVM-HKA (MCC: 0.648), MSFBinder (MCC: 0.616) and FKRR-MVSF (MCC: 0.676) obtained good results on PDB186. Adilina et al. [21] employed 7 types of features and the strategy of feature selection to construct predictive model. FKRR-MVSF [16] and MKSVM-HKA [17] utilized MKL algorithm to combine several features. MSFBinder [19] built a stacking framework model by multiple features. The multiple information fusion-based methods achieved better results. Our method (MKSVM with MKL-CKA) performs better (MCC: 0.691) than most of existing models on PDB186 data set. From the results, the fusion of multiple information can improve the performance of the prediction model. FKRR-MVSF (MCC: 0.676), MKSVM-HKA (MCC: 0.648) and MSFBinder (MCC: 0.616) achieved better results on PDB186. We also test the performance of Random Forest (RF) and Feed forward Neural Network (FNN) on PDB186. RF and FNN achieve MCC of 0.593 and 0.520, respectively. SVM can achieve better performance on small data sets.

Discussion

How to describe and integrate the information of proteins is the difficulty in predicting DNA-binding proteins. In our study, MKL-CKA is utilized to integrate 6 types of features and achieves better results on PDB1075 (MCC: 0.68) and PDB186 (MCC: 0.69) data sets. Other methods, such as FKRR-MVSF, MKSVM-HKA, MSFBinder and Adilina’s work, also obtained good performance. We can find that multiple information fusion-based methods have better generalization performance on DBP prediction. To obtain the optimal weights of kernels, MKL-CKA maximizes the alignment score between feature space and label space. Ideal kernel (label space) contains the category information of the training samples. The Laplace smooth term can further optimize weight values. The performance of MKL-CKA (MCC: 0.684) is better than mean weighted kernels (MCC: 0.664) on PDB1075 (LOOCV). The process of MKL is similar to feature selection. MKL weights each kernel matrix (6 types of features). Whether the predictive models are based on MKL or feature selection, the noise features can be effectively filtered.

Conclusion

Although many models have been constructed to predict DBP, they can still be optimized to improve accuracy. Existing methods do not consider the removal of outliers in data sets. In the future, we will filter noise samples and improve the predictive accuracy of DBP by fuzzy theory and ensemble strategy.

Methods

DBP identification can be considered as a traditional binary classification problem, and we use SVM algorithm to construct predictive model. First, we extract the features of the protein from the sequence information. Six types of kernel matrices are constructed from these features. Above kernels are integrated to construct optimal kernel (including training kernel and testing kernel) by Multi-Kernel Learning-based on Centered Kernel Alignment (MKL-CKA) algorithm. We employ the combined kernel to build a SVM model and identify DBP. Figure 4 represents the framework of MKLSVM (with MKL-CKA). Firstly, six types of features are extracted from protein sequences. Then, six kernels are built by Radial Basis Function (RBF). MKL-CKA algorithm combines the 6 types of kernels. Next, we use the combined kernel and SVM algorithm construct the final predictive model to detect DBP.

Sequence feature

There are six types of features from protein sequence information, including PSSM-based Discrete Wavelet Transform (PSSM-DWT) [54], PSSM-based Average Blocks (PSSM-AB) [55], Pseudo-PSSM (PsePSSM) [10, 12, 56, 57], Multi-scale Continuous and Discontinuous descriptor (MCD) [58], Global Encoding (GE) [59] and Normalized Moreau-Broto Auto correlation (NMBAC) [60, 61]. These features have been detailed descripted in related literatures. We employ RBF to construct six types of kernels. The function formula of RBF is as follow:

$$\begin{aligned} K_{ij}=K({\mathbf {x}}_{i},{\mathbf {x}}_{j}) = exp(-\gamma \Vert {\mathbf {x}}_{i} - {\mathbf {x}}_{j} \Vert ^{2}), \ i,j=1,2,...,N \end{aligned}$$

(2)

where $\gamma$ is the kernel bandwidth. We can obtain a kernel set ${\mathbf {K}}$ as follows:

$$\begin{aligned} {\mathbf {K}}= \left\{ {\mathbf {K}}_{GE}, {\mathbf {K}}_{MCD}, {\mathbf {K}}_{NMBAC}, {\mathbf {K}}_{PSSM-AB}, {\mathbf {K}}_{PSSM-DWT}, {\mathbf {K}}_{PsePSSM} \right\} \end{aligned}$$

(3)

Support vector machine

Support Vector Machine (SVM) is a classification algorithm, which is developed by Vapnik [6]. By finding the optimal hyper plane, the data set is separated on positive and negative points. The instance-label pairs (a training sample) {${\mathbf {x}}_{i},y_{i}$}, ${\mathbf {x}}_{i}\in {\mathbf {R}}^{d \times 1}$ and $i=1,2,...,N$. Labels $y_{i}\in \{ +1,-1\}$. The decision function is defined as following:

$$\begin{aligned} f({\mathbf {x}}) = sign[\sum _{i=1}^N y_{i}\alpha _{i}\cdot K({\mathbf {x}},{\mathbf {x}}_{i})+b] \end{aligned}$$

(4)

The coefficient $\pmb {\alpha }$ are estimated by solving a Quadratic Programming (QP) problem:

$$\begin{aligned}&Maximize \quad \sum _{i=1}^N \alpha _{i} - \frac{1}{2}\sum _{i=1}^N \sum _{j=1}^N \alpha _{i}\alpha _{j}\cdot y_{i}y_{j}\cdot K({\mathbf {x}}_{i},{\mathbf {x}}_{j}) \end{aligned}$$

(5a)

$$\begin{aligned}&s.t. \quad 0 \le \alpha _{i} \le C \end{aligned}$$

(5b)

$$\begin{aligned}&\sum _{i=1}^N \alpha _{i}y_{i} = 0, i=1,2,...,N \end{aligned}$$

(5c)

${\mathbf {x}}_{i}$ is support vector when the corresponding $\alpha _{i} > 0$. C denotes the tradeoff between margin and misclassification error. What’s more, we construct a SVM model by LIBSVM [62](http://www.csie.ntu.edu.tw/~cjlin/libsvm/). We employ the grid search method to obtain the optimal parameters of the SVM.

Multiple kernel learning

Because of strong theoretical guarantee and excellent experimental performance, the MKL-CKA [63, 64] method is adopted in our study. MKL-CKA is a multi-kernel learning algorithm based on kernel alignment. The optimal kernel is calculated as follows:

$$\begin{aligned}&{\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i} {\mathbf {K}}_{i}, \end{aligned}$$

(6a)

$$\begin{aligned}&{\mathbf {K}}_{i} \in {\mathbf {R}}^{N \times N}, \end{aligned}$$

(6b)

$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$

(6c)

where m is the number of kernels and $\beta _{i}$ is the weight of the kernel ${\mathbf {K}}_{i}$.

The value of kernel alignment is defined as follow:

$$\begin{aligned} A({\mathbf {P}},{\mathbf {Q}}) = \frac{\left\langle {\mathbf {P}},{\mathbf {Q}} \right\rangle _{F}}{\Vert {\mathbf {P}} \Vert _{F} \Vert {\mathbf {Q}} \Vert _{F}} \end{aligned}$$

(7)

where ${\mathbf {P}}, {\mathbf {Q}} \in {\mathbf {R}}^{N \times N}$, $\left\langle {\mathbf {P}},{\mathbf {Q}} \right\rangle _{F} = Trace({\mathbf {P}}^{T}{\mathbf {Q}})$ is the Frobenius inner product and $\Vert {\mathbf {P}} \Vert _{F} = \sqrt{ \left\langle {\mathbf {P}},{\mathbf {P}} \right\rangle _{F}}$ is Frobenius norm.

The score of kernel alignment can be described as the cosine similarity between two kernels. The more high score of kernel alignment, the greater similarity between the kernels. We hope that the alignment score between combined kernel (feature space) and ideal kernel (label space) is high. So, the function formula of centered kernel alignment is as follow:

$$\begin{aligned}&\underset{\pmb {\beta } \ge 0}{\text{ max }} \quad CA({\mathbf {K}}^{*},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T}) = \underset{\pmb {\beta } \ge 0}{\text{ max }}\quad \frac{\left\langle {\mathbf {U}}_{N}{\mathbf {K}}^{*}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F}}{\Vert {\mathbf {U}}_{N}{\mathbf {K}}^{*}{\mathbf {U}}_{N} \Vert _{F} \Vert {\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \Vert _{F}} \end{aligned}$$

(8a)

$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$

(8b)

$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$

(8c)

$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$

(8d)

where the centering matrix is ${\mathbf {U}}_{N} = {\mathbf {I}}_{N} - (1/N){\mathbf {l}}_{N}{\mathbf {l}}_{N}^{T}$, ${\mathbf {U}}_{N} \in {\mathbf {R}}^{N \times N}$ is centering matrix. ${\mathbf {I}}_{N} \in {\mathbf {R}}^{n \times n}$ denotes identity matrix. ${\mathbf {l}}_{N}$ is identity vector. So, formula 8 can be written as follow:

$$\begin{aligned}&\underset{\mathbf {\pmb {\beta }} \ge 0}{\text{ max }} \quad \frac{\pmb {\beta }^{T}{\mathbf {a}}}{\sqrt{\pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta }}} \end{aligned}$$

(9a)

$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$

(9b)

$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$

(9c)

$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$

(9d)

In Eq. (9), ${\mathbf {a}}\in {\mathbf {R}}^{m \times 1}$ and ${\mathbf {M}}\in {\mathbf {R}}^{m \times m}$ is represented as Eqs. (10) and (11).

$$\begin{aligned} \begin{aligned} {\mathbf {a}}&= \left( \left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{1}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F} ,...,\left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{m}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F} \right) ^{T} \in {\mathbf {R}}^{m \times 1} \end{aligned} \end{aligned}$$

(10)

$$\begin{aligned} {\mathbf {M}}&= \left[ \begin{array}{cccc} M_{1,1} &{} M_{1,2} &{} \cdots &{} M_{1,m} \\ M_{2,1} &{} P_{2,2} &{} \cdots &{} M_{2,m} \\ \vdots &{} \vdots &{} M_{e,f} &{} \vdots \\ M_{m,1} &{} M_{m,2} &{} \cdots &{} M_{m,m} \end{array} \right] _{m \times m} \end{aligned}$$

(11a)

$$\begin{aligned} M_{e,f}&= \left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{e}{\mathbf {U}}_{N},{\mathbf {U}}_{N}{\mathbf {K}}_{f}{\mathbf {U}}_{N} \right\rangle _{F} \end{aligned}$$

(11b)

$$\begin{aligned} e,f&=1,2,...,m \end{aligned}$$

(11c)

Equation 9 also can be represented as:

$$\begin{aligned}&\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} \end{aligned}$$

(12a)

$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$

(12b)

$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$

(12c)

$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$

(12d)

In order to prevent extreme situations (the weight of a kernel is close to 1 and the remaining weights are close to 0), we employ the Laplacian regular term to smooth the weights:

$$\begin{aligned} \begin{aligned} \sum _{i,j}^{P} (\beta _{i} - \beta _{j})^2 W_{ij}&= \sum _{i,j}^{P} (\beta _{i}^2 + \beta _{j}^2 - 2 \beta _{i} \beta _{j}) W_{ij}\\&= \sum _{i}^{P} \beta _{i}^2 D_{ii} + \sum _{j}^{P} \beta _{j}^2 D_{jj} - 2 \sum _{i,j}^{P} \beta _{i} \beta _{j} W_{ij}\\&= 2 \pmb {\beta }^{T} {\mathbf {L}} \pmb {\beta } \end{aligned} \end{aligned}$$

(13)

In Eq. (13), $i,j=1,...,m$, ${\mathbf {W}}\in {\mathbf {R}}^{m \times m}$ is the cosine similarity between two kernels. ${\mathbf {W}}$ can be calculated by Eq. (7). ${\mathbf {D}} \in {\mathbf {R}}^{m \times m}$ is a diagonal matrix, which is calculated by $D_{ii} = \sum _{j=1}^{m} W_{ij}$. ${\mathbf {L}}\in {\mathbf {R}}^{m \times m}$ is graph Laplacian matrix, which is obtained by ${\mathbf {L}} = {\mathbf {D}} -{\mathbf {W}}$. Equation (12) and formula 13 are integrated as follow:

$$\begin{aligned}&\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} + \lambda \pmb {\beta }^{T} {\mathbf {L}} \pmb {\beta }=\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}({\mathbf {M}}+ \lambda {\mathbf {L}})\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} \end{aligned}$$

(14a)

$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$

(14b)

$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$

(14c)

$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$

(14d)

where $\lambda$ is a hyper parameter of MKL-CKA. Finally, the weights obtained according to formula 14 and we calculate the optimal kernel by formula 6a.

Availability of data and materials

The datasets generated and/or analysed during this study are available under open licenses in the data repository, https://figshare.com/s/cf56cef6659c7eed16c9.

Abbreviations

DBP:: DNA-Binding Proteins
ML:: Machine Learning
MKL-CKA:: Multiple Kernel Learning-based on Centered Kernel Alignment
SVM:: Support Vector Machine
Y1H:: Yeast One-hybrid System
PseAAC:: Pseudo Amino Acid Composition
PSSM:: Position Specific Scoring Matrix
RF:: Random Forest
FKRR-MVSF:: Fuzzy Kernel Ridge Regression model with Multi-View Sequence Features
MKSVM-HKA:: Multi-Kernel SVM based on Heuristically Kernel Alignment
DPP-PseAAC:: DNA-binding Protein Prediction model using Chou general PseAAC
RFE:: Recursive Feature Elimination
PTM:: Post Translational Modification
PPI:: Protein–Protein Interactions
MKSVM:: Multi-Kernel SVM
RBF:: Radial Basis Function
PSSM-DWT:: PSSM-based Discrete Wavelet Transform
PSSM-AB:: PSSM-based Average Blocks
PsePSSM:: Pseudo-PSSM
MCD:: Multi-scale Continuous and Discontinuous descriptor
GE:: Global Encoding
NMBAC:: Normalized Moreau-Broto Auto Correlation
QP:: Quadratic Programming
LOOCV:: Leave One Out Cross validation
ACC:: Accuracy
MCC:: Matthew’s Correlation Coefficient
SN:: Sensitivity
SP:: Specificity
AUC:: Area under the receiver-operating characteristic curve
ROC:: Receiver Operating characteristic Curve
5-CV:: 5-fold Cross Validation

References

Wei L, Tang J, Quan Z. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci. 2016;384:135–44.
Article Google Scholar
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC. iDNA-Prot|dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE. 2014;9:106691.
Article CAS Google Scholar
Wang Y, Ding Y, Guo F, Wei L, Tang J. Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS ONE. 2017;12(9):e0185587.
Article PubMed PubMed Central CAS Google Scholar
Nimrod G, Schushan M, Szilágyi A, Leslie C. iDBPS: a web server for the identification of DNA binding proteins. Bioinformatics. 2010;26(5):692–3.
Article CAS PubMed PubMed Central Google Scholar
Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005;33(20):6486–93.
Article CAS PubMed PubMed Central Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
Article Google Scholar
Ahmad S, Sarai A. Moment-based prediction of DNA-binding proteins. J Mol Biol. 2004;341(1):65–71.
Article CAS PubMed Google Scholar
Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol. 2006;240(2):175–84.
Article CAS PubMed Google Scholar
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Mol Inf. 2015;34(1):8–17.
Article CAS Google Scholar
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep. 2015;5:15479.
Article CAS PubMed PubMed Central Google Scholar
Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003;1648(1):127–33.
Article CAS PubMed Google Scholar
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol. 2015;9:10.
Article CAS Google Scholar
Kumar M, Gromiha MM, Raghava GP. Identification of dna-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007;8:463.
Article CAS Google Scholar
Lipman DJ, Zhang J, Madden T, Altschul SF, Schäffer AA, Miller W, Zhang Z. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Article PubMed PubMed Central Google Scholar
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian Naïve Bayes. PLoS ONE. 2014;9:86703.
Article CAS Google Scholar
Zou Y, Ding Y, Tang J, Guo F, Peng L. FKRR-MVSF: a fuzzy kernel ridge regression model for identifying DNA-binding proteins by multi-view sequence features via Chou’s five-step rule. Int J Mol Sci. 2019;20(17):4175.
Article PubMed Central CAS Google Scholar
Ding Y, Chen F, Guo X, Tang J, Wu H. Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteomics. 2019;. https://doi.org/10.2174/1570164616666190417100509.
Article Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of DNA-binding proteins via fuzzy multiple kernel model and sequence information. Lect Notes Comput Sci. 2019;11644:468–79.
Article Google Scholar
Liu XJ, Gong XJ, Yu H, Xu JH. A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers. Genes. 2018;9:394.
Article PubMed Central CAS Google Scholar
Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP-PseAAC: a DNA-binding protein prediction model using Chou’s general PseAAC. J Theor Biol. 2018;452:22–34.
Article CAS PubMed Google Scholar
Adilina S, Farid D, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. J Theor Biol. 2019;460:64–78.
Article CAS PubMed Google Scholar
Wei L, Luan S, Nagai L, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA n4-methylcytosine sites in multiple species. Bioinformatics. 2019;35:1326–33.
Article CAS PubMed Google Scholar
Jia C, Zuo Y, Zou Q. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics. 2018;34:2029–36.
Article CAS PubMed Google Scholar
Zeng X, Liu L, Lu L, Zou Q. Prediction of potential disease-associated microrNAS using structural perturbation method. Bioinformatics. 2018;34:2425–32.
Article CAS PubMed Google Scholar
Wei L, Ding Y, Su L, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput. 2018;117:212–7.
Article Google Scholar
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA. 2019;25(9):205–18.
Article CAS PubMed PubMed Central Google Scholar
Ding YJ, Tang JJ, Guo F. The computational models of drug-target interaction prediction. Protein Pept Lett. 2019;26:1–11.
CAS Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE J Biomed Health Inform. 2019;23(6):2619–32.
Article PubMed Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of protein-ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model. 2017;57(12):3149–61.
Article CAS PubMed Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via multiple information integration. Inf Sci. 2017;418:546–60.
Article Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl. 2019;. https://doi.org/10.1007/s00521-019-04569-z.
Article Google Scholar
Wang YB, Ding YJ, Tang JJ, Dai Y, Guo F. CrystalM: a multi-view fusion approach for protein crystallization prediction. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2912173.
Article PubMed PubMed Central Google Scholar
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics. 2018;19(Suppl 10):911.
Article PubMed PubMed Central Google Scholar
Jiang L, Ding Y, Tang J, Guo F. MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association. Front Genet 2018, doi: 10.3389/fgene.2018.00618.
Article PubMed PubMed Central Google Scholar
Shen C, Ding YJ, Tang JJ, Guo F. Multivariate information fusion with fast kernel learning to kernel ridge regression in predicting LncRNA-protein interactions. Front Genet. 2019;. https://doi.org/10.3389/fgene.2018.00716.
Article PubMed PubMed Central Google Scholar
Shen C, Ding YJ, Tang JJ, Jiang LM, Guo F. LPI-KTASLP: prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information. IEEE Access. 2019;7:13486–96.
Article Google Scholar
Shen C, Ding YJ, Tang JJ, Xu XY, Guo F. An ameliorated prediction of drug-target interactions based on multi-scale discrete wavelet transform and network features. Int J Mol Sci. 2017;18(8):1781.
Article PubMed Central CAS Google Scholar
Shen C, Ding YJ, Tang JJ, Song J, Guo F. Identification of DNA-protein binding sites through multi-scale local average blocks on sequence information. Molecules. 2017;22(2):2079.
Article PubMed Central CAS Google Scholar
Shen YN, Tang JJ, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. J Theor Biol. 2019;462:230–9.
Article CAS PubMed Google Scholar
Ding YJ, Tang JJ, Guo F. Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput. 2020;96:106596.
Article Google Scholar
Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via dual Laplacian regularized least squares with multiple kernel fusion. Knowl Based Syst. 2020;204:106254.
Article Google Scholar
Zhang W, Jing K, Huang F, Chen Y, Li B, Li J, Gong J. SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug–drug interactions. Inf Sci. 2019;497:189–201.
Article CAS Google Scholar
Deng Y, Xu X, Qiu Y, Xia J, Zhang W, Liu S. A multimodal deep learning framework for predicting drug–drug interaction events. Bioinformatics. 2020;. https://doi.org/10.1093/bioinformatics/btaa501.
Article PubMed PubMed Central Google Scholar
Ding YJ, Tang JJ, Guo F. Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2954826.
Article PubMed Google Scholar
Zhang W, Li ZS, Guo WZ, Yang WT, Huang F. A fast linear neighborhood similarity-based network link inference method to predict microRNA-disease associations. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2931546.
Article PubMed Google Scholar
Gong YC, Niu YQ, Zhang W, Li XH. A network embedding-based multiple information integration method for the miRNA-disease association prediction. BMC Bioinform. 2019;20(1):468.
Article Google Scholar
Zhao Q, Yang YJ, Ren GF, Ge EX, Fan CL. Integrating bipartite network projection and KATZ measure to identify novel circRNA-disease associations. IEEE Trans Nanobiosci. 2019;18(4):578–84.
Article Google Scholar
Liu HS, Ren GF, Chen HY, Liu Q, Yang YJ, Zhao Q. Predicting lncrna-mirna interactions based on logistic matrix factorization with neighborhood regularized. Knowl-Based Syst. 2020;191:105261.
Article Google Scholar
Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comput Biol. 2017;13(6):1005420.
Article CAS Google Scholar
Zeng X, Lin W, Guo M, Zou Q. Details in the evaluation of circular RNA detection tools: Reply to Chen and Chuang. PLoS Comput Biol. 2019;15(4):1006916.
Article CAS Google Scholar
Rose PW, Prlić A, Bi C, et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015;4(Database issue):345–56.
Article CAS Google Scholar
Lin W, Fang J, Xiao X, Chou K. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS ONE. 2011;6:24756.
Article CAS Google Scholar
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn. 2009;26(6):679–86.
Article CAS PubMed Google Scholar
Nanni L, Brahnam S, Lumini A. Wavelet images and Chou’s pseudo amino acid composition for protein classification. Amino Acids. 2012;43:657–65.
Article CAS PubMed Google Scholar
Cheol Jeong J, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(2):308–15.
Article Google Scholar
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43:65–71.
Article CAS Google Scholar
Chou KC, Shen HB. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through PSE-PSSM. Biochem Biophys Res Commun. 2007;360(2):339–45.
Article CAS PubMed Google Scholar
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein–protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform. 2014;15:9.
Article Google Scholar
Li X, Liao B, Shu Y, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol. 2009;261(2):290–3.
Article CAS PubMed Google Scholar
Feng ZP, Zhang CT. Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem. 2000;19(4):269–75.
Article CAS PubMed Google Scholar
Ding Y, Tang J, Guo F. Predicting protein–protein interactions via multivariate mutual information of protein sequences. BMC Bioinform. 2016;17(1):398–410.
Article Google Scholar
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(27):1–27.
Article Google Scholar
Cristianini N, Kandola J, Elisseeff A. On kernel-target alignment. Adv Neural Inf Process Syst. 2001;179(5):367–73.
Google Scholar
Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered alignment. J Mach Learn Res. 2012;13(2):795–828.
Google Scholar

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 22 Supplement 3, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-22-supplement-3.

Funding

This work is supported by a Grant from the National Natural Science Foundation of China (NSFC 61902271, 61772362, 61772357, 61902272 and 61972280) and Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014). Publication costs of this article are funded by the grants of the above foundations and projects. The funding body did not play any role in the design of the study, collection, analysis, and interpretation of the data, and writing of the manuscript.

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, People’s Republic of China
Yuqing Qian & Yijie Ding
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, People’s Republic of China
Limin Jiang & Jijun Tang
School of Computer Science and Engineering, Central South University, Changsha, People’s Republic of China
Fei Guo

Authors

Yuqing Qian
View author publications
You can also search for this author in PubMed Google Scholar
Limin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yijie Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jijun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Guo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YQ, YD and FG conceived the study. YQ performed the experiments and analyzed the data. LJ, YD, JT and FG drafted the manuscript. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Yijie Ding or Fei Guo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Qian, Y., Jiang, L., Ding, Y. et al. A sequence-based multiple kernel model for identifying DNA-binding proteins. BMC Bioinformatics 22 (Suppl 3), 291 (2021). https://doi.org/10.1186/s12859-020-03875-x

Download citation

Received: 01 November 2020
Accepted: 13 November 2020
Published: 31 May 2021
DOI: https://doi.org/10.1186/s12859-020-03875-x

A sequence-based multiple kernel model for identifying DNA-binding proteins

Abstract

Background

Results

Conclusion

Similar content being viewed by others

Identification of DNA-Binding Proteins via Fuzzy Multiple Kernel Model and Sequence Information

DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation

Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information

Background

Results

Data sets

Measurements

Parameters selection

The parameters selection of features

Selection of C and \(\boldsymbol{\gamma}\)

Performance analysis on PDB1075

Comparison to existing predictors on PDB1075

Independent test

Discussion

Conclusion

Methods

Sequence feature

Support vector machine

Multiple kernel learning

Availability of data and materials

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation