A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction
- 240 Downloads
In this paper, we present a pipeline to perform improved QSAR analysis of peptides. The modeling involves a double selection procedure that first performs feature selection and then conducts sample selection before the final regression analysis. Five hundred and thirty-one physicochemical property parameters of amino acids were used as descriptors to characterize the structure of peptides. These high-dimensional descriptors then go through a feature selection process given by the binary matrix shuffling filter (BMSF) to obtain a set of important low-dimensional features. Each descriptor that passes the BMSF filtering also receives a weight defined through its contribution to reduce the estimation error. These selected features served as the predictors for subsequent sample selection and modeling. Based on the weighted Euclidean distances between samples, a common range was determined with high-dimensional semivariogram and then used as a threshold to select the near-neighbor samples from the training set. For each sample to be predicted, the QSAR model was established using SVR with the weighted, selected features based on the exclusive set of near-neighbor training samples. Prediction was conducted for each test sample accordingly. The performances of this pipeline are tested with the QSAR analysis of angiotensin-converting enzyme inhibitors and HLA-A*0201 data sets. Improved prediction accuracy was obtained in both applications. This pipeline can optimize the QSAR modeling from both the feature selection and sample selection perspectives. This leads to improved accuracy over single selection methods. We expect this pipeline to have extensive application prospect in the field of regression prediction.
KeywordsPeptides Quantitative structure–activity regression Feature selection Semivariogram Support vector regression
This work was supported by the Doctoral Foundation of Ministry of Education of China (No. 20124320110002), the Scientific Research Fund of the Hunan Provincial Financial Department (No. 62020411074) and the Postgraduate Scientific Research Innovation Project of Hunan Province, China (No. CX2013B306). The work of H. Wang was partially supported by a grant from the Simons Foundation (#246077).
Conflict of interest
The authors declare no conflict of interest.
- Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM T Intelli Syst Techn (TIST) 2:27Google Scholar
- Journel AG, Huijbregts CJ (1978) Mining geostatistics. Academic press, LondonGoogle Scholar
- Li J, Gao XB, Jiao LC (2005) A new feature weighted fuzzy clustering algorithm. In: Rough sets, fuzzy sets, data mining, and granular computing. Springer Berlin, Heidelberg, pp 412–420Google Scholar
- Liang GZ, Zhou P, Zhou Y et al (2006) New descriptors of amino acids and their applications to peptide quantitative structure–activity relationship. Acta Chim Sin 64:393–396Google Scholar
- Sewald N, Jakubke HD (2002) Peptides: chemistry and biology (vol. 2). Wiley-Vch, WeinheimGoogle Scholar
- Vivencio DP, Hruschka ER, Nicoletti MC et al (2007) Feature-weighted k-nearest neighbor classifier. In: Foundations of computational intelligence, 2007. FOCI 2007. IEEE Symposium on (pp 481–486). IEEEGoogle Scholar
- Wölfel M, Ekenel HK (2005) Feature weighted Mahalanobis distance: improved robustness for Gaussian classifiers. In: 13th European signal processing conferenceGoogle Scholar