Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Lee, Yun-Chen; Yu, Jen-Chieh; Ni, Kuan; Lin, Yu-Chuan; Chen, Ching-Tai

doi:10.1038/s41598-024-65062-9

Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Article
Open access
Published: 22 June 2024

Volume 14, article number 14387, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Download PDF

Yun-Chen Lee¹^na1,
Jen-Chieh Yu²^na1,
Kuan Ni³,
Yu-Chuan Lin² &
…
Ching-Tai Chen^2,4

278 Accesses
Explore all metrics

Abstract

Angiogenesis is a key process for the proliferation and metastatic spread of cancer cells. Anti-angiogenic peptides (AAPs), with the capability of inhibiting angiogenesis, are promising candidates in cancer treatment. We propose AAPL, a sequence-based predictor to identify AAPs with machine learning models of improved prediction accuracy. Each peptide sequence was transformed to a vector of 4335 numeric values according to 58 different feature types, followed by a heuristic algorithm for feature selection. Next, the hyperparameters of six machine learning models were optimized with respect to the feature subset. We considered two datasets, one with entire peptide sequences and the other with 15 amino acids from peptide N-termini. AAPL achieved Matthew’s correlation coefficients of 0.671 and 0.756 for independent tests based on the two datasets, respectively, outperforming existing predictors by a range of 5.3% to 24.6%. Further analyses show that AAPL yields higher prediction accuracy for peptides with more hydrophobic residues, and fewer hydrophilic and charged residues. The source code of AAPL is available at https://github.com/yunzheng2002/Anti-angiogenic.

AntAngioCOOL: computational detection of anti-angiogenic peptides

Article Open access 04 March 2019

Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection

Article Open access 24 October 2018

Using a Classifier Fusion Strategy to Identify Anti-angiogenic Peptides

Article Open access 14 September 2018

Introduction

Cancer encompasses a group of diseases characterized by the unregulated growth of abnormal cells, with the capacity to infiltrate adjacent healthy tissues and disseminate via the circulatory or lymphatic system, a phenomenon referred to as metastasis¹. Angiogenesis is a process by which new blood vessels are formed and it is seen as one of the key processes for the proliferation and metastatic spread of cancer cells. It promotes the circulation of oxygenated blood, supplies nutrients, and removes waste products from the body². Most solid tumors, such as those found in the lung, breast, colon, prostate, and many other organs, rely heavily on angiogenesis for their growth and progression. Take lung cancer for example, it accounted for 2.5 million new cases (12.4% of all cancers) of all newly diagnosed cancers in 2022. In the same year, an estimated 1.8 million deaths (18.7%) from cancer were attributed to lung cancer, making it a leading cause of cancer-related mortality³. Currently, the modulation and suppression of angiogenesis represent forefront investigations in the field of cancer therapy^4,5, with significant implications for the treatment of other diseases dependent on angiogenesis such as blindness, rheumatoid arthritis, and psoriasis^6,7,8.

Several peptides, originating from diverse proteins, possess the capability to inhibit angiogenesis^5,9,10,11,12. However, experimental techniques employed in the discovery and optimization of anti-angiogenic peptides (AAPs) have been markedly time-consuming, costly, and arduous, prompting the need for more efficient and effective approaches. With the growing number of AAPs available in databases, identifying potential AAPs based on computational models is highly desirable.

To date, several computational methods have been developed for the identification of AAP. AntiAngioPred¹³ considers peptide features of amino acid composition (AAC) and dipeptide composition (DPC) with seven machine learning (ML) models to predict AAP, among which support vector machines (SVM) with the feature of AAC yields the best prediction performance. Blanco et al.¹⁴ take AAC, DPC, and tripeptide composition as feature types, filter irrelevant numerical features using T-test, and conduct experiments using four ML models, among which the generalized linear model generates the highest prediction accuracy. TargetAntiAngio uses features of AAC, pseudo amino acid composition (PseAAC), and amphiphilic pseudo amino acid composition with a random forest model to predict AAPs. AntAngioCOOL¹⁵ uses PseAAC, several composition-based features, physicochemical profiles of 5 amino acids from N-termini and C-termini, and atomic profiles as features and employs 227 classifiers. Among all the classifiers, the three models that achieve the highest sensitivity, specificity, and accuracy, respectively, are included in their software package. AAPred-CNN is a deep learning model that adopts multiple convolution channels to extract local features of input sequences to identify AAP. It is reported to outperform existing methods in all evaluation metrics.

While existing studies have shown some advancements, a majority of them face limitations due to their choice of feature types and the lack of a systematic approach to determine suitable feature subsets and optimize ML models. More importantly, there is still room to improve the prediction accuracy. In this study, we present AAPL, a sequence-based predictor of AAP which considers a comprehensive set of features from peptide sequences. We considered 58 different feature types, producing a total of 4335 numeric values for each sequence. Next, feature ranking and a heuristic feature selection algorithm were applied to determine the best feature subset for ML. The hyperparameters of six different ML models were then optimized with respect to the selected feature subset. The resulting models were used to conduct cross validation and independent tests for benchmark analysis. The evaluation results, derived from two independent test datasets, indicate that AAPL significantly outperforms current methods in terms of prediction accuracy. The situation affirms that AAPL, by leveraging a comprehensive array of features extracted from peptide sequences, serves as an efficient and precise approach for the identification of AAP.

Materials and methods

Datasets

Two datasets used in several studies^8,14,15,16 were downloaded from the AntiAngioPred¹³ web server. The first dataset consisted of 133 AAPs with experimental validation and 135 non-AAPs randomly selected from SwissProt¹⁷. To minimize bias, the dataset included only pairs of peptides with a sequence identity less than 70%, and the lengths of both AAPs and non-AAPs were similarly distributed. An 80% split of the dataset was applied, resulting in 105 positive peptides and 107 negative peptides, termed S212, for model development and cross validation. The other 20% of sequences, termed S56, consisted of 28 AAPs and 28 non-AAPs for independent test.

In addition to the datasets of full peptide length, we also considered terminus datasets, i.e., datasets consisting of the first 15 amino acids from the N-termini of peptides^8,13,16. Two terminus datasets were obtained from TargetAntiAngio¹⁶. The one for model development and cross validation, termed NT-S160, consisted of 80 AAPs and 80 non-AAPs. The one for independent test, termed NT-S40, consisted of 19 AAPs and 21 non-AAPs.

Workflow of AAPL

Figure 1 illustrates the workflow of AAPL. The main dataset (consisting of 80% of sequences) was used for feature engineering, including procedures of feature encoding, normalization, and a heuristic algorithm for feature selection. The selected feature subset was then used to tune the hyperparameters of ML models using a Bayesian optimization algorithm. Next, the ML models were employed for cross validation with the main dataset and independent test with an independent dataset. The entire workflow is detailed in the following sections.

Feature encoding

We transformed amino acid sequences into numerical representations by a variety of compositional and physicochemical properties of peptides. A total of 58 feature types were considered, including peptide length, AAC, DPC, grouped amino acid composition (GAAC)¹⁸, grouped di-peptide composition (GDPC)¹⁹, composition of k-spaced amino acid pairs (CKSAAP)²⁰, amphiphilic pseudo amino acid composition (APAAC)^21,22,23, composition of k-spaced amino acid group pairs (CKSAAGP)^24,25, Shannon entropy at protein level (SEP)^26,27, Shannon entropy at residue level (SER)^26,27, Composition-Transition-Distribution (CTD)²⁸, conjoint triad (CTriad)²⁹, dipeptide deviation from expected mean (DDE)³⁰ etc. The entire list of 58 feature types and the size of each of them (number of numeric values) are listed in Supplementary Table S1. Each sequence was associated with a feature vector of 4335 entries.

Normalization

Robust normalization was applied to compositional and physicochemical features with RobustScaler from the Scikit-learn package³¹ of Python. RobustScaler converts a feature value x_i to y_i using the following equation.

$$y_{i} = \frac{{x_{i} - Median\left( X \right)}}{Q3\left( X \right) - Q1\left( X \right)},$$

(1)

in which Q3(X) and Q1(X) stand for the 3rd quartile and the 1st quartile of feature X. The scaled values of each feature have a median of zero and an interquartile range of one across all sequences.

Machine learning methods

Six ML models were considered in this study, including support vector machines (SVM)³², linear discriminant analysis (LDA)³³, random forest (RF)³⁴, extremely randomized trees (ET)³⁵, light gradient boosting machine (LightGBM)³⁶, and categorical boosting (CB)³⁷. SVM is a type of kernel-based learning algorithm that transforms the data into a higher-dimensional space, and then searches for a hyperplane in the space that maximizes the margin between the two classes. We used the radial basis function as the kernel for SVM. LDA aims to find linear combinations of features that maximize the separation between different classes in the data while minimizing the variance within each class. RF constructs a multitude of decision trees, trained on random subsets of the data and random subsets of features. The predictions of the individual trees are then averaged to produce the final prediction of RF. ET is similar to RF, but differs in two key ways: (1) ET splits nodes on all features instead of a random subset of features, and (2) ET builds each tree on the entire training set, rather than a random subset of the training set. LightGBM and CB are two different implementations of gradient boosting algorithm, which constructs a sequence of weak learners, each trained to minimize the residual error of the previous weak learner. The final prediction of the gradient boosting model is a weighted average of the predictions of the individual weak learners.

Heuristic algorithm for feature selection

Given n features, the potential number of feature subsets is 2ⁿ. Exhaustively evaluating all possible combinations to find the optimal subset is computationally infeasible due to the fact that n is 4335 in this study. A heuristic algorithm for feature selection is therefore proposed. First, the Boruta program³⁸, a wrapper approach utilizing a random forest algorithm for feature ranking, was employed to analyze all features. The algorithm progressively eliminates features that are statistically less relevant than randomized features, thereby generating a prioritized list of features based on their importance. For a given dataset, the ranked feature list is expressed as an ordered set F = {F₁, F₂, … F₄₃₃₅}, sorted in descending order according to feature importance. Next, a heuristic approach was applied to determine the feature subset for the dataset. The top N feature subset is defined as FS_N = {F₁, F₂, …F_N}. Iterative five-fold cross validation runs using FS_N was performed based on SVM, LDA, RF, ET, LightGBM, and CB with their default hyperparameters (based on the settings of Scikit-learn package). The experiments were carried out via PyCaret³⁹ and the results were evaluated with MCC. Let MCC_Nⁱ be the MCC for ML model i using FS_N, The best MCC based on FS_N is defined as

$${Best\_MCC}_{N}=\underset{i}{\text{max}}({MCC}_{N}^{i})$$

(2)

where i corresponds to any of the six ML models (SVM, LDA, RF, ET, LightGBM, and CB). The process was repeated iteratively, with N ranging from 50 to 200 in increments of 10. The best feature subset (BFS) is defined as

$$BFS={FS}_{j}, where {Best\_MCC}_{j}=\underset{k}{\text{max}}{(Best\_MCC}_{k})$$

(3)

The purpose of feature selection is to determine the best feature subset for the given dataset. Using default hyperparameters of the six different ML models in the task may circumvent the lengthy process of ML model optimization. Moreover, using ML models of different rationale has the advantage that the selected feature subsets are not biased towards a particular model. The heuristic algorithm was applied to S212 and NT-S160, yielding two feature subsets.

Optimization of machine learning models

After the feature subset was determined, hyperparameters of ML models were optimized on S212 and NT-S160 with Optuna⁴⁰, an automatic hyperparameter optimization package for sampling search space and pruning unpromising trials using a Bayesian model. We employed the tree-structured Parzen estimator algorithm⁴¹ as the search algorithm of Optuna due to its superior ability to converge quickly to the global optimum compared to randomized search, while also demanding less computing time than grid search. After the optimized hyperparameters were obtained for each model, ten-fold cross validation was performed to evaluate each predictor.

To perform an independent test, we carried out an additional training phase utilizing the optimized parameters of each model on the complete S212 or NT-S160 dataset. The resulting models, benefiting from an additional 10% of training data compared to the models obtained from cross validation, were used to conduct independent tests based on S56 and NT-S40.

Evaluation metrics

For benchmark comparison, prediction results were evaluated with accuracy, precision, recall (or sensitivity), F1-score, and MCC (Matthew’s Correlation Coefficient) defined as follows:

$$\begin{array}{*{20}c} {{\text{Accuracy (Acc) = }}\frac{{\text{TP + TN}}}{{\text{TP + TN + FP + FN}}}} \\ \end{array}$$

(4)

$$\begin{array}{*{20}c} {{\text{Precision (Pre) = }}\frac{{{\text{TP}}}}{{\text{TP + FP}}}} \\ \end{array}$$

(5)

$$\begin{array}{*{20}c} {{\text{Recall (Rec) = Sensitivity = }}\frac{{{\text{TP}}}}{{\text{TP + FN}}}} \\ \end{array}$$

(6)

$$\begin{array}{*{20}c} {{\text{Specificity (Sp)}} = \frac{{{\text{TN}}}}{{\text{TN + FP}}}} \\ \end{array}$$

(7)

$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} -{\text{FP}} \times {\text{FN}}}}{{\sqrt {({\text{TP }} + {\text{ FP}})({\text{TP }} + {\text{ FN}})({\text{TN }} + {\text{ FP}})({\text{TN }} + {\text{ FN}})} }}$$

(8)

where TP, TN, FP, and FN denote the counts of true positives, true negatives, false positives, and false negatives, respectively. Accuracy, precision, recall, and specificity are on a scale from 0 to 1, where a higher value reflects better predictive performance. MCC spans from − 1 to 1, signifying entirely negative and entirely positive correlations, respectively, with an MCC of 0 indicating a random correlation. In addition to the above metrics, AUC (Area Under the receiver operating characteristic Curve)⁴², a non-parametric and threshold independent measure, was also included for evaluation. An AUC value of 1 indicates a perfect model, while an AUC value of 0.5 indicates a random guesser.

Results and discussions

Amino acid and dipeptide composition analyses

The AAC analysis of AAPs and non-AAPs for S212 is illustrated in Fig. S1. The analysis reveals that residues of C, P, R, S, and W are predominantly found in AAPs, whereas residues of A, E, I, L, and V are more commonly present in non-AAPs. The DPC analysis for S212 is shown in Fig. S2. Certain dipeptides like CG, CN, CS, HG, HH, SP, and SC are predominant in AAPs, while dipeptides such as AA, EL, EV, IA, and NK are more common in non-AAPs. The analyses show that the presence of certain amino acids and their synergistic interactions play a pivotal role in modulating the angiogenic properties of peptides.

Selected feature subsets

Cross validation results of S212 and NT-S160 based on different feature numbers were evaluated with MCC, and the results are shown in Fig. 2. It can be seen the best feature numbers of S212 and NT-S160 are 150 and 120, respectively. Using the two selected feature numbers, the best MCC achieved by the ML models are 0.683 and 0.651 for S212 and NT-S160, respectively. The entire lists of 150 features for S212 and 120 features for NT-S160 are shown in Supplementary Tables S2 and S3, respectively. The Gini importance produced by the random forest model is also listed for comparison. The features of S212 and NT-S160 are associated with 34 and 32 feature types, respectively, among which 28 feature types are in common. Table 1 lists the top 5 most frequent feature types for S212 and NT-S160. All of them are among the 28 common feature types, despite their differences in the number of selected features for each feature type. The entire lists of feature types for S212 and NT-S160 are shown in Table S4.

Table 1 The selected numbers and sizes of the top 5 most frequent feature types from the selected feature subset using S212 and NT-S160.

Full size table

Feature exploration and biological relevance

The selected feature subset of S212 (Supplementary Table S2) consists of features regarding to certain amino acids, including Ala, Cys, Ser, Trp, Leu, and Phe. For example, amino acid composition (AAC) for Ala, Cys, and Ser, distance distribution of residues⁴³ (DDR) for Ala, Cys, and Trp, and Shannon entropy at residue level (SER) for Ala, Cys, and Ser. This is in good agreement to the propensity shown in Supplementary Fig. S1, namely, Cys, Ser, and Trp are prevalent in AAPs, while Ala, Leu, and Phe are more common in non-AAPs. There is also a strong association between hydrophobicity and the majority of the selected features of CTD. The situation is consistent to the established fact that AAPs have a relatively high incidence of hydrophobic residues⁴⁴. The presence of aliphatic residues appears to be a significant characteristic in CKSAAGP and GDPC. This observation aligns with the fact that aliphatic amino acids such as Ala, Ile, Leu, and Val tend to occur more frequently in AAPs, as indicated in Supplementary Fig. S1. In addition to the above, the feature subset also incorporates generalized feature types, such as Ez⁴⁵, Z3⁴⁶, and Z5⁴⁷. Ez specifically characterizes the empirical residue-based potential for protein insertion in lipid membranes, a process governed by complex factors such as hydrophobic interactions, electrostatic forces, and hydrogen bonding. Meanwhile, Z3 and Z5 are multidimensional descriptors capturing hydrophobicity, charge, aromaticity, polarity, and other physicochemical properties critical to peptide behavior and function.

It can be observed from Supplementary Table S3 that compared to S212, NT-S160 exhibits a significantly smaller number of selected features across AAC, GDPC, and CKSAAGP, with only 1, 1, and 4 features selected, respectively, in contrast to 3, 4, and 19 features selected for S212. These feature types, which are derived from specific amino acids or their combinations, are likely influenced by the shorter sequence lengths present in NT-S160. Supplementary Table S3 reveals that hydrophobicity remains a predominant characteristic among the selected features of the CTD descriptor for NT-S160. Additionally, the feature types Ez, Z3, and Z5, which capture various physicochemical properties, still contribute significantly to the selected features, accounting for 14, 9, and 21 features, respectively.

Benchmark results of cross validation

Table 2 shows the benchmark results of cross validation using the six different ML models on S212. It can be seen SVM, achieving an MCC of 0.642, outperforms the other models in all the evaluation measures. SVM outperforms CB, the second-best model, by 8.7%, 3.7%, and 4.2% in MCC, AUC, and accuracy, respectively. The precision of SVM is 0.840, representing an improvement over the other models by 3.6% to 9.9%. On the other hand, SVM generates a recall of 0.825, which is higher than other methods by a significant margin ranging from 7.6 to 11.5%, suggesting that recall plays a more critical role than precision in explaining the SVM’s substantial improvement in MCC on S212. Table 3 shows the benchmark results of cross validation on NT-S160. SVM achieves an MCC of 0.598 and generates the highest value in all evaluation measures. The MCCs of ET, RF, and CB are above 0.5. Similarly, SVM outperforms the other models by a range of 5.0% to 16.7% in recall, and by a range of 2.9% to 9.6% in precision, suggesting that the improved recall (or sensitivity) is the major reason why SVM yields the highest MCC on NT-S160. The ROC curves of the six ML models for S212 and NT-S160 are illustrated in Supplementary Fig. S3A,B, respectively.

Table 2 Benchmark results of cross validation on S212.

Full size table

Table 3 Benchmark results of cross validation on NT-S160.

Full size table

Benchmark results of independent tests

Table 4 shows the benchmark results of S56 with the six ML models and three existing predictors, AAPred-CNN, TargetAntiAngio, and AntiAngioPred. It can be observed SVM yields the highest recall, accuracy, and MCC. SVM improves the MCC and recall of AAPred-CNN, the state-of-the-art method, by 5.3% and 17.8%, respectively, though AAPred-CNN generates the highest precision of 0.815. The AUC of SVM, 0.828, is comparable to the highest AUC of 0.830 produced by TargetAntiAngio. It can also be seen that AntiAngioPred yields the lowest MCC and recall, indicating a large number of false negatives. The situation is very likely caused by the fact that AntiAngioPred relies on a single feature type, amino acid composition, for prediction. The ROC curves of the six ML models for S56 are illustrated in Supplementary Fig. S4A.

Table 4 Benchmark results of independent test on S56.

Full size table

The evaluation results on NT-S40 are shown in Table 5. In consistence with previous studies, the overall MCCs for all methods are improved compared to the results of S56, the dataset of entire peptide sequences. SVM generates an MCC of 0.756, which is 5.9%, 19.6%, and 24.6% higher than AAPred-CNN, TargetAntiAngio, and AntiAngioPred, respectively. Notably, LightGBM, RF, ET, and CB also outperform the three existing methods in MCC. AAPred-CNN produces a precision and specificity of 1.000 but suffers from the second lowest recall of 0.650, indicating a large number of false negatives. In contrast, TargetAntiAngio produces the highest recall of 0.905 but suffers from the lowest precision among all the methods, indicating a large number of false positives. Nevertheless, the above benchmark comparisons demonstrate that our SVM model produces a significant improvement in MCC over existing methods. The ROC curves of the six ML models for NT-S40 are illustrated in Supplementary Fig. S4B.

Table 5 Benchmark results of independent test on NT-S40.

Full size table

We further analyzed the correlation between the prediction probability output for each model and the true positive rate (TPR), calculated by the number of actual AAPs divided by the total number of sequences predicted within the range of the prediction probability. As illustrated in Fig. 3A,B, all six ML models demonstrate strong positive correlations between the true positive rate and the prediction probability on both datasets. In other words, a higher prediction probability leads to a higher likelihood that the sequence is an actual AAP.

Prediction accuracy with respect to peptide properties

Prediction results from independent tests based on SVM, the most accurate model, were further analyzed with respect to three different peptide properties, namely, the ratios of hydrophobic, hydrophilic, and charged residues within a peptide. In this study, hydrophobic amino acids are V, I, L, M, F, W, and C; hydrophilic amino acids are R, N, D, E, Q, H, K, S, and T; charged amino acids are E, D, R, K, H. As illustrated in Fig. 4, the prediction accuracy is positively correlated with the ratio of hydrophobic residues within a peptide for S56 (Fig. 4A) and NT-S40 (Fig. 4D). On the other hand, the prediction accuracy is negatively correlated with the ratio of hydrophilic residues in a peptide (Fig. 4B for S56 and Fig. 4E for NT-S40) and the ratio of charged residues in a peptide (Fig. 4C for S56 and Fig. 4F for NT-S40). These results suggest peptides of more hydrophobic residues and fewer hydrophilic and charged residues are more accurately predicted. These are in agreement with prior studies which state that the hydrophobicity of a peptide is an important characteristic of AAP^16,44. The analyses point towards potential areas for future refinement, specifically in improving predictions for peptides with fewer hydrophobic residues and a higher proportion of hydrophilic and charged residues.

Efficacy of the selected feature subset

In this study, each sequence is originally encoded with various compositional, physicochemical, and biological features, leading to a feature vector of 4335 numeric values. The selected feature subsets, consisting of 150 and 120 numeric values for S212 and NT-S160, respectively, play an important role in the enhanced prediction accuracy. To validate the efficacy of the feature subsets to the discrimination of TTCAs, we applied t-distributed stochastic neighbor embedding (t-SNE)^48,49 to visualize data distributions on a two-dimensional plane. As illustrated in Fig. 5, the t-SNE distributions of negatives and positives have serious overlap using all 4335 numeric features for S212 (Fig. 5A) and NT-S160 (Fig. 5C). Conversely, the t-SNE distributions of negatives and positives are more separated using the selected feature subsets for S212 (Fig. 5B) and NT-S160 (Fig. 5D). Take Fig. 5B as an example, the positives are more concentrated in the upper left and upper right regions, while the negatives are more concentrated in the lower left region. The scenario reveals that the feature subsets for the two datasets are informative and beneficial to the improved prediction performances of ML models.

Conclusion

In this study, we present AAPL, a sequence-based predictor for AAP prediction with improved prediction accuracy. Sequences were encoded with a comprehensive set of features, encompassing a total of 4335 numeric values based on 58 different feature types, followed by a feature ranking process that produces a ranked feature list according to feature importance. The best feature number was determined with a heuristic algorithm comprising iterative runs of cross validation using six different ML models. The feature subset corresponding to the highest MCC was then used for fine-tuning the hyperparameters of each ML model with a Bayesian optimization algorithm. The feature subset and the optimized ML models were then applied to conduct cross validation and independent tests. We considered two datasets, one consisting of full-length sequences and the other consisting of sequences of 15 residues from the peptide N-termini. The feature subsets of the former and the latter are composed of 150 and 120 numeric values, respectively. The independent test for the former shows that AAPL achieves an MCC of 0.671, which is 5.3%, 11.1%, and 26.1% higher than AAPred-CNN, TargetAntiAngio, and AntiAngioPred, respectively. The independent test for the latter shows that AAPL achieves an MCC of 0.756, which is 5.9%, 19.6%, and 24.6% higher than AAPred-CNN, TargetAntiAngio, and AntiAngioPred, respectively. Evaluation results reveal that AAPL’s higher recall, as opposed to precision, drives its superior prediction capability. Further analyses also demonstrate that the peptides of more hydrophobic residues, and fewer hydrophilic and charged residues are of higher prediction accuracy. The efficacy of the selected feature subsets was further validated with t-SNE plots. Compared to using the original 4335 numeric features, using the selected feature subsets results in greater separation between AAPs and non-AAPs in t-SNE plots, suggesting that the feature subsets are beneficial to ML models in improved prediction accuracy. Overall, the study shows that our machine learning-based approach achieves reasonably good prediction accuracy in identifying AAPs.

It is crucial to note that our models were developed using balanced datasets comprising equal proportions of AAPs and non-AAPs. This approach, albeit deviating from the anticipated real-world distribution, was adopted to facilitate model training and evaluation without introducing potential biases arising from class imbalance. Nonetheless, the work is considered an initial stage of development and future efforts can be made to explore the impact of class imbalance, thereby improving the robustness and generalizability of AAPL with more non-AAPs. In addition, future availability of experimentally validated AAPs could pave the way for deep learning-based prediction models.

Data availability

Source code of AAPL is available at https://github.com/yunzheng2002/Anti-angiogenic. Publicly available datasets analyzed in this study can be obtained from the website of AntiAngioPred at https://webs.iiitd.edu.in/raghava/antiangiopred/.

References

Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011).
Article CAS PubMed Google Scholar
Stephenson, J. A., Goddard, J. C., Al-Taan, O., Dennison, A. R. & Morgan, B. Tumour angiogenesis: A growth area—From John Hunter to Judah Folkman and Beyond. J. Cancer Res. 2013, e895019 (2013).
Article Google Scholar
Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 74, 229–263 (2024).
Article PubMed Google Scholar
Mukherjee, S. & Patra, C. R. Therapeutic application of anti-angiogenic nanomaterials in cancers. Nanoscale 8, 12444–12470 (2016).
Article CAS PubMed ADS Google Scholar
Rosca, E. V. et al. Anti-angiogenic peptides for cancer therapeutics. Curr. Pharm. Biotechnol. 12, 1101–1116 (2011).
Article CAS PubMed PubMed Central Google Scholar
Quiroz-Mercado, H., Martinez-Castellanos, M. A., Hernandez-Rojas, M. L., Salazar-Teran, N. & Chan, R. V. P. Antiangiogenic therapy with intravitreal bevacizumab for retinopathy of prematurity. Retina 28, S19 (2008).
Article PubMed Google Scholar
Chlenski, A. et al. Anti-angiogenic SPARC peptides inhibit progression of neuroblastoma tumors. Mol. Cancer 9, 138 (2010).
Article PubMed PubMed Central Google Scholar
Lin, C., Wang, L. & Shi, L. AAPred-CNN: Accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 204, 442–448 (2022).
Article CAS PubMed Google Scholar
Koskimaki, J. E. et al. Peptides derived from type IV collagen, CXC Chemokines, and thrombospondin-1 domain-containing proteins inhibit neovascularization and suppress tumor growth in MDA-MB-231 breast cancer Xenografts. Neoplasia 11, 1285-IN2 (2009).
Article Google Scholar
Sulochana, K. N. & Ge, R. Developing antiangiogenic peptide drugs for angiogenesis-related diseases. Curr. Pharm. Des. 13, 2074–2086 (2007).
Article CAS PubMed Google Scholar
Karagiannis, E. D. & Popel, A. S. A systematic methodology for proteome-wide identification of peptides inhibiting the proliferation and migration of endothelial cells. Proc. Natl. Acad. Sci. 105, 13775–13780 (2008).
Article CAS PubMed PubMed Central ADS Google Scholar
Maeshima, Y. et al. Identification of the anti-angiogenic site within vascular basement membrane-derived Tumstatin*. J. Biol. Chem. 276, 15240–15248 (2001).
Article CAS PubMed Google Scholar
Ramaprasad, A. S. E. et al. AntiAngioPred: A server for prediction of anti-angiogenic peptides. PLoS ONE 10, e0136990 (2015).
Article Google Scholar
Blanco, J. L., Porto-Pazos, A. B., Pazos, A. & Fernandez-Lozano, C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci. Rep. 8, 15688 (2018).
Article PubMed PubMed Central ADS Google Scholar
Zahiri, J. et al. AntAngioCOOL: computational detection of anti-angiogenic peptides. J. Transl. Med. 17, 71 (2019).
Article PubMed PubMed Central Google Scholar
Laengsri, V. et al. TargetAntiAngio: A sequence-based tool for the prediction and analysis of anti-angiogenic peptides. Int. J. Mol. Sci. 20, 2950 (2019).
Article CAS PubMed PubMed Central Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res. 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lee, T.-Y., Lin, Z.-Q., Hsieh, S.-J., Bretaña, N. A. & Lu, C.-T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 27, 1780–1787 (2011).
Article CAS PubMed Google Scholar
Sun, J.-N. et al. Prediction of cyclin protein using two-step feature selection technique. IEEE Access 8, 109535–109542 (2020).
Article Google Scholar
Chen, K., Kurgan, L. A. & Ruan, J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct. Biol. 7, 25 (2007).
Article PubMed PubMed Central Google Scholar
Chou, K. C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43, 246–255 (2001).
Article CAS PubMed Google Scholar
Pace, C. N. et al. Contribution of Hydrophobic Interactions to Protein Stability. J. Mol. Biol. 408, 514–528 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).
Article CAS PubMed Google Scholar
Liu, L.-M., Xu, Y. & Chou, K.-C. iPGK-PseAAC: Identify Lysine Phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC. Med. Chem. Shariqah U.A.E. 13, 552–559 (2017).
CAS Google Scholar
Chen, X. et al. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinforma. Oxf. Engl. 29, 1614–1622 (2013).
Article CAS Google Scholar
Pfeature_Manual.pdf.
Pande, A. et al. Computing wide range of protein/peptide features from their sequence and structure. 599126 Preprint at https://doi.org/10.1101/599126 (2019).
Dubchak, I., Muchnik, I., Holbrook, S. R. & Kim, S. H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. U.S.A. 92, 8700–8704 (1995).
Article CAS PubMed PubMed Central ADS Google Scholar
Shen, J. et al. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U.S.A. 104, 4337–4341 (2007).
Article CAS PubMed PubMed Central ADS Google Scholar
Saravanan, V. & Gautham, N. Harnessing computational biology for exact linear B-cell epitope prediction: A novel amino acid composition-based feature descriptor. Omics J. Integr. Biol. 19, 648–658 (2015).
Article CAS Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188 (1936).
Article Google Scholar
Ho, T. K. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition vol. 1 278–282 (1995).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Article Google Scholar
Ke, G. et al. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017).
Dorogush, A. V., Ershov, V. & Gulin, A. CatBoost: Gradient boosting with categorical features support. ArXiv (2018).
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
Article Google Scholar
PyCaret—pycaret 2.3.5 documentation. https://pycaret.readthedocs.io/en/latest/index.html.
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Kyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2623–2631 (Association for Computing Machinery, New York, NY, USA, 2019). https://doi.org/10.1145/3292500.3330701.
Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems vol. 24 (Curran Associates, Inc., 2011).
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
Article CAS PubMed Google Scholar
Pande, A. et al. Pfeature: A tool for computing wide range of protein features and building prediction models. J. Comput. Biol. 30, 204–222 (2023).
Article CAS PubMed Google Scholar
Dings, R. P. M., Nesmelova, I., Griffioen, A. W. & Mayo, K. H. Discovery and development of anti-angiogenic peptides: A structural link. Angiogenesis 6, 83–91 (2003).
Article CAS PubMed Google Scholar
Senes, A. et al. Ez, a depth-dependent potential for assessing the energies of insertion of amino acid side-chains into membranes: derivation and applications to determining the orientation of transmembrane and interfacial helices. J. Mol. Biol. 366, 436–448 (2007).
Article CAS PubMed Google Scholar
Hellberg, S., Sjoestroem, M., Skagerberg, B. & Wold, S. Peptide quantitative structure-activity relationships, a multivariate approach. J. Med. Chem. 30, 1126–1135 (1987).
Article CAS PubMed Google Scholar
Sandberg, M., Eriksson, L., Jonsson, J., Sjöström, M. & Wold, S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 41, 2481–2491 (1998).
Article CAS PubMed Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
van der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).
MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by National Science and Technology Council, Taiwan under Grant NSTC 112-2221-E-468-019.

Author information

These authors contributed equally: Yun-Chen Lee and Jen-Chieh Yu.

Authors and Affiliations

Department of Computer Science and Information Engineering, Asia University, Taichung, 41354, Taiwan
Yun-Chen Lee
Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan
Jen-Chieh Yu, Yu-Chuan Lin & Ching-Tai Chen
Graduate Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, 40227, Taiwan
Kuan Ni
Center for Precision Health Research, Asia University, Taichung, 41354, Taiwan
Ching-Tai Chen

Authors

Yun-Chen Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jen-Chieh Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kuan Ni
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chuan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Tai Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.C. Lee, J.C. Yu and K. Ni developed the method. Y.C. Lee and Y.C. Lin conducted experiments and analyzed the results. C.T. Chen conceived the work and supervised the study, while also drafted and edited the manuscript. All authors read and approved the manuscript.

Corresponding author

Correspondence to Ching-Tai Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, YC., Yu, JC., Ni, K. et al. Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences. Sci Rep 14, 14387 (2024). https://doi.org/10.1038/s41598-024-65062-9

Download citation

Received: 24 February 2024
Accepted: 17 June 2024
Published: 22 June 2024
DOI: https://doi.org/10.1038/s41598-024-65062-9
Springer Nature Limited

Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Abstract

Similar content being viewed by others

AntAngioCOOL: computational detection of anti-angiogenic peptides

Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection

Using a Classifier Fusion Strategy to Identify Anti-angiogenic Peptides

Introduction

Materials and methods

Datasets

Workflow of AAPL

Feature encoding

Normalization

Machine learning methods

Heuristic algorithm for feature selection

Optimization of machine learning models

Evaluation metrics

Results and discussions

Amino acid and dipeptide composition analyses

Selected feature subsets

Feature exploration and biological relevance

Benchmark results of cross validation

Benchmark results of independent tests

Prediction accuracy with respect to peptide properties

Efficacy of the selected feature subset

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information 1.

Supplementary Tables.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation