Introduction

The biomedical field faces a significant challenge in the development of pharmacological compounds that can be efficiently delivered to binding sites. Cell-Penetrating Peptides (CPPs) provide a safe and effective means of delivering therapeutic agents and other cargoes into cells without causing damage to the cell membrane. Such cargo may include nucleic acids, proteins, peptides, nanoparticles, fluorophores, small therapeutic compounds, and peptide nucleic acids [1,2,3,4]. CPPs share common structural and physicochemical features, including short amino acid sequences consisting of 4–40 residues, which typically adopt α-helical structures [1, 5,6,7]. They are often amphiphilic or cationic, soluble in water, partially hydrophobic, and rich in arginine and lysine residues [6, 8,9,10].

CPPs have been extensively studied for their potential use as drug delivery systems and diagnostic tools in various medical areas, such as immunotherapy [11], neurological disorders [12], and cancer [13]. Although the number of clinical trials involving CPPs has increased, only one CPP has been approved by the European Medicines Agency (EMA) [1, 14]. The design and testing of different CPPs in vitro and in vivo can be expensive and labor-intensive [15, 16]. Therefore, efficient computational tools and methodologies are necessary for rapid and accurate identification of suitable CPPs. Recently, many computational resources have been used to provide information on CPPs design and uptake ability, including Machine Learning (ML) approaches such as C2Pred [17], CPPred-RF [18], SkipCPP-Pred [19], CellPPD-MOD [20], ML-based prediction of CPP (MLCPP) [21, 22], Kernel Extreme Learning Machine-based prediction (KELM-CPPpred) [23], and StackCPPred [24]. However, existing methods rely solely on classification approaches because of the limited qualitative nature of the data available in current databases. One of the most commonly used databases, CPPsite 2.0, published in 2016, contains qualitative data for over 1,800 CPPs sequences [2].

We created POSEIDON–Peptidic Objects SEquence-based Interaction with cellular DOmaiNs, a comprehensive database containing quantitative uptake values and physicochemical properties of 1,315 cell-penetrating peptides across various scenarios, to fill gaps in the current CPP design. POSEIDON is indeed the most extensive database of quantitative CPP uptake values, with up-to-date information and unique data collection. Furthermore, POSEIDON includes a processed dataset that employs a well-designed methodological approach, making it an ideal benchmark for the development of new ML algorithms. By leveraging this database, coupled with cell line genomic features, we developed a novel ML regression model that accurately predicted CPP uptake efficiency.

Methods

Data extraction and curation

The general workflow for data collection is shown in Fig. 1, which depicts the collection, organization, and extraction of accurate and relevant information from various sources to create a centralized and annotated database. CPP sequences and associated features were first collected from the CPPsite 2.0 database [2]. We obtained the first dataset, composed of 1,855 entries corresponding to each entry to a CPP and their features in the dataset. The information retrieved from this database included the CPP identifier, its name, and corresponding sequence, along with information such as PubMed IDs, cell lines used in the study, and cargo coupled to the CPP. All scientific articles referenced in CPPsite2.0 were manually curated to fill POSEIDON with CPPs quantitative uptake values and respective units. Uptake values were recorded when quantitative data were available in plots or when they were directly mentioned by the authors. In addition, the temperature, concentration, time for CPP incubation, and uptake evaluation methods from the referenced articles were manually annotated. Therefore, only peptides with quantitative information were retained in the dataset, reducing the number of curated entries to 906, which corresponds to 676 unique CPPs.

Fig. 1
figure 1

Overall workflow of data collection

Subsequently, we conducted a thorough literature search to supplement the database with manually curated samples. This process involved extensive and careful examination of relevant publications to identify additional data points. To this end, another 228 CPP-related articles from PubMed were queried using the filters “((((CPP) AND (Cell Penetrating Peptide)) OR (Cell-penetrating Peptide)) AND (Cellular Uptake)) AND ((”2015/11/19 “[Date—Publication]:”2022/08/01 “[Date—Publication])))”)” were evaluated and quantitative experimental information was added when existent. The final database comprised 2,371 entries, of which 1,315 were unique CPPs and 1,056 were CPPs with different uptake conditions. The latter refers to unique peptides that have been repeated under different conditions, such as varying cargoes, cell lines, temperatures, or incubation times, to analyze the uptake capacity of a peptide under different conditions.

To develop a suitable ML approach, it was necessary to refine the dataset to ensure the uniformity of the target variable (Uptake) in units, values, and experimental determination approaches. The following steps were performed to obtain a benchmark dataset for ML training and testing:

  • Rows lacking information on concentration or with unclear peptide sequences were excluded, resulting in 2,067 remaining samples.

  • Only samples determined by fluorescence were retained, as other methods would yield different target variables, leaving 1,765 samples.

  • Samples with relative uptake efficiencies were excluded because they could not be interpreted as actual experimental values for ML purposes, reducing the dataset to 1,563 samples.

  • Samples with unusable peptide concentration information were removed, leaving a final set of 1,316 samples.

  • Peptide sequences that contained an excessive number of anomalous amino acids or non-peptide sequences were manually curated and excluded, resulting in 1,274 samples.

  • Outliers for Uptake were removed, resulting in a final dataset size of 1,263 peptides.

  • Since the original dataset had the same CPPs appearing multiple times but with different uptake conditions, we included these repetitions in the ML dataset. This was done because varying uptake conditions are considered important factors for developing an ML predictor. As a result, the ML dataset, consisting of 1,263 peptides, contained CPPs that appeared multiple times under different uptake conditions, totaling 642 unique CPPs.

The POSEIDON original dataset and the ML predictor dataset are available at the following GitHub repository: https://github.com/MoreiraLAB/poseidon/tree/main/data. These datasets are stored under the names “CPP_dataset.csv” and “CPP_ML.csv”, respectively.

Feature extraction

To prepare the dataset for ML, the POSEIDON pipeline incorporates various features that aim to characterize peptides, cell lines, and experimental conditions.

The features can be further classified into three subcategories.

  • Whole-peptide features were obtained using the Peptides R package [25].

  • In-house position one-hot encoding features based on the size of the longest peptide. One-hot encoding is a reliable and interpretable method for representing categorical data such as amino acids in peptides [26, 27]. It is compatible with traditional ML algorithms, is robust to data variations, and minimizes information loss.

  • Annotation-based features, in which the sequence anomaly type and location were substituted with the closest amino acids (Additional file 1: Table S1).

Cell line features (736 in total) were obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) [28] database and matched with the cell lines of the POSEIDON dataset. They were then tagged as a true match depending on whether they were present on the GDSC. The POSEIDON dataset contained 43 available cell lines from the GDSC (Additional file 1: Table S2).

Finally, the experimental conditions were characterized using several variables (71 in total), including concentration (μM), categorical temperature (°C), incubation time availability and duration (in minutes), and curated cargo to avoid repetition (Additional file 1: Table S3). Prior to dimensionality reduction, this added up to 2,908 features (Table 1).

Table 1 POSEIDON features for the ML summary table

Data pre-processing and statistics treatment

Data cleaning, visualization, selection, and preprocessing of the raw dataset were performed using the programming language R (version 4.1.0) [29]. Peptides with unknown uptake values were excluded from the final dataset, as the methodologies used in these studies did not quantitatively measure peptide internalization. The resulting dataset consisted of 2,371 peptides with quantitative values, varying units, and uptake-evaluation techniques.

Subsequently, statistical analysis of the data was performed using RStudio (version 1.4.1717) [30]. The tidyverse package (version 1.3.1), which includes dplyr for data manipulation and ggplot2 for data visualization [31], was used for the data analysis.

To construct the processed dataset, Python programming language (version 3.10.8) was used in combination with NumPy (version 1.24.1) and Pandas (version 1.5.2.), and scikit-learn (version 1.2.0). The usable samples were extracted and accessed on GitHub (https://github.com/MoreiraLAB/poseidon). The dataset underwent several uniformization steps such as incubation time uniformization, temperature encoding, valid peptide sequence generation, and curation of the target variable (peptide uptake) in log10 form, as it provides a more comprehensible scale.

Feature extraction was performed as described, resulting in 1,330 usable features after removing features with null variance, which can be fully explained and linked to real information, as depicted on the website. A random 70–30 data split was performed, and data normalization was applied based on the average and standard deviation of the training set, which was then applied to both the training and test sets. The decision to retain dimensionality without reduction was bolstered by several factors: the sample size of the dataset, the relevance of domain-specific features, the robust performance of the model on an independent test set encompassing 30% of the total data, the need for transparency to facilitate interpretability, and the model's evident ability to withstand overfitting despite its high dimensionality. Notably, this high dimensionality was driven by the inclusion of relevant one-hot encoding features that accounted for 98% of the feature space.

Machine learning models deployment and optimization

After constructing the training and test sets as described, a battery of ML models from scikit-learn (1.2.0) [32] was implemented upon hyperparameter optimization (Table 3). In particular, xgboost (1.7.3) [33] and TensorFlow (2.11.0) [34] were optimized using ray[tune] (2.2.0) [35] as a tool (parameter range in Table 2). The tested models were a Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), k-Nearest Neighbors (kNN), Decision Tree (DT), Random Forest (RF), Extreme Randomized Trees (ERT), eXtreme Gradient Boosting (XGB), Deep Neural Network (DNN), and forked Neural Network (fNN). While most of these models are standard imports from their respective packages, the fNN was designed for these purposes, comprising a neural network with different points of entry for each feature block type. All models were parameterized using the training set and an independent testing set. In this study, we evaluated the performance of our regression ML models using several metrics, including Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), Pearson correlation, Spearman correlation, and coefficient of determination (r2)-

Table 2 Hyper parameter optimization parameters for all the tested models

POSEIDON front-end implementation

A web server free available to the scientific community can be found at https://moreiralab.com/resources/poseidon/. The webserver was constructed using the Nginx webserver with a Linux operating system. To develop the web interface, Flask [36] was used as the back end and HTML, CSS, and JavaScript were applied as the front end in conjunction with Plotly [37] for dynamic plot visualization.

Upon navigating to the POSEIDON platform, users are greeted with an intuitive interface designed to facilitate the submission of peptide sequences for prediction. Detailed instructions are provided on the homepage to guide users through the input process. This involves the following steps:

  • Users input peptide sequence(s) into a designated text field within the interface.

  • After entering the sequence, users can customize properties, such as peptide concentration, incubation time, temperature, and cell line type.

  • Users are required to provide a valid email address to which the prediction results will be sent.

  • To initiate the prediction process, users must click the 'Submit' button.

After submission, the POSEIDON prediction is swiftly computed, and the results are delivered on a separate page. Users are notified via email when a run succeeds.

Data and associated code underpinning the analyses presented herein are accessible via the repository at https://github.com/MoreiraLAB/poseidon.

Results

Database description

The POSEIDON database is a unique collection of recent information on CPPs, including quantitative cellular uptake values that have been experimentally obtained for each peptide. In addition to including all peptides in the CPPsite 2.0 database for which experimental quantitative cellular uptake data are available, POSEIDON has been highly enriched with up-to-date mining of the available literature.

A dataset of 2,371 entries was obtained through several steps of data acquisition and preprocessing, providing information about uptake evaluation methods, uptake conditions (such as temperature, cell line, and time of CPP incubation), uptake values, uptake units, cargoes, and peptide sequence. Both the CPPsite 2.0, and POSEIDON databases share information on peptide sequences, characteristics, modifications, validation methods, and cargo types. However, POSEIDON stands out because it offers quantitative uptake values for CPPs, whereas CPPsite 2.0 provides qualitative data.

POSEIDON covers all types of CPPs, including L-amino acids, D-amino acids, L- and D-amino acids, and non-natural amino acids (Fig. 2A). The composition of CPPs revealed that certain types of residues, such as arginine, lysine, and leucine, were more prominent in CPPs than in methionine, aspartate, tyrosine, and asparagine residues, which were not enriched in CPPs (Fig. 2C). The positively charged residues like arginine and lysine in POSEIDON interact with negatively charged cell membrane components, increasing cellular uptake, as shown in Fig. 2B. The amphiphilic nature of CPPs, owing to their cationic and hydrophobic residues, enhances their interactions with the cell membrane and improves cell penetration [38] or cargo interaction [39].

Fig. 2
figure 2

Representation of peptide composition in the POSEIDON database, raw data in blue, and benchmark data in red based on A chirality/modifications of CPP, B the type of amino acid, and C quantification of the amino acid composition of CPPs. The data pertain to peptides without non-natural amino acids

This database provides peptide sequences that facilitate the retrieval of physicochemical properties that can be directly calculated from their primary sequences. Our dataset contained a significant number of peptides with lengths less than 10 amino acids (n = 821) and between 11 and 20 amino acids (n = 1,029), as shown in Fig. 3A. Most CPPs exhibit molecular weights ranging from 1 to 1.5 kDa. Both charge distribution and peptide length properties enable CPPs to interact with various cell-surface molecules, significantly influencing the selection of an entry pathway [40]. Among several influencing factors, such as the physicochemical properties of the peptide and its cargo, the internalization routes of CPPs are primarily directed towards two major pathways: endocytosis (an active or energy-dependent process) and membrane translocation (a direct or passive energy-independent process) [41]. Therefore, we analyzed the distribution of the cell lines, as they play a key role in peptide cell penetration. POSEIDON showed that more than 100 cell lines are associated with CPPs internalization. As shown in Fig. 3B, most CPPs were tested in HeLa cells (n = 597), followed by MCF7 (n = 162), A549 (n = 137), CHO (n = 97), CHO-K1 (n = 94), and HEK293T cells (n = 82). The diversity of cell lines ensures that CPP/cell line combinations can be analyzed using this database.

Fig. 3
figure 3

CPP features in both datasets (raw data in blue and benchmark data in red). A Length of peptide sequences in the database. B The 10 most used cell lines according to the dataset

Scientific studies have shown that there are various roles associated with CPPs, ranging from fluorophores to nucleic acids. Thus, cargoes associated with each peptide are available in POSEIDON. As expected, our dataset demonstrated that fluorescein isothiocyanate (FITC), fluorescein, and carboxyfluorescein were the cargoes most strongly associated with CPPs (Fig. 4A). As shown in Fig. 4B, most CPPs in the dataset were associated with fluorophores (n = 4,368), followed by small ligands (n = 795), nanoparticles (n = 633), proteins (n = 600), and nucleic acids (n = 471).

Fig. 4
figure 4

Distribution of CPPs in POSEIDON according to A cargo, B cargo type, C uptake evaluation methods, and D uptake units. A and B represent both datasets: raw data in blue and benchmark data in red

Flow cytometry was the most commonly employed method for uptake evaluation in this dataset, accounting for 1,349 entries, whereas fluorescence microscopy, fluorescence spectroscopy, and Fluorescence-Activated Cell Sorting (FACS) were employed for 289, 247, and 155 entries, respectively (Fig. 4C). However, as shown in Fig. 4D, there was a high degree of variability in the uptake units, and several studies used slight differences in identical uptake unit designations.

After standardizing identical units to a unique designation, the mean fluorescence intensity was the most frequently employed unit in this dataset, with 481 entries. The different units presented in Fig. 4C highlight the lack of standardization in CPP uptake evaluations conducted in previous studies, which hinders the comparison and analysis of the CPP uptake data. Although there are currently no standardized methods for CPP uptake evaluation, flow cytometry has been employed significantly more frequently than the other methods. This suggests that it is possible to establish a general method using specific easily attainable controls, allowing a large amount of quantitative data to be acquired and compared more adequately and easily. This database also provides information on the temperature and time of CPP incubation. Due to the nature of CPPs and their internalization mechanisms, changes in certain conditions, such as temperature, can significantly impact the uptake of CPPs by cells, often due to alterations in the underlying mechanism [42,43,44]. Thus, these data are highly valuable for the development of new approaches.

Processed database description

The POSEIDON database uptake-prediction methods developed in this study rely exclusively on fluorescence measurements. This approach was selected because other methods can produce inconsistent results, leading to discrepancies in the derived uptake units. Therefore, to establish a reliable benchmark dataset, we selected CPPs that were evaluated using fluorescence methods,

resulting in a dataset of 1274 entries. After removing outliers, the final dataset contained 1263 entries.

As shown in red in the figures, most amino acids are L-amino acids (Fig. 2A) and were essentially hydrophobic and polar charged (Fig. 2B). Similar to the raw dataset, arginine, lysine, and leucine were present in large numbers in the CPP sequences, in contrast to methionine, aspartate, asparagine, and tyrosine residues, which were not prominent in CPPs (Fig. 2C).

The benchmark dataset included CPP sequences of various sizes, with sequences consisting of 11–20 residues being the most common (n = 619), followed by sequences with fewer than 10 residues (n = 316), and sequences consisting of 21–30 residues (n = 265) (Fig. 3A, red). In terms of cell lines, HeLa cells were the most frequently used, as in the raw dataset. However, the benchmark dataset showed the emergence of HepG2, Jurkat, and bEnd.3 cell lines as among the most frequently used cell lines for CPPs. Regarding cargo, the benchmark dataset showed a slightly different trend than the raw dataset, with Dil, rhodamine (Rho), small interfering RNA (siRNA), and TAM being highly associated with CPPs. Fluorophores were the most common cargo (n = 1,249), followed by nanoparticles (n = 198), small ligands (n = 165), nucleic acids (n = 110), and proteins (n = 56) (Fig. 4B, red).

Additional interesting information emerges when conducting a correlation analysis between the features and the processed target variable. Among the 30 features that exhibited the highest correlation with the target variable (Additional file 1: Table S4), 50% with the highest Pearson correlation were position-encoding features. One-third of the most correlated features are genomic features. Only two features from the entire sequence were present in the top 30, whereas cargo had 3. Although experimental features such as concentration and temperature were not included in the top 30, it is apparent that they are among the top 100 on the additional figures on the website.

Performance of the different predictors

After implementing the hyperparameter optimization pipeline (Table 3), the best-performing models were XGB and DNN, as indicated by their evaluation metrics on the independent test set that did not participate in either training or hyper-parameter optimization (Table 4). Specifically, both models achieved high r2 scores, exceeding 0.76, whereas the other methods barely surpassed the 0.70 threshold. Furthermore, they exhibited high correlation metrics, with Pearson correlations above 0.87 and Spearman correlations above 0.88. Consequently, the final prediction pipeline of POSEIDON displays predictions generated by both DNN and XGB models.

Table 3 Optimal parameters for optimized ML models
Table 4 Results for the best performance of each optimized ML model

Discussion

CPPs have great potential in therapy and diagnosis; however, identifying new and efficient CPPs can be costly and time-consuming. Consequently, computational biological studies have become increasingly important in this field, although they have mainly focused on the qualitative features of CPPs. POSEIDON addresses this gap by offering a novel up-to-date database that includes quantitative experimental uptake efficiency data and serves as a benchmark for the field. The POSEIDON database and prediction pipeline have provided several important insights into the rapidly evolving field of CPP research. First, it is evident that effective CPPs are characterized by an abundance of positively charged amino acids, which is biochemically logical because it allows peptides to leverage the electrostatic differences inside and outside the cell, thereby augmenting cellular internalization. Indeed, the internalization mechanism of CPPs remains a subject of ongoing debate, with CPP concentration, charge, and amphipathicity emerging as crucial factors. The intricate processes governing CPP internalization involve a combination of endocytic and direct translocation mechanisms [41]. The positive charge, particularly from arginine residues, significantly influenced CPP uptake, with arginine being more favorable for delivery and CPP activity than lysine. Amphipathicity peptides can directly penetrate the cell membrane at low concentrations, whereas non-amphipathic CPPs rely on endocytosis [6]. Regarding CPP concentration, endocytosis is typically the predominant mechanism under physiological conditions and at low peptide concentrations. In contrast, at higher peptide concentrations, direct translocation across the plasma membrane becomes more prevalent [41]. Further investigation of the specific mechanisms employed by CPPs with different physicochemical properties and concentrations will provide valuable insights into the complex dynamics governing cellular uptake.

Second, fluorophores are significant molecular interventions for CPP activity, as their presence is methodologically required, and they are highly correlated with the uptake variable, implying that they may intervene in molecular interactions. Moreover, the presence of cargo can modify the CPP uptake pathway, as demonstrated by the observed impact of cargo size and binding methodology on the CPP translocation mechanism [41, 45].

Third, genomics descriptors play a crucial role in this process, which was not adequately addressed before POSEIDON. Notably, mutation of the NRAS gene, which is linked to cell division in cancer, was found to be the variable most correlated with CPP uptake, followed closely by mutation of IDH1, which is associated with the expression of isocitrate dehydrogenase 1, a key player in the Krebs Cycle. Exploring the biological relationship between these genes (and several others high in ranking) and CPPs might be a worthy endeavor.

Fourth, CPP penetration into cells is influenced by the cell line owing to differences in membrane composition, receptor expression, and intracellular mechanisms. These factors affect the effectiveness and penetration mechanism of CPPs. Understanding CPP behavior in specific cell lines is crucial for accurate results, as the findings may not apply universally, as studies on various cell lines reveal cell-dependent preferences for specific CPPs [41], which also supports targeted CPP application in various biological and therapeutic contexts.

The POSEIDON database is not only the largest but also a comprehensive, curated database with CPP information. The inclusion of an extensive range of experimental characteristics in our dataset underscores the complexity inherent in CPP behavior. The prediction method employed by POSEIDON is unique in that it effectively considers CPP uptake activity as a continuous variable, unlike previous efforts that only featured categorical predictions. Our approach also includes multiple previously unused sources of information, which will allow users to test sequence anomalies, select tissue-specific cell lines, choose up to two cargoes per peptide, and adjust experimental conditions, such as temperature, concentration, and incubation time. We ensured that the algorithm incorporated all relevant parameters, thereby enabling it to capture intricate and nonlinear relationships among the variables. This approach enhances the predictive capacity of the model, making it adept at handling multifaceted experimental conditions encountered in various studies.

Assessing the POSEIDON ML approach in comparison with other prediction methods poses a distinct challenge mainly because of the limited availability of similar approaches. Nonetheless, Dowaidar et al. represented an exception, as they spearheaded the creation of Fragment Quantitative Structure–Activity Relationship (FQSAR) models [46]. These models were specifically tailored to forecast the biological activity of CPPs in peptide-based transfection systems (PBTS), trained on only 11 data points, yet achieved r2 values ranging from 0.906 to 0.961 across various models. Nevertheless, POSEIDON stands out with very high correlation metrics and low errors, fully demonstrating its ability to predict CPP uptake under different conditions with exceptional performance.

Conclusion

POSEIDON provides the first quantitative data on cellular uptake, methodology, units, and experimental conditions, making it an exceptional tool. The POSEIDON database, a recently launched, open-source, and comprehensive resource, focuses exclusively on curated CPPs with quantitative uptake values. Each CPP in the database is accompanied by physicochemical properties, cell line, cargo, sequence, uptake evaluation method, concentration, temperature, and incubation time. The POSEIDON predictor is also groundbreaking, as it was the first tool to predict CPP uptake based on quantitative uptake and genomic data. With its dynamic, free, and easy-to-use interface, users can easily submit a peptide sequence and obtain computational predictions of its uptake in various cell lines. Additionally, users can customize properties, such as peptide concentration, incubation time, temperature, and cell line type. The POSEIDON database is a unique resource for researchers to develop new methodologies and predictors for CPP sequence design, based on uptake values.