Background

CRISPR-Cas adaptive immune system is one of the most widespread immunity strategies in prokaryotes against invading bacteriophages and plasmids [1, 2]. To counteract and overcome different CRISPR-Cas immunity systems, bacteriophages have evolved anti-CRISPR proteins (Acrs) that were first discovered in Pseudomonas aeruginosa phages in 2013 [3]. Subsequently, a proliferation of Acrs has proved to inactivate multiple CRISPR subtypes [3,4,5,6,7].

Several methods have been proposed to identify Acrs, including “Guilt-by-association” studies [6, 8], self-targeting CRISPR arrays [6, 7], and metagenome DNA screening [9, 10], etc. These methods assumed the new Acrs are similar to the previous Acrs. However, most Acrs fall short in sharing similarities currently acknowledged. Therefore, the traditional screening methods based on homology search are unreliable and require a lot of prior knowledge of Acrs to identify new Acrs. For instance, the “Guilt-by-association” method involves searching for homologs of helix-turn-helix (HTH)-containing proteins that are typically encoded downstream of Acrs [11]. The performance of “Guilt-by-association” is unstable when known Acrs proteins might share low similarity with queried protein. Therefore, a computational approach with less requirement for prior knowledge of known Acrs will provide a new perspective on the identification of Acrs. Machine learning algorithms with appropriate features could reveal the potential mechanism of Acrs and identify the Acrs without prior knowledge.

Recently, some machine learning methods have been presented for predicting Acrs. There are several web servers about Acrs, such as: Anti-CRISPRdb [12], AcrHub [13], AcrDB [14], CRISPRminer2 [15], AcRanker [14, 16], AcrFinder [17], AcrCatalog [18] and PaCRISPR [19]. Anti-CRISPRdb, AcrDB, and AcrCatalog are online Acr datasets, while AcrHub, CRISPRminer2, AcRanker, AcrFinder and PaCRISPR are prediction web servers. Eitzinger et al. developed AcRanker, using the XGBoost ranking model to predict candidate Acrs only based on protein sequence information [16]. Wang et al. proposed PaCRISPR, an ensemble learning-based predictor, to identify Acrs from protein datasets derived from genome and metagenome sequencing projects [19]. Gussow et al. proposed a machine learning approach, using a random forest model with extremely randomized trees to expand the repertoire of Acrs families [20]. These machine learning methods have made a great contribution to discovering Acrs. However, the most appropriate features or feature combinations for Acrs prediction have not been systematically assessed. For instance, The PaCRISPR method identified the Acrs using only evolutionary features, and the AcRanker used only amino acid composition features to identify Acrs. Gussow et al. predict Acrs based on the sequence alignment and a heuristic secondary screen of few known Acrs. Thus, since previous work did not fully assess the feature combinations and relied on prior knowledge, we proposed a novel, effective and robust machine learning framework to help identify Acrs.

This study presented an ensemble machine learning method, called PreAcrs, to efficiently and accurately predict Acrs based on protein sequences. Specifically, we used three features and eight different machine learning methods to train our model. 412 experimentally validated Acrs and 412 non-Acrs were introduced in the training dataset, and 176 were experimentally determined Acrs and 176 non-Acrs in the independent dataset. We found that the PreAcrs method outperformed other existing predictors with an AUC of 0.972 in the independent dataset.

Results and discussion

Performance evaluation of five different features

To find the appropriate feature encoding methods, we evaluated and compared the performance of nine machine learning methods, including SVM, KNN, MLP, LR, RF, XGBoost, LightGBM, CatBoost and ensemble methods, for each feature encoding based on a randomized fivefold cross-validation. The results of classifiers based on the fivefold cross-validation are shown in Table 1.

Table 1 Performance comparison of different features and classifiers based on the fivefold cross-validation

We used five feature encoding methods (AAC, PAAC, PSSM_AC, RPSSM, SSA) to convert each protein into a feature vector. As the most forceful one in five feature encoding methods, RPSSM achieved the highest AUC value in eight classifiers (Fig. 1). An interesting phenomenon is that the RPSSM feature obtained the best performance among five single features and the performance of PSSM_AC is second only to RPSSM. The evolutionary features derived from the PSSM files showed that evolutionary features have an outstanding contribution to Acrs prediction. The evolutionary feature RPSSM had a better performance than the evolutionary feature PSSM AC in most classifiers (except LR). The pre-trained machine learning feature SSA also achieved good performance for most classifiers, and its performance is better than sequence features AAC and PAAC. The PAAC contains more sequence information, showing higher AUC values than AAC for all classifiers. The sequence features AAC and PAAC achieved a relatively poor performance compared with other features. One explanation is that evolutionary features and the pre-trained feature encoded more valuable and appropriate information about protein sequences. In contrast, sequence features might involve redundant information that reduces the accuracy of Acrs prediction. In the PreAcrs model, features PAAC_AC, RPSSM and SSA were considered. From Additional file 2: Table S2, the RPSSM-based model achieved the best prediction performance among the three features on the independent test, the PSSM_AC-based model achieved the second prediction accuracy, and the SSA-based model showed a lower prediction accuracy compared to another two features. In addition, the AUC value of the PSSM_AC&SSA was 0.953, up to 0.969 after considering the feature RPSSM. Two ensemble features PSSM_AC&RPSSM and RPSSM&SSA achieved an excellent performance in terms of AUC (0.967 and 0.961, respectively). Therefore, the feature RPSSM made the most contribution to the PreAcrs model in predicting Acrs.

Fig. 1
figure 1

The ROC curve of five single features and AAC&PAAC&RPSSM feature on five-fold cross-validation

Performance evaluation of eight different single classifiers and ensemble classifiers

For most feature encodings, the LightGBM classifier, CatBoost and SVM classifier outperformed the other single classifiers (except the ensemble classifier) in terms of PRE (Table 1). This observation is supported by Fernandez-Delgado et al. [21], who found the SVM model is most likely the best classifier compared with the other 17 machine learning methods based on various public data sets. Moreover, Ke et al. [22] demonstrated LightGBM model achieved a better performance than others in multiple public datasets. LightGBM could handle the high-dimension features and large-scale data [22]. CatBoost is proved superior to XGB and LightGBM in terms of a set of publicly available datasets [23]. Although LightGBM obtained the highest PRE values among the eight classifiers in PSSM_AC and SSA in this study, CatBoost had a better performance than LightGBM in RPSSM. In addition, Catboost showed excellent performance in other metrics, such as AUC and MCC. SVM obtained the highest PRE values among the eight classifiers in features AAC and PAAC. It implied that the SVM, LightGBM and CatBoost classifiers provided an outstanding prediction ability, and SVM tended to show excellent performances in sequence features. Additionally, the highest PRE value of 1.00 was obtained by LightGBM classifier when the PSSM_AC feature was used for training during experiments. It means that the predicted positive samples of this model are more likely to be true positive samples, and it might be beneficial for the virtual screening of Acrs.

To fairly compare the performance of various classifiers, other measurements were considered, such as SP, SN, and MCC. As one crucial evaluation matrix, MCC considers all four confusion matrices and can comprehensively reflect the performance. CatBoost presented its powerful and stable ability in terms of MCC value among five features. MLP outperformed other single classifiers in RPSSM features according to the MCC value. In all cases, the highest MCC value was 0.763 when the RPSSM feature was used for training in MLP. It provided more extensive and persuasive evidence for various performances with various features and classifiers. It is unreliable only to use one feature and a single model to identify Acrs protein.

Although some single classifiers have shown good performance for predicting Acrs, only one classifier might not be robust and reliable enough. In order to build a more comprehensive, reliable, and robust predictor, three ensemble methods have been adopted based on eight single classifiers in this study. Three ensemble methods integrated other classifiers by three different principles. Table 1 and Fig. 2 illustrate that three ensemble methods achieved better performance than single classifiers in terms of AUC value in most features, demonstrating the superiority of ensemble learning. This observation is supported by the study of Zou et al. [24].

Fig. 2
figure 2

The six matrices PRE, SP, SN, F-score, ACC and MCC values of various classifiers in five types of encoding features based on five-fold cross-validation

Performance evaluation of various ensemble features

As we mentioned above, five features were trained by eight different classifiers, respectively. Since single features cannot comprehensively represent the Acrs for identification, we attempted to integrate five single features in two ways: ensemble feature and combination feature. For combination features, we combined singles features into a vector to train models [25,26,27]. We explored the contribution of a variety of combined features to the prediction models of Acrs (Additional file 1: Table S1). For ensemble features, first, we trained eight different classifiers (including ensemble classifier) with five single features, then integrated classifiers of five features as an ensemble model. This study discussed ensemble features detailly because they showed better performance than combination features. For every single feature in each classifier, we have obtained its probability score of Acrs. The output of two-feature ensemble models is obtained by averaging the predictive scores of two single features in the same model. For example, we averaged the predictive scores of predicted Acrs obtained by the AAC feature trained in the SVM model and the PAAC feature trained in the same model, and we labeled it as ‘AAC&PAAC’. Therefore, the three-feature ensemble models were obtained by averaging the predictive scores of three single features in the same model, and Feature1&Feature2&Feature3 represented the three-ensemble features. The four-ensemble features and the five-ensemble feature were also shown similarly. Finally, we used the averaged predictive scores as the final scores of the ensemble feature in every classifier. From the cross-validation results, the ensemble features achieved good performance for Acrs identification. By comparing the performance of all ensemble features, the ensemble feature PSSM_AC&RPSSM&SSA showed the best performance with the highest AUC value. The second-best ensemble feature is PAAC&PSSM_AC&RPSSM, and the PSSM_AC&RPSSM ensemble feature is the third best. We found that all the top 12 ensemble features include the RPSSM encoding method from Additional file 2: Table S2. These observations also demonstrated that the RPSSM feature plays an essential role in Acrs prediction.

Performance evaluation of ensemble learning model

In the above section, ensemble classifiers with five single features have shown an excellent ability to predict Acrs, and the Sta-LR method obtained the best performance in terms of metrics. Therefore, we used the Sta-LR classifier to train various features in this study. Besides, we compared combination features with ensemble features in the same model. The ensemble feature achieved superior performance than combination features in most classifiers. Among all models, the average AUC value of Sta-LR classifiers using PSSM_AC, RPSSM and SSA features (the three-ensemble feature PSSM_AC&RPSSM&SSA) achieved the highest 0.969. Besides, the Sta-LR classifier with PSSM_AC&RPSSM&SSA ensemble feature achieved an excellent performance in terms of a high PRE value of 0.978, a high MCC value of 0.754, an ACC value of 0.866 and an F-score of 0.848 based on the fivefold cross-validation test. Based on these findings, we constructed a PreAcrs predictor to predict Acrs with a default setting: eight machine learning classifiers (SVM, KNN, MLP, LR, RF, XGBoost, LightGBM, CatBoost) were integrated into an ensemble classifier (Sta-LR); three features PAAC_AC, RPSSM, and SSA were trained by the Sta-LR classifier, separately, and three models could be obtained in this step. Then, we could obtain the PreAcrs predictor by averaging the score of the three models. The PreAcrs predictor achieved a stable and accurate prediction performance in the fivefold cross-validation and independent dataset.

Performance comparison with other existing methods

In order to further evaluate the performance of the PreAcrs predictor, we compared PreAcrs with the state-of-the-art Acrs predictor PaCRISPR. This machine learning model was proposed by Wang et al. [19], and significantly outperformed other methods such as AcRanker and BLAST on their independent dataset. Four evolutionary features, PSSM-composition, DPC PSSM, PSSM_AC and RPSSM, were adopted in the PaCRISPR predictor, which was constructed by 10 SVM classifiers. Besides, the BLAST-based predictor, AcRanker and the hidden Markov model (HMM) based predictor were implemented for the comparison. For the BLAST-based predictor, each protein in the independent dataset was searched against all samples in the training dataset based on BLAST + software [28] and was predicted as Acr when it has the highest similarity with positive samples. The predicted results of the other three predictors could be obtained from the webserver (https://pacrispr.erc.monash.edu/AcrHub).

Figures 3 show that the performance of PreAcrs is better than the other predictors on the independent dataset based on the AUC and AUPRC values. The performance demonstrates that the PreAcrs method is more suitable for capturing the intrinsic patterns of non-homologous Acrs than other predictors. From other metrics (Table 2), HMM obtained higher PRE and SP values than PreAcrs, but it does not indicate that HMM outperformed PreAcrs. It means the false positive is lower and one possible reason for it is HMM prone to predict the queried proteins as non-Acrs. HMM uses probabilistic models to search homologous protein sequences. The homology-based baseline predictors made a biased prediction, as HMM failed to recognize Acrs. It predicted the Acrs with extremely high accuracy (the lowest FP) but classified many true Acrs into non-Acrs (the highest FN). HMM obtained the best PRE with the cost of predicting most Arcs as non-Acrs. This observation is supported by the work of Wang et al. [19]. Therefore, when considering the FN and FP, HMM showed poor performance when it was evaluated. According to other more critical metrics like ACC, F-score and MCC, PreAcrs outperformed the other four approaches.

Fig. 3
figure 3

The precision-recall curves (A) and the ROC curves (B) are produced by the two existing state-of-the-art methods and the PreAcrs in the independent dataset

Table 2 Performance comparison between PreAcrs and existing methods based on the independent test

We listed the predictive scores of five experimentally validated Acrs on the independent test as a case study to further evaluate the performance of PreAcrs (Table 3). The PreAcrs achieved better performance than PaCRISPR and AcRanker. For the AcrIIA7 and AcrIIA9, PaCRISPR predicted lower scores, and the predictive score of AcrIIA7 was 0.407. In contrast, PreAcrs gave these three Acrs higher scores. For AcrIIC2, PaCRISPR showed better performance, but PreAcrs also gave considerable scores. PaCRISPR only considered four features driven from evolution information and the SVM model, while PreAcrs incorporated the SSA feature from the pre-trained model and eight different models. Considering more information and various classifiers, PreAcrs showed a more robust and accurate prediction performance.

Table 3 The predictive scores of the case study Acrs

Conclusions

The identification of candidate Acrs plays a vital role in manipulating CRISPR-Cas machinery as a tool in gene editing or gene therapy. Using the machine learning method to identify the new Acrs based on the protein sequence can accelerate the discovery of Acrs. In this work, we proposed a machine learning-based ensemble framework, PreAcrs, to accurately and efficiently identify Acrs from protein sequences. PreAcrs extracted distinctive characteristics from experimentally validated Acrs by combining the evolutionary features with the pretrained model feature with multiple models. The features were trained by an ensemble classifier constructed by eight base classifiers. PreAcrs predictor displayed a good performance for predicting new Acrs in terms of prediction accuracy and robustness. We anticipate that PreAcrs will be extensively used in Acrs prediction and help researchers to have a comprehension understanding of Acrs. PreAcrs shows excellent performance compared to the existing methods, but it still has some limitations. One limitation is that only the mRMR algorithm is applied to select significant features in PreAcrs, so some biases in this step may reduce the predictive accuracy. Another limitation is that PreAcrs does not provide a visual and user-friendly website; it may be difficult for some biologists to analyze Acrs. In future works, we may use multiple feature selection algorithms to calculate feature importance to obtain a reasonable feature, and build a powerful, user-friendly and interactive website.

Methods and materials

Overall framework of PreAcrs

Figure 4 shows the overall workflow of the PreAcrs framework, including five major steps: Dataset collection and curation, Feature encoding, Feature selection, Model training, and Model validation. These steps are described in the following sections.

Fig. 4
figure 4

The flowchart of the PreAcrs framework for Acrs prediction. The five major steps for constructing PreAcrs include data collection, feature encoding, feature selection, model construction, and performance evaluation

Dataset collection and curation

To build a powerful Acrs predictive model, we need to construct a training dataset and an independent test dataset comprised of two parts: positive samples (experimentally validated Acrs) and negative samples (non-Acrs). As mentioned above, Anti-CRISPRdb, AcrDB, and AcrCatalog are online databases of anti-CRISPR proteins. The latest update time of the Anti-CRISPRdb database is January 2021, and it has 1378 experimentally validated entries.

The AcrDB and AcrCatalog are databases of computationally predicted Acrs. In this study, we collected the experimentally validated Acrs from Anti-CRISPRdb, which is the latest database and contains more experimentally validated Acrs than others. We extracted 1,378 experimentally validated Acrs from the Anti-CRISPRdb [12] and 17 newly discovered experimentally validated Acrs from NCBI. To construct a robust machine learning model and eliminate the redundant Arcs, we used CD-HIT [29] to remove the highly-homologous sequences. Here, we set the identification threshold as 70% in CD-HIT (removed those sequences with more than 70% similarity). 588 Acrs sequences were obtained, and their length ranges from 50 to 350. After the 588 Acrs were randomly divided into two parts with a ratio of 7:3, we obtained 412 Acrs in the training dataset and 176 Acrs in the independent dataset.

Because there is no standard set of non-Arcs, constructing a comprehensive and reasonable non-Acrs dataset is a challenging and vital question. In this study, we referred to the work of Wang et al. [19] to construct the non-Acrs dataset. Because the range of Acrs sequence length is fixed, and most Acrs were found from a limited set of phages and mobile genetic elements (MGEs), the negative samples were selected with four strict criteria from Uniprot. The four criteria are the following: (1) must not be known or putative Acrs; (2) must be isolated from phage or bacterial MGEs (known or putative MEGs); (3) must have < 40% sequence similarity to each other and the 588 positive samples; (4) the lengths must fall in the range between 50 and 350 residues. According to the above four criteria, 1571 non-Acrs were obtained in this study. Then, we randomly selected 412 non-Acrs as negative samples in the training dataset and 176 non-Acrs as negative samples in the independent dataset. Each negative sample was only included in one dataset. In this way, the training dataset has 412 positive and 412 negative samples, while the independent test dataset contains 176 positive and 176 negative samples (Table 4). In addition, we chose 5 Acrs from the independent dataset as a case study.

Table 4 The statistics of datasets employed in this study

Feature encoding

In order to find the features that could better represent Acrs, we firstly evaluated 18 types of features to represent Acrs, including the composition of k-spaced amino acid pairs (CKSAAP), amino acid composition (AAC), pseudo amino acid composition (PAAC), bidirectional long short-term memory (BiLSTM), soft sequence alignment (SSA), PSSM_AC, RPSSM and PSSM-composition et. (Table 5 and Additional file 3: Table S3). We selected five features (AAC, PAAC, PSSM_AC, RPSSM, SSA) considering the computational requirements and predictive performance. The five features could be categorized into three groups: sequence features, evolutionary features, and pre-trained model features. These features have been widely applied in feature encoding research [19, 30, 31] and have achieved a good performance in protein properties and function predictions [32,33,34,35,36,37,38]. The following are the five features adopted in this study.

Table 5 Features: the sequence and structural features calculated and their dimensionalities

AAC

As one of the most important features, amino acid composition (AAC) has been successfully applied in many bioinformatics fields, for example, protein structure classification [30], thermophilic proteins prediction [39], and protein–protein interactions identification [40]. For AAC, each sequence is represented by a 20-dimensional numerical vector, in which each number corresponds to the frequency of an amino acid type in the whole protein sequence [41]. Every element in AAC of a given protein \(\mathrm{P}\) could be calculated by the following formula:

$$P=\left[\begin{array}{c}{p}_{1}\\ {p}_{2}\\ \begin{array}{c}\vdots \\ {p}_{20}\end{array}\end{array}\right]$$

with

$${p}_{i}=\frac{{c}_{i}}{L}, (i=1, 2, \cdots , 20)$$

where \({c}_{i}\) is the number of type \(i\) native amino acid in the whole protein \(P\) sequence, and \(L\) is the length of the protein \(\mathrm{P}\) sequence. Finally, the \({p}_{i}\) is the frequency of type \(i\) native amino acid in the protein \(\mathrm{P}\).

PAAC

Pseudo-Amino acid composition (PAAC) was proposed by Zhou [42] for predicting cellular protein attributes and has been widely used in many studies [31, 43]. This group of descriptors involves sequence-order information, hydrophobicity value, hydrophilicity value, and side-chain mass. The PAAC is defined by 20 + λ discrete numbers:

$$\mathrm{P}=\left[{p}_{1},p,\dots ,{p}_{20+1},\dots ,{p}_{20+\uplambda }\right]$$

with

$${p}_{c}=\frac{{f}_{c}}{\sum_{c=1}^{20}{f}_{c}+\omega \sum_{j=1}^{\lambda }{\theta }_{j}}, \quad (1<c<20)$$
$${p}_{c}=\frac{\omega {\theta }_{c-20}}{\sum_{c=1}^{20}{f}_{c}+\omega \sum_{j=1}^{\lambda }{\theta }_{j}},\quad (21<c<20+\lambda )$$
$${\theta }_{j}=\frac{1}{L-\lambda }\sum_{i=1}^{N-1}\Theta (P({S}_{i}),P({S}_{i+j}))$$

where the \({f}_{c}\) is the normalized frequency of amino acid \(\mathrm{c}\) in the protein sequence. L is the length of protein and θj is the jth rank of the coupling factor. \(\Theta (P({S}_{i}),P({S}_{i+j}))\) represents the correlation function, and λ is the maximum correlation length. This study used iLearnPlus to extract PAAC feature-based protein sequences [44] and generated a 23-dimensional feature vector for each protein.

PSSM-AC

PSSM-AC is derived from Position-Specific Scoring Matrix (PSSM) by applying the auto covariance (AC) transformation to each column of PSSM, and it measures the average correlation between two elements within the PSSM [45, 46]. A 20 × G-dimensional vector represents each sequence in PSSM-AC by the following formula:

$$PSSM-AC(j,g)=\frac{1}{L-g}\sum_{i=1}^{L-g}{(P}_{i,j}-\overline{{P }_{j}})\times ({P}_{i+g,j}-\overline{{P }_{j}})$$

with

$$\overline{{P }_{j}}=\frac{1}{L}\sum_{i=1}^{L}{P}_{i,j},\quad (j=1, 2, 3, \cdots , 20)$$

where \({P}_{i,j}\) represents the PSSM value at the \(\mathrm{ith}\) row and jith column, and the \(\overline{{P }_{j}}\) is the average value of amino acid j in the whole protein sequence. \(\mathrm{G}\) is a number smaller than the length of the whole protein sequence L, and the \(\mathrm{g}\) ranges from 1, 2, …, G; here, \(\mathrm{G}\) is set to 10 in this study [47]. Therefore, a 200-dimensional feature vector is generated for each protein.

RPSSM

According to the work of Li et al. [48], the original PSSM profile (L × 20) could be reduced to a L × 10 matrix by merging some columns. RPSSM is obtained by exploring the local sequence information based on the L × 10 reduced PSSM [49, 50]:

$$re-PSSM=({P}_{1}, {P}_{2},{P}_{3}, \cdots , {P}_{10})$$

and

$${P}_{1}=\frac{{p}_{F}+{p}_{Y}+{p}_{W}}{3}, {P}_{2}=\frac{{p}_{M}+{p}_{L}}{2}, {P}_{3}=\frac{{p}_{I}+{p}_{V}}{2}, {P}_{4}=\frac{{p}_{A}+{p}_{T}+{p}_{S}}{3}$$
$${P}_{5}\frac{{p}_{N}+{p}_{H}}{2}, {P}_{6}=\frac{{p}_{Q}+{p}_{E}+{p}_{D}}{3}, {P}_{7}=\frac{{p}_{R}+{p}_{K}}{2}, {P}_{8}={p}_{C}, {P}_{9}={p}_{G}, {P}_{10}={p}_{P}$$

where \(p_{A} ,p_{R} , \ldots ,p_{V}\) represent the 20 columns in the original PSSM profile corresponding to the 20 amino acids. The re-PSSM is further transformed into a 10-dimensional vector:

$${E}_{j}=\frac{1}{L}{\sum_{i=1}^{L}({p}_{i,j}-{\overline{p} }_{j})}^{2}$$

and

$$\overline{{p }_{j}}=\frac{1}{L}\sum_{i=1}^{L}{p}_{i,j}, (j=1, 2, \cdots , 10; i=1, 2, \cdots , L)$$

Additionally, the re-PSSM can be further transformed into a 10 × 10 matrix to capture the local sequence-order information by this formula:

$${E}_{j, t}=\frac{1}{L-1}\sum_{i=1}^{L-1}\frac{{({p}_{i, j}-{p}_{i+1,t})}^{2}}{2}, (s,t=1, 2, 3,\cdots , 10)$$

where \({p}_{i,j}\) represents the element at the ith row and jth column of there-PSSM. Finally, a 110-dimensional RPSSM feature is obtained by combining \({E}_{j,t}\) and \({E}_{j}\):

$$RPSSM=[{E}_{\mathrm{1,1}},{E}_{\mathrm{1,2}},\cdots ,{E}_{\mathrm{10,10}},{E}_{1},\cdots ,{E}_{10}]$$

Pretrained SSA embedding

The pretrained SSA embedding mosdel is obtained by combining the pre-trained language model with the soft sequence alignment (SSA) [51]. First, an embedding matrix RL×121 is given using the stacked BiLSTM encoders for each sequence, where L is the protein sequence length [52]. Then, the pretrained SSA embedding model is trained and optimized by SSA, which the following formulas could describe. For convenience, we supposed two embedding matrices P1(RL1×121) and P2(RL2×121), of two different protein sequences with lengths L1 and L2, respectively:

$${P}_{1}=[{x}_{1},{x}_{2},\cdots ,{x}_{L1}], {P}_{2}=[{y}_{1},{y}_{2},\cdots ,{y}_{L2}]$$

where xi, yi are vectors with 121-dimension.

The following formula represents the similarity of P1 and P2:

$$\widehat{p}=-\frac{1}{A}\sum_{i=1}^{L1}\sum_{j=1}^{L2}{\alpha }_{ij}\Vert {x}_{i}-{{y}_{j}\Vert }_{1}$$

and

$$A=\sum_{i=1}^{L1}\sum_{j=1}^{L2}{\alpha }_{ij}, { \alpha }_{ij}={\delta }_{ij}+{\varepsilon }_{ij}-{\delta }_{ij}{\varepsilon }_{ij}$$

with

$${\delta }_{ij}=\frac{exp(-\Vert {x}_{i}-{{y}_{k}\Vert }_{1})}{\sum_{k=1}^{L2}exp(-\Vert {x}_{i}-{{y}_{k}\Vert }_{1})}, {\varepsilon }_{ij}=\frac{exp(-\Vert {x}_{k}-{{y}_{j}\Vert }_{1})}{\sum_{k=1}^{L1}exp(-\Vert {x}_{k}-{{y}_{j}\Vert }_{1})}$$

The SSA embedding model could convert each protein sequence into an embedded matrix RL×121, and finally, an average pooling operation obtained a 121-dimensional feature.

Feature selection

Original features are represented by a high dimensional vector or matrix, which would raise severe problems in machine learning algorithms, such as overfitting, time-consuming training process and high requirement of computing resources. Therefore, identifying the most contributing information and features plays a vital role in performance improvement. As one of the most popular feature selection algorithms, maximum relevance minimum redundancy (mRMR) was proposed by Peng et al. [53] and has been applied in many studies and achieved robust performances [54,55,56]. In this study, mRMR was used to identify the most important features and improve the generalization ability of the model.

Machine learning algorithm

In this study, we focused on the traditional machine learning classification methods, including support vector machine, k-nearest neighbor, multi-layer perceptron, logistic regression, random forest, extreme gradient boosting, Light gradient boost machine and ensemble method that integrates the previous eight classification methods by hard voting strategy and stacking classifiers. More information is shown in the following subsections.

Support vector machine

Support vector machine (SVM) was first proposed by Vapnik et al. [57], and has successfully dealt with some binary classification problems in bioinformatics [25, 58, 59]. Two parameters Cost (C) and Gamma (γ) affect the performance of the SVM model with the RBF kernel. In this study, we used the grid search strategy to optimize C and γ in the space {2−6, 2−5, …, 25, 26}. Finally, an SVM classifier with the optimal value of C and γ was constructed.

K-nearest neighbor

K-nearest neighbor (KNN) is a fundamental classifier that has been applied in predicting protein function [60], extracting protein–protein information [61], and predicting eukaryotic protein subcellular [62]. The performance of KNN is directly affected by the parameter k. In this study, a grid search within the space \(\left\{ {1,2, \ldots ,\max \left\{ {\sqrt {FeaNum} ,\frac{FeaNum}{2}} \right\}} \right\}\) was applied to optimize the parameter k during model training, where FeaNum is the number of features used in modelling.

Multi-layer perception

Multi-layer perceptron (MLP) is known as a type of artificial neural network (ANN) [63, 64]. MLP has been applied in many bioinformatics studies, such as the prediction of protein structure classes [65], protein tertiary structure [66], and DNA–protein binding sites [67]. In this study, an MLP classifier with two hidden layers was trained, and the first and second hidden layers have 64 and 32 nodes, respectively. The maximum learning iterations is 1000.

Logistic regression

Logistic regression (LR) is widely used to predict the probability of an event happening [59, 68], which the following formula could represent:

$$p(y)=\frac{1}{1+{e}^{-({\beta }_{0}+{\beta }_{1}\chi )}}$$

where p(y) is the expected probability of dependent variable \(\mathrm{y}\), and β0 and β1 are constants.

Random forest

Random forest (RF) classifier is proposed by Breiman [69] and has been used in the prediction of type IV secreted effector proteins [70] and protein structural class [59]. To find the optimal number of the trees M and features mtry, we used a gird searching to optimize \(\mathrm{M}\) and \(\mathrm{mtry}\) within space \(\{1, 2,\cdots ,\mathrm{max}\left\{\sqrt{FeaNum},\frac{FeaNum}{2}\right\}\}\) and {1, 6, 11, 16}, respectively, where FeaNum is the number of features adopted during modeling.

XGBoost

Extreme gradient boosting (XGBoost) is a scalable end-to-end tree boosting system [71] and has been widely used as a fast and highly effective machine learning method [72, 73]. Eitzinger et al. implemented AcRanker using XGBoost to identify Acrs [14, 16]. In this study, the default parameters are adopted in the XGBoost model, except for the learning rate of 0.1.

LightGBM

Light gradient boost machine (LightGBM) shows excellent performance when the feature dimension is high and the larger data size [21]. LightGBM has been used in identifying miRNA targets [74] and predicting the protein–protein interactions [75] and the blood–brain-barrier penetration [76]. This study used the LightGBM package with default parameters in python during experiments.

CatBoost

CatBoost achieves state-of-the-art results since it successfully handles categorical features and calculates leaf values via a new scheme, which helps reduce overfitting [23]. Catboost has been applied in various tasks, including molecular structure relationship and the biological activity prediction [77] and the identification of pyroptosis-related molecular subtypes of lung adenocarcinoma [78]. In this study, the parameters of CatBoost were set as default values.

Ensemble learning method

This study proposed three ensemble models to construct more robust and reliable classifiers, which predicted new Acrs proteins by integrating the above eight classifiers (SVM, KNN, MLP, LR, RF, XGB, LightGBM, and CatBoost) through the hard voting rule (Ens-vote) or two stacking classifiers with logistic regression (Sta-LR) and gradient boosting classifier (Sta-GBC) [79], respectively.

Performance assessment

Fairly evaluating the classification methods' predictive performance is an essential subject in machine learning. In this study, we used six measurements, namely, Sensitivity (SN), Specificity (SP), Accuracy (ACC), Precision (PRE), F1-score, and Matthew’s correlation coefficient (MCC) [80], which are denoted as:

$$SN=\frac{TP}{TP+FN}$$
$$SP=\frac{TN}{TN+FP}$$
$$PRE=\frac{TP}{TP+FP}$$
$$ACC=\frac{TP+TN}{TP+FP+TN+FN}$$
$$F-score=2\times \frac{TP}{2TP+FP+FN}$$
$$\mathrm{MCC}=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FN\right)\times \left(TN+FP\right)\times \left(TP+FP\right)\times \left(TN+FN\right)}}$$

where TP, TN, FP, and FN are the number of true positive, true negative, false positive and false negative, respectively. Besides, the area under the receiver operating characteristic (ROC) curve (AUC) is also used to assess the performance, and the ROC was shown in a plot of the TP rate versus the FP rate. All methods were evaluated based on a fivefold cross-validation.