1 Introduction

Protein methylation is a vital post-translational modification (PTM) that plays a significant role in regulating various biological processes, including the modulation of gene expression, protein-protein interactions, DNA repair, and signal transduction. This modification involves adding a methyl group to specific amino acid residues, such as lysine or arginine, within proteins. Identifying protein methylation sites is vital for fundamental research and drug discovery endeavors. Substantial research progress has revealed the involvement of protein methylation in several human diseases, including neurotic disorders [1], multiple sclerosis, coronary heart disease [2], cancer [3, 4], and rheumatoid arthritis [5]. Therefore, precise prediction of methylated sites is crucial for understanding the underlying molecular mechanism associated with protein methylation. Traditional experimental techniques like ChIP-chip [6], mass spectrometry [7], and methylation-specific antibody probing [8] can predict methylation sites in proteins but are time-consuming and expensive. In the era of big data, there is a growing demand for machine learning-based prediction tools that offer accurate and rapid prediction capabilities, making them highly desirable.

Machine learning algorithms have shown great potential in predicting protein methylation sites from protein sequences in recent years. These algorithms can learn complex patterns from large datasets and extract sequence features, enabling accurate and efficient predictions. For example, some researchers used a deep learning model [9] to identify protein methylation sites from amino acid sequences and achieved an accuracy of 87.04%. Among the research endeavors, one notable example is the iMethyl-PseAAC [10], an innovative Support Vector Machine (SVM) model designed specifically for predicting protein methylation sites. They achieved high prediction accuracy using the position-specific scoring matrix and the pseudo-amino acid composition features. Chen et al. [11] presented a pioneering methodology MeMo which ingeniously merged an SVM algorithm with orthogonal binary coding schemes to retrieve information from the primary sequences. MeMo showed improved prediction performance compared to traditional sequence-based methods. He et al. [12] devised MethyCancer, a machine learning-based tool for identifying cancer-related methylation sites. They employed an ensemble model that combined multiple machine-learning algorithms to enhance prediction accuracy. Various bioinformatics techniques have been devised during the past few years for predicting different PTM locations in protein sequences [13,14,15,16,17,18,19,20,21,22,23,24].

However, the predictions of machine learning algorithms are often difficult to interpret, hindering their practical application in biology and medicine. Explainable artificial intelligence (XAI) techniques aim to provide clear and comprehensive explanations for predictive learning. XAI technology can help scientists understand how machine learning algorithms make predictions and gain insight into the biological processes that make those predictions.

In this work, we propose a novel RMSxAI method for predicting protein methylated sites during protein synthesis, utilizing machine learning algorithms. Additionally, we elucidate and interpret the predictions by applying XAI techniques. We used a methylated and unmethylated protein sequences database and extracted multiple features, including dipeptide composition, amino acid composition, physical and chemical properties, and distribution information. We use different types of machine learning algorithms, including random forests (RF), SVM, naive Bayes classifier, K-nearest neighbors (KNN), and fuzzy SVM (FSVM), to predict protein methylation sites and evaluate their performance using a variety of metrics. We also used various XAI concepts to explain the predictions to understand the biological mechanisms behind protein methylation. The detailed explanation of the proposed model is represented in Fig. 1.

Fig. 1
figure 1

Steps in methylation site prediction. The first step consists of data collection and removal of redundant sequences. The second step includes feature extraction, such as dipeptide composition, amino acid composition, normal distribution, and physicochemical properties. Next, build models for various machine learning algorithms. Then, evaluate the model using different evaluation metrics. Finally, interpret the output of the model using XAI techniques

Our approach has potential implications for both science and drug discovery. Predicting protein methylation sites can deliver insight into the biological functions of proteins and can be used to develop new treatments for diseases, including cancer and neurodegenerative diseases [25]. The XAI method used in our study becomes more precise and interpretable by applying it to other computational biology and medicine machine-learning models.

The remaining sections are presented as shown: Sect. 2 describes the data used in the research, feature extraction, machine learning algorithms, and evaluation metrics to find the effectiveness of all models. Section 3 discusses the results along with a comparison of various algorithms for the identification of protein methylation sites. Section 4 explains the interpretation process of XAI. Finally, Section 5 will conclude the paper and implications for research and future directions in this area.

2 Materials and methods

2.1 Data set

We gathered experimental evidence of in vivo methylated arginine sites from the manuscript [26] and UniProt database (version 2015_06) [27]. The dataset selected consists of experimentally verified sites where methylation happened. Searches were performed using the terms ’arginine,’ ’methylation,’ and ’methylation site’ to ensure the reliability of the data included. After careful analysis, peptides/proteins mentioned in PubMed search publications from June 2015 to December 2015 were added to the database. The collected data did not include methylated regions reported in vitro, which lack significant in vivo evidence. The ambiguous regions/proteins, including non-standard amino acids, inconsistent regions, small proteins (less than 30 amino acids), and protein-free foods were removed. To maintain the reliability of the data, they did not consider methylated sites from the PhosphoSitePlus database [28] due to the lack of detailed information on the experimental sources as well as additional supporting evidence for confirming the PTM evidence. We selected this dataset because most of the state-of-the-art methods used it. For a fair comparison of the proposed model with state-of-the-art methods, the dataset should be same. So, we selected these experimental dataset to show the effectiveness of the proposed model.

Notably, a significant portion of our methylation data is identical to the information shown on PhosphoSitePlus, which they claim to have been extracted from the literature. The data set consists of 2596 protein sequences, of which 1298 were methylated peptide sequences. The negative peptide sequences were generated from those arginine sites not marked as methylated from the same protein sequences from which positive data were selected. Using CD-HIT-2d [29] with 40% similarity cut-off, the negative peptide sequences were selected by removing redundant sequences with positive sequences. The length of the peptide sequences is 19, and these are segments from the full protein sequence. We combined the positive and negative sequences to form a benchmark data set. The benchmark data set is divided into two parts: training and testing. We used 80% of the data for model training and the rest 20% for testing of the model.

2.2 Feature representation

The process of extracting features from primary sequences is vital in bioinformatics. It requires analysis and coding of relevant information about the protein sequence according to numerical or categorical features. These features then serve as essential inputs for various machine learning algorithms for classification and prediction. Some of the features that are extracted from amino acid sequences are described below:

2.2.1 Amino acid composition

This method requires the composition of individual amino acids in the protein sequence. The frequency of every amino acid is computed to represent the primary sequence to the 20-dimensional feature vector [30, 31]. For instance, suppose a primary sequence S has length k; then the following equation can be used to find the frequency of every amino acid.

$$\begin{aligned} A_i = \frac{N_i}{k}, \; \; \; \; \; \; i=1 \; to \; 20 \end{aligned}$$
(1)

where k indicates the sequence length and \(N_i\) denotes the frequency count of \(i^{th}\) amino acid in the primary sequence.

2.2.2 Physicochemical properties

This process involves the calculation of various properties, such as molecular weight, isoelectric point, hydrophobicity, and other structural tendencies like turn fraction, helix fraction, gravy, and aromaticity [32, 33]. These products help to capture the main points of the physical and chemical properties of proteins, providing a better understanding of their properties.

(a) Hydrophobicity: Hydrophobicity is the physical property of amino acids like running water and prefers non-polar locations. Various scales are available to measure the hydrophobicity of amino acids, including the Kyte-Doolittle, Hessa, and Eisenberg scales [34]. The Kyte-Doolittle scale assigns each amino acid a numerical value related to its hydrophobic or hydrophilic (water-absorbing) properties. Positive values indicate greater hydrophobicity, while negative values indicate stronger hydrophilicity. The hydrophobicity index (KD) for an amino acid i is estimated using the below formulation:

$$\begin{aligned} KD_i = \frac{\sum _{j=1}^{n} x_{ij}}{n} \end{aligned}$$
(2)

where n is the protein sequence length and \(x_{ij}\) is the hydrophobicity value of \(i^{th}\) amino acid at \(j^{th}\) position in the protein sequence.

(b) Charge and Polarity: To measure the aggregation of a protein sequence, the number of positively charged amino acids can be counted, and the charge of negatively charged amino acids subtracted. This calculation results in a value indicating whether the protein is positively, negatively, or neutrally charged [35].

For charge, consider a protein sequence S having length n, where \(S = s_1 s_2 \ldots s_n\) and \(s_i\) indicate the \(i^{th}\) amino acid in the sequence. Let \(\text {pH}\) be the pH value at which the net charge is calculated.

The net charge of the protein sequence is computed by using the below equation:

$$\begin{aligned} \text {charge} = {\left\{ \begin{array}{ll} \sum _{i=1}^{n} \left( \frac{1}{1 + 10^{(\text {pKa}\_\text {values}[s_i] - \text {pH})}} \right) , \\ \text {if } \text {protonated}[s_i] = \text {True} \\ \sum _{i=1}^{n} \left( -\frac{1}{1 + 10^{(\text {pH} - \text {pKa}\_\text {values}[s_i])}} \right) , \\ \text {if } \text {protonated}[s_i] = \text {False} \end{array}\right. } \end{aligned}$$
(3)

where \(\text {pKa}\_\text {values}[s_i]\) is the pKa values of amino acid \(s_i\) and \(\text {protonated}[s_i]\) indicates whether amino acid \(s_i\) is protonated at the given pH.

The resulting \(\text {charge}\) provides the net charge of the protein sequence at the specified pH value. To measure the overall polarity of the protein sequence, it can be estimated by counting the number of polar and non-polar amino acid residues. Analyzing the relative abundance of polar and non-polar residues can provide insight into the hydrophilic or hydrophobic nature of the protein. Once the features are extracted, we can use them for various tasks such as classification, clustering, or prediction. Let \(\text {polar}\_\text {aa}\) be the set of polar amino acid residues and \(\text {nonpolar}\_\text {aa}\) be the set of non-polar amino acid residues. The following formula can be used to compute the polarity score:

$$\begin{aligned} \text {polarity score} = \frac{\text {polar}\_\text {count}}{\text {nonpolar}\_\text {count}} \end{aligned}$$
(4)

where \(\text {polar}\_\text {count}\) is the count of polar amino acid residues in the sequence, and \(\text {nonpolar}\_\text {count}\) is the count of non-polar amino acid residues in the sequence. If \(\text {nonpolar}\_\text {count}\) is 0, meaning there are no non-polar amino acids in the sequence, the function returns \(-1\) to indicate an undefined polarity score. The resulting polarity score measures the relative proportion of polar to non-polar amino acids in the sequence, indicating its overall polarity.

(c) Isoelectric point: The isoelectric point (pI) of a protein can be estimated using the Henderson-Hasselbalch equation [36]:

$$\begin{aligned} pI = \frac{pKa_1 + pKa_2}{2} \end{aligned}$$
(5)

where \(pKa_2\) is the ionization constant of the most basic group with a pKa value less than the pH and \(pKa_1\) is the ionization constant of the most acidic group with a pKa value more significant than the pH. The pKa values for the ionizable groups are specific to the amino acids in the primary sequence.

2.2.3 Dipeptide composition

Dipeptides consist of two amino acid residues in a protein sequence and play a significant role in revealing the complex structure and function of proteins [37]. By studying the frequency and distribution of dipeptides, scientists can learn more about the relationship between amino acid pairs, revealing important information about protein structure, stability, association, and enzymatic activity. Dipeptide analysis is a powerful tool for understanding the complex mechanisms behind protein folding, binding interactions, and structural changes. Additionally, dipeptide-based signatures facilitate the development of machine-learning models for protein classification, prediction, and engineering. There are 400 dipeptide pairs, and the frequency of every pair is computed to describe a primary sequence by a 400-dimensional fixed vector. The Eq. 6 is used to compute the dipeptide composition of the primary sequence S with length k.

$$\begin{aligned} D_i=\frac{n_i}{k}, \; \; \; \; \; \; \; i=1 \; to \; 400 \end{aligned}$$
(6)

where \(n_i\) is the count that tells how many times the \(i^{th}\) dipeptide pair occurs within the sequence S.

2.2.4 Normal distribution

The probability normal distribution, sometimes called the Gaussian distribution, is a continuous distribution widely used in statistics and contingency theory [38]. This distribution is characterized by the equivalent bell curve, where the standard deviation and mean control the central tendency and spread of the distribution. The mode, median, and mean are equal in a normal distribution, and 68% of the data are within one standard deviation of the mean. Additionally, almost all data points are within three standard deviations of the mean, with around 95% of the data falling within two standard deviations. Consider a protein sequence S having length n, where \(S = s_1 s_2 \ldots s_n\) and \(s_i\) indicate the \(i^{th}\) amino acid in the sequence.

Let \(x_i\) be the property value associated with amino acid \(s_i\); we have used frequencies of each amino acid in the protein sequence as the property value. Let \(\sigma _i\) be the standard deviation and \(\mu _i\) be the mean of the property value for amino acid \(s_i\). The normal distribution value \(\text {ND}(x_i)\) for amino acid \(s_i\) can be estimated using the probability density function:

$$\begin{aligned} \text {ND}(x_i) = \frac{1}{\sqrt{2\pi }\sigma _i} \cdot e^{-\frac{(x_i - \mu _i)^2}{2\sigma _i^2}} \end{aligned}$$
(7)

The resulting \(\text {ND}(x_i)\) represents the normal distribution value of the property corresponding to amino acid \(s_i\) in the primary sequence.

In this study, we employed a comprehensive set of features to characterize and analyze protein sequences to gain deeper insights into their structural and functional attributes. The features utilized encompassed diverse aspects, including the frequency of individual amino acids, the fractional representation of each amino acid, occurrences of dipeptides, and normal distribution values associated with each amino acid. Furthermore, physicochemical properties such as molecular weight, charge, polarity, hydrophobicity, aromaticity, instability index, gravy, helix rotation, turn fraction, and sheet fraction were considered. In the present study, each protein sequences is represented by 471-dimensional feature vector. This multifaceted feature set allowed us to construct a holistic profile of the protein sequences under investigation. These features capture the primary sequence characteristics and delve into the physicochemical properties that contribute to the proteins’ overall structural and functional aspects.

2.3 Machine learning algorithms

2.3.1 KNN classifier

The KNN is a supervised machine learning technique that may be used for regression and classification problems [39]. It functions as a non-parametric technique, using the k-nearest training samples to the test sample for generating predictions. KNN provides a list of the nearest neighbors for new information for classification tasks. In regression analysis, the algorithm calculates the mean of the K nearest neighbors to make the estimation.

Unlike other algorithms, KNN does not require training as it stores all training information in its memory. The main parameter in KNN is the K value, which determines the count of neighbors taken into account when making the prediction. The choice of K can affect the algorithm’s performance and is often decided by competition. Mathematically, the KNN classifier assigns the class label \(y_{\text {new}}\) to \(x_{\text {new}}\) using the majority class of its k nearest neighbors:

$$\begin{aligned} y_{\text {new}} = \underset{y}{\text {argmax}} \sum _{x_i \text { among k nearest neighbors}} \delta (y_i, y) \end{aligned}$$
(8)

where \(y_{\text {new}}\) is the predicted class label of \(x_{\text {new}}\), \(y_i\) is the class label of the \(i^{th}\) nearest neighbor, and \(\delta (y_i, y)\) is the Kronecker delta function that returns 1 if \(y_i = y\) and 0 otherwise. The parameter k determines the number of neighbors to consider. A larger k results in a smoother decision boundary, while a smaller k captures more local details.

2.3.2 Naive Bayes classifier

Naive Bayes is a machine learning algorithm often utilized in classification and prediction tasks. It is known for its simplicity and efficiency, especially when dealing with extensive data sets [40]. The Naive Bayes algorithm is based on Bayes’ theorem, which establishes a relationship between the probability of a hypothesis (in this context, a list of classes) given the particular proof (represented by the process) and the given outcome. Given a data set with m instances and n features, and a set of C possible class labels \(\{C_1, C_2, \ldots , C_C\}\), the Naive Bayes classifier assigns a class label \(C_i\) to a new instance \(\textbf{x}_{\text {new}} = (x_1, x_2, \ldots , x_n)\) by calculating the conditional probabilities of each class given the features:

$$\begin{aligned} P(C_i | \textbf{x}_{\text {new}}) = \frac{P(\textbf{x}_{\text {new}} | C_i) \cdot P(C_i)}{P(\textbf{x}_{\text {new}})} \end{aligned}$$
(9)

where \(P(\textbf{x}_{\text {new}} | C_i)\) is the likelihood of the features \(\textbf{x}_{\text {new}}\) given class \(C_i\) and \(P(C_i)\) is the prior probability of class \(C_i\). \(P(C_i | \textbf{x}_{\text {new}})\) is the posterior probability of class \(C_i\) given the new instance \(\textbf{x}_{\text {new}}\). \(P(\textbf{x}_{\text {new}})\) is the marginal likelihood of the features \(\textbf{x}_{\text {new}}\).

The “naive” assumption behind Naive Bayes is that, given the class labels, the features are conditionally independent. This assumption simplifies the likelihood term:

$$\begin{aligned} P(\textbf{x}_{\text {new}} | C_i) = P(x_1 | C_i) \cdot P(x_2 | C_i) \cdot \cdots \cdot P(x_n | C_i) \end{aligned}$$
(10)

The Naive Bayes classifier then assigns the new instance to the class having the maximum posterior probability:

$$\begin{aligned} \text {Predicted class} = \underset{C_i}{\text {argmax}} \left( P(C_i) \cdot \prod _{j=1}^{n} P(x_j | C_i) \right) \end{aligned}$$
(11)

In practice, probabilities \(P(x_j | C_i)\) are often estimated from the training data using techniques like frequency counting or kernel density estimation. Naive Bayes is often used as a basic algorithm for text classification and spam filtering tasks.

2.3.3 SVM classifier

The SVM is a supervised learning algorithm for regression and classification. It works by defining an optimum hyperplane that divides data points into different classes [41]. The SVM algorithm tries to learn a decision boundary with the most significant separation between the two classes. The margin is the distance between the decision region and the nearest data points in each class. The goal is to determine a general hyperplane to separate the classes while minimizing the classification error. Given a data set with m instances and n features, and a set of C possible class labels \(\{C_1, C_2, \ldots , C_C\}\), the SVM algorithm finds a hyperplane that best classifies the data into distinct classes and maximizes the margin among the classes.

Let \(\textbf{X}\) represent the feature matrix of size \(m \times n\), where each row \(\textbf{x}_i\) corresponds to the \(i^{th}\) instance, and \(\textbf{y}\) be the corresponding vector of class labels (\(\textbf{y} = [y_1, y_2, \ldots , y_m]^T\)). The optimization problem of finding the optimal hyperplane can be formulated as:

$$\begin{aligned} \begin{aligned} \text {minimize} \quad&\frac{1}{2} \Vert \textbf{w}\Vert ^2 + C \sum _{i=1}^{m} \xi _i \\ \text {subject to} \quad&y_i (\textbf{w}^T \textbf{x}_i + b) \ge 1 - \xi _i, \quad \forall i = 1, 2, \ldots , m \\&\xi _i \ge 0, \quad \forall i = 1, 2, \ldots , m \end{aligned} \end{aligned}$$
(12)

where \(\textbf{w}\) is the weight vector of the hyperplane and b is the bias term. \(\xi _i\) are slack variables that allow for soft-margin classification, and C is a regularization parameter that balances the trade-off between minimizing the classification error and maximizing the margin.

The decision function for classifying a new instance \(\textbf{x}_{\text {new}}\) is given by:

$$\begin{aligned} f(\textbf{x}_{\text {new}}) = \text {sign}(\textbf{w}^T \textbf{x}_{\text {new}} + b) \end{aligned}$$
(13)

The SVM seeks to determine the optimum values of \(\textbf{w}\) and b that maximize the margin among the classes while ensuring that data points are ideally classified within certain margins (controlled by \(\xi _i\)) or on the accurate side of the hyperplane. The effectiveness of SVM has been shown in different areas, such as image recognition, text classification, and bioinformatics. However, the selection of the kernel function and its associated parameters may affect the results. Additionally, SVM is highly desirable, especially when dealing with large data sets.

2.3.4 Fuzzy SVM classifier

FSVM is a variant of the SVM algorithm that integrates fuzzy logic into the decision-making process. It operates as a supervised learning approach capable of handling uncertainty present in the input data [42]. In FSVM, the algorithm assigns membership values to each data point in the training set, considering their proximity to the decision boundary.

Given a data set with m instances and n features, and a set of C possible class labels \(\{C_1, C_2, \ldots , C_C\}\), the FSVM algorithm extends the traditional SVM by introducing a fuzziness factor to account for uncertainty in-class assignments. Let \(\textbf{X}\) represent the feature matrix of size \(m \times n\), where each row \(\textbf{x}_i\) corresponds to the \(i^{th}\) instance, and \(\textbf{y}\) be the corresponding vector of class labels (\(\textbf{y} = [y_1, y_2, \ldots , y_m]^T\)). The optimization problem of finding the optimal hyperplane for FSVM can be formulated as:

$$\begin{aligned} \begin{aligned} \text {minimize} \quad&\frac{1}{2} \Vert \textbf{w}\Vert ^2 + C \sum _{i=1}^{m} \mu _i \xi _i \\ \text {subject to} \quad&y_i (\textbf{w}^T \textbf{x}_i + b) \ge 1 - \xi _i, \quad \forall i = 1, 2, \ldots , m \\&and \quad \xi _i \ge 0, \quad \forall i = 1, 2, \ldots , m \\&and \quad \mu _i \in [0, 1], \quad \forall i = 1, 2, \ldots , m \end{aligned} \end{aligned}$$
(14)

where b is the bias and \(\textbf{w}\) is the weight vector of the hyperplane. \(\xi _i\) are slack variables that allow for soft-margin classification. \(\mu _i\) are fuzzy membership values representing each instance’s degree of class membership. They capture the uncertainty in-class assignments. C is a regularization parameter that balances the trade-off between minimizing the classification error and maximizing the margin. \(y_i\) is the class label of instance \(\textbf{x}_i\). The following function is used for classifying a new instance \(\textbf{x}_{\text {new}}\):

$$\begin{aligned} f(\textbf{x}_{\text {new}}) = \text {sign}\left( \textbf{w}^T \textbf{x}_{\text {new}} + b - \frac{1}{C}\sum _{i=1}^{m} \mu _i \xi _i\right) \end{aligned}$$
(15)

The FSVM aims to determine the optimum \(\textbf{w}\), b, and \(\mu _i\), which maximizes the margin among the classes while considering the fuzzy membership values and handling misclassified instances through slack variables. FSVM has exhibited strong performance in uncertain or imprecise data domains, including medical diagnosis, image processing, and pattern recognition.

2.3.5 RF classifier

RF is used for classification and regression analysis. It functions as a cluster learning technique that uses different decision trees to increase the stability and accuracy of the model. The RF generates many trees that are determined by randomly selecting a subset of the data points and features from the training data [43]. Each decision tree is trained on a different data and feature set. The final estimate is determined by summing the estimates of all the decision trees.

Consider a data set with m instances and n features, and a set of C possible class labels \(\{C_1, C_2, \ldots , C_C\}\), the RF algorithm creates an ensemble of decision trees to carry out the classification. Let \(\textbf{X}\) represent the feature matrix having size \(m \times n\), where every row \(\textbf{x}_i\) corresponds to the \(i^{th}\) instance, and \(\textbf{y}\) be the corresponding vector of class labels (\(\textbf{y} = [y_1, y_2, \ldots , y_m]^T\)).

The Random Forest algorithm constructs T decision trees using a bootstrapped subset of the data. Each tree is trained to minimize impurity or maximize information gain at each node, leading to diverse trees. The final classification is determined through a majority vote of the predictions from each tree. For a new instance \(\textbf{x}_{\text {new}}\), the RF prediction \(y_{\text {new}}\) is given by:

$$\begin{aligned} y_{\text {new}} = \text {mode}\left( f_1(\textbf{x}_{\text {new}}), f_2(\textbf{x}_{\text {new}}), \ldots , f_T(\textbf{x}_{\text {new}}) \right) \end{aligned}$$
(16)

where \(f_i(\textbf{x}_{\text {new}})\) is the output of the \(i^{th}\) decision tree for the new instance \(\textbf{x}_{\text {new}}\) and \(\text {mode}\) returns the most frequently occurring class label among the individual tree predictions.

3 Experimental results

This section assesses the performance of multiple classifiers to select the best-performing classifier to identify the protein methylated sites. The performance of the proposed model is also compared with previous state-of-the-art arginine methylation predictors.

3.1 Evaluation metrics

The evaluation metrics measure and analyze the effectiveness and efficiency of the proposed model. We conduct rigorous experiments using appropriate criteria to assess the effectiveness of the proposed model. We compared the performance of the proposed model with various frameworks and the latest technology in the field. The assessment criteria used in this study involved recall, precision, accuracy, area under the curve (AUC), F1 score, etc. These metrics provide information about the proposed method’s classification accuracy, predictive power, and discrimination power. We use a true positive (TP), true negative (TN), false positive (FP), and false negative (FN) based assessment method to evaluate the effectiveness of the proposed model. These measurements successfully measure the classification accuracy and the error of our method.

AUC was obtained from the receiver operating characteristic (ROC) curve using the trapezoidal approximation of the ROC curve. Another measure for assessing the validity of a binary (two-class) distribution is the Matthews Correlation Coefficient (MCC). It uses true and false positives and negatives and is considered an equal measure that can be used for different classes. MCC is the correlation coefficient between the observed and predicted binary distributions, and the F-test is used to calculate the accuracy of the distribution.

$$\begin{aligned}&Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} \end{aligned}$$
(17)
$$\begin{aligned}&Recall \; or \; Sensitivity = \frac{TP}{(TP+FN)} \end{aligned}$$
(18)
$$\begin{aligned}&Precision = \frac{TP}{TP+FP} \end{aligned}$$
(19)
$$\begin{aligned}&Specificity = \frac{TN}{FP+TN} \end{aligned}$$
(20)
$$\begin{aligned}&MCC = \frac{(TP*TN-FN*FP)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$
(21)
$$\begin{aligned}&F1-measure = 2*\frac{(precision*recall)}{(precision+recall)} \end{aligned}$$
(22)

3.2 Performance of various classifiers

In this study, we evaluated the results of different machine-learning algorithms to identify methylation sites in proteins. The extracted features from the protein sequences are physicochemical properties, dipeptide composition, normal distribution, and amino acid composition. We trained various machine learning models on these features, including the KNN, SVM, FSVM, RF, and Naive Bayes classifiers. Our test results are shown in Table 1.

Table 1 Performance comparison of various machine learning classifiers for the prediction of arginine methylation sites

Table 1 shows that the accuracy of the KNN, SVM, FSVM, RF, and naive Bayes models are 71.9%, 78.7%, 80.9%, 88.4%, and 79.6%, respectively. In addition to accuracy, we also evaluated the performance using the F1 score, MCC, precision, specificity, recall, and AUC score. The RF model achieves 89% precision, 93% specificity, 0.772 MCC, 88% f1 score, and 89% recall (see Table 1). The specificity achieved by KNN, SVM, FSVM, and naive Bayes model is 72.7%, 79.3%, 78.4%, and 78.8%, respectively. Moreover, the recall obtained by KNN, SVM, FSVM, and naive Bayes model are 72%, 79%, 81%, and 80%, respectively. We also generated a ROC curve to evaluate the performance of the models as mentioned above. The ROC curve shows the true positive value on the y-axis (sensitivity) and the false positive rate (1-specificity) on the x-axis for various cut-off points for the measured values. The AUC is a combined measure of the sensitivity and specificity of the classification test, indicating its validity.

An AUC score indicates the overall performance of the binary classification model in terms of its ability to distinguish positive and negative examples. An AUC maximum of 1 indicates that the classification test is unbiased and distinguishes between positive and negative data. The ROC curve for different machine learning models is shown in Fig. 2. We obtained AUC values of 0.78, 0.82, 0.86, 0.85, and 0.94 for the KNN, SVM, naive Bayes, FSVM, and RF models, respectively. The AUC values in the range 0.75\(\le \) AUC \(\le \) 0.95 indicate the excellent ability to distinguish between methylated and unmethylated arginine sites. From Fig. 2 and Table 1, we can see that the RF model performs better than other classifiers in terms of all evaluation metrics. Therefore, the RF model was adopted as the base classifier for predicting protein arginine methylated sites. Overall, our findings suggest that machine learning algorithms can be used to predict methylation sites in proteins accurately.

Fig. 2
figure 2

Comparison of ROC curve of different machine learning algorithms to predict arginine methylated sites from protein sequences

Understanding protein methylated sites is essential for both epigenetic inheritance and proteome analysis of several human diseases. Which arginine site in a protein can be methylated, and which site cannot be methylated. This is the first significant issue that has to be addressed in order to fully understand the methylation process and drugs development. So, the proposed model can be used to identify arginine methylated sites in a protein. Moreover, methylation of proteins or DNA can result in epigenetic inheritance due to methylation. Researchers and medical professionals have established a connection between the dysregulation of neurochemistry and the emergence of long-term health issues including hypertension, diabetes, obesity, and depression. The processes behind these fundamental epigenetic phenomena would thus undoubtedly yield a useful knowledge or suggestions for medication research.

3.3 Comparison of the RMSxAI with existing predictors

The proposed model (RMSxAI) used RF for predicting arginine methylated sites. RMSxAI was compared with previous state-of-the-art predictors to find its effectiveness. For a fair comparison of the RMSxAI and state-of-the-art predictors, the identical data set for arginine methylation was used. The comparison results of the RMSxAI with existing predictors are shown in Table 2. RMSxAI was compared with MeMo [11], BPB-PPMS [44], PMeS [45], MASA [46], iMethyl-PseAAC [10], PSSMe [47], MePred-RF [48], PRmePRed [26], DeepRMethylSite [9], and SSMFN [49] on the same data set delivered by Kumar et al. [26]. The performance values of PMeS, BPB-PPMS, MASA, MeMo, PSSMe, iMethyl-PseAAC, MePred-RF, PRmePRed, DeepRMethylSite, and SSMFN were taken from Lumbanraja et al. [49]. Except for specificity, the other evaluation measures of the proposed model are higher than the existing arginine methylation site predictors. The accuracy, recall, and MCC of the RMSxAI were 1.57% to 32.4%, 1.91% to 77%, and 0.04 to 0.61 higher than the state-of-the-art predictors, respectively. In conclusion, the proposed model performed better than existing predictors for identifying arginine methylated sites.

Table 2 The comparison of the RMSxAI with previous predictors on the same data set provided by Kumar et al. [26]

4 Explainable artificial intelligence

To investigate the effectiveness of the RMSxAI, we use XAI to explain the predictions of the proposed model. XAI is an innovative approach to address the increasing need for accountability, transparency, and clarity in machine learning models. Machine learning models can be interpreted as “black boxes” which are difficult to understand because of their intricate internal structure. XAI is a set of methods that can be used to understand why a machine learning model makes a prediction. Every choice made during the machine learning procedure can be tracked and explained because of the particular methods and techniques used by XAI. The prediction accuracy may be evaluated by executing simulations and comparing the output of XAI with the outcomes in the dataset. Local Interpretable Model-Agnostic Explanations (LIME) describes the predictions of classifier by machine learning technique, is the most widely used approach for this. We use LIME technique to generate native annotations for machine learning models. LIME works by creating a synthetic sample database similar to the input model. Synthetic data are used to train the linear model, which is utilized to explain the initial decision of the model. The linear model explains the proposed model estimation, providing the most essential criteria. We used LIME to describe the estimation of protein abundance from the proposed model. Interpreted results of an arbitrary instance from the test data set using LIME XAI are shown in Fig. 3.

Fig. 3
figure 3

Explaination of an instance from the dataset using XAI

It can be observed from Fig. 3 that the estimation of the proposed model depends on several properties, including the dipeptide composition, amino acid composition, and chemical and physical properties of the protein. In Fig. 3a, we can see the different properties contributing to the prediction for an individual instance from the test data. If the value belongs to the given range, we decide whether the feature contributes to the positive or negative class based on the feature value. Figure 3b shows the cumulative effect for each class and the individual characteristics that determine the class. Here, “0” indicates that the methylation site is absent, and “1” indicates that the methylation site is present. We can observe that this test instance has methylation with 81% confidence. All the features contributing to this instance to decide that methylation is present are shown in Fig. 3b. The reasons for the model to make this decision are the dipeptide composition value of pair RF is more than 0.00, and helix fraction (physicochemical properties) is more than 0.32.

The features’ average impact on the proposed model output using LIME is shown in Fig. 4. The plot shows the ten most important features for predicting protein methylation sites. The main features include molecular weight, the composition of dipeptide pair RG, the composition of amino acid G, and other physicochemical properties that impact the model output. These results demonstrate that the proposed model can predict protein methylation sites by identifying the most relevant features.

The results of this study suggest that machine learning algorithms can be utilized to predict methylation sites in high-density proteins. Using XAI helps to understand how the proposed model makes decisions, which can be used to improve model accuracy and identify new patterns in data. The difficulties and constraints associated with XAI encompass issues related to users understanding, human bias, AI complexity of models, security, and data privacy. However, it is necessary to understand that XAI is unavoidably important, since it is a vital step towards future deployments of more responsible and effective AI models.

Fig. 4
figure 4

Feature importance for the identification of protein arginine methylation sites using XAI

5 Conclusion

In summary, this research paper explores the use of machine learning algorithms to predict methylation sites and provide explanations through XAI. With this work, we demonstrate the ability of machine learning algorithms to accurately predict methylation sites that can help us understand epigenetic changes and their effects on gene expression.

Our results show that using various machine learning algorithms, including RF, FSVM, and different classifiers, gives good results in estimating methylation sites. These algorithms exploit the relationship between protein abundance and methylation patterns by demonstrating their ability to process extensive genomics data and produce accurate predictions. Applying machine learning algorithms in this context opens up new avenues to identify potential methylated sites and understand the mechanisms underlying epigenetic changes. Also, the integration of XAI input allows us to interpret and explain the predictions made by the machine learning models. XAI technology provides clarity and interpretation, helping researchers and scientists understand the complex decision-making process of machine learning models when predicting methylation sites.

In the future, more research is needed to explore the full potential of machine learning algorithms and XAI methods in methylation site prediction. Integration with other omics data, such as gene expression and chromatin accessibility, provides a more comprehensive understanding of epigenetic regulation. In addition, developing software tools that can be used to combine machine learning and XAI methods will facilitate their adoption by the broader research community, thereby making progress in epigenetics and personalized medicine.

This study demonstrates essential lessons and valuable insights from XAI techniques for predicting methylation sites using machine learning algorithms. These advances could revolutionize our understanding of epigenetic changes and their impact on human health and disease.