RMSxAI: arginine methylation sites prediction from protein sequences using machine learning algorithms and explainable artificial intelligence

Dwivedi, Gaurav; Khandelwal, Monika; Rout, Ranjeet Kumar; Umer, Saiyed; Mallik, Saurav; Qin, Hong

doi:10.1007/s42452-024-05898-y

RMSxAI: arginine methylation sites prediction from protein sequences using machine learning algorithms and explainable artificial intelligence

Research
Open access
Published: 16 June 2024

Volume 6, article number 329, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Applied Sciences Aims and scope Submit manuscript

RMSxAI: arginine methylation sites prediction from protein sequences using machine learning algorithms and explainable artificial intelligence

Download PDF

Gaurav Dwivedi¹,
Monika Khandelwal¹,
Ranjeet Kumar Rout¹,
Saiyed Umer²,
Saurav Mallik^3,4 &
…
Hong Qin⁵

222 Accesses
Explore all metrics

Abstract

Protein methylation is a vital regulator of many biological processes at the post-translational level, and accurate prediction of protein methylation sites is essential for research and drug discovery. In this paper, we present a new method, namely RMSxAI, to predict the arginine methylation sites from primary sequences using machine learning algorithms and describe the predictions using explainable artificial intelligence (XAI) techniques. Leveraging experimentally validated methylated and unmethylated protein sequences from diverse organisms, we deduced several sequence features, encompassing physicochemical properties, amino acid composition, and evolutionary insights. Our results show that the proposed RMSxAI can predict protein methylation sites with high accuracy, bringing the F1 score up to 0.88 and overall accuracy up to 88.4%. We use various XAI methods to explain the output results. These include key features, partial occupancy maps, and local variation models that provide insight into key features and interactions that lead to predictions. Overall, our approach is relevant to research and drug discovery, and our results demonstrate the potential of machine learning algorithms and XAI methods to provide accurate and meaningful prediction of arginine methylation sites.

Article highlights

A novel method, RMSxAI, is proposed to predict arginine methylation sites from the protein sequences using machine learning algorithms.
It explores the sequence-based features, such as physicochemical properties, dipeptide composition, amino acid composition, and distribution information, to extract discriminative and informative features from the sequences.
It explains the feature importance and prediction results using explainable artificial intelligence.

PRMxAI: protein arginine methylation sites prediction based on amino acid spatial distribution using explainable artificial intelligence

Article Open access 04 October 2023

Position-specific prediction of methylation sites from sequence conservation based on information theory

Article Open access 23 July 2015

Two-Level Protein Methylation Prediction using structure model-based features

Article Open access 07 April 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Protein methylation is a vital post-translational modification (PTM) that plays a significant role in regulating various biological processes, including the modulation of gene expression, protein-protein interactions, DNA repair, and signal transduction. This modification involves adding a methyl group to specific amino acid residues, such as lysine or arginine, within proteins. Identifying protein methylation sites is vital for fundamental research and drug discovery endeavors. Substantial research progress has revealed the involvement of protein methylation in several human diseases, including neurotic disorders [1], multiple sclerosis, coronary heart disease [2], cancer [3, 4], and rheumatoid arthritis [5]. Therefore, precise prediction of methylated sites is crucial for understanding the underlying molecular mechanism associated with protein methylation. Traditional experimental techniques like ChIP-chip [6], mass spectrometry [7], and methylation-specific antibody probing [8] can predict methylation sites in proteins but are time-consuming and expensive. In the era of big data, there is a growing demand for machine learning-based prediction tools that offer accurate and rapid prediction capabilities, making them highly desirable.

Machine learning algorithms have shown great potential in predicting protein methylation sites from protein sequences in recent years. These algorithms can learn complex patterns from large datasets and extract sequence features, enabling accurate and efficient predictions. For example, some researchers used a deep learning model [9] to identify protein methylation sites from amino acid sequences and achieved an accuracy of 87.04%. Among the research endeavors, one notable example is the iMethyl-PseAAC [10], an innovative Support Vector Machine (SVM) model designed specifically for predicting protein methylation sites. They achieved high prediction accuracy using the position-specific scoring matrix and the pseudo-amino acid composition features. Chen et al. [11] presented a pioneering methodology MeMo which ingeniously merged an SVM algorithm with orthogonal binary coding schemes to retrieve information from the primary sequences. MeMo showed improved prediction performance compared to traditional sequence-based methods. He et al. [12] devised MethyCancer, a machine learning-based tool for identifying cancer-related methylation sites. They employed an ensemble model that combined multiple machine-learning algorithms to enhance prediction accuracy. Various bioinformatics techniques have been devised during the past few years for predicting different PTM locations in protein sequences [13,14,15,16,17,18,19,20,21,22,23,24].

However, the predictions of machine learning algorithms are often difficult to interpret, hindering their practical application in biology and medicine. Explainable artificial intelligence (XAI) techniques aim to provide clear and comprehensive explanations for predictive learning. XAI technology can help scientists understand how machine learning algorithms make predictions and gain insight into the biological processes that make those predictions.

In this work, we propose a novel RMSxAI method for predicting protein methylated sites during protein synthesis, utilizing machine learning algorithms. Additionally, we elucidate and interpret the predictions by applying XAI techniques. We used a methylated and unmethylated protein sequences database and extracted multiple features, including dipeptide composition, amino acid composition, physical and chemical properties, and distribution information. We use different types of machine learning algorithms, including random forests (RF), SVM, naive Bayes classifier, K-nearest neighbors (KNN), and fuzzy SVM (FSVM), to predict protein methylation sites and evaluate their performance using a variety of metrics. We also used various XAI concepts to explain the predictions to understand the biological mechanisms behind protein methylation. The detailed explanation of the proposed model is represented in Fig. 1.

Our approach has potential implications for both science and drug discovery. Predicting protein methylation sites can deliver insight into the biological functions of proteins and can be used to develop new treatments for diseases, including cancer and neurodegenerative diseases [25]. The XAI method used in our study becomes more precise and interpretable by applying it to other computational biology and medicine machine-learning models.

The remaining sections are presented as shown: Sect. 2 describes the data used in the research, feature extraction, machine learning algorithms, and evaluation metrics to find the effectiveness of all models. Section 3 discusses the results along with a comparison of various algorithms for the identification of protein methylation sites. Section 4 explains the interpretation process of XAI. Finally, Section 5 will conclude the paper and implications for research and future directions in this area.

2 Materials and methods

2.1 Data set

We gathered experimental evidence of in vivo methylated arginine sites from the manuscript [26] and UniProt database (version 2015_06) [27]. The dataset selected consists of experimentally verified sites where methylation happened. Searches were performed using the terms ’arginine,’ ’methylation,’ and ’methylation site’ to ensure the reliability of the data included. After careful analysis, peptides/proteins mentioned in PubMed search publications from June 2015 to December 2015 were added to the database. The collected data did not include methylated regions reported in vitro, which lack significant in vivo evidence. The ambiguous regions/proteins, including non-standard amino acids, inconsistent regions, small proteins (less than 30 amino acids), and protein-free foods were removed. To maintain the reliability of the data, they did not consider methylated sites from the PhosphoSitePlus database [28] due to the lack of detailed information on the experimental sources as well as additional supporting evidence for confirming the PTM evidence. We selected this dataset because most of the state-of-the-art methods used it. For a fair comparison of the proposed model with state-of-the-art methods, the dataset should be same. So, we selected these experimental dataset to show the effectiveness of the proposed model.

Notably, a significant portion of our methylation data is identical to the information shown on PhosphoSitePlus, which they claim to have been extracted from the literature. The data set consists of 2596 protein sequences, of which 1298 were methylated peptide sequences. The negative peptide sequences were generated from those arginine sites not marked as methylated from the same protein sequences from which positive data were selected. Using CD-HIT-2d [29] with 40% similarity cut-off, the negative peptide sequences were selected by removing redundant sequences with positive sequences. The length of the peptide sequences is 19, and these are segments from the full protein sequence. We combined the positive and negative sequences to form a benchmark data set. The benchmark data set is divided into two parts: training and testing. We used 80% of the data for model training and the rest 20% for testing of the model.

2.2 Feature representation

The process of extracting features from primary sequences is vital in bioinformatics. It requires analysis and coding of relevant information about the protein sequence according to numerical or categorical features. These features then serve as essential inputs for various machine learning algorithms for classification and prediction. Some of the features that are extracted from amino acid sequences are described below:

2.2.1 Amino acid composition

This method requires the composition of individual amino acids in the protein sequence. The frequency of every amino acid is computed to represent the primary sequence to the 20-dimensional feature vector [30, 31]. For instance, suppose a primary sequence S has length k; then the following equation can be used to find the frequency of every amino acid.

$$\begin{aligned} A_i = \frac{N_i}{k}, \; \; \; \; \; \; i=1 \; to \; 20 \end{aligned}$$

(1)

where k indicates the sequence length and $N_i$ denotes the frequency count of $i^{th}$ amino acid in the primary sequence.

2.2.2 Physicochemical properties

This process involves the calculation of various properties, such as molecular weight, isoelectric point, hydrophobicity, and other structural tendencies like turn fraction, helix fraction, gravy, and aromaticity [32, 33]. These products help to capture the main points of the physical and chemical properties of proteins, providing a better understanding of their properties.

(a) Hydrophobicity: Hydrophobicity is the physical property of amino acids like running water and prefers non-polar locations. Various scales are available to measure the hydrophobicity of amino acids, including the Kyte-Doolittle, Hessa, and Eisenberg scales [34]. The Kyte-Doolittle scale assigns each amino acid a numerical value related to its hydrophobic or hydrophilic (water-absorbing) properties. Positive values indicate greater hydrophobicity, while negative values indicate stronger hydrophilicity. The hydrophobicity index (KD) for an amino acid i is estimated using the below formulation:

$$\begin{aligned} KD_i = \frac{\sum _{j=1}^{n} x_{ij}}{n} \end{aligned}$$

(2)

where n is the protein sequence length and $x_{ij}$ is the hydrophobicity value of $i^{th}$ amino acid at $j^{th}$ position in the protein sequence.

(b) Charge and Polarity: To measure the aggregation of a protein sequence, the number of positively charged amino acids can be counted, and the charge of negatively charged amino acids subtracted. This calculation results in a value indicating whether the protein is positively, negatively, or neutrally charged [35].

For charge, consider a protein sequence S having length n, where $S = s_1 s_2 \ldots s_n$ and $s_i$ indicate the $i^{th}$ amino acid in the sequence. Let $\text {pH}$ be the pH value at which the net charge is calculated.

The net charge of the protein sequence is computed by using the below equation:

$$\begin{aligned} \text {charge} = {\left\{ \begin{array}{ll} \sum _{i=1}^{n} \left( \frac{1}{1 + 10^{(\text {pKa}\_\text {values}[s_i] - \text {pH})}} \right) , \\ \text {if } \text {protonated}[s_i] = \text {True} \\ \sum _{i=1}^{n} \left( -\frac{1}{1 + 10^{(\text {pH} - \text {pKa}\_\text {values}[s_i])}} \right) , \\ \text {if } \text {protonated}[s_i] = \text {False} \end{array}\right. } \end{aligned}$$

(3)

where $\text {pKa}\_\text {values}[s_i]$ is the pKa values of amino acid $s_i$ and $\text {protonated}[s_i]$ indicates whether amino acid $s_i$ is protonated at the given pH.

The resulting $\text {charge}$ provides the net charge of the protein sequence at the specified pH value. To measure the overall polarity of the protein sequence, it can be estimated by counting the number of polar and non-polar amino acid residues. Analyzing the relative abundance of polar and non-polar residues can provide insight into the hydrophilic or hydrophobic nature of the protein. Once the features are extracted, we can use them for various tasks such as classification, clustering, or prediction. Let $\text {polar}\_\text {aa}$ be the set of polar amino acid residues and $\text {nonpolar}\_\text {aa}$ be the set of non-polar amino acid residues. The following formula can be used to compute the polarity score:

$$\begin{aligned} \text {polarity score} = \frac{\text {polar}\_\text {count}}{\text {nonpolar}\_\text {count}} \end{aligned}$$

(4)

where $\text {polar}\_\text {count}$ is the count of polar amino acid residues in the sequence, and $\text {nonpolar}\_\text {count}$ is the count of non-polar amino acid residues in the sequence. If $\text {nonpolar}\_\text {count}$ is 0, meaning there are no non-polar amino acids in the sequence, the function returns $-1$ to indicate an undefined polarity score. The resulting polarity score measures the relative proportion of polar to non-polar amino acids in the sequence, indicating its overall polarity.

(c) Isoelectric point: The isoelectric point (pI) of a protein can be estimated using the Henderson-Hasselbalch equation [36]:

$$\begin{aligned} pI = \frac{pKa_1 + pKa_2}{2} \end{aligned}$$

(5)

where $pKa_2$ is the ionization constant of the most basic group with a pKa value less than the pH and $pKa_1$ is the ionization constant of the most acidic group with a pKa value more significant than the pH. The pKa values for the ionizable groups are specific to the amino acids in the primary sequence.

2.2.3 Dipeptide composition

Dipeptides consist of two amino acid residues in a protein sequence and play a significant role in revealing the complex structure and function of proteins [37]. By studying the frequency and distribution of dipeptides, scientists can learn more about the relationship between amino acid pairs, revealing important information about protein structure, stability, association, and enzymatic activity. Dipeptide analysis is a powerful tool for understanding the complex mechanisms behind protein folding, binding interactions, and structural changes. Additionally, dipeptide-based signatures facilitate the development of machine-learning models for protein classification, prediction, and engineering. There are 400 dipeptide pairs, and the frequency of every pair is computed to describe a primary sequence by a 400-dimensional fixed vector. The Eq. 6 is used to compute the dipeptide composition of the primary sequence S with length k.

$$\begin{aligned} D_i=\frac{n_i}{k}, \; \; \; \; \; \; \; i=1 \; to \; 400 \end{aligned}$$

(6)

where $n_i$ is the count that tells how many times the $i^{th}$ dipeptide pair occurs within the sequence S.

2.2.4 Normal distribution

The probability normal distribution, sometimes called the Gaussian distribution, is a continuous distribution widely used in statistics and contingency theory [38]. This distribution is characterized by the equivalent bell curve, where the standard deviation and mean control the central tendency and spread of the distribution. The mode, median, and mean are equal in a normal distribution, and 68% of the data are within one standard deviation of the mean. Additionally, almost all data points are within three standard deviations of the mean, with around 95% of the data falling within two standard deviations. Consider a protein sequence S having length n, where $S = s_1 s_2 \ldots s_n$ and $s_i$ indicate the $i^{th}$ amino acid in the sequence.

Let $x_i$ be the property value associated with amino acid $s_i$; we have used frequencies of each amino acid in the protein sequence as the property value. Let $\sigma _i$ be the standard deviation and $\mu _i$ be the mean of the property value for amino acid $s_i$. The normal distribution value $\text {ND}(x_i)$ for amino acid $s_i$ can be estimated using the probability density function:

$$\begin{aligned} \text {ND}(x_i) = \frac{1}{\sqrt{2\pi }\sigma _i} \cdot e^{-\frac{(x_i - \mu _i)^2}{2\sigma _i^2}} \end{aligned}$$

(7)

The resulting $\text {ND}(x_i)$ represents the normal distribution value of the property corresponding to amino acid $s_i$ in the primary sequence.

In this study, we employed a comprehensive set of features to characterize and analyze protein sequences to gain deeper insights into their structural and functional attributes. The features utilized encompassed diverse aspects, including the frequency of individual amino acids, the fractional representation of each amino acid, occurrences of dipeptides, and normal distribution values associated with each amino acid. Furthermore, physicochemical properties such as molecular weight, charge, polarity, hydrophobicity, aromaticity, instability index, gravy, helix rotation, turn fraction, and sheet fraction were considered. In the present study, each protein sequences is represented by 471-dimensional feature vector. This multifaceted feature set allowed us to construct a holistic profile of the protein sequences under investigation. These features capture the primary sequence characteristics and delve into the physicochemical properties that contribute to the proteins’ overall structural and functional aspects.

2.3 Machine learning algorithms

2.3.1 KNN classifier

The KNN is a supervised machine learning technique that may be used for regression and classification problems [39]. It functions as a non-parametric technique, using the k-nearest training samples to the test sample for generating predictions. KNN provides a list of the nearest neighbors for new information for classification tasks. In regression analysis, the algorithm calculates the mean of the K nearest neighbors to make the estimation.

Unlike other algorithms, KNN does not require training as it stores all training information in its memory. The main parameter in KNN is the K value, which determines the count of neighbors taken into account when making the prediction. The choice of K can affect the algorithm’s performance and is often decided by competition. Mathematically, the KNN classifier assigns the class label $y_{\text {new}}$ to $x_{\text {new}}$ using the majority class of its k nearest neighbors:

$$\begin{aligned} y_{\text {new}} = \underset{y}{\text {argmax}} \sum _{x_i \text { among k nearest neighbors}} \delta (y_i, y) \end{aligned}$$

(8)

where $y_{\text {new}}$ is the predicted class label of $x_{\text {new}}$, $y_i$ is the class label of the $i^{th}$ nearest neighbor, and $\delta (y_i, y)$ is the Kronecker delta function that returns 1 if $y_i = y$ and 0 otherwise. The parameter k determines the number of neighbors to consider. A larger k results in a smoother decision boundary, while a smaller k captures more local details.

2.3.2 Naive Bayes classifier

Naive Bayes is a machine learning algorithm often utilized in classification and prediction tasks. It is known for its simplicity and efficiency, especially when dealing with extensive data sets [40]. The Naive Bayes algorithm is based on Bayes’ theorem, which establishes a relationship between the probability of a hypothesis (in this context, a list of classes) given the particular proof (represented by the process) and the given outcome. Given a data set with m instances and n features, and a set of C possible class labels $\{C_1, C_2, \ldots , C_C\}$, the Naive Bayes classifier assigns a class label $C_i$ to a new instance $\textbf{x}_{\text {new}} = (x_1, x_2, \ldots , x_n)$ by calculating the conditional probabilities of each class given the features:

$$\begin{aligned} P(C_i | \textbf{x}_{\text {new}}) = \frac{P(\textbf{x}_{\text {new}} | C_i) \cdot P(C_i)}{P(\textbf{x}_{\text {new}})} \end{aligned}$$

(9)

where $P(\textbf{x}_{\text {new}} | C_i)$ is the likelihood of the features $\textbf{x}_{\text {new}}$ given class $C_i$ and $P(C_i)$ is the prior probability of class $C_i$. $P(C_i | \textbf{x}_{\text {new}})$ is the posterior probability of class $C_i$ given the new instance $\textbf{x}_{\text {new}}$. $P(\textbf{x}_{\text {new}})$ is the marginal likelihood of the features $\textbf{x}_{\text {new}}$.

The “naive” assumption behind Naive Bayes is that, given the class labels, the features are conditionally independent. This assumption simplifies the likelihood term:

$$\begin{aligned} P(\textbf{x}_{\text {new}} | C_i) = P(x_1 | C_i) \cdot P(x_2 | C_i) \cdot \cdots \cdot P(x_n | C_i) \end{aligned}$$

(10)

The Naive Bayes classifier then assigns the new instance to the class having the maximum posterior probability:

$$\begin{aligned} \text {Predicted class} = \underset{C_i}{\text {argmax}} \left( P(C_i) \cdot \prod _{j=1}^{n} P(x_j | C_i) \right) \end{aligned}$$

(11)

In practice, probabilities $P(x_j | C_i)$ are often estimated from the training data using techniques like frequency counting or kernel density estimation. Naive Bayes is often used as a basic algorithm for text classification and spam filtering tasks.

2.3.3 SVM classifier

The SVM is a supervised learning algorithm for regression and classification. It works by defining an optimum hyperplane that divides data points into different classes [41]. The SVM algorithm tries to learn a decision boundary with the most significant separation between the two classes. The margin is the distance between the decision region and the nearest data points in each class. The goal is to determine a general hyperplane to separate the classes while minimizing the classification error. Given a data set with m instances and n features, and a set of C possible class labels $\{C_1, C_2, \ldots , C_C\}$, the SVM algorithm finds a hyperplane that best classifies the data into distinct classes and maximizes the margin among the classes.

Let $\textbf{X}$ represent the feature matrix of size $m \times n$, where each row $\textbf{x}_i$ corresponds to the $i^{th}$ instance, and $\textbf{y}$ be the corresponding vector of class labels ($\textbf{y} = [y_1, y_2, \ldots , y_m]^T$). The optimization problem of finding the optimal hyperplane can be formulated as:

$$\begin{aligned} \begin{aligned} \text {minimize} \quad&\frac{1}{2} \Vert \textbf{w}\Vert ^2 + C \sum _{i=1}^{m} \xi _i \\ \text {subject to} \quad&y_i (\textbf{w}^T \textbf{x}_i + b) \ge 1 - \xi _i, \quad \forall i = 1, 2, \ldots , m \\&\xi _i \ge 0, \quad \forall i = 1, 2, \ldots , m \end{aligned} \end{aligned}$$

(12)

where $\textbf{w}$ is the weight vector of the hyperplane and b is the bias term. $\xi _i$ are slack variables that allow for soft-margin classification, and C is a regularization parameter that balances the trade-off between minimizing the classification error and maximizing the margin.

The decision function for classifying a new instance $\textbf{x}_{\text {new}}$ is given by:

$$\begin{aligned} f(\textbf{x}_{\text {new}}) = \text {sign}(\textbf{w}^T \textbf{x}_{\text {new}} + b) \end{aligned}$$

(13)

The SVM seeks to determine the optimum values of $\textbf{w}$ and b that maximize the margin among the classes while ensuring that data points are ideally classified within certain margins (controlled by $\xi _i$) or on the accurate side of the hyperplane. The effectiveness of SVM has been shown in different areas, such as image recognition, text classification, and bioinformatics. However, the selection of the kernel function and its associated parameters may affect the results. Additionally, SVM is highly desirable, especially when dealing with large data sets.

2.3.4 Fuzzy SVM classifier

FSVM is a variant of the SVM algorithm that integrates fuzzy logic into the decision-making process. It operates as a supervised learning approach capable of handling uncertainty present in the input data [42]. In FSVM, the algorithm assigns membership values to each data point in the training set, considering their proximity to the decision boundary.

Given a data set with m instances and n features, and a set of C possible class labels $\{C_1, C_2, \ldots , C_C\}$, the FSVM algorithm extends the traditional SVM by introducing a fuzziness factor to account for uncertainty in-class assignments. Let $\textbf{X}$ represent the feature matrix of size $m \times n$, where each row $\textbf{x}_i$ corresponds to the $i^{th}$ instance, and $\textbf{y}$ be the corresponding vector of class labels ($\textbf{y} = [y_1, y_2, \ldots , y_m]^T$). The optimization problem of finding the optimal hyperplane for FSVM can be formulated as:

$$\begin{aligned} \begin{aligned} \text {minimize} \quad&\frac{1}{2} \Vert \textbf{w}\Vert ^2 + C \sum _{i=1}^{m} \mu _i \xi _i \\ \text {subject to} \quad&y_i (\textbf{w}^T \textbf{x}_i + b) \ge 1 - \xi _i, \quad \forall i = 1, 2, \ldots , m \\&and \quad \xi _i \ge 0, \quad \forall i = 1, 2, \ldots , m \\&and \quad \mu _i \in [0, 1], \quad \forall i = 1, 2, \ldots , m \end{aligned} \end{aligned}$$

(14)

where b is the bias and $\textbf{w}$ is the weight vector of the hyperplane. $\xi _i$ are slack variables that allow for soft-margin classification. $\mu _i$ are fuzzy membership values representing each instance’s degree of class membership. They capture the uncertainty in-class assignments. C is a regularization parameter that balances the trade-off between minimizing the classification error and maximizing the margin. $y_i$ is the class label of instance $\textbf{x}_i$. The following function is used for classifying a new instance $\textbf{x}_{\text {new}}$:

$$\begin{aligned} f(\textbf{x}_{\text {new}}) = \text {sign}\left( \textbf{w}^T \textbf{x}_{\text {new}} + b - \frac{1}{C}\sum _{i=1}^{m} \mu _i \xi _i\right) \end{aligned}$$

(15)

The FSVM aims to determine the optimum $\textbf{w}$, b, and $\mu _i$, which maximizes the margin among the classes while considering the fuzzy membership values and handling misclassified instances through slack variables. FSVM has exhibited strong performance in uncertain or imprecise data domains, including medical diagnosis, image processing, and pattern recognition.

2.3.5 RF classifier

RF is used for classification and regression analysis. It functions as a cluster learning technique that uses different decision trees to increase the stability and accuracy of the model. The RF generates many trees that are determined by randomly selecting a subset of the data points and features from the training data [43]. Each decision tree is trained on a different data and feature set. The final estimate is determined by summing the estimates of all the decision trees.

Consider a data set with m instances and n features, and a set of C possible class labels $\{C_1, C_2, \ldots , C_C\}$, the RF algorithm creates an ensemble of decision trees to carry out the classification. Let $\textbf{X}$ represent the feature matrix having size $m \times n$, where every row $\textbf{x}_i$ corresponds to the $i^{th}$ instance, and $\textbf{y}$ be the corresponding vector of class labels ($\textbf{y} = [y_1, y_2, \ldots , y_m]^T$).

The Random Forest algorithm constructs T decision trees using a bootstrapped subset of the data. Each tree is trained to minimize impurity or maximize information gain at each node, leading to diverse trees. The final classification is determined through a majority vote of the predictions from each tree. For a new instance $\textbf{x}_{\text {new}}$, the RF prediction $y_{\text {new}}$ is given by:

$$\begin{aligned} y_{\text {new}} = \text {mode}\left( f_1(\textbf{x}_{\text {new}}), f_2(\textbf{x}_{\text {new}}), \ldots , f_T(\textbf{x}_{\text {new}}) \right) \end{aligned}$$

(16)

where $f_i(\textbf{x}_{\text {new}})$ is the output of the $i^{th}$ decision tree for the new instance $\textbf{x}_{\text {new}}$ and $\text {mode}$ returns the most frequently occurring class label among the individual tree predictions.

3 Experimental results

This section assesses the performance of multiple classifiers to select the best-performing classifier to identify the protein methylated sites. The performance of the proposed model is also compared with previous state-of-the-art arginine methylation predictors.

3.1 Evaluation metrics

The evaluation metrics measure and analyze the effectiveness and efficiency of the proposed model. We conduct rigorous experiments using appropriate criteria to assess the effectiveness of the proposed model. We compared the performance of the proposed model with various frameworks and the latest technology in the field. The assessment criteria used in this study involved recall, precision, accuracy, area under the curve (AUC), F1 score, etc. These metrics provide information about the proposed method’s classification accuracy, predictive power, and discrimination power. We use a true positive (TP), true negative (TN), false positive (FP), and false negative (FN) based assessment method to evaluate the effectiveness of the proposed model. These measurements successfully measure the classification accuracy and the error of our method.

AUC was obtained from the receiver operating characteristic (ROC) curve using the trapezoidal approximation of the ROC curve. Another measure for assessing the validity of a binary (two-class) distribution is the Matthews Correlation Coefficient (MCC). It uses true and false positives and negatives and is considered an equal measure that can be used for different classes. MCC is the correlation coefficient between the observed and predicted binary distributions, and the F-test is used to calculate the accuracy of the distribution.

$$\begin{aligned}&Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} \end{aligned}$$

(17)

$$\begin{aligned}&Recall \; or \; Sensitivity = \frac{TP}{(TP+FN)} \end{aligned}$$

(18)

$$\begin{aligned}&Precision = \frac{TP}{TP+FP} \end{aligned}$$

(19)

$$\begin{aligned}&Specificity = \frac{TN}{FP+TN} \end{aligned}$$

(20)

$$\begin{aligned}&MCC = \frac{(TP*TN-FN*FP)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$

(21)

$$\begin{aligned}&F1-measure = 2*\frac{(precision*recall)}{(precision+recall)} \end{aligned}$$

(22)

3.2 Performance of various classifiers

In this study, we evaluated the results of different machine-learning algorithms to identify methylation sites in proteins. The extracted features from the protein sequences are physicochemical properties, dipeptide composition, normal distribution, and amino acid composition. We trained various machine learning models on these features, including the KNN, SVM, FSVM, RF, and Naive Bayes classifiers. Our test results are shown in Table 1.

Table 1 Performance comparison of various machine learning classifiers for the prediction of arginine methylation sites

Full size table

Table 1 shows that the accuracy of the KNN, SVM, FSVM, RF, and naive Bayes models are 71.9%, 78.7%, 80.9%, 88.4%, and 79.6%, respectively. In addition to accuracy, we also evaluated the performance using the F1 score, MCC, precision, specificity, recall, and AUC score. The RF model achieves 89% precision, 93% specificity, 0.772 MCC, 88% f1 score, and 89% recall (see Table 1). The specificity achieved by KNN, SVM, FSVM, and naive Bayes model is 72.7%, 79.3%, 78.4%, and 78.8%, respectively. Moreover, the recall obtained by KNN, SVM, FSVM, and naive Bayes model are 72%, 79%, 81%, and 80%, respectively. We also generated a ROC curve to evaluate the performance of the models as mentioned above. The ROC curve shows the true positive value on the y-axis (sensitivity) and the false positive rate (1-specificity) on the x-axis for various cut-off points for the measured values. The AUC is a combined measure of the sensitivity and specificity of the classification test, indicating its validity.

An AUC score indicates the overall performance of the binary classification model in terms of its ability to distinguish positive and negative examples. An AUC maximum of 1 indicates that the classification test is unbiased and distinguishes between positive and negative data. The ROC curve for different machine learning models is shown in Fig. 2. We obtained AUC values of 0.78, 0.82, 0.86, 0.85, and 0.94 for the KNN, SVM, naive Bayes, FSVM, and RF models, respectively. The AUC values in the range 0.75$\le $ AUC $\le $ 0.95 indicate the excellent ability to distinguish between methylated and unmethylated arginine sites. From Fig. 2 and Table 1, we can see that the RF model performs better than other classifiers in terms of all evaluation metrics. Therefore, the RF model was adopted as the base classifier for predicting protein arginine methylated sites. Overall, our findings suggest that machine learning algorithms can be used to predict methylation sites in proteins accurately.

Understanding protein methylated sites is essential for both epigenetic inheritance and proteome analysis of several human diseases. Which arginine site in a protein can be methylated, and which site cannot be methylated. This is the first significant issue that has to be addressed in order to fully understand the methylation process and drugs development. So, the proposed model can be used to identify arginine methylated sites in a protein. Moreover, methylation of proteins or DNA can result in epigenetic inheritance due to methylation. Researchers and medical professionals have established a connection between the dysregulation of neurochemistry and the emergence of long-term health issues including hypertension, diabetes, obesity, and depression. The processes behind these fundamental epigenetic phenomena would thus undoubtedly yield a useful knowledge or suggestions for medication research.

3.3 Comparison of the RMSxAI with existing predictors

The proposed model (RMSxAI) used RF for predicting arginine methylated sites. RMSxAI was compared with previous state-of-the-art predictors to find its effectiveness. For a fair comparison of the RMSxAI and state-of-the-art predictors, the identical data set for arginine methylation was used. The comparison results of the RMSxAI with existing predictors are shown in Table 2. RMSxAI was compared with MeMo [11], BPB-PPMS [44], PMeS [45], MASA [46], iMethyl-PseAAC [10], PSSMe [47], MePred-RF [48], PRmePRed [26], DeepRMethylSite [9], and SSMFN [49] on the same data set delivered by Kumar et al. [26]. The performance values of PMeS, BPB-PPMS, MASA, MeMo, PSSMe, iMethyl-PseAAC, MePred-RF, PRmePRed, DeepRMethylSite, and SSMFN were taken from Lumbanraja et al. [49]. Except for specificity, the other evaluation measures of the proposed model are higher than the existing arginine methylation site predictors. The accuracy, recall, and MCC of the RMSxAI were 1.57% to 32.4%, 1.91% to 77%, and 0.04 to 0.61 higher than the state-of-the-art predictors, respectively. In conclusion, the proposed model performed better than existing predictors for identifying arginine methylated sites.

Table 2 The comparison of the RMSxAI with previous predictors on the same data set provided by Kumar et al. [26]

Full size table

4 Explainable artificial intelligence

To investigate the effectiveness of the RMSxAI, we use XAI to explain the predictions of the proposed model. XAI is an innovative approach to address the increasing need for accountability, transparency, and clarity in machine learning models. Machine learning models can be interpreted as “black boxes” which are difficult to understand because of their intricate internal structure. XAI is a set of methods that can be used to understand why a machine learning model makes a prediction. Every choice made during the machine learning procedure can be tracked and explained because of the particular methods and techniques used by XAI. The prediction accuracy may be evaluated by executing simulations and comparing the output of XAI with the outcomes in the dataset. Local Interpretable Model-Agnostic Explanations (LIME) describes the predictions of classifier by machine learning technique, is the most widely used approach for this. We use LIME technique to generate native annotations for machine learning models. LIME works by creating a synthetic sample database similar to the input model. Synthetic data are used to train the linear model, which is utilized to explain the initial decision of the model. The linear model explains the proposed model estimation, providing the most essential criteria. We used LIME to describe the estimation of protein abundance from the proposed model. Interpreted results of an arbitrary instance from the test data set using LIME XAI are shown in Fig. 3.

It can be observed from Fig. 3 that the estimation of the proposed model depends on several properties, including the dipeptide composition, amino acid composition, and chemical and physical properties of the protein. In Fig. 3a, we can see the different properties contributing to the prediction for an individual instance from the test data. If the value belongs to the given range, we decide whether the feature contributes to the positive or negative class based on the feature value. Figure 3b shows the cumulative effect for each class and the individual characteristics that determine the class. Here, “0” indicates that the methylation site is absent, and “1” indicates that the methylation site is present. We can observe that this test instance has methylation with 81% confidence. All the features contributing to this instance to decide that methylation is present are shown in Fig. 3b. The reasons for the model to make this decision are the dipeptide composition value of pair RF is more than 0.00, and helix fraction (physicochemical properties) is more than 0.32.

The features’ average impact on the proposed model output using LIME is shown in Fig. 4. The plot shows the ten most important features for predicting protein methylation sites. The main features include molecular weight, the composition of dipeptide pair RG, the composition of amino acid G, and other physicochemical properties that impact the model output. These results demonstrate that the proposed model can predict protein methylation sites by identifying the most relevant features.

The results of this study suggest that machine learning algorithms can be utilized to predict methylation sites in high-density proteins. Using XAI helps to understand how the proposed model makes decisions, which can be used to improve model accuracy and identify new patterns in data. The difficulties and constraints associated with XAI encompass issues related to users understanding, human bias, AI complexity of models, security, and data privacy. However, it is necessary to understand that XAI is unavoidably important, since it is a vital step towards future deployments of more responsible and effective AI models.

5 Conclusion

In summary, this research paper explores the use of machine learning algorithms to predict methylation sites and provide explanations through XAI. With this work, we demonstrate the ability of machine learning algorithms to accurately predict methylation sites that can help us understand epigenetic changes and their effects on gene expression.

Our results show that using various machine learning algorithms, including RF, FSVM, and different classifiers, gives good results in estimating methylation sites. These algorithms exploit the relationship between protein abundance and methylation patterns by demonstrating their ability to process extensive genomics data and produce accurate predictions. Applying machine learning algorithms in this context opens up new avenues to identify potential methylated sites and understand the mechanisms underlying epigenetic changes. Also, the integration of XAI input allows us to interpret and explain the predictions made by the machine learning models. XAI technology provides clarity and interpretation, helping researchers and scientists understand the complex decision-making process of machine learning models when predicting methylation sites.

In the future, more research is needed to explore the full potential of machine learning algorithms and XAI methods in methylation site prediction. Integration with other omics data, such as gene expression and chromatin accessibility, provides a more comprehensive understanding of epigenetic regulation. In addition, developing software tools that can be used to combine machine learning and XAI methods will facilitate their adoption by the broader research community, thereby making progress in epigenetics and personalized medicine.

This study demonstrates essential lessons and valuable insights from XAI techniques for predicting methylation sites using machine learning algorithms. These advances could revolutionize our understanding of epigenetic changes and their impact on human health and disease.

Data availability

The data set used for this research is available at UniProt database (www.uniprot.org).

References

Longo VD, Kennedy BK. Sirtuins in aging and age-related disease. Cell. 2006;126(2):257–68.
Article Google Scholar
Chen X, Niroomand F, Liu Z, Zankl A, Katus H, Jahn L, Tiefenbacher C. Expression of nitric oxide related enzymes in coronary heart disease. Basic Res Cardiol. 2006;101:346–53.
Article Google Scholar
Wang Y, Zhang S, Li F, Zhou Y, Zhang Y, Wang Z, Zhang R, Zhu J, Ren Y, Tan Y, et al. Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic Acids Res. 2020;48(D1):D1031–41.
Google Scholar
Liu C, Chyr J, Zhao W, Xu Y, Ji Z, Tan H, Soto C, Zhou X, Initiative ADN. Genome-wide association and mechanistic studies indicate that immune response contributes to Alzheimer’s disease development. Front Genet. 2018;9:410.
Article Google Scholar
Suzuki A, Yamada R, Yamamoto K. Citrullination by peptidylarginine deiminase in rheumatoid arthritis. Ann N Y Acad Sci. 2007;1108(1):323–39.
Article Google Scholar
Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, et al. Systematic evaluation of variability in chip-chip experiments using predefined dna targets. Genome Res. 2008;18(3):393–403.
Article Google Scholar
Ong S-E, Mittler G, Mann M. Identifying and quantifying in vivo methylation sites by heavy methyl silac. Nat Methods. 2004;1(2):119–26.
Article Google Scholar
Boisvert F-M, Côté J, Boulanger M-C, Richard S. A proteomic analysis of arginine-methylated protein complexes. Mol Cell Proteom. 2003;2(12):1319–30.
Article Google Scholar
Chaudhari M, Thapa N, Roy K, Newman RH, Saigo H, Dukka B. Deeprmethylsite: a deep learning based approach for prediction of arginine methylation sites in proteins. Mol Omics. 2020;16(5):448–54.
Article Google Scholar
Qiu W-R, Xiao X, Lin W-Z, Chou K-C. imethyl-pseaac: identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Research International 2014
Chen H, Xue Y, Huang N, Yao X, Sun Z. Memo: a web tool for prediction of protein methylation modifications. Nucleic Acids Res. 2006;34(suppl_2):W249–53.
Article Google Scholar
He X, Chang S, Zhang J, Zhao Q, Xiang H, Kusonmano K, Yang L, Sun ZS, Yang H, Wang J. Methycancer: the database of human dna methylation and cancer. Nucleic Acids Res. 2007;36(suppl_1):D836-41.
Article Google Scholar
Xu Y, Ding J, Wu L-Y, Chou K-C. isno-pseaac: predict cysteine s-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PloS One. 2013;8(2): e55844.
Article Google Scholar
Qiu W-R, Xiao X, Lin W-Z, Chou K-C. iubiq-lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dyn. 2015;33(8):1731–42.
Article Google Scholar
Khandelwal M, Shabbir N, Umer S. Extraction of sequence-based features for prediction of methylation sites in protein sequences. Artif Intell Technol Comput Biol. 2022;29–46
Xu Y, Wen X, Wen L-S, Wu L-Y, Deng N-Y, Chou K-C. initro-tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PloS One. 2014;9(8): e105018.
Article Google Scholar
Qiu W-R, Sun B-Q, Xiao X, Xu Z-C, Chou K-C. iptm-mlys: identifying multiple lysine ptm sites and their different types. Bioinformatics. 2016;32(20):3116–23.
Article Google Scholar
Liu L-M, Xu Y, Chou K-C. ipgk-pseaac: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general pseaac. Med Chem. 2017;13(6):552–9.
Article Google Scholar
Xu Y, Wang Z, Li C, Chou K-C. ipreny-pseaayc: identify c-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into pseaac. Med Chem. 2017;13(6):544–51.
Article Google Scholar
Xie H-L, Fu L, Nie X-D. Using ensemble svm to identify human gpcrs n-linked glycosylation sites based on the general form of chou’s pseaac. Protein Eng Des Sel. 2013;26(11):735–42.
Article Google Scholar
Jia C, Lin X, Wang Z. Prediction of protein s-nitrosylation sites based on adapted normal distribution bi-profile bayes and chou’s pseudo amino acid composition. Int J Mol Sci. 2014;15(6):10410–23.
Article Google Scholar
Zhang J, Zhao X, Sun P, Ma Z. Psno: predicting cysteine s-nitrosylation sites by incorporating various sequence-derived features into the general form of chou’s pseaac. Int J Mol Sci. 2014;15(7):11204–19.
Article Google Scholar
Ju Z, He J-J. Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into chou’s general pseaac. J Mol Graph Model. 2017;77:200–4.
Article Google Scholar
Khandelwal M, Kumar Rout R, Umer S, Mallik S, Li A. Multifactorial feature extraction and site prognosis model for protein methylation data. Brief Funct Genom. 2023;22(1):20–30.
Article Google Scholar
Zhao J, Zou G, Xiao M, Lin Q, Wang Q, Liu J, Ma L. Cnnarginineme: A cnn structure for training models of predicting arginine methylation sites based on the one-hot encoding of peptide sequence, Available at SSRN 4045843.
Kumar P, Joy J, Pandey A, Gupta D. Prmepred: A protein arginine methylation prediction tool. PLoS One. 2017;12(8): e0183318.
Article Google Scholar
U. Consortium. Uniprot: a hub for protein information. Nucleic Acids Res. 2015;43(D1):D204–12.
Article Google Scholar
Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. Phosphositeplus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012;40(D1):D261–70.
Article Google Scholar
Huang Y, Niu B, Gao Y, Fu L, Li W. Cd-hit suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2.
Article Google Scholar
Khandelwal M, Rout RK, Umer S. Protein-protein interaction prediction from primary sequences using supervised machine learning algorithm. In: 2022 12th international conference on cloud computing, data science & engineering (confluence). IEEE, 2022; pp. 268–272.
Roy S, Martinez D, Platero H, Lane T, Werner-Washburne M. Exploiting amino acid composition for predicting protein-protein interactions. PloS One. 2009;4(11): e7813.
Article Google Scholar
Khandelwal M, Sheikh S, Rout RK, Umer S, Mallik S, Zhao Z. Unsupervised learning for feature representation using spatial distribution of amino acids in aldehyde dehydrogenase (aldh2) protein sequences. Mathematics. 2022;10(13):2228.
Article Google Scholar
Rout RK, Umer S, Sheikh S, Sindhwani S, Pati S. Eightydvec: a method for protein sequence similarity analysis using physicochemical properties of amino acids. Comput Methods Biomech Biomed Eng Imaging Vis. 2022;10(1):3–13.
Article Google Scholar
Hessa T, Meindl-Beinker NM, Bernsel A, Kim H, Sato Y, Lerch-Bader M, Nilsson I, White SH, Von Heijne G. Molecular code for transmembrane-helix recognition by the sec61 translocon. Nature. 2007;450(7172):1026–30.
Article Google Scholar
da Rocha L, Baptista AM, Campos SR. Approach to study ph-dependent protein association using constant-ph molecular dynamics: application to the dimerization of $\beta $-lactoglobulin. J Chem Theory Comput. 2022;18(3):1982–2001.
Article Google Scholar
Po HN, Senozan N. The henderson-hasselbalch equation: its history and limitations. J Chem Educ. 2001;78(11):1499.
Article Google Scholar
Bhasin M, Raghava G. Eslpred: Svm-based method for subcellular localization of eukaryotic proteins using dipeptide composition and psi-blast. Nucleic Acids Res. 2004;32(suppl_2):W414–W419.
Patel JK, Read CB. Handbook of the normal distribution, vol. 150. CRC Press; 1996.
Google Scholar
Peterson LE. K-nearest neighbor. Scholarpedia. 2009;4(2):1883.
Article Google Scholar
Webb GI, Keogh E, Miikkulainen R, Naïve bayes., Encyclopedia of Machine Learning. 2010;15(1):713–714.
Noble WS. What is a support vector machine? Nat Biotechnol. 2006;24(12):1565–7.
Article Google Scholar
Lin C-F, Wang S-D. Fuzzy support vector machines. IEEE Trans Neural Netw. 2002;13(2):464–71.
Article Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Article Google Scholar
Shao J, Xu D, Tsai S-N, Wang Y, Ngai S-M. Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PloS One. 2009;4(3): e4920.
Article Google Scholar
Shi S-P, Qiu J-D, Sun X-Y, Suo S-B, Huang S-Y, Liang R-P. Pmes: prediction of methylation sites based on enhanced feature encoding scheme. PloS One. 2012;7(6): e38772.
Article Google Scholar
Shien D-M, Lee T-Y, Chang W-C, Hsu JB-K, Horng J-T, Hsu P-C, Wang T-Y, Huang H-D. Incorporating structural characteristics for identification of protein methylation sites. J Comput Chem. 2009;30(9):1532–43.
Article Google Scholar
Wen P-P, Shi S-P, Xu H-D, Wang L-N, Qiu J-D. Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization. Bioinformatics. 2016;32(20):3107–15.
Article Google Scholar
Wei L, Xing P, Shi G, Ji Z, Zou Q. Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Trans Comput Biol Bioinform. 2017;16(4):1264–73.
Article Google Scholar
Lumbanraja FR, Mahesworo B, Cenggoro TW, Sudigyo D, Pardamean B. Ssmfn: a fused spatial and sequential deep learning model for methylation site prediction. PeerJ Comput Sci. 2021;7: e683.
Article Google Scholar

Download references

Acknowledgements

We thank the US NSF award 1761839 and a catalyst award from the US National Academy of Medicine.

Funding

Hong Qin was supported by National Science Foundation (Grant No. 1761839).

Author information

Authors and Affiliations

National Institute of Technology Srinagar, Hazratbal, Srinagar, J&K, 190006, India
Gaurav Dwivedi, Monika Khandelwal & Ranjeet Kumar Rout
Aliah University, Newtown, Kolkata, West Bengal, 700156, India
Saiyed Umer
Harvard T H Chan School of Public Health, Boston, MA, 02115, USA
Saurav Mallik
University of Arizona, Tucson, AZ, 85721, USA
Saurav Mallik
University of Tennessee at Chattanooga, Chattanooga, TN, 37403, USA
Hong Qin

Authors

Gaurav Dwivedi
View author publications
You can also search for this author in PubMed Google Scholar
Monika Khandelwal
View author publications
You can also search for this author in PubMed Google Scholar
Ranjeet Kumar Rout
View author publications
You can also search for this author in PubMed Google Scholar
Saiyed Umer
View author publications
You can also search for this author in PubMed Google Scholar
Saurav Mallik
View author publications
You can also search for this author in PubMed Google Scholar
Hong Qin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Gaurav Dwivedi: Conceptualization of this study, Analysis and interpretation of data, Methodology, Writing-Original draft preparation. Monika Khandelwal: Conceptualization and analysis of this study, Methodology, Result analysis, Writing-Reviewing and editing. RanjeetKumar Rout: Conceptualization, Result analysis and Investigation, Writing-Reviewing and editing, Formal analysis. Saiyed Umer: Result analysis and Investigation, Writing-Reviewing and editing, Formal analysis. Saurav Mallik: Investigation, Writing-Reviewing and editing. Hong Qin: Investigation, Writing-Reviewing and editing.

Corresponding authors

Correspondence to Ranjeet Kumar Rout, Saurav Mallik or Hong Qin.

Ethics declarations

Competing interests

No Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Saurav Mallik and Hong Qin reviewed and edited the manuscript, and also validated the outcome. Both of them also supervised the project.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dwivedi, G., Khandelwal, M., Rout, R.K. et al. RMSxAI: arginine methylation sites prediction from protein sequences using machine learning algorithms and explainable artificial intelligence. Discov Appl Sci 6, 329 (2024). https://doi.org/10.1007/s42452-024-05898-y

Download citation

Received: 17 September 2023
Accepted: 16 April 2024
Published: 16 June 2024
DOI: https://doi.org/10.1007/s42452-024-05898-y

RMSxAI: arginine methylation sites prediction from protein sequences using machine learning algorithms and explainable artificial intelligence

Abstract

Article highlights

Similar content being viewed by others

PRMxAI: protein arginine methylation sites prediction based on amino acid spatial distribution using explainable artificial intelligence

Position-specific prediction of methylation sites from sequence conservation based on information theory

Two-Level Protein Methylation Prediction using structure model-based features

1 Introduction

2 Materials and methods

2.1 Data set

2.2 Feature representation

2.2.1 Amino acid composition

2.2.2 Physicochemical properties

2.2.3 Dipeptide composition

2.2.4 Normal distribution

2.3 Machine learning algorithms

2.3.1 KNN classifier

2.3.2 Naive Bayes classifier

2.3.3 SVM classifier

2.3.4 Fuzzy SVM classifier

2.3.5 RF classifier

3 Experimental results

3.1 Evaluation metrics

3.2 Performance of various classifiers

3.3 Comparison of the RMSxAI with existing predictors

4 Explainable artificial intelligence

5 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation