Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors

Nath, Abhigyan; Subbiah, Karthikeyan

doi:10.1007/s13205-016-0410-1

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors

Original Article
Open access
Published: 21 March 2016

Volume 6, article number 93, (2016)
Cite this article

Download PDF

You have full access to this open access article

3 Biotech Aims and scope Submit manuscript

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors

Download PDF

Abhigyan Nath¹ &
Karthikeyan Subbiah¹

1468 Accesses
7 Citations
Explore all metrics

Abstract

To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.

How to balance the bioinformatics data: pseudo-negative sampling

Article Open access 24 December 2019

Analysis and Classification of Plant MicroRNAs Using Decision Tree Based Approach

RNA Secondary Structure Prediction Using Extreme Learning Machine with Clustering Under-Sampling Technique

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

RNA silencing is a common host defense mechanism in plants against many plant RNA/DNA viruses (Li et al. 2014a; Pérez-Cañamás and Hernández 2014; Valli et al. 2001). To counter the RNA silencing defense mechanism, these plant viruses encode RNA-silencing suppressors, which disturb the host RNA silencing pathway. The molecular basis for the mechanism of encoding RNA-silencing suppressors by these plant viruses is still largely unknown. P1/HC–Pro of Potyviruses, P19 of tombusviruses and 2b proteins of cucumo-viruses are some of the well-studied RNA silencing suppressors (Qu and Morris 2005) and recently new RNA silencing suppressors are being identified in a mastrevirus (Wang et al. 2014) and in a wheat dwarf virus (Liu et al. 2014). Recent studies have also pointed to the role of suppressors in modulating the function of microRNAs (Chapman et al. 2004; Dunoyer et al. 2004).

Annotation of putative members of this family is hampered by the presence of high sequence diversity existing among these plant virus-encoded RNA-silencing suppressors (Qu and Morris 2005). The sequence similarity-based search methods like BLAST (Altschul et al. 1990) and PSI-BLAST (Altschul et al. 1997) have their inherent limitations in these situations where there exists low sequence conservation. Previously in (Jagga and Gupta 2014) the shortcomings of sequence similarity-based search methods like PSI-BLAST in correctly annotating the members of this protein family are emphasized. Machine learning methods trained on mathematically represented suitable input feature vectors become a viable alternative to sequence similarity-based search methods. Previously different machine learning methods have been successfully applied to solve biological classification tasks (Kumari et al. 2015; Nath et al. 2012; Nath and Subbiah 2014). But the true performance of machine learning methods is affected by various factors such as class imbalance (Nath and Subbiah 2015a), imperfect learning due to some missing example instances and selection of inappropriate input features.

The class imbalance problem is quite common in biological datasets, where there is a huge difference in the number of instances belonging to the different classes and subclasses. These types of imbalanced datasets result in classifier bias towards the majority class and tend to produce majority class classifier (Wei and Dunbrack 2013). In most of the cases, the class of interest is the minority class and is the cause for lower sensitivity. Many methods had been proposed to deal with the class imbalance problem. Previously it has been stressed that the natural class distribution may not be optimal for training (Lee 2014; Weiss and Provost 2003) and the requirement of a balanced training set for proper learning has been pointed out by Dunbrack et al. (Wei and Dunbrack 2013). In the current work, we propose a technique to achieve better learning of both the positive and negative classes by experimenting with different resampling methods to balance the dataset with varying degree of class distributions. We have also repeated the experiments on different machine learning algorithms on imbalanced, Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al. 2002) oversampled and randomly undersampled datasets to find the optimal class distribution. We used the sequence features like amino acid composition, property group composition, dipeptide counts and property group n-grams for creating the input feature vectors. Broadly, two types of approaches are used for handling the class imbalance, (1) resampling methods which are algorithm independent and are transferable to different machine learning algorithms and (2) internal approaches which involve altering the existing algorithms and its various parameters for adapting to imbalance class distribution. The SMOTE and random undersampling fall under resampling methods, although other sophisticated varieties of SMOTE exist (Barua et al. 2014; Han et al. 2005; Nakamura et al. 2013), but in the present study, we have limited our focus on simple undersampling and SMOTE oversampling as they are found to be useful for many classifiers (Blagus and Lusa 2013) and in many biological classification problems (Batuwita and Palade 2009; MacIsaac et al. 2006; Xiao et al. 2011).

The current method explored the possibility of improvement in prediction accuracy of the machine learning algorithms using optimal class distribution and presented in detail the behavior of the tested learning algorithms with varying degrees of resampling. From the current work, it is also proved that prediction accuracy for the plant virus-encoded RNA-silencing suppressor proteins can be improved using resampling techniques.

Materials and methods

Dataset

We have used the dataset as used in (Jagga and Gupta 2014) which consisted of 208 plant virus-encoded RNA-silencing suppressor proteins (RSSPs) belonging to positive class and 1321 non-suppressor proteins (NSPs) belonging to negative class, for this study. The CD-HIT (Li and Godzik 2006) was applied separately to these classes of sequences to reduce the redundancy at 70 % sequence identity. Here, the positive class is the minority class as the number of positive class sequences is relatively very small when compared to the number of negative class sequences and their prediction will suffer from the imbalance class factor.

Extraction of feature vectors

The quality of the attributes of the protein sequences selected for creating the input feature vector will have great influence in learning the concepts of a particular protein family. We represented each protein sequence as the combination of following sequence features to create input instances and they are explained below.

Amino acid composition feature

Different proteins are evolved through the avoidance and preference of some specific amino acids and leads to some certain unique set of percentage frequency composition, which can be used successfully for discriminatory purposes (Nath and Subbiah 2014). So we have taken the frequency percentage of distribution of the 20 amino acids along the length of the protein sequence as one of the features for creating the input feature vector. It is calculated using the following formula:

$$ {\text{AA}}_{i} = \frac{{{\text{TC}}_{{{\text{AA}},i}} }}{{{\text{TC}}_{{{\text{res}}, i}} }} \times 100, $$

(1)

where AA denotes for one of the 20 amino acid residues, AA_i denotes the amino acid percentage frequency of specific type ‘AA’ in the ith Sequence, TC_AA,i denotes the total count of amino acid of specific ‘AA’ type in the ith sequence, TC_res,i denotes the total count of all residues in the ith sequence (i.e., sequence length).

Amino acid property group composition feature

The amino acids can be grouped according to their physicochemical properties. The Table 1 contains the list of amino acids belonging to the 11 different physicochemical groups. We have taken the percentage frequency composition of the 11 different amino acid property groups as used in (Nath et al. 2013) as the second feature. The formula for calculating this feature attribute is given below.

Table 1 Physicochemical groupings of amino acids taken for the present study

Full size table

$$ {\text{PG}}_{i} = \frac{{{\text{TC}}_{{{\text{PG}}, i}} }}{{{\text{TC}}_{{{\text{res}}, i}} }} \times 1 0 0 , $$

(2)

where PG denotes one of the 11 different amino acid property groups, PG_i denotes the percentage frequency of specific ‘PG’ amino acid property group in the ith sequence, TC_PG,i denotes the total count of specific amino acid property group ‘PG’ in the ith sequence, TC_res,i denotes the total count of all residues in the ith sequence.

Dipeptide counts

There are four hundred different possible dipeptides from 20 amino acids. To take advantage of the local sequence order and amino acid coupling into the prediction we have taken the dipeptide counts as the third feature.

Property group n-grams

To take into the conservation of similar contiguous physicochemical amino acid property groups in the protein sequence, we have calculated the property groups n-grams, where n is the window length. In the current work we have taken the window length of 2 as the fourth feature and is calculated by the formula given below:

$$ {\text{Physicochemical}}\,\; 2 {\text{-grams}}:\,\;{\text{Small}} = \sum\limits_{i = 1}^{N - 1} {C\left({i,i + 1} \right)}, $$

(3)

where N denotes the length of the protein sequence, i denotes the position of the amino acid residue along the protein sequence, if the condition $ ({\text{aa}}_{i} \in S^{*} {\text{and aa}}_{i + 1} \in S^{*} ) $ is true then $ C(i,i + 1) $ = 1 else $ C(i,i + 1) $ = 0 where the set of small aminoacids S^* = {Ala,Cys,Asp,Gly,Asn,Pro,Ser,Thr,Val}.

The above formula is used to calculate physicochemical 2-grams for the small amino acid group. In the similar way the physicochemical 2-grams for the other ten physicochemical property groups were calculated. An example feature vector is provided in Supplementary Table S1–S3.

Optimal balancing protocol

SMOTE

It was proposed by Chawla et al. (2002) for intelligent oversampling of minority samples as opposed to random oversampling, which may bias the learning towards the overrepresented samples. It is a nearest neighbor-based method, where it first chooses k nearest samples for a particular minority sample. It then randomly selects the j minority samples to create a synthetic minority sample. Successful use of SMOTE in classification tasks have been shown in (Li et al. 2014b; Nath and Subbiah 2015b; Suvarna Vani and Durga Bhavani 2013).

Classification protocol SVM

Support vector machines are supervised learning algorithms and are based on statistical learning theory of Vapnik (Vapnik 1995, 1998). Previous usage of SVM for biological classification/prediction problems has found them to be more accurate and also they are robust to noise and well suited for high dimensional datasets (Kandaswamy et al. 2011; Mishra et al. 2014; Pugalenthi et al. 2010). We have used the sequential minimization optimization (SMO) (Platt 1999) algorithm for fast training of SVM with polynomial kernel with an exponent value of 1 and C = 1 (a complexity parameter which SMO uses to build the hyperplane between the two classes, -C governs softness of the class margins).

All the experiments were carried out using WEKA (Hall et al. 2009) which is an open source java-based machine learning platform. The schematic representation of the current methodology is given in Fig. 1.

Characterization of plant virus-encoded RNA-silencing suppressors

We have used Relieff (Kira and Rendell 1992) feature ranking algorithm to rank the sequence features according to their discriminating ability. Relieff is a nearest neighbor-based feature relevance algorithm. It starts by randomly selecting an instance and then searches for the nearest neighboring instances belonging to the same and opposite classes. It compares the attributes of the instance with its nearest neighbors and assigns weights according to its discriminating ability.

Performance evaluation metrics

We have used stratified tenfold cross validation for the evaluation of the various models. The performances of the machine learning algorithms were assessed with both threshold-dependent and threshold-independent parameters. These parameters are derived from the values of the confusion matrix, namely TP: true positive that is the number of correctly predicted RSSPs, TN: true negative that is the number of correctly predicted NSPs, FP: false positive that is the number of incorrectly predicted NSPs and FN: false negative that is the number of incorrectly predicted RSSPs. The formula for calculating the evaluation parameters are given below:

Sensitivity Expresses the percentage of correctly predicted RSSPs.

$$ {\text{Sensitivity}} = {{\text{TP}} \mathord{\left/ {\vphantom {{\text{TP}} {\left( {{\text{TP}} + {\text{FN}}} \right)}}} \right. \kern-0pt} {\left( {{\text{TP}} + {\text{FN}}} \right)}} \times 100. $$

(4)

Specificity Expresses the percentage of correctly predicted NSPs.

$$ {\text{Specificity}} = {{\text{TN}} \mathord{\left/ {\vphantom {{\text{TN}} {\left( {{\text{TN}} + {\text{FP}}} \right)}}} \right. \kern-0pt} {\left( {{\text{TN}} + {\text{FP}}} \right)}} \times 100. $$

(5)

Accuracy Expresses the percentage of both correctly predicted RSSPs and NSPs.

$$ {\text{Accuracy}} = {{\left( {{\text{TP}} + {\text{TN}}} \right)} \mathord{\left/ {\vphantom {{\left( {{\text{TP}} + {\text{TN}}} \right)} {\left( {{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}} \right)}}} \right. \kern-0pt} {\left( {{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}} \right)}} \times 100. $$

(6)

AUC Area under the receiver operating characteristic (ROC) curve that summarizes the ROC by a single numerical value. It is a threshold-independent metric and can take values from 0 to 1 (Bradley 1997). The value of 0 indicates the worst case, 0.5 for random ranking and 1 indicates the best prediction.

Youden’s Index This performance metric evaluates the algorithm’s ability to avoid failure. Lower failure rates are expressed by higher index values (Youden 1950). It is calculated as:

$$ Y = \left( {\text{Sensitivity}} \right) - \left( {1 - {\text{Specificity}}} \right). $$

(7)

Dominance It expresses the relationship between the TP_rate (true-positive rate) and TN_rate (true-negative rate) and is proposed by (García et al. 2009). It is calculated as:

$$ {\text{Dominance}} = \left({\text{TP}} \_{\text{rate}}\right)-\left({\text{TN}} \_{\text{rate}}\right). $$

(8)

Its value ranges from −1 to +1. A dominance value of +1 means a perfect accuracy on the positive class and a value −1 means a perfect accuracy on the negative class. A value closer to zero means a balance between TP_rate and TN_rate.

g-mean: it was proposed by Kubat et al. (1997), this evaluation parameter shows the balance between sensitivity and specificity. It is the geometric mean of sensitivity and specificity. It is calculated as:

$$ g{\text{-means}} = \sqrt {{\text{Sensitivity}} \times {\text{Specificity}}}. $$

(9)

Results and discussion

We have experimented with four different machine learning algorithms, namely—(1) naive Bayes (NB), (2) Fischer linear discriminant function (implemented as FLDA in WEKA), (3) support vector machines with sequential minimization optimization (SMO) and (4) K nearest neighbor (implemented as IBK in WEKA) on the imbalanced dataset (original), randomly undersampled dataset (with varying class distribution) and SMOTE oversampled dataset (with varying class distribution) to find the optimal class distribution for each of these classifiers.

Learning performance on imbalanced dataset

Observing the values of the performance evaluation parameters obtained from the different machine learning algorithms when trained with the imbalanced dataset (Table 2), the overall accuracy of SMO and IBK crossed above 90 %, although with a large difference in their individual accuracies for the positive (sensitivity) and negative classes (specificity), respectively. The training on the imbalanced dataset resulted in high specificity values for all the learning algorithms except the naive Bayes. The negative dominance values of all the learning algorithms (except the naive Bayes) are also biased towards the TN_rate. This indicates that optimal learning with higher accuracies (sensitivity and specificities) for the positive and negative classes is difficult in cases where there is an imbalance between the positive and negative class instances.

Table 2 Performance evaluation metrics of the different learning algorithms trained on the imbalanced datasets

Full size table

Learning performance on randomly undersampled datasets

Nearest neighbor-based IBK method performed better than all the other machine learning algorithms and closely followed by SMO, when the original imbalance dataset was subjected to undersampling at different distribution rates for dealing with the data imbalance problem. The values of different performance evaluation parameters obtained by different degrees of class distribution are recorded in the Table 3. When the dataset is fully balanced by undersampling (undersampled 1:1), we obtained higher accuracy for the positive class samples than all other undersampled datasets. Highest overall accuracy of 91.8 % is obtained by IBK when the undersampling rate is 1:5 closely followed by SMO with 89.4 % accuracy. In the case of the undersampling datasets, IBK performed better than all other machine learning algorithms.

Table 3 Performance evaluation metrics of the different machine learning algorithms trained on the different randomly undersampled training sets

Full size table

Learning performance on SMOTE oversampled datasets

SMO performed better than all the other machine learning algorithms closely followed by FLDA on SMOTE oversampled datasets. The values of different performance parameters are recorded in the Table 4. One of the best noticeable effects of oversampling is the immediate increase in sensitivity values for all the four machine learning algorithms. There is a regular increasing trend for the Youden’s Index (which shows the model’s ability to avoid faults) with increasing rate of SMOTE oversampling. The best trade-off for the different evaluation parameters was obtained for the SMOTE 500 % dataset with SMO as the machine learning algorithm. This particular training dataset gave the best performance evaluation metrics with SMO as the learning algorithm. With this training dataset we could achieve 98.5 % sensitivity, 92.6 % specificity, 95.3 % overall accuracy, and 0.955 of AUC. A high value of sensitivity indicates that the model is very accurate for the positive minority class samples. A positive dominance index of 0.059 also confirms the fact that the model is good in predicting minority samples. A high value of the Youden’s Index (0.911) indicates the model’s superiority in fault avoidance ability. A g-means value of 95.5 also indicates an optimal balance between sensitivity and specificity. ROC plots for the four different machine learning algorithms trained on the best performing training set (SMOTE oversampled 500 % dataset) are shown in Fig. 2.

Table 4 Performance evaluation metrics of the different machine learning algorithms trained on the different SMOTE oversampled training sets

Full size table

To further validate the learned models trained on a SMOTE oversampled dataset (500 %), we have used leave on out cross validation test (Chou and Zhang 1995). It is deemed as the most objective and robust test and has been used by many researchers for the assessment of classifier models (Chou and Cai 2004; Gao et al. 2005; Xie et al. 2013), the results are given in Table 5.

Table 5 Leave on out cross validation performance evaluation metrics on the best training set

Full size table

Further, a corrected resampled paired t test was performed using WEKA with SMO as the baseline classifier. The t test was performed at the 5 % significance level. Each tenfold cross validation was repeated ten times (10 × 10 runs for each algorithm). Percentage correctly predicted instances, AUC, TP rate and TN rate was used for comparison with t test. The results of the t test are provided in the supplementary material (Table S4a–d).

Comparing the results with previous study

We have compared the evaluation metric of the current study with the previous study and the performance evaluation metric values for the current best training set and the previously reported values are presented in Table 6.

Table 6 Comparison of the performance evaluation metrics of the current work with the previous methods

Full size table

On comparison with the previous method, the current SMOTE (500 %) model achieved far better performance evaluation metrics.

It is also observed that both the SMOTE oversampling and random undersampling have least effect on the performance of the naive Bayes algorithm, a similar observation has also been made by (Daskalaki et al. 2006).

Characterization of RNA-silencing suppressors using sequence-based features

In Fig. 3, we have plotted the heat map representation of the sequence attributes except the dipeptides. Figure 4 presents the heat map representation of the dipeptides. The color bar in both the figures (on the right side of both the figures) shows the color intensity proportional to the feature ranking scores which are calculated according to their discriminating ability. Observing the Fig. 3, arginine, polar and nonpolar property groups are the most useful discriminatory features. From Fig. 4, it can also be observed that DF, SF, NN, DT, CW, CG are the most discriminatory dipeptides.

Arginines are relatively important in binding sites (Barnes 2007), also it is imperative to mention the importance of the role of arginine in suppressor activity of PRS suppressor (2b) of a cucumber mosaic virus strain (CM95R) (Goto et al. 2007) where it facilitates in binding to RNA and in potato virus M where mutational studies have shown the importance of arginines in suppression activity (Senshu et al. 2011). The importance of nonpolar amino acids, specifically isoleucine in suppression activity is also emphasized in (Carr and Pathology 2007).

Conclusions

Machine learning-based approaches are apposite techniques when compared to sequence alignment-based methods for the prediction of plant virus-encoded RNA-silencing suppressors and can become the superior alternative if the imbalance dataset problem is properly resolved. The protein family classification problem intrinsically presents a class imbalance situation, where the class of interest is a particular protein family which constitutes the positive class and the rest of the protein families belonging to the negative classes. Naturally, there is a large difference between the number of instances belonging to positive and negative classes. Depending on the mathematical representation of the protein sequences, machine learning-based approaches can capture the hidden relationship among the calculated protein attributes, which is most of the times better than alignment-based methods for protein classification. The plant virus-encoded RNA-silencing suppressor protein classification presents a data imbalance problem; we compared the learning of different machine learning algorithms on imbalanced, SMOTE oversampled and randomly undersampled datasets. The results reported in this study showed that learning is non-optimal for imbalanced positive and negative class data sets. The behavior of the machine learning algorithms is different in SMOTE oversampling and random undersampling. IBK performed better on randomly undersampled datasets, while the performance of SMO is superior to all other machine learning algorithms on SMOTE oversampled datasets. Better performance evaluation metrics were obtained on SMOTE oversampled datasets than on the randomly undersampled datasets. The best model is achieved with SMOTE oversampling when SMO is used as the learning algorithm. This also points to the fact that the full (ideal) balancing between the positive and negative classes may not fully eliminate the classifier bias. The current study supports and provides evidence to the fact that the learning of different machine learning algorithms can be improved using an optimal class distribution and also the fully balanced class distribution need not be optimal for the training of the learning algorithms. Individual accuracies and learning on the positive and negative classes can be increased by changing the class distribution. Overall the performance of the various machine learning algorithms on SMOTE oversampled datasets is better than the random undersampled datasets. Further, we have ranked the calculated sequence features according to their discriminating ability in classifying plant virus-encoded RNA-silencing suppressors from non-suppressors. The current pipeline can be successfully applied to other protein family classification problem with different degrees of imbalance. The current method explored the possibility of improvement in prediction accuracy of the four machine learning algorithms using an optimal class distribution that provides the best trade-off between imbalance dataset and the diversity of the dataset. A comprehensive study was carried out and presented in detail the behavior of the tested learning algorithms with varying degrees of resampling. It is also proved that prediction accuracy for the plant virus suppressor proteins can be improved using the optimal class distribution ratio.

Future research can be carried out by incorporating additional diversifying techniques to deal with the related problem of incomplete learning. More sophisticated techniques can be evolved to deal with the trade-off between the balancing factor and input instance diversity. Further research in this direction can lead to the formulation of some kind of standard in creating benchmark data sets to every specific biological problem.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410. doi:10.1016/s0022-2836(05)80360-2
Article CAS Google Scholar
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS Google Scholar
Barnes MR (2007) Bioinformatics for geneticists: a bioinformatics primer for the analysis of genetic data. Wiley
Barua S, Islam MM, Xin Y, Murase K (2014) MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning knowledge and data engineering. IEEE Trans 26:405–425. doi:10.1109/TKDE.2012.232
Google Scholar
Batuwita R, Palade V (2009) microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25:989–995. doi:10.1093/bioinformatics/btp107
Article CAS Google Scholar
Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinform 14:106
Article Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30:1145–1159. doi:10.1016/S0031-3203(96)00142-2
Article Google Scholar
Carr T, Pathology ISUP (2007) Genetic and molecular investigation of compatible plant-virus interactions. Iowa State University, Iowa
Chapman EJ, Prokhnevsky AI, Gopinath K, Dolja VV, Carrington JC (2004) Viral RNA silencing suppressors inhibit the microRNA pathway at an intermediate step. Genes Dev 18:1179–1186. doi:10.1101/gad.1201204
Article CAS Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16:321–357
Google Scholar
Chou K-C, Cai Y-D (2004) Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 321:1007–1009. doi:10.1016/j.bbrc.2004.07.059
Article CAS Google Scholar
Chou K, Zhang C (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349
Article CAS Google Scholar
Daskalaki S, Kopanas I, Avouris NM (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417
Article Google Scholar
Dunoyer P, Lecellier CH, Parizotto EA, Himber C, Voinnet O (2004) Probing the microRNA and small interfering RNA pathways with virus-encoded suppressors of RNA silencing. Plant Cell 16:1235–1250. doi:10.1105/tpc.020719
Article CAS Google Scholar
Gao Y, Shao S, Xiao X, Ding Y, Huang Y, Huang Z, Chou KC (2005) Using pseudo amino acid composition to predict protein subcellular location: approached with Lyapunov Index, Bessel function, and Chebyshev filter. Amino Acids 28:373–376. doi:10.1007/s00726-005-0206-9
Article CAS Google Scholar
García V, Mollineda RA, Sánchez JS (2009) Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo H, Mendonça A, Pinho A, Torres M (eds) Pattern recognition and image analysis, vol 5524. Lecture notes in computer science. Springer, Heidelberg, pp 441–448. doi:10.1007/978-3-642-02172-5_57
Goto K, Kobori T, Kosaka Y, Natsuaki T, Masuta C (2007) Characterization of silencing suppressor 2b of cucumber mosaic virus based on examination of its small RNA-binding abilities. Plant Cell Physiol 48:1050–1060. doi:10.1093/pcp/pcm074
Article CAS Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11:10–18. doi:10.1145/1656274.1656278
Article Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing, vol 3644. Lecture notes in computer science. Springer, Heidelberg, pp 878–887. doi:10.1007/11538059_91
Jagga Z, Gupta D (2014) Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors. PLoS ONE 9:e97446. doi:10.1371/journal.pone.0097446
Article Google Scholar
Kandaswamy K, Pugalenthi G, Hazrati M, Kalies K-U, Martinetz T (2011) BLProt: prediction of bioluminescent proteins based on support vector machine and relief feature selection. BMC Bioinformatics 12:345
Article CAS Google Scholar
Kira K, Rendell LA (1992) A practical approach to feature selection. Paper presented at the proceedings of the ninth international workshop on machine learning, Aberdeen
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: van Someren M, Widmer G (eds) Machine learning: ECML-97, vol 1224. Lecture notes in computer science. Springer, Heidelberg, pp 146–153. doi:10.1007/3-540-62858-4_79
Kumari P, Nath A, Chaube R (2015) Identification of human drug targets using machine-learning algorithms. Comp Biomed 56:175–181. doi:10.1016/j.compbiomed.2014.11.008
CAS Google Scholar
Lee PH (2014) Resampling methods improve the predictive power of modeling in class-imbalanced datasets. Int J Environ Res Public Health 11:9776–9789. doi:10.3390/ijerph110909776
Article Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. doi:10.1093/bioinformatics/btl158
Article CAS Google Scholar
Li F, Huang C, Li Z, Zhou X (2014a) Suppression of RNA silencing by a plant DNA virus satellite requires a host calmodulin-like protein to repress RDR6 expression. PLoS Pathog 10:e1003921. doi:10.1371/journal.ppat.1003921
Article Google Scholar
Li H, Pi D, Wang C (2014b) The prediction of protein-protein interaction sites based on RBF classifier improved by SMOTE. Math Probl Eng 2014:7. doi:10.1155/2014/528767
Google Scholar
Liu Y, Jin W, Wang L, Wang X (2014) Replication-associated proteins encoded by wheat dwarf virus act as RNA silencing suppressors. Virus Res 190:34–39. doi:10.1016/j.virusres.2014.06.014
Article CAS Google Scholar
MacIsaac KD et al (2006) A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22:423–429. doi:10.1093/bioinformatics/bti815
Article CAS Google Scholar
Mishra NK, Chang J, Zhao PX (2014) Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS ONE 9:e100278. doi:10.1371/journal.pone.0100278
Article Google Scholar
Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) LVQ-SMOTE—learning vector quantization based synthetic minority over-sampling technique for biomedical data. BioData Min 6:16
Article Google Scholar
Nath A, Subbiah K (2014) Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier. Comput Biol Chem 53:198–203. doi:10.1016/j.compbiolchem.2014.10.002
Article CAS Google Scholar
Nath A, Subbiah K (2015a) Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput Biol Chem 59:101–110. doi:10.1016/j.compbiolchem.2015.09.011
Article CAS Google Scholar
Nath A, Subbiah K (2015b) Unsupervised learning assisted robust prediction of bioluminescent proteins. Comput Biol Med 68:27–36. doi:10.1016/j.compbiomed.2015.10.013
Article Google Scholar
Nath A, Chaube R, Karthikeyan S (2012) Discrimination of psychrophilic and mesophilic proteins using random forest algorithm. In: Biomedical engineering and biotechnology (iCBEB), 2012 international conference, 28–30 May 2012, pp 179–182. doi:10.1109/iCBEB.2012.151
Nath A, Chaube R, Subbiah K (2013) An insight into the molecular basis for convergent evolution in fish antifreeze proteins. Comput Biol Med 43:817–821. doi:10.1016/j.compbiomed.2013.04.013
Article CAS Google Scholar
Pérez-Cañamás M, Hernández C (2014) Key importance of small RNA binding for the activity of a glycine/tryptophan (GW) motif-containing viral suppressor of RNA silencing. J Biol Chem. doi:10.1074/jbc.M114.593707
Google Scholar
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods. MIT Press, pp 185–208
Pugalenthi G, Kandaswamy KK, Suganthan PN, Archunan G, Sowdhamini R (2010) Identification of functionally diverse lipocalin proteins from sequence information using support vector machine. Amino Acids 39:777–783. doi:10.1007/s00726-010-0520-8
Article CAS Google Scholar
Qu F, Morris TJ (2005) Suppressors of RNA silencing encoded by plant viruses and their role in viral infections. FEBS Lett 579:5958–5964. doi:10.1016/j.febslet.2005.08.041
Article CAS Google Scholar
Senshu H et al (2011) A dual strategy for the suppression of host antiviral silencing: two distinct suppressors for viral replication and viral movement encoded by potato virus M. J Virol 85:10269–10278. doi:10.1128/jvi.05273-11
Article CAS Google Scholar
Suvarna Vani K, Durga Bhavani S (2013) SMOTE based protein fold prediction classification. In: Meghanathan N, Nagamalai D, Chaki N (eds) Advances in computing and information technology, vol 177. Advances in intelligent systems and computing. Springer, Heidelberg, pp 541–550. doi:10.1007/978-3-642-31552-7_55
Valli A, López-Moya JJ, García JA (2001) RNA silencing and its suppressors in the plant-virus interplay. In: eLS. Wiley doi:10.1002/9780470015902.a0021261
Vapnik V (1995) The nature of statistical learning theory. Springer
Vapnik V (1998) Statistical learning theory. Wiley, New York
Google Scholar
Wang Y, Dang M, Hou H, Mei Y, Qian Y, Zhou X (2014) Identification of an RNA silencing suppressor encoded by a mastrevirus. J Gen Virol 95:2082–2088. doi:10.1099/vir.0.064246-0
Article CAS Google Scholar
Wei Q, Dunbrack RL Jr (2013) the role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8:e67863. doi:10.1371/journal.pone.0067863
Article CAS Google Scholar
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Int Res 19:315–354
Google Scholar
Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M (2011) Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinformatics 12:165
Article CAS Google Scholar
Xie H-L, Fu L, Nie X-D (2013) Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou’s PseAAC. Protein Eng Des Sel 26:735–742. doi:10.1093/protein/gzt042
Article CAS Google Scholar
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3:32–35. doi:10.1002/1097-0142(1950)3:1<32:AID-CNCR2820030106>3.0.CO;2-3
Article CAS Google Scholar

Download references

Acknowledgments

The authors are very grateful to Department of Computer Science, Faculty of Science, Banaras Hindu University for supports in this study.

Author information

Authors and Affiliations

Department of Computer Science, Banaras Hindu University, Varanasi, India
Abhigyan Nath & Karthikeyan Subbiah

Authors

Abhigyan Nath
View author publications
You can also search for this author in PubMed Google Scholar
Karthikeyan Subbiah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Abhigyan Nath or Karthikeyan Subbiah.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interests.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 19 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Nath, A., Subbiah, K. Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors. 3 Biotech 6, 93 (2016). https://doi.org/10.1007/s13205-016-0410-1

Download citation

Received: 30 September 2015
Accepted: 03 March 2016
Published: 21 March 2016
DOI: https://doi.org/10.1007/s13205-016-0410-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors

Abstract

Similar content being viewed by others

How to balance the bioinformatics data: pseudo-negative sampling

Analysis and Classification of Plant MicroRNAs Using Decision Tree Based Approach

RNA Secondary Structure Prediction Using Extreme Learning Machine with Clustering Under-Sampling Technique

Introduction