Abstract
Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.
Similar content being viewed by others
References
Sonnenburg S, Schweikert G, Philips P, Behr J, Ratsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics. 2007;88(Suppl 10):S7.
Bin W, Jing Z. A novel artificial neural network and an improved particle swarm optimization used in splice site prediction. J Appl Comput Mathematics. 2014;3(4) doi:10.4172/2168-9679.1000166.
Nassa T, Singh S, Goel N. Splice site detection in DNA sequences using probabilistic neural network. Intern J Comp Appl(IJCA). 2013;76(4):1–4.
Salekdeh AY, Wiese KC. Improving splice-junctions classification employing a novel encoding schema and decision-tree. Evol Comput (CEC). 2011:1302–7. doi:10.1109/CEC.2011.5949766.
Bari AG, Reaz MR, Choi HJ, Jeong BS. Survey on nucleotide encoding techniques and SVM kernel Design for Human Splice Site Prediction. Interdisciplinary Bio Central. 2012;4(14):1–6. doi:10.4051/ibc.2012.4.4.0014.
Zhang Y, Chu C-H, Chen Y, Zha H, Ji X. Splice site prediction using support vector machines with a Bayes kernel. Expert Syst Appl. 2006;30(1):73–81.
Burge C, Karlin S. Predictions of complete gene structures in human genomic DNA. J Mol Biol. 1997;9(5):499–509.
Baten A, Chang B, Halgamuge S, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics. 2006;7(Suppl 5):S15.
Baten A, Halgamuge S, Chang B. Fast splice site detection using information content and feature reduction. BMC Bioinformatics. 2008;9(Suppl 12):S8.
Reese M, Eeckman F, Kupl D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–24.
Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouzé P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–52.
Loi HS, Rajapakse JC. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Informatics. 2003;14:64–72.
Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice site prediction of human genome using length-variable Markov model and feature selection. Expert Syst Appl. 2010;37(4):2771–82.
Wei D, Zhang H, Wei Y, Jiang Q. A novel splice site prediction method using support vector machine. J Comput Inf Syst. 2013;9(20):8053–60.
Maji S, Garg D. Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinforma. 2014;9(1):76–85.
Goel N, Singh S, Aseri TC. An improved method for splice site prediction in DNA sequences using support vector machines. Procedia Comp Sci. 2015;57:358–67. doi:10.1016/j.procs.2015.07.350.
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on Gene selection. IEEE/ACM Transac Comp Biol Bioinformatics. 2016;13(5):971–89.
Kumari B, Swarnkar T. Filter versus wrapper feature subset selection in large dimensionality micro array: a review. Intern J Comp Sci Inform Technol (IJCSIT). 2011;2(3):1048–53.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma. 2015;2015 doi:10.1155/2015/198363.
Saeys Y, Degroeve S, Aeyels D, Rouze P, Peer Y. Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinformatics. 2004;5(64) doi:10.1186/1471-2105-5-64.
Saeys Y, Degroeve S, Aeyels D, Van PD, Rouze P. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice ste prediction. Bioinformatics. 2003;19(SUPPL2):179–88.
Svetnik V, Liaw A, Tong C, editors. Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship. Proceedings of the 7th Course on Ensemble Methods for Learning Machines. USA: Springer-Verlag; 2004.
Genuera R, Poggi JM, Malotc CT. Variable selection using random forests. Pattern Recognition Letters, Elsevier. 2010;31(14):2225–36.
Han L, Embrechts MJ, Szymanski B, Sternickel K, Ross A. Random Forests Feature Selection with Kernel Partial Least Squares: Detecting Ischemia from Magneto Cardiograms. Burges, Belgium: European Symposium on Artificial Neural Networks; 2006. p. 221–6.
Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH. Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB'06); Toronto: IEEE; 2006. p. 1–8. doi:10.1109/CIBCB.2006.330987.
Slavkov I, Zenko B, Dzeroski S. Evaluation method for feature rankings and their aggregations for biomarker discovery. In: JMLR Workshop and Conference Proceedings: Machine Learning in Systems Biology. 2010. vol. 8. p. 122–35.
Kocev D, Slavkov I, Dzeroski S, editors. Feature ranking for multi-label classication using predictive clustering trees. International Workshop on Solving Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD; 2013.
Wei D, Zhuang W, Jiang Q, Wei Y. A new classification method for human gene splice site prediction. In: He J, Liu X, Krupinski E, Xu G, editors. Health information science lecture notes in computer science. Heidelberg: Springer; 2012. p. 121–30.
Lopes HS, Lima CRE, Murata NJ. A configware approach for high-speed parallel analysis of genomic data. J Circuits Syst Comp. 2007;16:527–40.
Sun H, Peng Q, Zhang Q, Mou D. Splice site prediction based on characteristic of sequential motifs and C4.5 algorithm. In: 50th International Conference on Fuzzy Systems and Knowledge Discovery. Jinan Shandong: China IEEE; 2008. p. 417–22. doi:10.1109/FSKD.2008.331.
Yin M, Wang J. Effective hidden Markov models for detecting splicing junction sites in DNA sequences. Inf Sci. 2001;139:139–63.
Rajapakse J, Ho L. Markov encoding for detecting signals in genomic sequences. IEEE-ACM Transact Comp Biol Bioinform. 2005;2(2):131–42.
Marashi S, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H. Importance of RNA secondary structure information for yeast donor and acceptor splice site prediction by neural networks. Comput Biol Chem. 2006;30(1):50–7.
Johansen O, Ryen T, Eftesol T, Kjosmoen T, Ruoff P. Splice site Predicton using artificial neural networks. In: Masulli F, Tagliaferri R, Verkhivker GM, editors. Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science. Heidelberg: Springer; 2009. p. 102–33.
Cai D, Delcher A, Kao B, Ksif S. Modeling splice sites with Bayes networks. Bioinformatics. 2000;16:152–8.
Chen T, Lu C, Li W. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–82.
Tsai K, Lin S, Shih S, Lai J, Chenn C. Genomic splice Sirte prediction algorithm based on nucleotide sequence pattern for RNA viruses. Comput Biol Chem. 2009;33:171–5.
Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. In: Guyon I, Gunn S, Nikrevesh M, Zadeh L, editors. Feature extraction studies in fuzziness and soft computing. New York: Springer; 2006. p. 315–24.
Liu H, Motoda H. Feature selection for Knowlegde discovery and data mining. London: Kluwer Academic Publisher; 1998.
Pollastro P, Rampone S. HS3D, a dataset of homo sapies splice site regions, and its extraction procedure from a major public database. Inter J Modern Physics. 2002;C13(13):1105–17.
Breiman L. Random Forest. Mchine Learning. 2001;45(1):5–32. doi:10.1023/A:1010933404324.
Vapnik VN. Statistical learning theory. Adaptive and learning system for signal processing communications and control. New York: 1998.
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319.
Filimon A. Hedge fund fraud prediction using classication algorithms. Merlin: University of Zurich; 2011.
Lin WJ, Che JJ. Class-imbalanced classifiers for high-dimensional data. Brief Bioinform. 2012;14(1):13–26. doi:10.1093/bib/bbs006.
Ganganwar V. An overview of classification algorithms for imbalanced datasets. Intern J Emerg Technol Advance Eng(IJETAE). 2012;2(4):42–7.
Longadge R, Dongre SS, Malik L. Class imbalance problem in data mining: review. Intern J Comp Sci Net (IJCSN). 2013;2(1):83–7.
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3) doi:10.1371/journal.pone.0118432.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
This article is part of the Topical collection on Systems Medicine
Rights and permissions
About this article
Cite this article
Pashaei, E., Ozen, M. & Aydin, N. Splice site identification in human genome using random forest. Health Technol. 7, 141–152 (2017). https://doi.org/10.1007/s12553-016-0157-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12553-016-0157-z