Skip to main content

Advertisement

Log in

Splice site identification in human genome using random forest

  • Original Paper
  • Published:
Health and Technology Aims and scope Submit manuscript

Abstract

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Sonnenburg S, Schweikert G, Philips P, Behr J, Ratsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics. 2007;88(Suppl 10):S7.

    Article  Google Scholar 

  2. Bin W, Jing Z. A novel artificial neural network and an improved particle swarm optimization used in splice site prediction. J Appl Comput Mathematics. 2014;3(4) doi:10.4172/2168-9679.1000166.

  3. Nassa T, Singh S, Goel N. Splice site detection in DNA sequences using probabilistic neural network. Intern J Comp Appl(IJCA). 2013;76(4):1–4.

    Google Scholar 

  4. Salekdeh AY, Wiese KC. Improving splice-junctions classification employing a novel encoding schema and decision-tree. Evol Comput (CEC). 2011:1302–7. doi:10.1109/CEC.2011.5949766.

  5. Bari AG, Reaz MR, Choi HJ, Jeong BS. Survey on nucleotide encoding techniques and SVM kernel Design for Human Splice Site Prediction. Interdisciplinary Bio Central. 2012;4(14):1–6. doi:10.4051/ibc.2012.4.4.0014.

    Article  Google Scholar 

  6. Zhang Y, Chu C-H, Chen Y, Zha H, Ji X. Splice site prediction using support vector machines with a Bayes kernel. Expert Syst Appl. 2006;30(1):73–81.

    Article  Google Scholar 

  7. Burge C, Karlin S. Predictions of complete gene structures in human genomic DNA. J Mol Biol. 1997;9(5):499–509.

    Google Scholar 

  8. Baten A, Chang B, Halgamuge S, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics. 2006;7(Suppl 5):S15.

    Article  Google Scholar 

  9. Baten A, Halgamuge S, Chang B. Fast splice site detection using information content and feature reduction. BMC Bioinformatics. 2008;9(Suppl 12):S8.

    Article  Google Scholar 

  10. Reese M, Eeckman F, Kupl D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–24.

    Article  Google Scholar 

  11. Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouzé P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–52.

    Article  Google Scholar 

  12. Loi HS, Rajapakse JC. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Informatics. 2003;14:64–72.

    Google Scholar 

  13. Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice site prediction of human genome using length-variable Markov model and feature selection. Expert Syst Appl. 2010;37(4):2771–82.

    Article  Google Scholar 

  14. Wei D, Zhang H, Wei Y, Jiang Q. A novel splice site prediction method using support vector machine. J Comput Inf Syst. 2013;9(20):8053–60.

    Google Scholar 

  15. Maji S, Garg D. Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinforma. 2014;9(1):76–85.

    Article  Google Scholar 

  16. Goel N, Singh S, Aseri TC. An improved method for splice site prediction in DNA sequences using support vector machines. Procedia Comp Sci. 2015;57:358–67. doi:10.1016/j.procs.2015.07.350.

    Article  Google Scholar 

  17. Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on Gene selection. IEEE/ACM Transac Comp Biol Bioinformatics. 2016;13(5):971–89.

    Article  Google Scholar 

  18. Kumari B, Swarnkar T. Filter versus wrapper feature subset selection in large dimensionality micro array: a review. Intern J Comp Sci Inform Technol (IJCSIT). 2011;2(3):1048–53.

    Google Scholar 

  19. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma. 2015;2015 doi:10.1155/2015/198363.

  20. Saeys Y, Degroeve S, Aeyels D, Rouze P, Peer Y. Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinformatics. 2004;5(64) doi:10.1186/1471-2105-5-64.

  21. Saeys Y, Degroeve S, Aeyels D, Van PD, Rouze P. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice ste prediction. Bioinformatics. 2003;19(SUPPL2):179–88.

    Google Scholar 

  22. Svetnik V, Liaw A, Tong C, editors. Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship. Proceedings of the 7th Course on Ensemble Methods for Learning Machines. USA: Springer-Verlag; 2004.

    Google Scholar 

  23. Genuera R, Poggi JM, Malotc CT. Variable selection using random forests. Pattern Recognition Letters, Elsevier. 2010;31(14):2225–36.

    Article  Google Scholar 

  24. Han L, Embrechts MJ, Szymanski B, Sternickel K, Ross A. Random Forests Feature Selection with Kernel Partial Least Squares: Detecting Ischemia from Magneto Cardiograms. Burges, Belgium: European Symposium on Artificial Neural Networks; 2006. p. 221–6.

    Google Scholar 

  25. Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH. Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB'06); Toronto: IEEE; 2006. p. 1–8. doi:10.1109/CIBCB.2006.330987.

  26. Slavkov I, Zenko B, Dzeroski S. Evaluation method for feature rankings and their aggregations for biomarker discovery. In: JMLR Workshop and Conference Proceedings: Machine Learning in Systems Biology. 2010. vol. 8. p. 122–35.

  27. Kocev D, Slavkov I, Dzeroski S, editors. Feature ranking for multi-label classication using predictive clustering trees. International Workshop on Solving Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD; 2013.

  28. Wei D, Zhuang W, Jiang Q, Wei Y. A new classification method for human gene splice site prediction. In: He J, Liu X, Krupinski E, Xu G, editors. Health information science lecture notes in computer science. Heidelberg: Springer; 2012. p. 121–30.

    Chapter  Google Scholar 

  29. Lopes HS, Lima CRE, Murata NJ. A configware approach for high-speed parallel analysis of genomic data. J Circuits Syst Comp. 2007;16:527–40.

    Article  Google Scholar 

  30. Sun H, Peng Q, Zhang Q, Mou D. Splice site prediction based on characteristic of sequential motifs and C4.5 algorithm. In: 50th International Conference on Fuzzy Systems and Knowledge Discovery. Jinan Shandong: China IEEE; 2008. p. 417–22. doi:10.1109/FSKD.2008.331.

    Google Scholar 

  31. Yin M, Wang J. Effective hidden Markov models for detecting splicing junction sites in DNA sequences. Inf Sci. 2001;139:139–63.

    Article  MathSciNet  MATH  Google Scholar 

  32. Rajapakse J, Ho L. Markov encoding for detecting signals in genomic sequences. IEEE-ACM Transact Comp Biol Bioinform. 2005;2(2):131–42.

    Article  Google Scholar 

  33. Marashi S, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H. Importance of RNA secondary structure information for yeast donor and acceptor splice site prediction by neural networks. Comput Biol Chem. 2006;30(1):50–7.

    Article  MATH  Google Scholar 

  34. Johansen O, Ryen T, Eftesol T, Kjosmoen T, Ruoff P. Splice site Predicton using artificial neural networks. In: Masulli F, Tagliaferri R, Verkhivker GM, editors. Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science. Heidelberg: Springer; 2009. p. 102–33.

    Chapter  Google Scholar 

  35. Cai D, Delcher A, Kao B, Ksif S. Modeling splice sites with Bayes networks. Bioinformatics. 2000;16:152–8.

    Article  Google Scholar 

  36. Chen T, Lu C, Li W. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–82.

    Article  Google Scholar 

  37. Tsai K, Lin S, Shih S, Lai J, Chenn C. Genomic splice Sirte prediction algorithm based on nucleotide sequence pattern for RNA viruses. Comput Biol Chem. 2009;33:171–5.

    Article  Google Scholar 

  38. Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. In: Guyon I, Gunn S, Nikrevesh M, Zadeh L, editors. Feature extraction studies in fuzziness and soft computing. New York: Springer; 2006. p. 315–24.

    Google Scholar 

  39. Liu H, Motoda H. Feature selection for Knowlegde discovery and data mining. London: Kluwer Academic Publisher; 1998.

    Book  MATH  Google Scholar 

  40. Pollastro P, Rampone S. HS3D, a dataset of homo sapies splice site regions, and its extraction procedure from a major public database. Inter J Modern Physics. 2002;C13(13):1105–17.

    Article  Google Scholar 

  41. Breiman L. Random Forest. Mchine Learning. 2001;45(1):5–32. doi:10.1023/A:1010933404324.

    Article  MATH  Google Scholar 

  42. Vapnik VN. Statistical learning theory. Adaptive and learning system for signal processing communications and control. New York: 1998.

  43. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319.

    Article  Google Scholar 

  44. Filimon A. Hedge fund fraud prediction using classication algorithms. Merlin: University of Zurich; 2011.

    Google Scholar 

  45. Lin WJ, Che JJ. Class-imbalanced classifiers for high-dimensional data. Brief Bioinform. 2012;14(1):13–26. doi:10.1093/bib/bbs006.

    Article  Google Scholar 

  46. Ganganwar V. An overview of classification algorithms for imbalanced datasets. Intern J Emerg Technol Advance Eng(IJETAE). 2012;2(4):42–7.

    Google Scholar 

  47. Longadge R, Dongre SS, Malik L. Class imbalance problem in data mining: review. Intern J Comp Sci Net (IJCSN). 2013;2(1):83–7.

    Google Scholar 

  48. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3) doi:10.1371/journal.pone.0118432.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nizamettin Aydin.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

This article is part of the Topical collection on Systems Medicine

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pashaei, E., Ozen, M. & Aydin, N. Splice site identification in human genome using random forest. Health Technol. 7, 141–152 (2017). https://doi.org/10.1007/s12553-016-0157-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12553-016-0157-z

Keywords

Navigation