Splice site identification in human genome using random forest

Pashaei, Elham; Ozen, Mustafa; Aydin, Nizamettin

doi:10.1007/s12553-016-0157-z

Splice site identification in human genome using random forest

Original Paper
Published: 02 December 2016

Volume 7, pages 141–152, (2017)
Cite this article

Health and Technology Aims and scope Submit manuscript

306 Accesses
14 Citations
Explore all metrics

Abstract

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prediction of donor splice sites using random forest with a new sequence encoding approach

Article Open access 22 January 2016

Random Forest in Splice Site Prediction of Human Genome

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

Article Open access 01 June 2016

References

Sonnenburg S, Schweikert G, Philips P, Behr J, Ratsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics. 2007;88(Suppl 10):S7.
Article Google Scholar
Bin W, Jing Z. A novel artificial neural network and an improved particle swarm optimization used in splice site prediction. J Appl Comput Mathematics. 2014;3(4) doi:10.4172/2168-9679.1000166.
Nassa T, Singh S, Goel N. Splice site detection in DNA sequences using probabilistic neural network. Intern J Comp Appl(IJCA). 2013;76(4):1–4.
Google Scholar
Salekdeh AY, Wiese KC. Improving splice-junctions classification employing a novel encoding schema and decision-tree. Evol Comput (CEC). 2011:1302–7. doi:10.1109/CEC.2011.5949766.
Bari AG, Reaz MR, Choi HJ, Jeong BS. Survey on nucleotide encoding techniques and SVM kernel Design for Human Splice Site Prediction. Interdisciplinary Bio Central. 2012;4(14):1–6. doi:10.4051/ibc.2012.4.4.0014.
Article Google Scholar
Zhang Y, Chu C-H, Chen Y, Zha H, Ji X. Splice site prediction using support vector machines with a Bayes kernel. Expert Syst Appl. 2006;30(1):73–81.
Article Google Scholar
Burge C, Karlin S. Predictions of complete gene structures in human genomic DNA. J Mol Biol. 1997;9(5):499–509.
Google Scholar
Baten A, Chang B, Halgamuge S, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics. 2006;7(Suppl 5):S15.
Article Google Scholar
Baten A, Halgamuge S, Chang B. Fast splice site detection using information content and feature reduction. BMC Bioinformatics. 2008;9(Suppl 12):S8.
Article Google Scholar
Reese M, Eeckman F, Kupl D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–24.
Article Google Scholar
Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouzé P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–52.
Article Google Scholar
Loi HS, Rajapakse JC. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Informatics. 2003;14:64–72.
Google Scholar
Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice site prediction of human genome using length-variable Markov model and feature selection. Expert Syst Appl. 2010;37(4):2771–82.
Article Google Scholar
Wei D, Zhang H, Wei Y, Jiang Q. A novel splice site prediction method using support vector machine. J Comput Inf Syst. 2013;9(20):8053–60.
Google Scholar
Maji S, Garg D. Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinforma. 2014;9(1):76–85.
Article Google Scholar
Goel N, Singh S, Aseri TC. An improved method for splice site prediction in DNA sequences using support vector machines. Procedia Comp Sci. 2015;57:358–67. doi:10.1016/j.procs.2015.07.350.
Article Google Scholar
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on Gene selection. IEEE/ACM Transac Comp Biol Bioinformatics. 2016;13(5):971–89.
Article Google Scholar
Kumari B, Swarnkar T. Filter versus wrapper feature subset selection in large dimensionality micro array: a review. Intern J Comp Sci Inform Technol (IJCSIT). 2011;2(3):1048–53.
Google Scholar
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma. 2015;2015 doi:10.1155/2015/198363.
Saeys Y, Degroeve S, Aeyels D, Rouze P, Peer Y. Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinformatics. 2004;5(64) doi:10.1186/1471-2105-5-64.
Saeys Y, Degroeve S, Aeyels D, Van PD, Rouze P. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice ste prediction. Bioinformatics. 2003;19(SUPPL2):179–88.
Google Scholar
Svetnik V, Liaw A, Tong C, editors. Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship. Proceedings of the 7th Course on Ensemble Methods for Learning Machines. USA: Springer-Verlag; 2004.
Google Scholar
Genuera R, Poggi JM, Malotc CT. Variable selection using random forests. Pattern Recognition Letters, Elsevier. 2010;31(14):2225–36.
Article Google Scholar
Han L, Embrechts MJ, Szymanski B, Sternickel K, Ross A. Random Forests Feature Selection with Kernel Partial Least Squares: Detecting Ischemia from Magneto Cardiograms. Burges, Belgium: European Symposium on Artificial Neural Networks; 2006. p. 221–6.
Google Scholar
Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH. Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB'06); Toronto: IEEE; 2006. p. 1–8. doi:10.1109/CIBCB.2006.330987.
Slavkov I, Zenko B, Dzeroski S. Evaluation method for feature rankings and their aggregations for biomarker discovery. In: JMLR Workshop and Conference Proceedings: Machine Learning in Systems Biology. 2010. vol. 8. p. 122–35.
Kocev D, Slavkov I, Dzeroski S, editors. Feature ranking for multi-label classication using predictive clustering trees. International Workshop on Solving Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD; 2013.
Wei D, Zhuang W, Jiang Q, Wei Y. A new classification method for human gene splice site prediction. In: He J, Liu X, Krupinski E, Xu G, editors. Health information science lecture notes in computer science. Heidelberg: Springer; 2012. p. 121–30.
Chapter Google Scholar
Lopes HS, Lima CRE, Murata NJ. A configware approach for high-speed parallel analysis of genomic data. J Circuits Syst Comp. 2007;16:527–40.
Article Google Scholar
Sun H, Peng Q, Zhang Q, Mou D. Splice site prediction based on characteristic of sequential motifs and C4.5 algorithm. In: 50th International Conference on Fuzzy Systems and Knowledge Discovery. Jinan Shandong: China IEEE; 2008. p. 417–22. doi:10.1109/FSKD.2008.331.
Google Scholar
Yin M, Wang J. Effective hidden Markov models for detecting splicing junction sites in DNA sequences. Inf Sci. 2001;139:139–63.
Article MathSciNet MATH Google Scholar
Rajapakse J, Ho L. Markov encoding for detecting signals in genomic sequences. IEEE-ACM Transact Comp Biol Bioinform. 2005;2(2):131–42.
Article Google Scholar
Marashi S, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H. Importance of RNA secondary structure information for yeast donor and acceptor splice site prediction by neural networks. Comput Biol Chem. 2006;30(1):50–7.
Article MATH Google Scholar
Johansen O, Ryen T, Eftesol T, Kjosmoen T, Ruoff P. Splice site Predicton using artificial neural networks. In: Masulli F, Tagliaferri R, Verkhivker GM, editors. Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science. Heidelberg: Springer; 2009. p. 102–33.
Chapter Google Scholar
Cai D, Delcher A, Kao B, Ksif S. Modeling splice sites with Bayes networks. Bioinformatics. 2000;16:152–8.
Article Google Scholar
Chen T, Lu C, Li W. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–82.
Article Google Scholar
Tsai K, Lin S, Shih S, Lai J, Chenn C. Genomic splice Sirte prediction algorithm based on nucleotide sequence pattern for RNA viruses. Comput Biol Chem. 2009;33:171–5.
Article Google Scholar
Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. In: Guyon I, Gunn S, Nikrevesh M, Zadeh L, editors. Feature extraction studies in fuzziness and soft computing. New York: Springer; 2006. p. 315–24.
Google Scholar
Liu H, Motoda H. Feature selection for Knowlegde discovery and data mining. London: Kluwer Academic Publisher; 1998.
Book MATH Google Scholar
Pollastro P, Rampone S. HS3D, a dataset of homo sapies splice site regions, and its extraction procedure from a major public database. Inter J Modern Physics. 2002;C13(13):1105–17.
Article Google Scholar
Breiman L. Random Forest. Mchine Learning. 2001;45(1):5–32. doi:10.1023/A:1010933404324.
Article MATH Google Scholar
Vapnik VN. Statistical learning theory. Adaptive and learning system for signal processing communications and control. New York: 1998.
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319.
Article Google Scholar
Filimon A. Hedge fund fraud prediction using classication algorithms. Merlin: University of Zurich; 2011.
Google Scholar
Lin WJ, Che JJ. Class-imbalanced classifiers for high-dimensional data. Brief Bioinform. 2012;14(1):13–26. doi:10.1093/bib/bbs006.
Article Google Scholar
Ganganwar V. An overview of classification algorithms for imbalanced datasets. Intern J Emerg Technol Advance Eng(IJETAE). 2012;2(4):42–7.
Google Scholar
Longadge R, Dongre SS, Malik L. Class imbalance problem in data mining: review. Intern J Comp Sci Net (IJCSN). 2013;2(1):83–7.
Google Scholar
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3) doi:10.1371/journal.pone.0118432.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
Elham Pashaei & Nizamettin Aydin
Department of Pathology & Immunology, Baylor College of Medicine, Houston, 77030, TX, USA
Mustafa Ozen

Authors

Elham Pashaei
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Ozen
View author publications
You can also search for this author in PubMed Google Scholar
Nizamettin Aydin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nizamettin Aydin.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

This article is part of the Topical collection on Systems Medicine

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pashaei, E., Ozen, M. & Aydin, N. Splice site identification in human genome using random forest. Health Technol. 7, 141–152 (2017). https://doi.org/10.1007/s12553-016-0157-z

Download citation

Received: 28 June 2016
Accepted: 25 November 2016
Published: 02 December 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s12553-016-0157-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Splice site identification in human genome using random forest

Abstract

Access this article

Similar content being viewed by others

Prediction of donor splice sites using random forest with a new sequence encoding approach

Random Forest in Splice Site Prediction of Human Genome

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Splice site identification in human genome using random forest

Abstract

Access this article

Similar content being viewed by others

Prediction of donor splice sites using random forest with a new sequence encoding approach

Random Forest in Splice Site Prediction of Human Genome

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation