Abstract
Accurate identification of splice junctions in a DNA sequence is an active area of research. The knowledge of splice junction’s occurrence provides valuable information about its internal genomic structure and aids in its deeper analysis and interpretation. The major problems faced during gene analysis are diversity, complexity and the uncertainty nature of DNA sequences. The application of computational techniques using machine learning algorithms in this direction has attracted enormous attention in the last few decades. In this study, the development of hybrid machine learning ensembles approaches is presented that address the splice junction problem more effectively. Multiple classifier systems consisting of random subspace, rotation forest and boosting methods are implemented and are validated over the real genome sequence dataset. A novel feature selection technique based on attribute’s correlation estimation using Best first strategy is proposed. The average prediction accuracy achieved is more than 98 % in identifying the splice junctions. All the computations are performed with 95 % confidence interval. The results presented in this study are superior as compared to the state-of-the-art approaches in the literature. This work strengthens the viability of expanding and using machine learning models to similar problems.
Similar content being viewed by others
References
Ahila R, Sadasivam V (2014) Performance enhancement of extreme learning machine for power system disturbances classification. Soft Comput 18(2):239–253. doi:10.1007/s00500-013-1051-5. http://www.scopus.com/inward/record.url?eid=2-s2.0-84892665519&partnerID=40&md5=cee0b3339000b0dac9b920d8f8eb4b98
Alaei HK, Salahshoor K, Alaei HK (2013) A new integrated on-line fuzzy clustering and segmentation methodology with adaptive PCA approach for process monitoring and fault detection and diagnosis. Soft Comput 17(3):345–362. doi:10.1007/s00500-012-0910-9. http://www.scopus.com/inward/record.url?eid=2-s2.0-84873745848&partnerID=40&md5=d32b3b8e7305d696e9cd2113e2f3848f
Ali KM, Pazzani MJ (1996) Error reduction through learning multiple descriptions. Mach Learn 24(3):173–202. http://www.scopus.com/inward/record.url?eid=2-s2.0-0030235637&partnerID=40&md5=27f20e10782fda3667ba606f4a7a753a
Baten AKMA, Chang BCH, Halgamuge SK, Li J (2006) Splice site identification using probabilistic parameters and SVM classification. BMC Bioinf 7(SUPPL. 5), art. no. S15. doi:10.1186/1471-2105-7-S5-S15
Biedrzycki R, Arabas J (2012) Kis: an automated attribute induction method for classification of DNA sequences. Int J Appl Math Comput Sci 22(3):711–721. doi:10.2478/v10006-012-0053-2
Breiman L (2000) Randomizing outputs to increase prediction accuracy. Mach Learn 40(3):229–242. doi:10.1023/A:1007682208299. http://www.scopus.com/inward/record.url?eid=2-s2.0-0034276320&partnerID=40&md5=c1b8bef500f7a74c36d5767521ec6943
Cao P, Yang J, Li W, Zhao D, Zaiane O (2014) Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD. Comput Med Imaging Graph 38(3):137–150. doi:10.1016/j.compmedimag.2013.12.003PUBMEDID:24418073
Chen T-M, Lu C-C, Li W-H (2005) Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics 21(4):471–482. doi:10.1093/bioinformatics/bti025
Churbanov A, Rogozin IB, Deogun JS, Ali H (2006) Method of predicting splice sites based on signal interactions. Biol Direct 1, art. no. 10. doi:10.1186/1745-6150-1-10
Ciuffo B, Punzo V (2014) ’No free lunch’ theorems applied to the calibration of traffic simulation models. IEEE Trans Intell Transp Syst 15(2):553–562, art. no. 6670773. doi:10.1109/TITS.2013.2287720
Damaševicius R (2010) Structural analysis of regulatory DNA sequences using grammar inference and support vector machine. Neurocomputing 73(4–6):633–638. doi:10.1016/j.neucom.2009.09.018
Degroeve S, Saeys Y, De Baets B, Rouzé P, Van de Peer Y (2005) SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21(8):1332–1338. doi:10.1093/bioinformatics/bti166
Derrac J, Verbiest N, García S, Cornelis C, Herrera F (2013) On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection. Soft Comput 17(2):223–238. doi:10.1007/s00500-012-0888-3. http://www.scopus.com/inward/record.url?eid=2-s2.0-84872775741&partnerID=40&md5=fb1077b22b592bafa81d0af718a1e15d
Dietterich TG (2000) Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2), pp. 139–157, doi:10.1023/A:1007607513941. http://www.scopus.com/inward/record.url?eid=2-s2.0-0034250160&partnerID=40&md5=dbf953f6d271c1db16b2ffcc5139bb08
Dietterich TG (2000) Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157. doi:10.1023/A:1007607513941
Dogan RI, Getoor L, Wilbur WJ, Mount SM (2007) SplicePort–an interactive splice-site analysis tool. Nucl Acids Res 35(SUPPL.2): W285–W291. doi:10.1093/nar/gkm407
Ekbal A, Saha S (2013) Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition. Soft Comput 17(1):1–16. doi:10.1007/s00500-012-0885-6. http://www.scopus.com/inward/record.url?eid=2-s2.0-84871936165&partnerID=40&md5=311c2537e6021875d173181c19aa56f8
Ferles C, Stafylopatis A (2013) Self-organizing hidden markov model map (SOHMMM). Neural Netw 48:133–147. doi:10.1016/j.neunet.2013.07.011
Hall M, Frank E (2008) Combining naive Bayes and decision tables. In: Proceedings of the 21th international florida artificial intelligence research society conference, FLAIRS-21, pp 318–319. http://www.scopus.com/inward/record.url?eid=2-s2.0-55849123955&partnerID=40&md5=ab9f1a44bda80bc2cf44e26ccb48f2a7
Huang Y-F, Liang C-P, Liou S-W (2012) Intron identification approaches based on weighted features and fuzzy decision trees. Comput Biol Med 42(1):112–122. doi:10.1016/j.compbiomed.2011.10.015
Indrajit M (2014) Developing new machine learning ensembles for quality spine diagnosis, knowledge-based systems. 19 October 2014, ISSN 0950–7051. doi:10.1016/j.knosys.2014.10.012. http://www.sciencedirect.com/science/article/pii/S0950705114003797
Jadhav S, Nalbalwar S, Ghatol A (2014) Feature elimination based random subspace ensembles learning for ECG arrhythmia diagnosis. Soft Comput 18(3):579–587. doi:10.1007/s00500-013-1079-6. http://www.scopus.com/inward/record.url?eid=2-s2.0-84897580355&partnerID=40&md5=65ae532a98bc5a5fd15ddbef6be20e7f
Jing S-Y (2014) A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Comput 18(7):1373–1382. doi:10.1007/s00500-013-1150-3. http://www.scopus.com/inward/record.url?eid=2-s2.0-84902303935&partnerID=40&md5=7fb7eae248e72bb6ae97dbc52a98d830
Kamath U, Compton J, Islamaj-Dogan R, De Jong KA, Shehu A (2012) An evolutionary algorithm approach for feature generation from sequence data and its application to DNA splice site prediction. IEEE/ACM Trans Comput Biol Bioinf 9(5):1387–1398, art. no. 6185531. doi:10.1109/TCBB.2012.53
Kashiwabara AY, Vieira DCG, Machado-Lima A, Durham AM (2007) Splice site prediction using stochastic regular grammars. Genet Mol Res 6(1):105–115
Latkowski T, Osowski S (2014) Data mining for feature selection in gene expression autism data. Exp Syst Appl 42(2):864–872. doi:10.1016/j.eswa.2014.08.043. http://www.scopus.com/inward/record.url?eid=2-s2.0-84907487322&partnerID=40&md5=568024ff82767c30efd2552a55bc2f8d
Li JL, Wang LF, Wang HY, Bai LY, Yuan ZM (2012) High-accuracy splice site prediction based on sequence component and position features. Genet Mol Res 11(3):3431–3451. doi:10.4238/2012.September.25.12
Liou S-W, Huang Y-F (2013) Modelling splice sites with locality-sensitive sequence features. Int J Data Min Bioinf 7(1):78–102. doi:10.1504/IJDMB.2013.050979
Liu L, Ho Y-K, Yau S (2007) Prediction of primate splice site using inhomogeneous Markov chain and neural network. DNA Cell Biol 26(7):477–483. doi:10.1089/dna.2007.0583
Lu X, Peng X, Deng Y, Feng B, Liu P, Liao B (2014) A novel feature selection method based on correlation-based feature selection in cancer recognition. J Comput Theor Nanosci 11(2):427–433. doi:10.1166/jctn.2014.3374
Lumini A, Nanni L (2006) Identifying splice-junction sequences by hierarchical multiclassifier. Patt Recognit Lett 27(12):1390–1396. doi:10.1016/j.patrec.2006.01.013
Maji S, Garg D (2014) Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinf 9(1):76–85. doi:10.2174/1574893608999140109121721
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246. doi:10.1016/j.ins.2014.07.015. http://www.scopus.com/inward/record.url?eid=2-s2.0-84906691312&partnerID=40&md5=6e6de07d7f55ef8e6b627e97516d1ad6
Malousi A, Chouvarda I, Koutkias V, Kouidou S, Maglaveras N (2010) SpliceIT: a hybrid method for splice signal identification based on probabilistic and biological inference. J Biomed Inf 43(2):208–217. doi:10.1016/j.jbi.2009.09.004
Mandal I (2014) A novel approach for accurate identification of splice junctions based on hybrid algorithms. J Biomol Struct Dyn, 1–10. doi:10.1080/07391102.2014.944218
Mandal I, Sairam N (2011) Enhanced classification performance using computational intelligence. Commun Comput Inf Sci, CCIS 204:384–391. doi:10.1007/978-3-642-24043-0_39
Mandal I, Sairam N (2012) Accurate prediction of coronary artery disease using reliable diagnosis system. J Med Syst 36(5):3353–3373. doi:10.1007/s10916-012-9828-0
Mandal I, Sairam N (2013) Accurate telemonitoring of Parkinson’s disease diagnosis using robust inference system. Int J Med Inf 82(5):359–377. doi:10.1016/j.ijmedinf.2012.10.006
Mandal I, Sairam N (2014) New machine-learning algorithms for prediction of Parkinson’s disease. Int J Syst Sci 45(3):647–666. doi:10.1080/00207721.2012.724114
Meila M, Jordan MI (2001) Learning with mixtures of trees. J Mach Learn Res 1(1):1–48. http://www.scopus.com/inward/record.url?eid=2-s2.0-24044550075&partnerID=40&md5=2b4ddf1232dd1848d26971f56e5217c5
Morchid, M., Dufour, R., Bousquet, P.-M., Linarès, G., Torres-Moreno, J.-M. (2014) Feature selection using Principal Component Analysis for massive retweet detection. Pattern Recognition Letters, 49, pp. 33–39, doi:10.1016/j.patrec.2014.05.020. http://www.scopus.com/inward/record.url?eid=2-s2.0-84904007642&partnerID=40&md5=53a4a615d00cefcad07b05d83efe361d
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281. doi:10.1023/A:1024068626366
Nasibov E, Tunaboylu S (2010) Classification of splice-junction sequences via weighted position specific scoring approach. Comput Biol Chem 34(5–6):293–299. doi:10.1016/j.compbiolchem.2010.10.003
Pertea M, Lin X, Salzberg SL (2001) GeneSplicer: a new computational method for splice site prediction. Nucl Acids Res 29(5):1185–1190. http://www.scopus.com/inward/record.url?eid=2-s2.0-0035282695&partnerID=40&md5=03b4c52204e6f26beb4a4f3492f351e8
Romero E, Mu T, Lisboa PJG (2012) Cohort-based kernel visualisation with scatter matrices. Pattern Recognit 45(4):1436–1454. doi:10.1016/j.patcog.2011.09.025
Sun Z, Sang L, Ju L, Zhu H (2008) A new method for splice site prediction based on the sequence patterns of splicing signals and regulatory elements. Chin Sci Bull 53(21):3331–3340. doi:10.1007/s11434-008-0448-5
Tang S, Riva A (2013) PASTA: splice junction identification from RNA-sequencing data. BMC Bioinf 14, art. no. 116. doi:10.1186/1471-2105-14-116
Thangavel K, Manavalan R (2014) Soft computing models based feature selection for TRUS prostate cancer image classification. Soft Comput 18(6):1165–1176. doi:10.1007/s00500-013-1135-2. http://www.scopus.com/inward/record.url?eid=2-s2.0-84901033064&partnerID=40&md5=f019316e3c06aad2974eed956b45ba6a
Thulin M (2014) A high-dimensional two-sample test for the mean using random subspaces. Comput Stat Data Anal 74:26–38. doi:10.1016/j.csda.2013.12.003
Towell GG, Shavlik JW (1994) Knowledge-based artificial neural networks. Artif Intell 70(1–2):119–165. http://www.scopus.com/inward/record.url?eid=2-s2.0-0028529307&partnerID=40&md5=5b0de69f98bf9330f8cc7d3e83a591ef
Towell GG, Shavlik JW (1993) Extracting refined rules from knowledge-based neural networks. Mach Learn 13(1):71–101. doi:10.1007/BF00993103
Trawinski K, Alonso JM, Hernández N (2013) A multiclassifier approach for topology-based WiFi indoor localization. Soft Comput 17(10):1817–1831. doi:10.1007/s00500-013-1019-5
Wei D, Zhang H, Wei Y, Jiang Q (2013) A novel splice site prediction method using support vector machine. J Comput Inf Syst 9(20):8053–8060. doi:10.12733/jcis6763
Wen J-B, Xiong Y-S, Wang S-L (2013) A novel two-stage weak classifier selection approach for adaptive boosting for cascade face detector. Neurocomputing 116:122–135. doi:10.1016/j.neucom.2011.12.060
Wu Q, Ye Y, Zhang H, Ng MK, Ho S-S (2014) ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl Based Syst 67:105–116. doi:10.1016/j.knosys.2014.06.004
Wu J, Cai Z-H (2014) Learning attribute weighted AODE for ROC area ranking. Int J Inf Commun Technol 6(1):23–38. doi:10.1504/IJICT.2014.057970
Xu J, He H, Man H (2012a) DCPE co-training for classification. Neurocomputing 86:75–85. doi:10.1016/j.neucom.2012.01.006
Xu J, He H, Man H (2012b) Feature selection based on sparse imputation, neural networks (IJCNN). In: The 2012 international joint conference, 2012, pp 1–7. doi:10.1109/IJCNN.2012.6252639. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252639&isnumber=6252360
Xu J, Yang G, Man H, He H (2013) L1 graph based on sparse coding for feature selection. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7951. LNCS (PART 1), pp 594–601. doi:10.1007/978-3-642-39065-4-71
Yu H, Hong S, Yang X, Ni J, Dan Y, Qin B (2013) Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BioMed Res Int 2013, art. no. 239628. doi:10.1155/2013/239628. PUBMED ID: 24078908
Zhang B (2014) Random subspace support vector machine ensemble for reliable face recognition. Int J Biometrics 6(1):1–17. doi:10.1504/IJBM.2014.059636
Zhang L-R, Luo L-F (2004) Recognition of splice sites in genes by use of diversity measure method. Progr Biochem Biophys 31(1):77–82
Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J (2010) Splice sites prediction of human genome using length-variable Markov model and feature selection. Exp Syst Appl 37(4):2771–2782. doi:10.1016/j.eswa.2009.09.014
Zięba M, Tomczak JM, Lubicz M, Swiątek J (2014) Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J 14(PART A):99–108. doi:10.1016/j.asoc.2013.07.016
Zupana B, Bohanec M, Demšar J, Bratko I (1999) Learning by discovering concept hierarchies. Artif Intell 109(1):211–242. doi:10.1016/S0004-3702(99)00008-9
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by E. Lughofer.
Rights and permissions
About this article
Cite this article
Mandal, I. A novel approach for predicting DNA splice junctions using hybrid machine learning algorithms. Soft Comput 19, 3431–3444 (2015). https://doi.org/10.1007/s00500-014-1550-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1550-z