Abstract
Knowledge of DNA sequences is indispensable for basic biological research. Many researchers use DNA sequencing for various purposes including molecular biology research and sequence comparison for individual identification. Automated DNA sequencing devices use four colored chromatograms or base-calling signals to indicate strength of hybridization for each base channel. Typically, relative strengths of peaks at each base location are used to quantify the quality and/or reliability of individual readings. However, assessment of overall quality of whole DNA trace files remains to be an open problem. Therefore, classification of raw DNA trace files as high or low quality is an important issue for efficient utilization of resources. In this study, we have used several supervised machine learning approaches, including logistic regression and ensemble decision trees, to identify high- or acceptable-quality chromatogram files and compared their prediction performances. In order to test and develop our ideas, we have used a public DNA trace repository consisting of 1626 high- and 631 low-quality files marked by our expert molecular biologist. Our results indicate that, although all of the methods tried offer comparable and acceptable performances, random forest decision tree algorithm with adapting boosting ensemble learning shows slightly higher prediction accuracy with as few as four features.
Similar content being viewed by others
References
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467
Stucky BJ (2012) SeqTrace: a graphical tool for rapidly processing DNA sequencing chromatograms. J Biomol Tech. 23(3):90–93. doi:10.7171/jbt.12-2303-004
Öz E, Kaya H (2013) Support vector machines for quality control of DNA sequencing. J Inequal Appl 85:1–9. doi:10.1186/1029-242X-2013-85
Benhamou CL, Poupon S, Lespessailles E, Loiseau S, Jennane R, Siroux V, Ohley W, Pothuaud L (2001) Fractal analysis of radiographic trabecular bone texture and bone mineral density: two complementary parameters related to osteoporotic fractures. J Bone Miner Res 16:697–704. doi:10.1359/jbmr.2001.16.4.697
Tartar A, Kilic N, Akan A (2013) Classification of pulmonary nodules by using hybrid features. Computational and Mathematical Methods in Medicine Article ID 148363, 11 pages. doi:10.1155/2013/148363
Erdal HI, Karakurt O, Namli E (2013) High performance concrete compressive strength forecasting using ensemble models based on discrete wavelet transform. Eng Appl Artif 26(4):1246–1254. doi:10.1016/j.engappai.2012.10.014
Tartar A, Akan A, Kilic N (2014) A novel approach to malignant-benign classification of pulmonary nodules by using ensemble learning classifiers. In: 36th Annual international conference of the IEEE engineering in medicine and biology society 4651–4654. doi: 10.1109/EMBC.2014.6944661
Kilic N, Hosgormez E (2016) Automatic estimation of osteoporotic fracture cases by using ensemble learning approaches. J Med Syst 40(3):1–10. doi:10.1007/s10916-015-0413-1
Manaster C, Zheng W, Teuber M, Wachter S, Doring F, Schreiber S, Hampe J (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum Mutat 26(1):11–19. doi:10.1002/humu.20188
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Hoboken
Delen D, Kuzey C, Uyar A (2013) Measuring firm performance using financial ratios: a decision tree approach. Expert Syst Appl 40(10):3970–3983. doi:10.1016/j.eswa.2013.01.012
Drazin S, Montag M (2012) Decision tree analysis using WEKA. http://ww.samdrazin.com/classes/een548/project2report.pdf. Accessed 3 October 2016
Ting H, Mai YT, Hsu HC, Wu HC, Tseng MH (2014) Decision tree based diagnostic system for moderate to severe obstructive sleep apnea. J Med Syst 38(9):1–10. doi:10.1007/s10916-014-0094-1
Sushilkumar K (2015) Analysis of WEKA data mining algorithm REPTree, Simple CART and random tree for classification of Indian news. Int J Innov Sci Eng Technol 2(2):438–446
Quinlan JR (2014) C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco
Chen XW, Liu M (2005) Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24):4394–4400. doi:10.1093/bioinformatics/bti721
Askin ÖE, Gokalp F (2013) Comparing the predictive and classification performances of logistic regression and neural networks: a case study on TIMSS 2011. Proced Soc Behav Sci 106:667–676. doi:10.1016/j.sbspro.2013.12.076
Hosmer D, Lemeshow S (2000) Applied logistic regression. Wiley, Hoboken
Mert A, Kilic N, Akan A (2014) Evaluation of bagging ensemble method with time domain feature for diagnosing of arrhythmia beats. Neural Comput Appl 24(2):317–326. doi:10.1007/s00521-012-1232-7
Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of conference on multiple classifier systems 1–15
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. doi:10.1006/jcss.1997.1504
Ridgeway G (1999) The state of boosting. Comput Sci Stat 31:172–181
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601
Panov P, Dzeroski S (2007) Combining bagging and random subspaces to create better ensembles. 7th Int. Sym. on Intell. Data Anal. 118–129. doi: 10.1007/978-3-540-74825-0_11
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B (Methodol) 36(2):111–147
Hall MA (1999) Correlation-based feature selection for machine learning. Dissertation, The University of Waikato
Öz E, Kurt S, Asyalı MH, Kaya H, Yücel Y (2016) Feature based quality assessment of DNA sequencing chromatograms. Appl Soft Comput 41:420–427. doi:10.1016/j.asoc.2016.01.025
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors state that there is no conflict of interest.
Rights and permissions
About this article
Cite this article
Kurt, S., Öz, E., Aşkın, Ö.E. et al. Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches. Neural Comput & Applic 29, 251–262 (2018). https://doi.org/10.1007/s00521-017-2960-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-2960-5