Skip to main content
Log in

Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

  • New Trends in data pre-processing methods for signal and image classification
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Knowledge of DNA sequences is indispensable for basic biological research. Many researchers use DNA sequencing for various purposes including molecular biology research and sequence comparison for individual identification. Automated DNA sequencing devices use four colored chromatograms or base-calling signals to indicate strength of hybridization for each base channel. Typically, relative strengths of peaks at each base location are used to quantify the quality and/or reliability of individual readings. However, assessment of overall quality of whole DNA trace files remains to be an open problem. Therefore, classification of raw DNA trace files as high or low quality is an important issue for efficient utilization of resources. In this study, we have used several supervised machine learning approaches, including logistic regression and ensemble decision trees, to identify high- or acceptable-quality chromatogram files and compared their prediction performances. In order to test and develop our ideas, we have used a public DNA trace repository consisting of 1626 high- and 631 low-quality files marked by our expert molecular biologist. Our results indicate that, although all of the methods tried offer comparable and acceptable performances, random forest decision tree algorithm with adapting boosting ensemble learning shows slightly higher prediction accuracy with as few as four features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467

    Article  Google Scholar 

  2. Stucky BJ (2012) SeqTrace: a graphical tool for rapidly processing DNA sequencing chromatograms. J Biomol Tech. 23(3):90–93. doi:10.7171/jbt.12-2303-004

    Article  Google Scholar 

  3. Öz E, Kaya H (2013) Support vector machines for quality control of DNA sequencing. J Inequal Appl 85:1–9. doi:10.1186/1029-242X-2013-85

    MathSciNet  MATH  Google Scholar 

  4. Benhamou CL, Poupon S, Lespessailles E, Loiseau S, Jennane R, Siroux V, Ohley W, Pothuaud L (2001) Fractal analysis of radiographic trabecular bone texture and bone mineral density: two complementary parameters related to osteoporotic fractures. J Bone Miner Res 16:697–704. doi:10.1359/jbmr.2001.16.4.697

    Article  Google Scholar 

  5. Tartar A, Kilic N, Akan A (2013) Classification of pulmonary nodules by using hybrid features. Computational and Mathematical Methods in Medicine Article ID 148363, 11 pages. doi:10.1155/2013/148363

  6. Erdal HI, Karakurt O, Namli E (2013) High performance concrete compressive strength forecasting using ensemble models based on discrete wavelet transform. Eng Appl Artif 26(4):1246–1254. doi:10.1016/j.engappai.2012.10.014

    Article  Google Scholar 

  7. Tartar A, Akan A, Kilic N (2014) A novel approach to malignant-benign classification of pulmonary nodules by using ensemble learning classifiers. In: 36th Annual international conference of the IEEE engineering in medicine and biology society 4651–4654. doi: 10.1109/EMBC.2014.6944661

  8. Kilic N, Hosgormez E (2016) Automatic estimation of osteoporotic fracture cases by using ensemble learning approaches. J Med Syst 40(3):1–10. doi:10.1007/s10916-015-0413-1

    Article  Google Scholar 

  9. Manaster C, Zheng W, Teuber M, Wachter S, Doring F, Schreiber S, Hampe J (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum Mutat 26(1):11–19. doi:10.1002/humu.20188

    Article  Google Scholar 

  10. Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Hoboken

    MATH  Google Scholar 

  11. Delen D, Kuzey C, Uyar A (2013) Measuring firm performance using financial ratios: a decision tree approach. Expert Syst Appl 40(10):3970–3983. doi:10.1016/j.eswa.2013.01.012

    Article  Google Scholar 

  12. Drazin S, Montag M (2012) Decision tree analysis using WEKA. http://ww.samdrazin.com/classes/een548/project2report.pdf. Accessed 3 October 2016

  13. Ting H, Mai YT, Hsu HC, Wu HC, Tseng MH (2014) Decision tree based diagnostic system for moderate to severe obstructive sleep apnea. J Med Syst 38(9):1–10. doi:10.1007/s10916-014-0094-1

    Article  Google Scholar 

  14. Sushilkumar K (2015) Analysis of WEKA data mining algorithm REPTree, Simple CART and random tree for classification of Indian news. Int J Innov Sci Eng Technol 2(2):438–446

    Google Scholar 

  15. Quinlan JR (2014) C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  16. Chen XW, Liu M (2005) Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24):4394–4400. doi:10.1093/bioinformatics/bti721

    Article  Google Scholar 

  17. Askin ÖE, Gokalp F (2013) Comparing the predictive and classification performances of logistic regression and neural networks: a case study on TIMSS 2011. Proced Soc Behav Sci 106:667–676. doi:10.1016/j.sbspro.2013.12.076

    Article  Google Scholar 

  18. Hosmer D, Lemeshow S (2000) Applied logistic regression. Wiley, Hoboken

    Book  MATH  Google Scholar 

  19. Mert A, Kilic N, Akan A (2014) Evaluation of bagging ensemble method with time domain feature for diagnosing of arrhythmia beats. Neural Comput Appl 24(2):317–326. doi:10.1007/s00521-012-1232-7

    Article  Google Scholar 

  20. Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of conference on multiple classifier systems 1–15

  21. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350

    MathSciNet  MATH  Google Scholar 

  22. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. doi:10.1006/jcss.1997.1504

    Article  MathSciNet  MATH  Google Scholar 

  23. Ridgeway G (1999) The state of boosting. Comput Sci Stat 31:172–181

    Google Scholar 

  24. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601

    Article  Google Scholar 

  25. Panov P, Dzeroski S (2007) Combining bagging and random subspaces to create better ensembles. 7th Int. Sym. on Intell. Data Anal. 118–129. doi: 10.1007/978-3-540-74825-0_11

  26. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B (Methodol) 36(2):111–147

    MathSciNet  MATH  Google Scholar 

  27. Hall MA (1999) Correlation-based feature selection for machine learning. Dissertation, The University of Waikato

  28. Öz E, Kurt S, Asyalı MH, Kaya H, Yücel Y (2016) Feature based quality assessment of DNA sequencing chromatograms. Appl Soft Comput 41:420–427. doi:10.1016/j.asoc.2016.01.025

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serkan Kurt.

Ethics declarations

Conflict of interest

Authors state that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kurt, S., Öz, E., Aşkın, Ö.E. et al. Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches. Neural Comput & Applic 29, 251–262 (2018). https://doi.org/10.1007/s00521-017-2960-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-017-2960-5

Keywords

Navigation