Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

Kurt, Serkan; Öz, Ersoy; Aşkın, Öyküm Esra; Öz, Yeliz Yücel

doi:10.1007/s00521-017-2960-5

Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

New Trends in data pre-processing methods for signal and image classification
Published: 05 April 2017

Volume 29, pages 251–262, (2018)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Serkan Kurt¹,
Ersoy Öz²,
Öyküm Esra Aşkın² &
…
Yeliz Yücel Öz^3,4

516 Accesses
6 Citations
Explore all metrics

Abstract

Knowledge of DNA sequences is indispensable for basic biological research. Many researchers use DNA sequencing for various purposes including molecular biology research and sequence comparison for individual identification. Automated DNA sequencing devices use four colored chromatograms or base-calling signals to indicate strength of hybridization for each base channel. Typically, relative strengths of peaks at each base location are used to quantify the quality and/or reliability of individual readings. However, assessment of overall quality of whole DNA trace files remains to be an open problem. Therefore, classification of raw DNA trace files as high or low quality is an important issue for efficient utilization of resources. In this study, we have used several supervised machine learning approaches, including logistic regression and ensemble decision trees, to identify high- or acceptable-quality chromatogram files and compared their prediction performances. In order to test and develop our ideas, we have used a public DNA trace repository consisting of 1626 high- and 631 low-quality files marked by our expert molecular biologist. Our results indicate that, although all of the methods tried offer comparable and acceptable performances, random forest decision tree algorithm with adapting boosting ensemble learning shows slightly higher prediction accuracy with as few as four features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

Gérard Biau & Erwan Scornet

A Review on Random Forest: An Ensemble Classifier

References

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467
Article Google Scholar
Stucky BJ (2012) SeqTrace: a graphical tool for rapidly processing DNA sequencing chromatograms. J Biomol Tech. 23(3):90–93. doi:10.7171/jbt.12-2303-004
Article Google Scholar
Öz E, Kaya H (2013) Support vector machines for quality control of DNA sequencing. J Inequal Appl 85:1–9. doi:10.1186/1029-242X-2013-85
MathSciNet MATH Google Scholar
Benhamou CL, Poupon S, Lespessailles E, Loiseau S, Jennane R, Siroux V, Ohley W, Pothuaud L (2001) Fractal analysis of radiographic trabecular bone texture and bone mineral density: two complementary parameters related to osteoporotic fractures. J Bone Miner Res 16:697–704. doi:10.1359/jbmr.2001.16.4.697
Article Google Scholar
Tartar A, Kilic N, Akan A (2013) Classification of pulmonary nodules by using hybrid features. Computational and Mathematical Methods in Medicine Article ID 148363, 11 pages. doi:10.1155/2013/148363
Erdal HI, Karakurt O, Namli E (2013) High performance concrete compressive strength forecasting using ensemble models based on discrete wavelet transform. Eng Appl Artif 26(4):1246–1254. doi:10.1016/j.engappai.2012.10.014
Article Google Scholar
Tartar A, Akan A, Kilic N (2014) A novel approach to malignant-benign classification of pulmonary nodules by using ensemble learning classifiers. In: 36th Annual international conference of the IEEE engineering in medicine and biology society 4651–4654. doi: 10.1109/EMBC.2014.6944661
Kilic N, Hosgormez E (2016) Automatic estimation of osteoporotic fracture cases by using ensemble learning approaches. J Med Syst 40(3):1–10. doi:10.1007/s10916-015-0413-1
Article Google Scholar
Manaster C, Zheng W, Teuber M, Wachter S, Doring F, Schreiber S, Hampe J (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels. Hum Mutat 26(1):11–19. doi:10.1002/humu.20188
Article Google Scholar
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Hoboken
MATH Google Scholar
Delen D, Kuzey C, Uyar A (2013) Measuring firm performance using financial ratios: a decision tree approach. Expert Syst Appl 40(10):3970–3983. doi:10.1016/j.eswa.2013.01.012
Article Google Scholar
Drazin S, Montag M (2012) Decision tree analysis using WEKA. http://ww.samdrazin.com/classes/een548/project2report.pdf. Accessed 3 October 2016
Ting H, Mai YT, Hsu HC, Wu HC, Tseng MH (2014) Decision tree based diagnostic system for moderate to severe obstructive sleep apnea. J Med Syst 38(9):1–10. doi:10.1007/s10916-014-0094-1
Article Google Scholar
Sushilkumar K (2015) Analysis of WEKA data mining algorithm REPTree, Simple CART and random tree for classification of Indian news. Int J Innov Sci Eng Technol 2(2):438–446
Google Scholar
Quinlan JR (2014) C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Chen XW, Liu M (2005) Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24):4394–4400. doi:10.1093/bioinformatics/bti721
Article Google Scholar
Askin ÖE, Gokalp F (2013) Comparing the predictive and classification performances of logistic regression and neural networks: a case study on TIMSS 2011. Proced Soc Behav Sci 106:667–676. doi:10.1016/j.sbspro.2013.12.076
Article Google Scholar
Hosmer D, Lemeshow S (2000) Applied logistic regression. Wiley, Hoboken
Book MATH Google Scholar
Mert A, Kilic N, Akan A (2014) Evaluation of bagging ensemble method with time domain feature for diagnosing of arrhythmia beats. Neural Comput Appl 24(2):317–326. doi:10.1007/s00521-012-1232-7
Article Google Scholar
Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of conference on multiple classifier systems 1–15
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350
MathSciNet MATH Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. doi:10.1006/jcss.1997.1504
Article MathSciNet MATH Google Scholar
Ridgeway G (1999) The state of boosting. Comput Sci Stat 31:172–181
Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601
Article Google Scholar
Panov P, Dzeroski S (2007) Combining bagging and random subspaces to create better ensembles. 7th Int. Sym. on Intell. Data Anal. 118–129. doi: 10.1007/978-3-540-74825-0_11
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B (Methodol) 36(2):111–147
MathSciNet MATH Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning. Dissertation, The University of Waikato
Öz E, Kurt S, Asyalı MH, Kaya H, Yücel Y (2016) Feature based quality assessment of DNA sequencing chromatograms. Appl Soft Comput 41:420–427. doi:10.1016/j.asoc.2016.01.025
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communications Engineering, Faculty of Electrical and Electronics Engineering, Yildiz Technical University, Istanbul, Turkey
Serkan Kurt
Department of Statistics, Faculty of Arts and Sciences, Yildiz Technical University, Istanbul, Turkey
Ersoy Öz & Öyküm Esra Aşkın
Molecular Biology-Biotechnology, Istanbul Technical University, Istanbul, Turkey
Yeliz Yücel Öz
Iontek A.Ş., Istanbul, Turkey
Yeliz Yücel Öz

Authors

Serkan Kurt
View author publications
You can also search for this author in PubMed Google Scholar
Ersoy Öz
View author publications
You can also search for this author in PubMed Google Scholar
Öyküm Esra Aşkın
View author publications
You can also search for this author in PubMed Google Scholar
Yeliz Yücel Öz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Serkan Kurt.

Ethics declarations

Conflict of interest

Authors state that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kurt, S., Öz, E., Aşkın, Ö.E. et al. Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches. Neural Comput & Applic 29, 251–262 (2018). https://doi.org/10.1007/s00521-017-2960-5

Download citation

Received: 08 October 2016
Accepted: 21 March 2017
Published: 05 April 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s00521-017-2960-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation