Abstract
The genetic information expressed through the development of a sequencing model for DNA/RNA proteins using Machine Learning Algorithms is a big exploration and growing need. Basically, this was intended to identify, predict as well as classify gene families based on the DNA sequence with medical anomalies for early diagnosis of genetic variation. This study assessed gene sequences from three DNA sequence text files, including 4380 human, 1682 chimpanzee, and 820 dog DNA. The genetic disorder dataset includes 35 features that were utilized to predict genetic abnormalities across 22083 patient data. Labelling, correlating, exploratory data analysis, and prediction systems were made for both datasets. Prediction systems were made using Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Gradient Boosting, CatBoost, Multinomial Naive Bayes Classifier, and SVC Classifier algorithms. Multinomial Naive Bayes Classifier achieved the best accuracy rate of 94.42% for DNA sequencing dataset, followed by K Neighbour Classifier, Decision Tree Classifier, Random Forest Classifier, and SVC Classifier contributed 71.98%, 74.85%, 86.82% and 79.6% respectively. For the genetic disorder dataset, the best-performing model was CatBoost with a 54.72% R2CV score. As for the R2CV scores, Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Extreme Gradient Boosting, Light Gradient Boosting Machine and Gradient Boosting Classifier offered 47.36%, 34.16%, 45.27%, 40.83%, 52.36%, 48.89%, 48.75% and 53.34% respectively. Genetic disorders will be classified in the future based on extensive medical history, sequence data, deep learning models, federated machine learning and transfer learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sanders, S.J.: First glimpses of the neurobiology of autism spectrum disorder. Curr. Opin. Genet. Dev. 33, 80–92 (2015)
Schizophrenia working group of the psychiatric genomics consortium: biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014)
Jamie, P., et al.: Global, regional, and national causes of under-5 mortality in 2000–15: an updated systematic analysis with implications for the sustainable development goals. Lancet 388, 3027–3035 (2017)
Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Nature 15, 233–234 (2018)
Mistry, J., Finn, R.D., Eddy, S.R., Bateman, A., Punta, M.: Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 41(12), e121-e (2013)
Skewes-Cox, P., Sharpton, T.J., Pollard, K.S., DeRisi, J.L.: Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS ONE 9(8), e105067 (2014)
Bzhalava, Z., Hultin, E., Dillner, J.: Extension of the viral ecology in humans using viral profile hidden Markov models. PLoS ONE 13(1), e0190938 (2018)
Bzhalava, Z., Tampuu, A., Bała, P., Vicente, R., Dillner, J.: Machine Learning for detection of viral sequences in human metagenomic datasets. BMC Bioinformatics 19(1), 1–11 (2018)
Muhammad, U., Muhammad, A., Muhammad, Z., Ghazal, T., Raed, A., Hamadi, A.: Single and mitochondrial gene inheritance disorder prediction using machine learning. Comput. Mat. Continua 73(1), 953–963 (2022)
Ferreira, C., Van Karnebeek, C., Vockley, J., Blaue, N.: A proposed nosology of inborn errors of metabolism. Genet. Med. 21(1), 102–106 (2019)
Tan, J., Wagner, M., Stenton, S.L., Storm, T.M., Wortmaan, S.B.: Lifetime risk of autosomal recessive mitochondrial disorders calculated from genetic databases. Lancet 54, 111–119 (2019)
Amgarten, D., Braga, L.P.P., Da Silva, A.M., Setubal, J.C.: MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304 (2018)
Roux, S., Enault, F., Hurwitz, B.L., Sullivan, M.B.: VirSorter: mining viral signal from microbial genomic data. PeerJ 3(e985), 1–20 (2015)
Ren, J., Ahlgren, N.A., Lu, Y.Y., et al.: VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017)
Ren, J., et al.: Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8(1), 64–77 (2020)
Maarala, A.I., Bzhalava, Z., Dillner, J., Heljanko, K., Bzhalava, D.: ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads. Bioinformatics 34(6), 928–935 (2018)
Liu, F., Miao, Y., Liu, Y., Hou, T.: RNN-VirSeeker: a deep learning method for identification of short viral sequences from metagenomes. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics, USA, pp. 1840–1849. IEEE (2022)
Vaz, M., Silvestre, S.: Alzheimer’s disease: recent treatment strategies. Eur. J. Pharmacol. 887, 173554 (2020)
Alatrany, A.S., Hussain, A., Jamila, M., Al-Jumeiy, D.: Stacked machine learning model for predicting Alzheimer’s disease based on genetic data. In : Proceedings of the 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 594–598, IEEE, Sharjah, United Arab Emirates (2021)
Huckvale, E.D., et al.: Pairwise correlation analysis of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset reveals significant feature correlation. Genes 12(11), 1661 (2021)
Torkey, H., Atlam, M., El-Fishawy, N., Salem, H.: A novel deep autoencoder based survival analysis approach for microarray dataset. Peer J. Comput. Sci. 7, e492 (2021)
Deng, X., Li, M., Deng, S., Wang, L.: Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput.Comput. 60(3), 663–681 (2022)
Dhanalaxmi, B., Anirudh, K., Nikhitha, G., Jyothi, R.: A survey on analysis of genetic diseases using machine learning techniques. In: Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 496–501, IEEE, Palladam, India (2021)
Lattmann, E., et al.: A DNA replication-independent function of pre-replication complex genes during cell invasion in C. elegans. PLoS Biology 20(2), e3001317 (2022)
Ghazal, T.M., et al.: Supervised machine learning empowered multifactorial genetic inheritance disorder prediction. Comput. Intell. Neurosci.. Intell. Neurosci. 2022, 1051388 (2022)
Mihajlović, A., Mladenović, K., Lončar-Turukalo, T., Brdar, S.: Machine learning based metagenomic prediction of inflammatory bowel disease. Stud. Health Technol. Inf. 285, 165–170 (2021)
Wang, R.Y., Guo, T.Q., Li, L.G., Jiao, J.Y., Wang, L.Y.: Predictions of COVID-19 infection severity based on co-associations between the SNPs of co-morbid diseases and COVID-19 through machine learning of genetic data. In: Proceedings of the 2020 IEEE 8th International Conference on Computer Science and Network Technology (ICCSNT), pp. 92–96, Dalian. IEEE (2020)
Pina, A., et al.: Virtual genetic diagnosis for familial hypercholesterolemia powered by machine learning. Eur. J. Prev. Cardiol. 27, 1639–1646 (2020)
Quinodoz, M., Royer-Bertrand, B., Cisarova, K., Di Gioia, S.A., Superti-Furga, A., Rivolta, C.: DOMINO: Using machine learning to predict genes associated with dominant disorders. Am. J. Hum. Genet. 101(4), 623–629 (2017)
Boulogeorgos, A.A.A., Trevlakis, S.E., Tegos, S.A., Papanikolaou, V.K., Karagiannidis, G.K.: Machine learning in nano-scale biomedical engineering. In: IEEE Transaction of Molecular Biology and Multi-Scale Communications, pp. 10–39, USA., IEEE (2020)
Le, D-H.: Machine learning-based approaches for disease gene prediction. Briefings Funct. Genom. 19(5–6), 350–363 (2020)
Kaggle. https://www.kaggle.com/datasets/nageshsingh/dna-sequence-dataset. Accessed 3 Mar 2023
Kaggle. https://ww.kaggle.com/datasets/imsparsh/of-genomes-and-genetics-hackerearth-ml. Accessed 3 Mar 2023
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Upadhyay, V., Harbhajanka, S., Pangaonkar, S., Gunjan, R. (2023). Exploratory Data Analysis and Prediction of Human Genetic Disorder and Species Using DNA Sequencing. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2023, Volume 2. FTC 2023. Lecture Notes in Networks and Systems, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-031-47451-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-47451-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47450-7
Online ISBN: 978-3-031-47451-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)