Skip to main content

Exploratory Data Analysis and Prediction of Human Genetic Disorder and Species Using DNA Sequencing

  • Conference paper
  • First Online:
Proceedings of the Future Technologies Conference (FTC) 2023, Volume 2 (FTC 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 814))

Included in the following conference series:

Abstract

The genetic information expressed through the development of a sequencing model for DNA/RNA proteins using Machine Learning Algorithms is a big exploration and growing need. Basically, this was intended to identify, predict as well as classify gene families based on the DNA sequence with medical anomalies for early diagnosis of genetic variation. This study assessed gene sequences from three DNA sequence text files, including 4380 human, 1682 chimpanzee, and 820 dog DNA. The genetic disorder dataset includes 35 features that were utilized to predict genetic abnormalities across 22083 patient data. Labelling, correlating, exploratory data analysis, and prediction systems were made for both datasets. Prediction systems were made using Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Gradient Boosting, CatBoost, Multinomial Naive Bayes Classifier, and SVC Classifier algorithms. Multinomial Naive Bayes Classifier achieved the best accuracy rate of 94.42% for DNA sequencing dataset, followed by K Neighbour Classifier, Decision Tree Classifier, Random Forest Classifier, and SVC Classifier contributed 71.98%, 74.85%, 86.82% and 79.6% respectively. For the genetic disorder dataset, the best-performing model was CatBoost with a 54.72% R2CV score. As for the R2CV scores, Logistic Regression, Gaussian Naive Bayes, K Neighbors, Decision Tree, Random Forest, Extreme Gradient Boosting, Light Gradient Boosting Machine and Gradient Boosting Classifier offered 47.36%, 34.16%, 45.27%, 40.83%, 52.36%, 48.89%, 48.75% and 53.34% respectively. Genetic disorders will be classified in the future based on extensive medical history, sequence data, deep learning models, federated machine learning and transfer learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Sanders, S.J.: First glimpses of the neurobiology of autism spectrum disorder. Curr. Opin. Genet. Dev. 33, 80–92 (2015)

    Article  Google Scholar 

  2. Schizophrenia working group of the psychiatric genomics consortium: biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014)

    Google Scholar 

  3. Jamie, P., et al.: Global, regional, and national causes of under-5 mortality in 2000–15: an updated systematic analysis with implications for the sustainable development goals. Lancet 388, 3027–3035 (2017)

    Google Scholar 

  4. Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Nature 15, 233–234 (2018)

    Google Scholar 

  5. Mistry, J., Finn, R.D., Eddy, S.R., Bateman, A., Punta, M.: Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 41(12), e121-e (2013)

    Google Scholar 

  6. Skewes-Cox, P., Sharpton, T.J., Pollard, K.S., DeRisi, J.L.: Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS ONE 9(8), e105067 (2014)

    Article  Google Scholar 

  7. Bzhalava, Z., Hultin, E., Dillner, J.: Extension of the viral ecology in humans using viral profile hidden Markov models. PLoS ONE 13(1), e0190938 (2018)

    Article  Google Scholar 

  8. Bzhalava, Z., Tampuu, A., Bała, P., Vicente, R., Dillner, J.: Machine Learning for detection of viral sequences in human metagenomic datasets. BMC Bioinformatics 19(1), 1–11 (2018)

    Article  Google Scholar 

  9. Muhammad, U., Muhammad, A., Muhammad, Z., Ghazal, T., Raed, A., Hamadi, A.: Single and mitochondrial gene inheritance disorder prediction using machine learning. Comput. Mat. Continua 73(1), 953–963 (2022)

    Google Scholar 

  10. Ferreira, C., Van Karnebeek, C., Vockley, J., Blaue, N.: A proposed nosology of inborn errors of metabolism. Genet. Med. 21(1), 102–106 (2019)

    Article  Google Scholar 

  11. Tan, J., Wagner, M., Stenton, S.L., Storm, T.M., Wortmaan, S.B.: Lifetime risk of autosomal recessive mitochondrial disorders calculated from genetic databases. Lancet 54, 111–119 (2019)

    Google Scholar 

  12. Amgarten, D., Braga, L.P.P., Da Silva, A.M., Setubal, J.C.: MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304 (2018)

    Article  Google Scholar 

  13. Roux, S., Enault, F., Hurwitz, B.L., Sullivan, M.B.: VirSorter: mining viral signal from microbial genomic data. PeerJ 3(e985), 1–20 (2015)

    Google Scholar 

  14. Ren, J., Ahlgren, N.A., Lu, Y.Y., et al.: VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017)

    Article  Google Scholar 

  15. Ren, J., et al.: Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8(1), 64–77 (2020)

    Article  Google Scholar 

  16. Maarala, A.I., Bzhalava, Z., Dillner, J., Heljanko, K., Bzhalava, D.: ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads. Bioinformatics 34(6), 928–935 (2018)

    Article  Google Scholar 

  17. Liu, F., Miao, Y., Liu, Y., Hou, T.: RNN-VirSeeker: a deep learning method for identification of short viral sequences from metagenomes. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics, USA, pp. 1840–1849. IEEE (2022)

    Google Scholar 

  18. Vaz, M., Silvestre, S.: Alzheimer’s disease: recent treatment strategies. Eur. J. Pharmacol. 887, 173554 (2020)

    Article  Google Scholar 

  19. Alatrany, A.S., Hussain, A., Jamila, M., Al-Jumeiy, D.: Stacked machine learning model for predicting Alzheimer’s disease based on genetic data. In : Proceedings of the 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 594–598, IEEE, Sharjah, United Arab Emirates (2021)

    Google Scholar 

  20. Huckvale, E.D., et al.: Pairwise correlation analysis of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset reveals significant feature correlation. Genes 12(11), 1661 (2021)

    Article  Google Scholar 

  21. Torkey, H., Atlam, M., El-Fishawy, N., Salem, H.: A novel deep autoencoder based survival analysis approach for microarray dataset. Peer J. Comput. Sci. 7, e492 (2021)

    Article  Google Scholar 

  22. Deng, X., Li, M., Deng, S., Wang, L.: Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput.Comput. 60(3), 663–681 (2022)

    Article  Google Scholar 

  23. Dhanalaxmi, B., Anirudh, K., Nikhitha, G., Jyothi, R.: A survey on analysis of genetic diseases using machine learning techniques. In: Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 496–501, IEEE, Palladam, India (2021)

    Google Scholar 

  24. Lattmann, E., et al.: A DNA replication-independent function of pre-replication complex genes during cell invasion in C. elegans. PLoS Biology 20(2), e3001317 (2022)

    Google Scholar 

  25. Ghazal, T.M., et al.: Supervised machine learning empowered multifactorial genetic inheritance disorder prediction. Comput. Intell. Neurosci.. Intell. Neurosci. 2022, 1051388 (2022)

    Google Scholar 

  26. Mihajlović, A., Mladenović, K., Lončar-Turukalo, T., Brdar, S.: Machine learning based metagenomic prediction of inflammatory bowel disease. Stud. Health Technol. Inf. 285, 165–170 (2021)

    Google Scholar 

  27. Wang, R.Y., Guo, T.Q., Li, L.G., Jiao, J.Y., Wang, L.Y.: Predictions of COVID-19 infection severity based on co-associations between the SNPs of co-morbid diseases and COVID-19 through machine learning of genetic data. In: Proceedings of the 2020 IEEE 8th International Conference on Computer Science and Network Technology (ICCSNT), pp. 92–96, Dalian. IEEE (2020)

    Google Scholar 

  28. Pina, A., et al.: Virtual genetic diagnosis for familial hypercholesterolemia powered by machine learning. Eur. J. Prev. Cardiol. 27, 1639–1646 (2020)

    Article  Google Scholar 

  29. Quinodoz, M., Royer-Bertrand, B., Cisarova, K., Di Gioia, S.A., Superti-Furga, A., Rivolta, C.: DOMINO: Using machine learning to predict genes associated with dominant disorders. Am. J. Hum. Genet. 101(4), 623–629 (2017)

    Article  Google Scholar 

  30. Boulogeorgos, A.A.A., Trevlakis, S.E., Tegos, S.A., Papanikolaou, V.K., Karagiannidis, G.K.: Machine learning in nano-scale biomedical engineering. In: IEEE Transaction of Molecular Biology and Multi-Scale Communications, pp. 10–39, USA., IEEE (2020)

    Google Scholar 

  31. Le, D-H.: Machine learning-based approaches for disease gene prediction. Briefings Funct. Genom. 19(5–6), 350–363 (2020)

    Google Scholar 

  32. Kaggle. https://www.kaggle.com/datasets/nageshsingh/dna-sequence-dataset. Accessed 3 Mar 2023

  33. Kaggle. https://ww.kaggle.com/datasets/imsparsh/of-genomes-and-genetics-hackerearth-ml. Accessed 3 Mar 2023

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sakshi Harbhajanka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Upadhyay, V., Harbhajanka, S., Pangaonkar, S., Gunjan, R. (2023). Exploratory Data Analysis and Prediction of Human Genetic Disorder and Species Using DNA Sequencing. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2023, Volume 2. FTC 2023. Lecture Notes in Networks and Systems, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-031-47451-4_14

Download citation

Publish with us

Policies and ethics