Gene Sequence Classification Using K-mer Decomposition and Soft-Computing-Based Approach

Kumar, Sanjeev

doi:10.1007/978-981-16-1696-9_17

Sanjeev Kumar¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1381))

504 Accesses
1 Citations

Abstract

The healthcare industry is moving toward personalized medicine which requires the use of individual genetic information so that medical treatment can be customized to the specific properties of an individual. DNA sequence of a genome consists of several genes. These genes are the basic building blocks of an organism. A human genome consists of 20–30 thousand genes. Some of these genes are involved in the growth and development of the body and some are responsible for the production of critical diseases (influenza, ebola, dengue) remaining are the non-coding (junk) genes. Identification and classification of these genes into a few biological meaningful groups: coding, non-coding, and viral are useful for the treatment and diagnosis of an organism. In this paper, k-mer (substrings of length k) frequency decomposition and soft-computing-based approach is used to classify and identify the large set of unknown genes into some meaningful groups. It works by first taking a DNA sequence and computing a vector of the proportions of every possible k-mer. These vectors are used as feature vectors, and a well-known supervised classification algorithm (multi-mode Naive Bayes classifier) is trained on the vectors. Experiments show that the proposed approach achieves the highest accuracy (\(\ge \)90 \(\%\)) along with the lowest running time in comparison with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bokulich, N.A., et al.: Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6(1), 90 (2018)
Google Scholar
Nguyen, N.G., et al.: DNA sequence classification by convolutional neural network. J. Biomed. Sci. Eng. 9(05), 280 (2016)
Google Scholar
Eickholt, J., Cheng, J.: DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinform. 14(1), 88 (2013)
Article Google Scholar
Leung, M.K.K., et al.: Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12), i121–i129 (2014)
Google Scholar
Solis-Reyes, S., et al.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 13(11), e0206409 (2018)
Google Scholar
Ma, Jianmin, Nguyen, Minh N., Rajapakse, Jagath C.: Gene classification using codon usage and support vector machines. IEEE/ACM Trans. Comput. Biol. Bioinf. 6(1), 134–143 (2009)
Article Google Scholar
La Rosa, M., et al.: Genomic sequence classification using probabilistic topic modeling. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer, Cham (2013)
Google Scholar
Mukhopadhyay, S., et al.: A comparative study of genetic sequence classification algorithms. In: Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing. IEEE (2002)
Google Scholar
Buldyrev, S.V., et al.: Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys. Rev. E 51(5), 5084 (1995)
Article MathSciNet Google Scholar
Sharma, T.K., Pant, M.: Opposition-based learning embedded shuffled frog-leaping algorithm. In: Soft Computing: Theories and Applications, pp. 853-861. Springer, Singapore (2018)
Google Scholar
Mahajan, R.: Emotion recognition via EEG using neural network classifier. In: Soft Computing: Theories and Applications, pp. 429–438. Springer, Singapore (2018)
Google Scholar
Shinde, S., Brijesh, I.: IoT-enabled early prediction system for epileptic seizure in human being. In: Soft Computing: Theories and Applications, pp. 37–46. Springer, Singapore (2020)
Google Scholar
Kumar, S., Agarwal, S.: An efficient tool for searching maximal and super maximal repeats in large DNA/protein sequences via induced-enhanced suffix array. Recent Patents Comput. Sci. 12(2), 128–134 (2019)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of CEA, GLA University, Mathura, India
Sanjeev Kumar

Authors

Sanjeev Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjeev Kumar .

Editor information

Editors and Affiliations

Department of Computer Science, Shobhit University Gangoh, Gangoh, Uttar Pradesh, India
Tarun K. Sharma
Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)
Chang Wook Ahn
Department of Instrumentation and Control Engineering, Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, India
Om Prakash Verma
Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
Bijaya Ketan Panigrahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, S. (2021). Gene Sequence Classification Using K-mer Decomposition and Soft-Computing-Based Approach. In: Sharma, T.K., Ahn, C.W., Verma, O.P., Panigrahi, B.K. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 1381. Springer, Singapore. https://doi.org/10.1007/978-981-16-1696-9_17

Download citation

DOI: https://doi.org/10.1007/978-981-16-1696-9_17
Published: 27 June 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1695-2
Online ISBN: 978-981-16-1696-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics