Skip to main content

Gene Sequence Classification Using K-mer Decomposition and Soft-Computing-Based Approach

  • Conference paper
  • First Online:
Soft Computing: Theories and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1381))

Abstract

The healthcare industry is moving toward personalized medicine which requires the use of individual genetic information so that medical treatment can be customized to the specific properties of an individual. DNA sequence of a genome consists of several genes. These genes are the basic building blocks of an organism. A human genome consists of 20–30 thousand genes. Some of these genes are involved in the growth and development of the body and some are responsible for the production of critical diseases (influenza, ebola, dengue) remaining are the non-coding (junk) genes. Identification and classification of these genes into a few biological meaningful groups: coding, non-coding, and viral are useful for the treatment and diagnosis of an organism. In this paper, k-mer (substrings of length k) frequency decomposition and soft-computing-based approach is used to classify and identify the large set of unknown genes into some meaningful groups. It works by first taking a DNA sequence and computing a vector of the proportions of every possible k-mer. These vectors are used as feature vectors, and a well-known supervised classification algorithm (multi-mode Naive Bayes classifier) is trained on the vectors. Experiments show that the proposed approach achieves the highest accuracy (\(\ge \)90 \(\%\)) along with the lowest running time in comparison with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bokulich, N.A., et al.: Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6(1), 90 (2018)

    Google Scholar 

  2. Nguyen, N.G., et al.: DNA sequence classification by convolutional neural network. J. Biomed. Sci. Eng. 9(05), 280 (2016)

    Google Scholar 

  3. Eickholt, J., Cheng, J.: DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinform. 14(1), 88 (2013)

    Article  Google Scholar 

  4. Leung, M.K.K., et al.: Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12), i121–i129 (2014)

    Google Scholar 

  5. Solis-Reyes, S., et al.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 13(11), e0206409 (2018)

    Google Scholar 

  6. Ma, Jianmin, Nguyen, Minh N., Rajapakse, Jagath C.: Gene classification using codon usage and support vector machines. IEEE/ACM Trans. Comput. Biol. Bioinf. 6(1), 134–143 (2009)

    Article  Google Scholar 

  7. La Rosa, M., et al.: Genomic sequence classification using probabilistic topic modeling. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer, Cham (2013)

    Google Scholar 

  8. Mukhopadhyay, S., et al.: A comparative study of genetic sequence classification algorithms. In: Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing. IEEE (2002)

    Google Scholar 

  9. Buldyrev, S.V., et al.: Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys. Rev. E 51(5), 5084 (1995)

    Article  MathSciNet  Google Scholar 

  10. Sharma, T.K., Pant, M.: Opposition-based learning embedded shuffled frog-leaping algorithm. In: Soft Computing: Theories and Applications, pp. 853-861. Springer, Singapore (2018)

    Google Scholar 

  11. Mahajan, R.: Emotion recognition via EEG using neural network classifier. In: Soft Computing: Theories and Applications, pp. 429–438. Springer, Singapore (2018)

    Google Scholar 

  12. Shinde, S., Brijesh, I.: IoT-enabled early prediction system for epileptic seizure in human being. In: Soft Computing: Theories and Applications, pp. 37–46. Springer, Singapore (2020)

    Google Scholar 

  13. Kumar, S., Agarwal, S.: An efficient tool for searching maximal and super maximal repeats in large DNA/protein sequences via induced-enhanced suffix array. Recent Patents Comput. Sci. 12(2), 128–134 (2019)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjeev Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, S. (2021). Gene Sequence Classification Using K-mer Decomposition and Soft-Computing-Based Approach. In: Sharma, T.K., Ahn, C.W., Verma, O.P., Panigrahi, B.K. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 1381. Springer, Singapore. https://doi.org/10.1007/978-981-16-1696-9_17

Download citation

Publish with us

Policies and ethics