Skip to main content
Log in

A novel numerical mapping method based on entropy for digitizing DNA sequences

  • New Trends in data pre-processing methods for signal and image classification
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recently, digital signal processing has been widely applied in the study of genomics. One of the genomic studies is identification of protein-coding regions. Where is a protein coded? How much is encoded? Where are growth and development regulated? The answer to these questions is possible by DNA sequences that can be classified as the exon and intron. In signal processing application, numerical signals are used due to symbolic signal nature of DNA sequence; yet, it must be converted from symbolic sequence to numeric sequence prior the analysis in data preprocessing. The bases in a DNA sequence are represented with four letters A, G, C and T. Each letter corresponds to a numeric value. In the literature, several numerical mapping techniques exist. In this paper, a novel numerical mapping approach has been proposed for converting string to numerical values. Each codon is mapped by improved fractional derivative of Shannon equation in this approach. For exon regions prediction, three methods have been used. These methods are singular value decomposition (SVD), discrete Fourier transform (DFT) and short-time Fourier transform (STFT). The performance of the proposed mapping technique has been evaluated based on the above-mentioned three classification methods. The proposed novel technique has showed more success in the identification of protein-coding regions as compared to the predominant existing mapping techniques SVD, DFT and STFT methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ficket JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acid Res 20(24):6441–6450

    Article  Google Scholar 

  2. Koonin EV, Novozhilov AS (2009) Origin and evolution of the genetic code: the universal enigma. IUBMB Life 61(2):99–111. doi:10.1002/iub.146

    Article  Google Scholar 

  3. Course Hero. http://www.coursehero.com. Accessed 01 Mar 2016

  4. Tugan J, Rushdi A (2008) A DSP based approach for finding the codon bias in DNA sequences. IEEE J Signal Process 2(3):343–356. doi:10.1109/JSTSP.2008.923851

    Google Scholar 

  5. Kwan HK, Arniker SB (2009) Numerical representation of DNA sequences. In: IEEE international conference on electro/information technology, EIT ‘09, Windsor, pp 307–310

  6. Grandhi DG, Vijaykumar C (2007) Simplex mapping for identifying the protein coding regions in DNA. TENCON-2007, Taiwan

  7. Cristea PD (2002) Genetic signal representation and analysis. In: SPIE information conference biomedical optics, pp 77–84

  8. Akhtar M, Epps J, Ambikairajah E (2007) On DNA numerical representations for period-3 based exon prediction. IEEE workshop on genomic signal processing and statistics (GENSIPS), pp 1–4. doi:10.1109/GENSIPS.2007.4365821

  9. Holden T, Subramaniam R, Sullivan R, Cheng E, Sneider C, Tremberger G, Flamholz JA, Leiberman DH, Cheung TD (2007) ATCG nucleotide fluctuation of deinococcus radiodurans radiation genes. In: Proceedings of society of photo-optical instrumentation engineers (SPIE), pp 1598–1609

  10. Zahhad MA (2014) A novel circular mapping technique for spectral classification of exons and introns in human DNA sequences. Int J Inf Technol Comput Sci. doi:10.5815/ijitcs.2014.04.02

    Google Scholar 

  11. Zahhad MA, Ahmed SM, Elrahman SAA (2012) Genomic analysis and classification of exon and intron sequences using DNA numerical mapping techniques. Int J Inf Technol Comput Sci. doi:10.5815/ijitcs.2012.08.03

    Google Scholar 

  12. Wang SY, Tian FC, Liu X, Wang J (2009) A novel representation approach to DNA sequence and its application. IEEE Signal Process Lett 16(4):275–278. doi:10.1109/LSP.2009.2014291

    Article  Google Scholar 

  13. Zahhad MA, Ahmed SM, Elrahman SAA (2013) A new numerical mapping technique for recognition of exons and introns in DNA sequences. In: National radio science conference

  14. Cosic I (1994) Macromolecular bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng. doi:10.1109/10.335859

    Google Scholar 

  15. Ficket JW, Tung CS (1982) Recognition of protein coding regions in DNA sequence. Nucleic Acids Res 10(17):5303–5318. doi:10.1093/nar/10.17.5303

    Article  Google Scholar 

  16. Cristea PD (2002) Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 6:279–303. doi:10.1111/j.1582-4934.2002.tb00196.x

    Article  Google Scholar 

  17. Buldyrev SV, Goilberger AL, Havlin S, Mantegna RN, Mastsa ME, Peng CK, Simons M, Stanley HE (1995) Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys Rev E 51(5):5084–5091. doi:10.1103/PhysRevE.51.5084

    Article  Google Scholar 

  18. Peng C-K, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE, Goldberger AL, Havlin S, Peng CK, Stanley HE, Viswanathan GM (1998) Analysis of DNA sequences using methods of statistical physics. Phys A 249:430–438. doi:10.1016/S0378-4371(97)00503-7

    Article  MATH  Google Scholar 

  19. Hota MK (2011) Identification of protein-coding regions in eukaryotes using Fourier Transforms and Singular Value Decomposition using multiple length sliding windows. Int J Signal Imaging Syst Eng. doi:10.1504/IJSISE.2011.041604

    Google Scholar 

  20. Massachusetts Institute of Technology, Biological Engineering. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm. Accessed 03 Jan 2016

  21. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106. doi:10.1073/pnas.97.18.10101

    Article  Google Scholar 

  22. Golub GH, Van Loan CF (1989) Matrix computations, 2nd edn. Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  23. Akhtar M, Epps J, Ambikairajah E (2007) Time and frequency domain methods for gene and exon prediction in eukaryotes. In: Proceedings of IEEE ICASSP, pp 573–576. doi:10.1109/ICASSP.2007.366300

  24. Kwan JYY, Kwan BYM, Kwan HK (2010) Spectral analysis of numerical exon and intron sequences. In: Proceedings of IEEE international conference on bioinformatics and biomedicine workshops, Hong Kong, pp 876–877

  25. Vaidyanathan PP, ve Yoon B-J (2002) Gene and exon prediction using allpass-based filters. Workshop on genomic signal processing and statistics, Raleigh, NC, pp 45–55. doi:10.1016/S1672-0229(11)60007-7

  26. Hota MK, Srivastava VK (2010) Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time Discrete Fourier Transform. In: 2010 international conference on power control and embedded systems. doi:10.1109/ICPCES.2010.5698675

  27. Schmitt AO, Herzel H (1997) Estimating the entropy of DNA sequences. J Theor Biol 188(3):369–377. doi:10.1006/jtbi.1997.0493

    Article  Google Scholar 

  28. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656. doi:10.1002/j.1538-7305.1948.tb01338

  29. Machado JAT (2012) Shannon entropy analysis of the genome code. Math Probl Eng. Article ID 132625, 12 pages. 10.1155/2012/132625

  30. Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27(8):1061–1067. doi:10.1093/bioinformatics/btr077

    Article  Google Scholar 

  31. Kozarzewski B (2012) A method for nucleotide sequence analysis. Comput Methods Sci Technol 18(1):5–10

    Article  Google Scholar 

  32. Vinga S, Almeida JS (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinform 8:393. doi:10.1186/1471-2105-8-393

    Article  Google Scholar 

  33. Schneider TD (2010) A brief review of molecular information theory. Nano Commun Netw 1(3):173–180. doi:10.1016/j.nancom.2010.09.002

    Article  Google Scholar 

  34. Karcı A (2016) New kinds of entropy: fractional entropy. In: International conference on natural science and engineering (ICNASE’16). 19–20 March, Kilis

  35. NCBI GenBank database. http://www.ncbi.nlm.nih.gov/Genbank. Accessed Jan 2016

  36. Sendra GH (2008) Dynamic speckle algorithms comparison using receiver operating characteristic. Opt Eng 47(5):057005. doi:10.1117/1.2920429

    Article  MathSciNet  Google Scholar 

  37. Das R (2010) A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Syst Appl 37(2):1568–1572. doi:10.1016/j.eswa.2009.06.040

    Article  Google Scholar 

  38. Akhtar M, Ambikairajah E, Epps J (2005) Detection of period-3 behavior in genomic sequences using singular value decomposition. In: International conference on emerging technologies, vol 12, p 430. doi:10.1186/1471-2105-12-430

  39. Das B, Turkoglu I (2016) A new mapping technique for separation of exons and introns by using DFT method. In: International conference on engineering and natural science, Sarajevo, vol 2, no 10, pp 2778–2784

  40. Das B, Turkoglu I (2016) Sayisal Haritalama Teknikleri ve Fourier Dönüşümü Kullanılarak DNA Dizilimlerinin Sınıflandırılması, (Turkish). J Fac Eng Archit Gazi Univ 31(4):921–932. doi:10.17341/gazimmfd.278447

    Google Scholar 

  41. Das B, Turkoglu I (2016) A new numerical mapping approach for identification protein coding regions in DNA sequences by using SVD method. In: International conference on engineering and natural science, Sarajevo, vol 2, no 10, pp 2773–2777

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ibrahim Turkoglu.

Ethics declarations

Conflict of interest

There is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das, B., Turkoglu, I. A novel numerical mapping method based on entropy for digitizing DNA sequences. Neural Comput & Applic 29, 207–215 (2018). https://doi.org/10.1007/s00521-017-2871-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-017-2871-5

Keywords

Navigation