Abstract
DNA sequences are the basic data type that is processed to perform a generic study of biological data analysis. One key component of the biological analysis is represented by sequence classification, a methodology that is widely used to analyze sequential data of different nature. However, its application to DNA sequences requires a proper representation of such sequences, which is still an open research problem. Machine Learning (ML) methodologies have given a fundamental contribution to the solution of the problem. Among them, recently, also Deep Neural Network (DNN) models have shown strongly encouraging results. In this chapter, we deal with specific classification problems related to two biological scenarios: (A) metagenomics and (B) chromatin organization. The investigations have been carried out by considering DNA sequences as input data for the classification methodologies. In particular, we study and test the efficacy of (1) different DNA sequence representations and (2) several Deep Learning (DL) architectures that process sequences for the solution of the related supervised classification problems. Although developed for specific classification tasks, we think that such architectures could be served as a suggestion for developing other DNN models that process the same kind of input.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amato, D., Di Gangi, M.A., Lo Bosco, G., Rizzo, R.: Recurrent deep neural networks fornucleosome classification. In: Raposo, M., Ribeiro, P., Sério, S., Staiano, A., Ciaramella, A. (eds.) Computational Intelligence Methods for Bioinformatics and Biostatistics. pp. 118–127. Springer International Publishing, Cham (2020)
Cairns, B.R.: Chromatin remodeling complexes: strength in diversity, precision through specialization. Current opinion in genetics & development 15(2), 185–190 (2005)
Chaput, N., Lepage, P., Coutzac, C., Soularue, E., Le Roux, K., Monot, C., Boselli, L., Routier, E., Cassard, L., Collins, M., et al.: Baseline gut microbiota predicts clinical response and colitis in metastatic melanoma patients treated with ipilimumab. Annals of Oncology 28(6), 1368–1379 (2017)
Cole, J.R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R.J., Kulam-Syed-Mohideen, A., McGarrell, D.M., Marsh, T., Garrity, G.M., et al.: The ribosomal database project: improved alignments and new tools for rrna analysis. Nucleic acids research 37(suppl_1), D141–D145 (2008)
Di Gangi, M., Lo Bosco, G., Rizzo, R.: Deep learning architectures for prediction of nucleosome positioning from sequences data. BMC Bioinformatics 19(14), 418 (Nov 2018)
Di Gangi, M.A., Gaglio, S., La Bua, C., Lo Bosco, G., Rizzo, R.: A deep learning network for exploiting positional information in nucleosome related sequences. In: Rojas, I., Ortuño, F. (eds.) Bioinformatics and Biomedical Engineering: 5th International Work-Conference, IWBBIO 2017, Granada, Spain, April 26–28, 2017, Proceedings, Part II, pp. 524–533. Springer International Publishing (2017)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998)
Escobar-Zepeda, A., Vera-Ponce de León, A., Sanchez-Flores, A.: The road to metagenomics: from microbiology to dna sequencing technologies and bioinformatics. Frontiers in genetics 6, 348 (2015)
Escobar-Zepeda, A., Vera-Ponce de León, A., Sanchez-Flores, A.: The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics. Frontiers in Genetics 6(348) (2015)
Ferraro Petrillo, U., Sorella, M., Cattaneo, G., Giancarlo, R., Rombo, S.E.: Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinformatics 20(4), 138 (Apr 2019)
Fiannaca, A., La Paglia, L., La Rosa, M., Renda, G., Rizzo, R., Gaglio, S., Urso, A., et al.: Deep learning models for bacteria taxonomic classification of metagenomic data. BMC bioinformatics 19(7), 198 (2018)
Fiannaca, A., La Rosa, M., La Paglia, L., Rizzo, R., Urso, A.: nrc: non-coding rna classifier based on structural features. BioData mining 10(1), 27 (2017)
Fiannaca, A., La Rosa, M., Rizzo, R., Urso, A.: Analysis of dna barcode sequences using neural gas and spectral representation. In: Iliadis, L., Papadopoulos, H., Jayne, C. (eds.) Engineering Applications of Neural Networks, Communications in Computer and Information Science, vol. 384, pp. 212–221 (2013)
Fiannaca, A., La Rosa, M., Rizzo, R., Urso, A.: A k-mer-based barcode dna classification methodology based on spectral representation and a neural gas network. Artificial Intelligence in Medicine 64(3), 173–184 (2015). https://doi.org/10.1016/j.artmed.2015.06.002
Frankel, A.E., Coughlin, L.A., Kim, J., Froehlich, T.W., Xie, Y., Frenkel, E.P., Koh, A.Y.: Metagenomic shotgun sequencing and unbiased metabolomic profiling identify specific human gut microbiota and metabolites associated with immune checkpoint therapy efficacy in melanoma patients. Neoplasia 19(10), 848–855 (2017)
Giancarlo, R., Lo Bosco, G., Pinello, L., Utro, F.: The three steps of clustering in the post-genomic era: A synopsis. In: Rizzo, R., Lisboa, P.J.G. (eds.) Computational Intelligence Methods for Bioinformatics and Biostatistics. pp. 13–30. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
Goodfellow, I.J., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge, MA, USA (2016), http://www.deeplearningbook.org
Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W., Chou, K.C.: inuc-pseknc: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30(11), 1522–1529 (2014)
Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14(8), 1771–1800 (2002)
Hinton, G.E.: Reducing the Dimensionality of Data with Neural Networks. Science 313(5786), 504–507 (2006)
Hinton, G.E., Osindero, S., Teh, Y.W.: A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18(7), 1527–1554 (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Jones, P.A., Baylin, S.B.: The epigenomics of cancer. Cell 128(4), 683–692 (2007)
Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Artificial neural networks: concept learning, pp. 112–127 (1990)
Kaplan, N., K Moore, I., Mittendorf, Y., J Gossett, A., Tillo, D., Field, Y., M LeProust, E., R Hughes, T., Lieb, J., Widom, J., Segal, E.: The dna-encoded nucleosome organization of a eukaryotic genome. Nature 458, 362–6 (03 2009)
Kho, Z.Y., Lal, S.K.: The human gut microbiome–a potential controller of wellness and disease. Frontiers in microbiology 9 (2018)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014)
Krebs, C.J.: Species diversity measures. Ecological methodology (1999)
Kullback, S., Leibler, R.A.: On Information and Sufficiency. The Annals of Mathematical Statistics 22(1), 79–86 (1951)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Lecun, Y., èon Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE. pp. 2278–2324 (1998)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Li, Y., Huang, C., Ding, L., Li, Z., Pan, Y., Gao, X.: Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods (2019)
Liu, H., Lin, S., Cai, Z., Sun, X.: Role of 10–11bp periodicities of eukaryotic dna sequence in nucleosome positioning. Bio Systems 105, 295–9 (06 2011)
Liu, M.J., Seddon, A.E., Tsai, Z.T.Y., Major, I.T., Floer, M., Howe, G.A., Shiu, S.H.: Determinants of nucleosome positioning and their influence on plant gene expression. Genome research 25(8), 1182–1195 (2015)
Lo Bosco, G.: Alignment free dissimilarities for nucleosome classification. In: Computational Intelligence Methods for Bioinformatics and Biostatistics, Lecture Notes in Computer Science, vol. 9874, pp. 114–128 (2016)
Lo Bosco, G., Di Gangi, M.A.: Deep learning architectures for dna sequence classification. In: Petrosino, A., Loia, V., Pedrycz, W. (eds.) Fuzzy Logic and Soft Computing Applications. pp. 162–171. Springer International Publishing, Cham (2017)
Lo Bosco, G., Rizzo, R., Fiannaca, A., La Rosa, M., Urso, A.: A deep learning model for epigenomic studies. In: 12th International Conference on Signal-Image Technology Internet-Based Systems (SITIS). pp. 688–692. IEEE (2016)
Lo Bosco, G., Rizzo, R., Fiannaca, A., La Rosa, M., Urso, A.: Variable ranking feature selection for the identification of nucleosome related sequences. In: Benczúr, A., Thalheim, B., Horváth, T., Chiusano, S., Cerquitelli, T., Sidló, C., Revesz, P.Z. (eds.) New Trends in Databases and Information Systems. pp. 314–324. Springer International Publishing (2018)
Lu, Q., Wallrath, L.L., Elgin, S.C.: Nucleosome positioning and gene regulation. Journal of cellular biochemistry 55(1), 83–92 (1994)
Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Briefings in Bioinformatics pp. 1–19 (2016)
Montúfar, G.: Restricted boltzmann machines: Introduction and review. In: Ay, N., Gibilisco, P., Matúš, F. (eds.) Information Geometry and Its Applications. pp. 75–115. Springer International Publishing, Cham (2018)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10). pp. 807–814 (2010)
Pinello, L., Lo Bosco, G.: A new feature selection methodology for k-mers representation of dna sequences. In: Computational Intelligence Methods for Bioinformatics and Biostatistics, Lecture Notes in Computer Science, vol. 8623, pp. 99–108 (2015)
Pinello, L., Lo Bosco, G., Hanlon, B., Yuan, G.C.: A motif-independent metric for dna sequence specificity. BMC Bioinformatics 12 (2011)
Pinello, L., Lo Bosco, G., Yuan, G.C.: Applications of alignment-free methods in epigenomics. Briefings in Bioinformatics 15(3), 419–430 (2014)
Pulivarthy, S.R., Lion, M., Kuzu, G., Matthews, A.G., Borowsky, M.L., Morris, J., Kingston, R.E., Dennis, J.H., Tolstorukov, M.Y., Oettinger, M.A.: Regulated large-scale nucleosome density patterns and precise nucleosome positioning correlate with v (d) j recombination. Proceedings of the National Academy of Sciences 113(42), E6427–E6436 (2016)
Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D., et al.: A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490(7418), 55 (2012)
Ramazzotti, M., Berná, L., Donati, C., Cavalieri, D.: riboframe: an improved method for microbial taxonomy profiling from non-targeted metagenomics. Frontiers in genetics 6, 329 (2015)
Ridgway, P., Almouzni, G.: Chromatin assembly and organization. Journal of cell science 114(15), 2711–2712 (2001)
Rinke, C., Schwientek, P., Sczyrba, A., Ivanova, N.N., Anderson, I.J., Cheng, J.F., Darling, A., Malfatti, S., Swan, B.K., Gies, E.A., Dodsworth, J.A., Hedlund, B.P., Tsiamis, G., Sievert, S.M., Liu, W.T., Eisen, J.A., Hallam, S.J., Kyrpides, N.C., Stepanauskas, R., Rubin, E.M., Hugenholtz, P., Woyke, T.: Insights into the phylogeny and coding potential of microbial dark matter. Nature 499(7459), 431–437 (2013)
Rizzo, R., Fiannaca, A., La Rosa, M., Urso, A.: The general regression neural network to classify barcode and mini-barcode dna. In: Computational Intelligence Methods for Bioinformatics and Biostatistics, Lecture Notes in Computer Science, vol. 8623, pp. 142–155 (2015)
Rizzo, R., Fiannaca, A., La Rosa, M., Urso, A.: A deep learning approach to dna sequence classification. In: International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. pp. 129–140. Springer (2015)
Rizzo, R., Fiannaca, A., La Rosa, M., Urso, A.: Classification experiments of dna sequences by using a deep neural network and chaos game representation. In: Proceedings of the 17th International Conference on Computer Systems and Technologies 2016. pp. 222–228. ACM (2016)
Sala, A., Toto, M., Pinello, L., Gabriele, A., Di Benedetto, V., Ingrassia, A.M., Lo Bosco, G., Di Gesù, V., Giancarlo, R., Corona, D.F.V.: Genome-wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling atpase iswi. The EMBO Journal 30(9), 1766–1777 (2011)
Schnitzler, G.R.: Control of nucleosome positions by dna sequence and remodeling machines. Cell biochemistry and biophysics 51(2–3), 67–80 (2008)
Shahbazian, M.D., Grunstein, M.: Functions of site-specific histone acetylation and deacetylation. Annu. Rev. Biochem. 76, 75–100 (2007)
Shawe-Taylor, J., Cristianini, N.: Support vector machines. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods pp. 93–112 (2000)
Simpson, E.H.: Measurement of Diversity. Nature 163(4148), 688–688 (1949)
Song, Y.J., Cho, D.H.: Classification of various genomic sequences based on distribution of repeated k-word. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 3894–3897. IEEE (2017)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
Svaren, J., Horz, W.: Transcription factors vs. nucleosomes: Regulation of the pho5 promoter in yeast. Trends in Biochemical Sciences 22, 93–97 (1997)
Tekaia, F., Lazcano, A., Dujon, B.: The genomic tree as revealed from whole proteome comparisons. Genome research 9(6), 550–557 (1999)
Turnbaugh, P.J., Ley, R.E., Mahowald, M.A., Magrini, V., Mardis, E.R., Gordon, J.I.: An obesity-associated gut microbiome with increased capacity for energy harvest. nature 444(7122), 1027 (2006)
Vinje, H., Liland, K.H., Almøy, T., Snipen, L.: Comparing k-mer based methods for improved classification of 16s sequences. BMC Bioinformatics 16(1), 205 (Jul 2015)
Wang, Y., Hill, K., Singh, S., Kari, L.: The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene 346, 173–185 (2005)
Weiner, A., Hughes, A., Yassour, M., Rando, O.J., Friedman, N.: High-resolution nucleosome mapping reveals transcription-dependent promoter packaging. Genome research 20(1), 90–100 (2010)
Whitehouse, I., Tsukiyama, T.: Antagonistic forces that position nucleosomes in vivo. Nature structural & molecular biology 13(7), 633 (2006)
Wooley, J.C., Ye, Y.: Metagenomics: Facts and Artifacts, and Computational Challenges. Journal of Computer Science and Technology 25(1), 71–81 (2010)
Wu, H., Gu, X.: Towards dropout training for convolutional neural networks. Neural Networks 71, 1–10 (2015)
Yuan, C., Lei, J., Cole, J., Sun, Y.: Reconstructing 16s rrna genes in metagenomic data. Bioinformatics 31(12), i35–i43 (2015)
Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.: Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics 32(12), i121–i127 (2016)
Zhang, J., Peng, W., Wang, L.: Lenup: learning nucleosome positioning from dna sequences with improved convolutional neural networks. Bioinformatics 34(10), 1705–1712 (2018)
Acknowledgements
Additional support to Giosué Lo Bosco and Domenico Amato has been granted by Project INdAM - GNCS “Computational Intelligence methods for Digital Health”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Amato, D. et al. (2021). Classification of Sequences with Deep Artificial Neural Networks: Representation and Architectural Issues. In: Elloumi, M. (eds) Deep Learning for Biomedical Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-71676-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-71676-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71675-2
Online ISBN: 978-3-030-71676-9
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)