Abstract
Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.
Similar content being viewed by others
References
Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N (2009) Automated alphabet reduction for protein datasets. BMC Bioinf 10(1):6
Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660
Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Disc 29(2):400–422
Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117(2):185–198
Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C (2016) Multiple comparative metagenomics using multiset k-mer counting. Peer J Computer Science 2:e94
Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC press
Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479
Hapgood JP, Riedemann J, Scherer SD (2001) Regulation of gene expression by gc-rich dna cis-elements. Cell Biol Int 25(1):17–31
Kuksa P, Pavlovic V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinforma 10(14):S9
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15:107–144
Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341
MacNeil LT, Walhout AJ (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res 21 (5):645–657
Meher PK, Sahu TK, Rao A (2016) Identification of species based on dna barcode using k-mer feature vector and random forest classifier. Gene 592(2):316–324
Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1):236
Phillips KA, Trosman JR, Kelley RK, Pletcher MJ, Douglas MP, Weldon CB (2014) Genomic sequencing: assessing the health care system, policy, and big-data implications. Health Aff 33(7):1246–1253
Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘perceptron’algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res 10(9):2997–3011
Ullrich A, Schlessinger J (1990) Signal transduction by receptors with tyrosine kinase activity. Cell 61(2):203–212
Vinogradov AE (2003) Dna helix: the importance of being gc-rich. Nucleic Acids Res 31(7):1838–1844
Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, Wisconsin
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):R46
Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1):40–48
Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308
Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State University
Funding
This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection on Special Issue on Data Mining in Healthcare Informatics
Rights and permissions
About this article
Cite this article
Cakin, H., Gorgulu, B., Baydogan, M.G. et al. A Data Adaptive Biological Sequence Representation for Supervised Learning. J Healthc Inform Res 2, 448–471 (2018). https://doi.org/10.1007/s41666-018-0038-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41666-018-0038-5