Journal of Healthcare Informatics Research

, Volume 2, Issue 4, pp 448–471 | Cite as

A Data Adaptive Biological Sequence Representation for Supervised Learning

  • Hande Cakin
  • Berk Gorgulu
  • Mustafa Gokce BaydoganEmail author
  • Na Zou
  • Jing Li
Research Article
Part of the following topical collections:
  1. Special Issue on Data Mining in Healthcare Informatics


Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.


Gene expression Biological sequences Time series Categorical Classification Representation learning 


Funding Information

This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.

Compliance with Ethical Standards

Conflict of Interest

The authors declare that they have no conflict of interest.


  1. 1.
    Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N (2009) Automated alphabet reduction for protein datasets. BMC Bioinf 10(1):6CrossRefGoogle Scholar
  2. 2.
    Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660MathSciNetCrossRefGoogle Scholar
  3. 3.
    Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Disc 29(2):400–422MathSciNetCrossRefGoogle Scholar
  4. 4.
    Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117(2):185–198CrossRefGoogle Scholar
  5. 5.
    Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C (2016) Multiple comparative metagenomics using multiset k-mer counting. Peer J Computer Science 2:e94CrossRefGoogle Scholar
  6. 6.
    Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192Google Scholar
  7. 7.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  8. 8.
    Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC pressGoogle Scholar
  9. 9.
    Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479
  10. 10.
    Hapgood JP, Riedemann J, Scherer SD (2001) Regulation of gene expression by gc-rich dna cis-elements. Cell Biol Int 25(1):17–31CrossRefGoogle Scholar
  11. 11.
    Kuksa P, Pavlovic V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinforma 10(14):S9CrossRefGoogle Scholar
  12. 12.
    Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15:107–144MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341Google Scholar
  14. 14.
    MacNeil LT, Walhout AJ (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res 21 (5):645–657CrossRefGoogle Scholar
  15. 15.
    Meher PK, Sahu TK, Rao A (2016) Identification of species based on dna barcode using k-mer feature vector and random forest classifier. Gene 592(2):316–324CrossRefGoogle Scholar
  16. 16.
    Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1):236CrossRefGoogle Scholar
  17. 17.
    Phillips KA, Trosman JR, Kelley RK, Pletcher MJ, Douglas MP, Weldon CB (2014) Genomic sequencing: assessing the health care system, policy, and big-data implications. Health Aff 33(7):1246–1253CrossRefGoogle Scholar
  18. 18.
    Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214Google Scholar
  19. 19.
    Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘perceptron’algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res 10(9):2997–3011CrossRefGoogle Scholar
  20. 20.
    Ullrich A, Schlessinger J (1990) Signal transduction by receptors with tyrosine kinase activity. Cell 61(2):203–212CrossRefGoogle Scholar
  21. 21.
    Vinogradov AE (2003) Dna helix: the importance of being gc-rich. Nucleic Acids Res 31(7):1838–1844CrossRefGoogle Scholar
  22. 22.
    Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, WisconsinGoogle Scholar
  23. 23.
    Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):R46CrossRefGoogle Scholar
  24. 24.
    Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1):40–48CrossRefGoogle Scholar
  25. 25.
    Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308Google Scholar
  26. 26.
    Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State UniversityGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Industrial EngineeringBoğaziçi UniversityİstanbulTurkey
  2. 2.Texas A&M UniversityCollege StationUSA
  3. 3.Arizona State UniversityTempeUSA

Personalised recommendations