Skip to main content
Log in

A Data Adaptive Biological Sequence Representation for Supervised Learning

  • Research Article
  • Published:
Journal of Healthcare Informatics Research Aims and scope Submit manuscript

Abstract

Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N (2009) Automated alphabet reduction for protein datasets. BMC Bioinf 10(1):6

    Article  Google Scholar 

  2. Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660

    Article  MathSciNet  Google Scholar 

  3. Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Disc 29(2):400–422

    Article  MathSciNet  Google Scholar 

  4. Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117(2):185–198

    Article  Google Scholar 

  5. Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C (2016) Multiple comparative metagenomics using multiset k-mer counting. Peer J Computer Science 2:e94

    Article  Google Scholar 

  6. Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192

  7. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  8. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC press

  9. Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479

  10. Hapgood JP, Riedemann J, Scherer SD (2001) Regulation of gene expression by gc-rich dna cis-elements. Cell Biol Int 25(1):17–31

    Article  Google Scholar 

  11. Kuksa P, Pavlovic V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinforma 10(14):S9

    Article  Google Scholar 

  12. Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15:107–144

    Article  MathSciNet  Google Scholar 

  13. Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341

  14. MacNeil LT, Walhout AJ (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res 21 (5):645–657

    Article  Google Scholar 

  15. Meher PK, Sahu TK, Rao A (2016) Identification of species based on dna barcode using k-mer feature vector and random forest classifier. Gene 592(2):316–324

    Article  Google Scholar 

  16. Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1):236

    Article  Google Scholar 

  17. Phillips KA, Trosman JR, Kelley RK, Pletcher MJ, Douglas MP, Weldon CB (2014) Genomic sequencing: assessing the health care system, policy, and big-data implications. Health Aff 33(7):1246–1253

    Article  Google Scholar 

  18. Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214

  19. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘perceptron’algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res 10(9):2997–3011

    Article  Google Scholar 

  20. Ullrich A, Schlessinger J (1990) Signal transduction by receptors with tyrosine kinase activity. Cell 61(2):203–212

    Article  Google Scholar 

  21. Vinogradov AE (2003) Dna helix: the importance of being gc-rich. Nucleic Acids Res 31(7):1838–1844

    Article  Google Scholar 

  22. Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, Wisconsin

  23. Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):R46

    Article  Google Scholar 

  24. Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1):40–48

    Article  Google Scholar 

  25. Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308

  26. Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State University

Download references

Funding

This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mustafa Gokce Baydogan.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Special Issue on Data Mining in Healthcare Informatics

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cakin, H., Gorgulu, B., Baydogan, M.G. et al. A Data Adaptive Biological Sequence Representation for Supervised Learning. J Healthc Inform Res 2, 448–471 (2018). https://doi.org/10.1007/s41666-018-0038-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41666-018-0038-5

Keywords

Navigation