A Data Adaptive Biological Sequence Representation for Supervised Learning

Cakin, Hande; Gorgulu, Berk; Baydogan, Mustafa Gokce; Zou, Na; Li, Jing

doi:10.1007/s41666-018-0038-5

A Data Adaptive Biological Sequence Representation for Supervised Learning

Research Article
Published: 26 October 2018

Volume 2, pages 448–471, (2018)
Cite this article

Journal of Healthcare Informatics Research Aims and scope Submit manuscript

Hande Cakin¹,
Berk Gorgulu¹,
Mustafa Gokce Baydogan ORCID: orcid.org/0000-0001-6324-6575¹,
Na Zou² &
…
Jing Li³

436 Accesses
Explore all metrics

Abstract

Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gene Family Classification Using Machine Learning: A Comparative Analysis

Development of a new oligonucleotide block location-based feature extraction (BLBFE) method for the classification of riboswitches

Article 04 January 2020

Rule Extraction from Random Forest: the RF+HC Methods

References

Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N (2009) Automated alphabet reduction for protein datasets. BMC Bioinf 10(1):6
Article Google Scholar
Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660
Article MathSciNet Google Scholar
Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Disc 29(2):400–422
Article MathSciNet Google Scholar
Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117(2):185–198
Article Google Scholar
Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C (2016) Multiple comparative metagenomics using multiset k-mer counting. Peer J Computer Science 2:e94
Article Google Scholar
Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC press
Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479
Hapgood JP, Riedemann J, Scherer SD (2001) Regulation of gene expression by gc-rich dna cis-elements. Cell Biol Int 25(1):17–31
Article Google Scholar
Kuksa P, Pavlovic V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinforma 10(14):S9
Article Google Scholar
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15:107–144
Article MathSciNet Google Scholar
Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341
MacNeil LT, Walhout AJ (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res 21 (5):645–657
Article Google Scholar
Meher PK, Sahu TK, Rao A (2016) Identification of species based on dna barcode using k-mer feature vector and random forest classifier. Gene 592(2):316–324
Article Google Scholar
Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1):236
Article Google Scholar
Phillips KA, Trosman JR, Kelley RK, Pletcher MJ, Douglas MP, Weldon CB (2014) Genomic sequencing: assessing the health care system, policy, and big-data implications. Health Aff 33(7):1246–1253
Article Google Scholar
Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214
Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘perceptron’algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res 10(9):2997–3011
Article Google Scholar
Ullrich A, Schlessinger J (1990) Signal transduction by receptors with tyrosine kinase activity. Cell 61(2):203–212
Article Google Scholar
Vinogradov AE (2003) Dna helix: the importance of being gc-rich. Nucleic Acids Res 31(7):1838–1844
Article Google Scholar
Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, Wisconsin
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):R46
Article Google Scholar
Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1):40–48
Article Google Scholar
Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308
Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State University

Download references

Funding

This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.

Author information

Authors and Affiliations

Department of Industrial Engineering, Boğaziçi University, İstanbul, Turkey
Hande Cakin, Berk Gorgulu & Mustafa Gokce Baydogan
Texas A&M University, College Station, TX, USA
Na Zou
Arizona State University, Tempe, AZ, USA
Jing Li

Authors

Hande Cakin
View author publications
You can also search for this author in PubMed Google Scholar
Berk Gorgulu
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Gokce Baydogan
View author publications
You can also search for this author in PubMed Google Scholar
Na Zou
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mustafa Gokce Baydogan.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Special Issue on Data Mining in Healthcare Informatics

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cakin, H., Gorgulu, B., Baydogan, M.G. et al. A Data Adaptive Biological Sequence Representation for Supervised Learning. J Healthc Inform Res 2, 448–471 (2018). https://doi.org/10.1007/s41666-018-0038-5

Download citation

Received: 15 September 2017
Revised: 01 October 2018
Accepted: 02 October 2018
Published: 26 October 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s41666-018-0038-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Data Adaptive Biological Sequence Representation for Supervised Learning

Abstract

Access this article

Similar content being viewed by others

Gene Family Classification Using Machine Learning: A Comparative Analysis

Development of a new oligonucleotide block location-based feature extraction (BLBFE) method for the classification of riboswitches

Rule Extraction from Random Forest: the RF+HC Methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Data Adaptive Biological Sequence Representation for Supervised Learning

Abstract

Access this article

Similar content being viewed by others

Gene Family Classification Using Machine Learning: A Comparative Analysis

Development of a new oligonucleotide block location-based feature extraction (BLBFE) method for the classification of riboswitches

Rule Extraction from Random Forest: the RF+HC Methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation