A Data Adaptive Biological Sequence Representation for Supervised Learning
- 236 Downloads
Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.
KeywordsGene expression Biological sequences Time series Categorical Classification Representation learning
This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.
Compliance with Ethical Standards
Conflict of Interest
The authors declare that they have no conflict of interest.
- 6.Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192Google Scholar
- 8.Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC pressGoogle Scholar
- 9.Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479
- 13.Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341Google Scholar
- 18.Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214Google Scholar
- 22.Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, WisconsinGoogle Scholar
- 25.Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308Google Scholar
- 26.Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State UniversityGoogle Scholar