Advertisement

Journal of Healthcare Informatics Research

, Volume 2, Issue 4, pp 448–471 | Cite as

A Data Adaptive Biological Sequence Representation for Supervised Learning

  • Hande Cakin
  • Berk Gorgulu
  • Mustafa Gokce Baydogan
  • Na Zou
  • Jing Li
Research Article
  • 140 Downloads
Part of the following topical collections:
  1. Special Issue on Data Mining in Healthcare Informatics

Abstract

Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications.

Keywords

Gene expression Biological sequences Time series Categorical Classification Representation learning 

1 Introduction

Machine learning on large biological sequence databases has received considerable interest over the past decade with the increase in the sequential data due to the advancements in the microarray technology. An important challenge for the analysis of these sequential databases is coming from the high dimensionality of the sequences.

Much of the research has focused on finding a high-level representation by transforming the original data to another domain to reduce the dimension and capture certain relevant information. On the other hand, efficient analysis of the sequential data is not only a concern of bioinformaticians [4]. Similar problems exist in different domains such as online marketing where the sequence of pages or the path viewed by users are analyzed to understand the purchasing behavior [18] and telecommunication where the task is to predict a future event (i.e., the failure of a piece of equipment) based on past events (i.e., the alarm messages) [22].

With the recent advancements in whole-genome approaches to measure the gene expression, researchers have focused on learning the structural and dynamical properties of gene regulatory networks (GRN) [4]. A GRN consists of regulators which interact with each other and with other substances to govern the gene expression levels of mRNA and proteins. The regulators can be DNA, RNA, protein, and combination of these. The human genome has approximately 20000 protein-coding genes and these genes are expressed to generate proteins at specific levels in specific cells [14]. The creation of body structures (i.e., cells or tissues) is achieved by this way. Important factors driving this process are transcription factors (TFs). Regulation of genes are achieved by specialized proteins known as transcription factors which “bind” to the short sequences on the DNA and these binding sites are called TF binding sites (TFBSs). Describing the gene expression as a function of regulatory inputs specified by interactions between proteins and DNA is important to understand the underlying mechanisms by which genes are regulated [4]. Generally, pattern recognition algorithms are used to identify the sets of DNA sequence elements, or motifs, that are predictive of gene’s expression pattern. Genetic variations in the DNA sequence may indicate a person’s risk of particular disease or reaction to certain medications [17]. For example, in cancer genomics, a certain amount of specific genes have been identified in malignant tumors. DNA sequence could be used to detect variations or mutations in the established panel of genes for immediate treatment decisions and potential response to chemotherapy drugs. The second example is to use whole- genome sequence for predicting a future risk of specific disease even though without known clinical significance from found variants. A healthy newborn could be screened for a potential risk of genetic disease onset in early childhood. It is important to understand DNA sequence so that decision makers or healthcare professionals could determine when and how to use sequencing.

Biological sequences such as DNA is a class of sequential data. Sequential data can be described as the collection of events that take place in an order and events can be represented with categories (i.e., nucleotides), numbers (i.e. temperature), or a combination of both (i.e., network traffic direction and payload size) [24]. Sequential data-related problems have an important difference compared to regular data mining problems since the information is embedded in the order of the sequence. In other words, there is no explicit feature regarding the data point. In traditional learning tasks, each data point is represented with a feature vector describing the characteristics. However, sequential data is represented with the consecutive observations and the length of the sequences has potential to be very large depending on the sequential process. Therefore, an additional step is required to represent given sequences in a more effective manner.

The methods to represent categorical sequences can be categorized under three main groups, namely, similarity-based approaches, feature-based approaches, and model-based approaches. Similarity-based approaches are one of the most popular techniques in which alignment-based kernels are introduced to solve the classification problem with support vector machines (SVM) [11]. Depending on the area of application, similarity-based methods can be successful; however, they lack in interpretability. Additionally, computational performance deteriorates substantially with increasing data size as they require to store complete training data and calculate similarity between each sequence.

Besides from similarity-based approaches, there are model-based approaches. These approaches aim to discover underlying model of the observed sequences and use this model (i.e., logic) for classification of the new instances. Under the assumption of sequences having a parametric distribution, Markovian models are introduced as probabilistic model-based methods. These methods process using conditional probability of the occurrences under Markovian Property. Variations of Markov models are used in the field of categorical sequence representation. Hidden Markov Models (HMM) are the most famous approach to represent sequential data in this category [6]. These approaches learn a model for the sequence and use model parameters (i.e., learned transition and emission probabilities) as an input to learning algorithms. Model-based methods are generalizable and applied successfully on various domains and problems. However, training HMMs might be problematic since there are several parameters to tune and learn (i.e., number of hidden states and state transition probabilities) and the performance is affected significantly with the choice of initialization parameters.

Computational inefficiency and non-interpretable nature of similarity-based and model-based approaches have made researchers focus on feature-based approaches. The most popular approach in this category is the k-mers method. After all possible combinations of the observations are considered to generate words of length k, frequency of these words are used to represent the sequence. For instance, 3-mers representation of a DNA sequence is a feature vector of length 43 = 64 as there are 4 different nucleotides (i.e., A, T, G, C). It is a representation which has many applications in various domains such as speech recognition [9, 25], text categorization, and protein and genomic sequences. Furthermore, these methods perform significantly better than similarity-based approaches computationally since they do not require the storage of complete training data clustering. On the other hand, obtaining a representation with larger k values can be problematic for the cases where the number of categories is large. For example, there are protein sequences composed of 20 different letters [1]. Obtaining long words is almost impossible with k-mers representation for such datasets.

There are some variants of k-mers introduced to decrease the computational complexity and memory requirements of k-mers. KRAKEN [23] is a method that depends only the mapping of k-mers representations to the lowest common ancestors (LCA), namely, the deepest node that has two descendant nodes. It aims to decrease the complexity of k-mers without losing any performance. In order to speed up further and to decrease the memory requirements of the KRAKEN, [16] suggests CLARK. CLARK defines k-mers representation as k-spectrum which is a representation vector that has length of 4k and counts frequencies of all possible k-mers. This representation can be taught as a protector in order to decrease the effect of noise by eliminating the rare k-mers.

Although some variants of k-mers are introduced to speed up the process, computational complexity and memory requirements still are major problems in DNA sequencing. Considering this fact, data adaptive representations are proposed to reduce the dimensionality. k-mers is an unsupervised representation learning approach focusing on the generation of a frequency matrix [15, 16, 23]. However, a supervised representation learning that makes use of the class information of the sequences has potential to generate a scalable feature set [3]. These approaches make use of the response information for each sequence to obtain a compact representation.

This study proposes a novel supervised representation learning method, namely, sliding window-random forest (SW-RF) for categorical sequential data representation and classification. Proposed method can be utilized to find representative DNA sequence elements that are predictive of gene’s expression. Moreover, it is capable of working with any biological sequence and all varieties of categorical sequences.

Essentially, SW-RF is a combination of sliding windows and a tree-based ensemble learner. Therefore, it inherits the advantages of the tree learners. SW-RF has the computational efficiency advantage of tree learners and can handle large dimensional data successfully. Moreover, it does not require to process each sequence separately since tree classifiers evaluate the data as a whole. Tree learner used in this study is the well-known classification and regression trees algorithm proposed by [8].We evaluate the performance of the proposed representation method on several categorical sequence datasets. Firstly, we perform experiments on synthetic datasets to uncover the characteristics and evaluate the effectiveness of the approach. Then, a DNA promoter sequence dataset is used to illustrate the interpretability and accuracy of the approach on a real-world dataset. Moreover, we conduct further experiments on sequences with missing values and sequences with varying lengths to elaborate on the robustness of the approach in adverse conditions. To illustrate the generalizability of the proposed method, more experiments on protein sequences and numerical time series are also considered.

The remainder of this paper is organized as follows. Section 2 describes the framework for learning the representation. In Section 3, we illustrate the basics of the approach on a simple example. Sections 4 and 5 demonstrate the effectiveness and efficiency of our proposed approach by testing on synthetic and real-world datasets. Section 7 provides conclusions and potential future works.

2 Sliding Window - Random Forest Method

SW-RF method for categorical sequence representation and classification proceeds in four steps. The first step is to represent the given sequence with a sliding window and to assign class label of the corresponding sequence to all sub-windows (i.e., subsequences). The second step is to construct a decision tree classifier based on step one representation. The third step is generating a feature vector/matrix by using the frequency of each pattern that occurs on the terminal nodes of the decision tree/random forest. Finally, the last step is training any supervised learning algorithm on this representative vector.

Consider an observation sequence set X which is an N× T matrix where N is number of sequences and T is the length of each sequence. Then, xnt ∈{1,2,3,…,M} represents observation at tth position in sequence n. xnt is a categorical variable with M levels and yn is the corresponding response variable. Parameter M is referred to as alphabet size. For illustration purposes, the sequences are assumed to be of the same length, T, although our approach can handle the sequences of different lengths. Observed sequence X and response Y are defined as follows:
$$X = \left[\begin{array}{lll} x_{11} & {\dots} & x_{1T} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{N1} & {\dots} & x_{NT} \end{array}\right] \ \ \ \ Y = \left[\begin{array}{lll} y_{1} \\ {\vdots} \\ y_{N} \end{array}\right] $$
In the first step, the sliding window with pre-determined window size W is shifted horizontally one position at a time over each row of X to generate all subsequences of length W. All subsequences are merged to create a new feature matrix named as RF matrix. Note that for each categorical sequence, TW + 1 subsequences are created and added to the RF. Point by point sliding scheme considers all possible subsequences of a length W which guarantees that the new representation does not miss any important patterns. For illustration, the representation SW1 is obtained by applying sliding window over first sequence in X (i.e., x11x1T) and RF1 matrix is obtained by assigning the response variable y1 to all subsequences from first sequence is provided below. The same procedure is repeated for each categorical sequence and matrices that are obtained are combined vertically to create RF. Therefore, the combined matrix RF has N x (TW + 1) rows and (W + 1) columns:
$$SW_{1} =\begin{array}{cccc} ~~~~s_{1}~~ ~~~~~~~ ~s_{...}~~ ~~~~~ s_{W}\\\left[\begin{array}{ccc} x_{11} & {\dots} & x_{1W} \\ x_{12} & {\dots} & x_{1(W + 1)} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{1(T-W + 1)} & {\dots} & x_{1T} \end{array}\right] \end{array} \ \ \ \ RF_{1} = \begin{array}{cccc} ~~~~s_{1} ~~~~~~s_{...} ~~ ~~~s_{W}~y\\ \left[\begin{array}{cccc} x_{11} & {\dots} & x_{1W} & y_{1} \\ x_{12} & {\dots} & x_{1(W + 1)} & y_{1} \\ {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ x_{1(T-W + 1)} & {\dots} & x_{1T} & y_{1} \end{array}\right] \end{array} $$
$$SW_{n} =\! \begin{array}{ccc} ~~s_{1}~~~~~~~~~ s_{...}~~ ~~~~ s_{W}\\ \left[\begin{array}{ccc} x_{n1} & {\dots} & x_{nW} \\ x_{n2} & {\dots} & x_{n(W + 1)} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{n(T-W + 1)} & {\dots} & x_{nT} \end{array}\right] \end{array} \ \ \ \ RF_{n} = \begin{array}{cccc} ~~~~~~s_{1} ~ ~~~~~~~s_{...} ~ ~~~~~s_{W} ~ ~~~~~~y\\ \left[\begin{array}{cccc} x_{n1} & {\dots} & x_{nW} & y_{n} \\ x_{n2} & {\dots} & x_{n(W + 1)} & y_{n} \\ {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ x_{n(T-W + 1)} & {\dots} & x_{nT} & y_{n} \end{array}\right] \end{array} $$
$$SW = \begin{array}{ccc} ~~~~~s_{1} ~~~~~~~~~~~ s_{...} ~~~~~~~ s_{W}\\ \left[\begin{array}{ccc} x_{11} & {\dots} & x_{1W} \\ x_{12} & {\dots} & x_{1(W + 1)} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{1(T-W + 1)} & {\dots} & x_{1T} \\ {\vdots} & {\ddots} & {\vdots} \\ {\vdots} & {\ddots} & {\vdots} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{N1} & {\dots} & x_{NW} \\ x_{N2} & {\dots} & x_{N(W + 1)} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{N(T-W + 1)} & {\dots} & x_{NT} \end{array}\right] \end{array} \ \ \ \!RF = \begin{array}{cccc} ~~~~~~s_{1} ~ ~~~~~~~~s_{...} ~~~~~~~ s_{W} ~~~~ ~~~~y\\ \left[\begin{array}{cccc} x_{11} & {\dots} & x_{1W} & y_{1} \\ x_{12} & {\dots} & x_{1(W + 1)} & y_{1} \\ {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ x_{1(T-W + 1)} & {\dots} & x_{1T} & y_{1} \\ {\vdots} & {\ddots} & {\vdots} & \vdots\\ {\vdots} & {\ddots} & {\vdots} & \vdots\\ {\vdots} & {\ddots} & {\vdots} & \vdots\\ x_{N1} & {\dots} & x_{NW} & y_{N}\\ x_{N2} & {\dots} & x_{N(W + 1)} & y_{N} \\ {\vdots} & {\ddots} & {\vdots} \\ x_{N(T-W + 1)} & {\dots} & x_{NT} & y_{N} \end{array}\right] \end{array} $$
The second step is to construct a tree-based ensemble on the combined representation in Step 1 (RF). The base learner used in this ensemble is classification trees proposed by [8]. Classification trees recursively partition the input space by making binary splits until a termination condition is met. As a result of each split, observations are divided into two groups (nodes) according to a splitting rule. Splitting rule is determined with the aim of gathering similar observations together. In mathematical sense, splits are selected to minimize the total impurity in the terminal (i.e., leaf) nodes. Classification trees use Gini index as impurity measure and selects the split that maximizes the information gain in corresponding nodes. The motivation of using an ensemble, namely, random forests, is to avoid from greedy nature of a single tree [7].

The ensemble is trained on the given RF matrix that has the response vector y as the label and the column vectors s1, ... , sW representing the symbols as predictors. We are interested in quantifying how the subsequences generated from categorical sequences (RF) are distributed on the terminal nodes. This is very similar to a bag-of-words (BoW) approach traditionally used in computer vision problems. By applying a sliding window (TW + 1) subsequences are created for each categorical sequence. When a classification tree is trained on such data, subsequences of each sequence reside in a certain terminal node. The subsequences in the same terminal nodes share similar characteristics because of the recursive partitioning scheme of the decision trees and a sequence can be represented by the frequency of its subsequences at each terminal node. Then we obtain the final representation where rows correspond to sequences and columns represent the terminal nodes of the decision tree. As mentioned, we fit J trees on a random subsample of features and instances as in random forests [7] to avoid the potential problems with overfitting. The trees are trained in a breadth-first order and the final representation is the concatenation of the frequency vectors. We benefit from the desirable characteristics of the tree-based learning (i.e., computational efficiency, missing value handling by surrogate splits and etc.) at this stage.

Last step consists of an application of a supervised classification method to the proposed representation. Any supervised learning algorithm can be utilized on the proposed representation. However, extracted feature space becomes wide and sparse with the increasing number of terminal nodes. In order to facilitate intepretability with the proposed sparse representation, we train a lasso logistic regression model as the final classifier. The lasso regression model has a shrinkage effect, which would impose sparsity on feature selection in the model. It forces some coefficients to be zeros and others non-zeros. These predictors with non-zero coefficients are considered as having relatively higher impact on the response variable.

3 Illustration

All steps of SW-RF are described using a sample dataset depicted in Table 1. There are N = 200 sequences where Tn = 20. The alphabet size, M, is equal to 5. There is a binary class variable yn corresponding to each sequence. The sequences that has class variable yn = 1 are manipulated by adding a class indicator sequence {a,b,c} to randomly selected positions in order to check the ability of the proposed model whether to capture indicator sequence or not.
Table 1

First 25 categorical sequences of the sample dataset, N = 200, Tn = 20, M = 5, yn ∈{0,1}

 

x 1

x 2

x 3

x 4

x 5

x 6

x 7

x 8

x 9

x 10

x 11

x 12

x 13

x 14

x 15

x 16

x 17

x 18

x 19

x 20

y

1

Open image in new window

Open image in new window

Open image in new window

d

b

a

c

e

e

b

b

c

e

d

a

a

b

Open image in new window

Open image in new window

Open image in new window

0

2

b

b

b

c

d

c

c

a

e

c

c

a

c

a

b

c

e

b

a

e

1

3

a

b

a

a

b

c

d

d

c

d

b

c

a

b

b

c

e

d

c

a

1

4

b

b

e

a

b

a

b

a

a

a

d

b

b

b

b

a

b

c

a

b

1

5

c

e

e

d

e

c

e

d

b

d

b

a

a

b

c

b

a

c

c

d

1

6

d

b

a

d

e

d

c

d

c

d

c

c

e

c

b

e

d

c

b

a

0

7

d

d

c

d

c

b

c

a

d

a

c

b

c

d

b

a

c

a

d

c

0

8

b

b

e

d

e

b

d

c

c

b

c

b

b

e

a

a

d

d

a

d

0

9

e

b

c

c

e

c

a

d

b

b

c

e

d

a

d

a

c

d

d

a

0

10

e

e

b

e

e

b

a

c

c

b

b

a

d

a

a

e

d

e

c

a

0

11

d

a

e

e

d

c

a

c

b

a

d

c

d

d

a

a

a

e

e

b

0

12

d

e

c

a

e

c

c

a

a

b

c

b

a

a

d

c

c

d

e

b

1

13

d

c

a

b

e

d

d

c

b

d

b

a

a

b

e

b

b

b

e

e

0

14

e

a

c

e

b

e

d

c

c

a

c

e

a

b

c

b

c

d

b

b

1

15

e

d

d

c

b

e

d

c

e

a

b

d

e

c

a

e

c

b

d

e

0

16

d

c

c

d

e

b

d

d

e

c

d

d

e

b

e

a

b

e

c

b

0

17

a

e

e

d

e

d

d

d

d

d

c

c

a

d

d

b

d

e

d

a

0

18

e

a

b

a

d

a

d

d

e

d

e

d

b

d

a

c

e

a

d

e

0

19

e

a

b

d

d

e

d

e

b

a

a

b

e

e

b

a

d

e

d

b

0

20

a

c

d

d

e

a

a

a

d

c

e

a

c

d

b

d

a

b

a

b

0

21

b

a

d

d

b

a

c

d

c

d

c

b

b

a

b

e

c

b

b

e

0

22

e

c

e

a

e

e

a

e

d

a

d

a

d

a

d

c

a

e

e

a

0

23

b

a

c

e

c

e

c

d

a

b

c

d

e

b

b

e

e

b

e

b

1

24

c

a

b

b

c

c

e

e

a

e

d

b

b

a

b

c

e

a

b

c

1

25

d

a

e

d

c

c

c

e

a

c

a

d

a

e

b

e

c

b

b

b

0

SW-RF begins with constructing a sliding window representation for a given W. For illustration purposes, W is chosen as 3. Then RF1 matrix, i.e., matrix that contains all subsequences of length 3 of categorical sequence 1 is generated as:
Note that y values of all rows in RF1 are equal to 0 which is the label of the 1st categorical sequence in the dataset, because, if a categorical sequence belongs to a certain class, all of its subsequences should belong to the same class. RF matrix is constructed by merging subsequence matrices of all categorical sequences. In other words, row-wise stacking all RFn for n in 1,...,N. Therefore, when W is equal to 3, sliding window representation matrix RF has (20 − 3 + 1) × 200 = 3600 rows and 3 + 1 = 4 columns. Table 2 represents RF matrix with corresponding categorical sequence ids (SeqId) and subsequence ids (SubseqId).
Table 2

RF representatio with categorical sequence and subsequence ids for the sample dataset

SubseqId

SeqId

s1

s2

s3

y

1

1

e

a

d

0

2

1

a

d

d

0

3

1

d

d

b

0

4

1

d

b

a

0

5

1

b

a

c

0

6

1

a

c

e

0

7

1

c

e

e

0

8

1

e

e

b

0

.

.

.

.

.

.

.

.

.

.

.

.

1796

19

c

a

e

0

1797

19

a

e

e

0

1798

19

e

e

b

0

1799

19

e

b

c

0

1800

19

b

c

b

0

1801

20

b

a

d

0

1802

20

a

d

e

0

1803

20

d

e

e

0

1804

20

e

e

d

0

1805

20

e

d

d

0

.

.

.

.

.

.

.

.

.

.

.

.

3593

200

c

b

c

1

3594

200

b

c

a

1

3595

200

c

a

d

1

3596

200

a

d

a

1

3597

200

d

a

b

1

3598

200

a

b

d

1

3599

200

b

d

b

1

3600

200

d

b

a

1

For the sake of easy interpretation, only one tree (i.e., J = 1) is trained on the given RF matrix that has classification labels y as response vector and s1, s2, s3, as predictors. The resulting tree is schematized in Fig. 1. Our aim is to find a mapping for each sequence by using terminal nodes of this decision tree which are displayed as rectangles in the Fig. 1. There are 12 terminal nodes in the following tree and each of them has a unique ID as it can be seen in the Fig. 1. Unique terminal node IDs are {5, 7, 8, 9, 11, 15, 16, 17, 20, 21, 22, 23}.
Fig. 1

A sample decision tree trained on representation RF

Application of sliding window constructs (20 − 3 + 1) = 18 subsequences for each categorical sequence. Each of these subsequences are given as an observation to a decision tree. Based on the decision tree in Fig. 1, each subsequence end up in a single terminal node. Therefore, each subsequence can be represented as a binary vector showing the presence of each subsequence in each node. Sample binary representation is shown in Table 3 with corresponding categorical sequence ids and subsequence ids.
Table 3

Binary representations of the subsequences with respect to their presence on the terminal nodes

  

Terminal nodes

SubseqID

SeqID

5

7

8

9

11

15

16

17

20

21

22

23

1

1

1

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

0

0

0

0

0

0

1

3

1

0

0

0

0

0

0

0

0

0

0

0

1

4

1

0

0

0

0

1

0

0

0

0

0

0

0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3599

200

0

0

0

0

0

0

0

0

0

0

0

1

3600

200

0

0

0

0

1

0

0

0

0

0

0

0

SW-RF requires a representation for categorical sequences. A frequency vector is therefore constructed for each categorical sequence as the frequency of the subsequences residing at each terminal node. This procedure is realized simply by summing the binary vectors over their sequence ids (SeqId). Resulting representation contains a single frequency matrix for each categorical sequence. Table 4 demonstrates the final representation for the sample data and this representation. Each column in the final representation forms a feature vector and represented as φq where q represents the terminal node Ids and p is the number of terminal nodes which is equal to 12 for this example.
Table 4

Final representation table for the sample dataset

 

Terminal nodes

 

SeqID

5

7

8

9

11

15

16

17

20

21

22

23

Class

1

1

0

1

1

3

0

3

1

1

0

1

6

0

2

2

0

2

0

3

1

1

4

1

0

1

3

1

3

0

0

2

1

2

0

1

3

1

0

3

5

1

4

1

0

0

1

2

0

0

8

1

0

3

2

1

5

0

2

0

1

1

0

1

3

1

1

1

7

1

6

2

1

0

2

4

0

0

1

0

0

0

8

0

7

2

0

1

1

3

2

1

1

0

0

2

5

0

8

3

0

0

2

2

1

0

3

0

0

0

7

0

9

1

1

1

2

4

0

2

0

0

0

0

7

0

10

2

0

0

1

3

0

1

3

0

1

0

7

0

11

1

0

0

3

2

1

0

1

0

0

2

8

0

12

1

1

0

1

3

0

0

5

1

0

1

5

1

13

2

0

0

2

4

0

0

3

0

0

1

6

0

14

1

0

1

2

4

2

0

2

1

0

0

5

1

15

3

0

0

2

4

0

0

0

0

0

0

9

0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Note that the row sums of Table 4 equals to 18 because 18 subsequences are created for each sequence when sliding window is applied. For instance, sequence 1 has no subsequence that fit the rule set of terminal node 7. However, it has 3 subsequences that fit the rules set of terminal node 16. Table 5 provides a visualization for the rule sets of terminal node 7 and 16 in detail. s1,s2, and s3 are shown in the rule sets represent the first, second, and third columns (observation) in the sliding window representation which is provided in Table 2.
Table 5

Rule set of terminal node 7 and terminal node 16

Terminal Nodes 7

1) s2 == {a, b, c}

 2) s3 == {d, e}

  3) s1 == {b, c, e}

   4) s2 == {c}

    6) s1 == {c, e}

     7)*,weights = 89

Terminal Node 16

1) s2 == {a, b, c}

 2) s3 == {a, b, c}

  10) s1 == {a, b, c}

   12) s1 == {b, c}

    13) s3 == {c}

     14) s1 == {b}

      16)* weights = 94

After obtaining the final representation (Table 4) for each categorical sequence, a lasso regression model is trained to predict the binary class variable. The regularization parameter λ is selected internally using 10-fold cross-validation and λ providing the best likelihood is selected for model training. We consider area under the ROC curve (AUC) as an evaluation metric in our calculations. AUC is a well-known and widely used metric for comparison of binary classification algorithms. AUC is known to be a better measure than accuracy empirically in terms of consistency and discriminancy as discussed by [13].

With the proposed representation, each feature refers to a certain rule that describes a pattern, features with non-zero coefficients are interesting in the sense that they imply important motifs related to the class. Fitted model is as follows:
$$y \sim 0.6109(\varphi_{20})-0.0379 (\varphi_{23}) $$
The model indicates that terminal node 20 is significant in order to predict binary class variable with positive sign and terminal node 23 is significant with a negative sign. When we look at the rule set of node 20 and node 23 in detail as in Table 6, we can see that node 20 has a rule set that are similar to our manipulative indicator sequence {a, b, c} and node 23 has the vice-versa. Moreover, Table 4 points that categorical sequences from class 1 are tend to include subsequences represented on terminal node 20; however, categorical sequences from class 0 are tend to include subsequences represented on terminal node 23. Average cross-validation AUC for the best lambda parameter is close to 1 as illustrated in Fig. 2. This also confirms the reliability of the model in terms of the accuracy.
Table 6

Rule set of terminal node 20 and terminal node 23

Terminal Node 20

1) s2 == {a, b, c}

 2) s3 == {a, b, c}

  10) s1 == {a, b, c}

   12) s1 == {a}

    18) s3 == {c}

     19) s2 == {b}

      20)* weights = 110

Terminal Node 23

1) s2 == {d, e}

 23)* weights = 1319

Fig. 2

Plot of AUC with varying lambda values. The dashed line on the left represent the lambda value providing the maximum average AUC whereas the one on the left is the lambda value which gives the most regularized model such that performance is within one standard error of the maximum average AUC

4 Experimental Setup

SW-RF method is tested on both a synthetic and a real DNA sequence promoter dataset. We compare SW-RF to the state-of-the-art methods. Additionally, the performance of SW-RF is evaluated on modified synthetic datasets with varying lengths and missing values to illustrate the robustness of the approach in adverse conditions. SW-RF is implemented on R programming language and experimentation is performed on Windows 10 × 64 system with Intel Core i5-3337U CPU @ 1.60 GHz × 4 and 6 GB RAM. Although the CPU can handle four threads in parallel, only a single thread is used in the experiments.

4.1 Synthetic Dataset

A synthetic dataset with varying problem characteristics is generated to characterize the behavior of the proposed approach in terms of accuracy and computational efficiency. Two HMM models are used to generate categorical sequences from different classes. In order to observe the performance of algorithms based on changing parameters, it is preferred to generate data sets with varying number of sequences, length of sequence and alphabet size. All combinations of these parameter summarized in the Table 7 was applied with five replications. Therefore, in total 120 datasets are generated with respective parameters.
Table 7

Parameter settings that used to generate synthetic dataset

Parameter

Values

The number of sequences (N)

{50, 100, 250}

The length of sequence n (Tn)

{50, 100, 200, 500}

Alphabet size (M)

{5, 10}

4.2 DNA Sequence Promoter Dataset

A DNA promoter sequence data from eight human cell lines, which are H1-hESC, GM12878, HMEC, HSMM, HUVEC, K562, NHEK, and NHLF. The karyotype of K562 is cancer and others have normal karyotype. Table 8 provides the details about the characteristics of the cells. Naturally, there are 23 chromosome for each cell, in which 22 are autosomes and 23rd pair determines the gender. The letters of each chromosome of each cell line are from the DNA alphabet A, C, G, and T. Each chromosome have different number of genes, but each genes of each chromosome have the same length which is equal to 2100.
Table 8

Cell lines used as the source of experimental material

Cell

Tier

Lineage

Tissue

Karyotype

Sex

K562

1

Mesoderm

Blood

Cancer

F

GM12878

1

Mesoderm

Blood

Normal

F

NHEK

3

Ectoderm

Skin

Normal

U

NHLF

3

Endoderm

Lung

Normal

U

HUVEC

2

Mesoderm

Blood vessel

Normal

U

HSMM

3

Mesoderm

Muscle

Normal

U

H1-hESC

1

Inner cell mass

Embryonic stem cell

Normal

M

Moreover, the response variable for each gene of each chromosome is included in the dataset. The response variable is the gene expression that refers to complex processes in DNA is converted into a functional substance. It is a binary variable, if the genetic information stored in a gene is used to produce a substance then gene expression becomes “1”, otherwise it is “0”. To sum up, each row in data set corresponds to one gene and for each gene sequence there is a corresponding gene index binary variable. Using DNA Sequence Promoter Dataset it is aimed to classify a specific gene whether it has gene expression of “1” or “0”.

5 Algorithm Evaluation

We compare SW-RF with k-mers algorithm and HMM models with different parameter settings. For each method, a related representation is obtained by applying given methods. All representation methods have method-related parameters such as number of hidden states for HMM, value of k for k-mers and the window size W for SW-RF. Parameters of the each method are chosen by 10-fold cross-validation from the parameter values provided in Table 9. After the representations are learned a lasso logistic regression is trained where λ values are selected by 10-fold cross-validation. Unless otherwise stated, the number of trees J is set to 100. Random forests are known to provide robust results if the number of trees are set large enough [7] and we have identified that predictive performance stabilizes around 60 trees based on our preliminary experiments.
Table 9

Set of parameters considered in 10-fold cross-validation

Method

Parameter

Values

HMM

Number of hidden states (H)

2, 4, 6

k-mers

Length of words(k)

3, 4, 5

SW-RF

Window size (W)

3, 4, 5

SW-RF

Maximum number of terminal node (R)

36, 64, 128

State-of-the-art methods are compared in terms of their theoretical complexity as well as classification performance (AUC) and computational times. Time complexity of sliding window algorithm is O(TnW) where the length of the sequence is Tn and W is the determined window size. Time complexity of k-mers representation is O(V ) where V stands for number of distinct k-mers, namely, equals to Mk where M be the alphabet size and k stands for length of subsequences [5]. Space complexity of k-mers representation is O(V k) and it also increase exponentially with k. Time complexity of HMM representation which is used with Baum-Welch inference algorithm has a time complexity of O(H2Tn) * (# of iterations) where H is the number of hidden states and Tn is the length of the sequence.

5.1 Comparative Performance on Synthetic Dataset

According to reported AUC results on Table 10, proposed method provides significantly better results than HMM and k-mers for each parameter combination of the synthetic data set. Computational time analysis has been done for two metrics, namely, preparation time of competitor approaches HMM, k-mers and proposed SW-RF representations, and training and testing time of classification process on each representation. Each algorithm is implemented in R Software and performed on Windows 10 × 64 system with Intel Core i5-3337U CPU @ 1.60 GHz × 4 and 6 GB RAM.
Table 10

AUC values on synthetic data set with different length of sequence, number of instances and alphabet size against competitor approaches

  

Alphabet size = 5

Alphabet size = 10

  

HMM

k-mers

SW-RF

HMM

k-mers

SW-RF

Length of sequence

Number of instances

Average of AUC

Average of AUC

Average of AUC

Average of AUC

Average of AUC

Average of AUC

50

50

0.542

0.553

0.647

0.497

0.510

0.558

50

100

0.630

0.610

0.747

0.500

0.547

0.719

50

250

0.676

0.660

0.766

0.503

0.594

0.778

100

50

0.614

0.612

0.848

0.507

0.548

0.694

100

100

0.683

0.671

0.835

0.499

0.575

0.817

100

250

0.727

0.768

0.854

0.496

0.665

0.874

200

50

0.618

0.718

0.932

0.495

0.608

0.903

200

100

0.741

0.820

0.940

0.510

0.687

0.932

200

250

0.803

0.875

0.946

0.502

0.770

0.952

500

50

0.663

0.854

0.991

0.495

0.742

0.991

500

100

0.835

0.934

0.992

0.506

0.820

0.995

500

250

0.865

0.977

0.994

0.499

0.891

0.997

Grand average

0.700

0.754

0.874

0.501

0.663

0.851

Computational time required for SW-RF and competitor approaches are provided in Table 11. Time required for HMM does not change significantly with changing problem parameters. As the alphabet size increases, computational time for k-mers increases significantly with the increase in the representation size. Computational requirements for SW-RF are not affected significantly with the changes in problem parameters. Computational performance of the SW-RF representation is also examined under different parameter settings to make the characteristics of the proposed algorithm clear. Interested parameter settings are window size for the sliding window, maximum number of terminal nodes and number of trees for random forest. Figure 3 illustrates the results of sensitivity analyses. Computation times increases with the increase in the window size from 10 to 25. Maximum number of terminal nodes cause a sharp increase in training and testing time, while a smooth increase in representation preparation time. Increasing number of decision trees causes an increase in representation preparation time. The empirical behaviors are consistent with the theoretical complexity of SW-RF.
Table 11

Computational performances of HMM, k-mers and SW-RF on synthetic data set with changing sequence length and number of instances

  

Alphabet size = 10

Alphabet size = 10

  

HMM

HMM

k-mers

k-mers

SW-RF

SW-RF

HMM

HMM

k-mers

k-mers

SW-RF

SW-RF

Length of sequence

Number of instances

Preparation time

Train and test time

Preparation time

Train and test time

Preparation time

Train and test time

Preparation time

Train and test time

Preparation time

Train and test time

Preparation time

Train and test time

50

50

0.048

0.695

0.987

0.550

0.278

0.487

0.047

0.918

1.031

0.966

0.402

0.463

50

100

0.053

0.860

1.886

1.273

0.532

1.714

0.055

2.733

1.979

2.810

0.633

1.110

50

250

0.070

0.609

4.648

4.156

1.172

8.497

0.068

0.678

4.929

11.665

1.289

5.309

100

50

0.093

0.880

1.814

0.560

0.446

0.555

0.092

0.884

1.893

1.436

0.565

0.547

100

100

0.096

1.073

3.61

1.247

0.866

1.803

0.099

3.385

3.771

3.914

0.974

1.374

100

250

0.121

0.671

9.006

6.960

1.953

15.476

0.123

0.717

9.271

14.625

2.083

5.908

200

50

0.174

0.953

3.528

0.573

0.803

0.64

0.180

1.221

3.619

1.972

0.899

0.625

200

100

0.185

2.132

7.014

1.160

1.467

1.636

0.187

3.141

7.164

5.070

1.593

1.572

200

250

0.221

0.822

17.460

4.956

3.66

10.274

0.226

0.742

18.109

16.900

3.829

5.665

500

50

0.426

0.847

8.568

0.573

1.843

0.742

0.4379

1.102

8.895

2.780

1.925

0.717

500

100

0.457

1.684

17.287

1.133

3.470

1.616

0.460

3.520

17.669

6.411

3.544

1.471

500

250

0.539

1.229

43.218

3.320

9.612

5.203

0.551

0.852

44.773

19.546

9.673

4.385

Grand average

0.207

1.038

9.919

2.205

2.175

4.053

0.210

1.658

10.258

7.341

2.284

2.429

Fig. 3

Empirical complexity of SW-RF representation with respect to window size, max number of terminal nodes and number of decision tree respectively

5.2 Performance on Data with Varying Lengths

The previous section indicates that SW-RF performs well on the dataset consists of sequences with constant length. SW-RF is also capable of handling sequences with variable lengths without losing its predictive accuracy. SW-RF generates a sliding window representation of the sequences in the datasets and this representation is given to a random forest classifier as an input. Therefore, each sequence is transformed to a collection of subsequences with equal length and provided to the random forest classifier. This allows SW-RF to handle sequences with varying lengths. Table 6 provides the performance of SW-RF on the synthetic data that is generated in different lengths. Length of the sequences in each dataset is selected uniformly from the interval provided. Also, noise is introduced to the data to make it harder for SW-RF. From Table 12, it can be observed that SW-RF works with high accuracy on the data with varying lengths without significant performance loss.
Table 12

AUC values on synthetic data with varying lengths

  

Alphabet size

  

10

20

Length of sequence

Number of instances

AUC

AUC

10–100

200

0.97

0.97

20–200

200

0.92

0.97

50–500

200

0.92

0.92

100–1000

200

0.75

0.84

Average performance

0.89

0.92

5.3 Performance on Data with Missing Values

SW-RF utilizes tree-based methods to obtain new representation. Decision trees can handle missing values using surrogate splits without any imputation [8]. Therefore SW-RF is also capable of working with missing data. For illustration, new datasets are created by 5, 10, 25, and 50 percent of the data from Table 13 with length of sequence 10-100, alphabet size 10, and number of instances 200. Table 13 provides the performance of SW-RF on these datasets. These results clearly illustrates that SW-RF can perform with missing values. Naturally, performance in the dataset decreases as missing ratio increases. However, SW-RF still performs well with missing values.
Table 13

AUC values on synthetic data with changing missing proportion

Missing ratio

AUC

0

0.970

0.05

0.902

0.1

0.886

0.25

0.782

0.5

0.686

5.4 Comparative Performance on DNA Sequence Promoter Dataset

Proposed SW-RF representation, competitor HMM and k-mers representation methods are trained on DNA promoter sequence data set for each cell separately. Moreover, for each cell 10-fold cross-validation is applied for five replications. Therefore, presented results are averages over individual results of separate replications. We perform cross-validation to determine best parameter settings using the parameter levels in Table 9. Then, obtained parameter settings are used on the test part of the dataset. AUC values of the both performance are recorded in order to see whether there is a significant difference between AUC value of the cross-validation step and AUC value of the test step. The aim of such comparison is to show that SW-RF does not suffer from overfitting. When cross-validation AUC values are compared to test AUC values, there is not a significant difference that signs an overfitting. Consequently, results in following table indicates that our proposed SW-RF representation method outperforms HMM and k-mers representations for each cell. While HMM representation provides worst results in terms of AUC, k-mers representation provides moderate results that are close to results of HMM representation. In order to evaluate whether the difference between mean of the representation methods is significant or not, t test is conducted at the 0.05 significance level. Results also prove that mean difference between SW-RF and HMM representation, and SW-RF and k-mers representation is statistically significant (Table 14).
Table 14

AUC values of HMM, k-mers and SW-RF on DNA sequence promoter data set

 

HMM

k-mers

SW-RF

Cell

Average of cvAUC

Average of testAUC

Average of cvAUC

Average of testAUC

Average of cvAUC

Average of testAUC

GM12878

0.565

0.567

0.612

0.612

0.770

0.766

H1hESC

0.603

0.603

0.620

0.621

0.770

0.766

HMEC

0.615

0.616

0.614

0.614

0.740

0.736

HSMM

0.602

0.603

0.603

0.603

0.720

0.717

HUVEC

0.601

0.602

0.624

0.624

0.777

0.774

K562

0.578

0.580

0.636

0.636

0.800

0.798

NHEK

0.595

0.597

0.634

0.634

0.782

0.779

NHLF

0.588

0.588

0.606

0.606

0.739

0.736

Grand average

0.593

0.594

0.619

0.619

0.765

0.758

It is also possible to identify subsequences that have non-zero coefficients as illustrated in Section 3. For terminal nodes with non-zero coefficients, the distribution of the nucleotides is examined. Since, it is believed that detecting such sequences that have an effect on gene expression level may provide benefits with regards to biological researches. A similar study is conducted on the same data set previously by Zou [26]. A transfer learning approach on k-mers is proposed on the same DNA promoter sequence dataset to predict the same response [26]. These biological sequences are so called motifs and can be represented with a position-specific scoring matrix (PSSM) [19]. Figure 4 provides a motif logo example for the cell line GM12878 for W = 5. PSSM represents the distribution of the nucleotides for the corresponding terminal node of the tree. These results are consistent with the discussion that GC-rich (guanin-cytosine rich) DNA elements are important transcriptional regulatory elements in the promoter, enhancer, and locus control regions of many eukaryotic genes from several species [10, 21].
Fig. 4

Example for a motif logo of cell line GM12878, W = 5. First column refers to terminal node id. second column is the corresponding coefficient

6 Evaluation of SW-RF on different applications

In the previous sections, success of our method is demonstrated on synthetic datasets as well as a real DNA sequence data. However, application domain of SW-RF method lays well beyond these applications. Proposed method is suitable for any kind of categorical sequence datasets and it can even be used to classify numerical sequences with some pre-processing. In addition, applications of SW-RF in biological sequences are not limited to only DNA sequence data, it can be applied to any biological sequence such as proteins, DNA, RNA, etc. In order to demonstrate the generalizability, SW-RF is applied to both protein classification and numerical time series classification tasks.

6.1 Protein classification

Classification of the different protein families allows understanding the functional roles and structures of the proteins. In this experiment, SW-RF is applied to a protein dataset to determine the family of each protein sequence. The protein dataset is a sample of 2112 sequences taken from www.uniprot.org, where a sequence is from 20 alphabets (amino acids). Each sequence has one of two functions: “Might take part in the signal recognition particle (SRP) pathway” and “Binds to DNA and alters its conformation”. The functions are treated as the sequence labels. The experiments with 5-fold cross-validation yield that SW-RF with various window sizes w yields the AUC of 1. For demonstration purposes SW-RF with w = 5 is applied on the protein dataset and motif logo examples of important terminals are demonstrated in Fig. 5. For example, “YYDD” subsequence is a signature subsequence for “Might take part in the signal recognition particle (SRP) pathway” function. “Y” refers to amino acid tyrosine which is known to occur in proteins that are part of signal transduction processes [20].
Fig. 5

Example for a motif logo of Protein data, W = 5

6.2 Numerical time series classification

SW-RF method is essentially suggested for categorical sequences; however, it is possible to apply proposed method to any numerical time series with additional preprocessing. As an illustration, SW-RF is applied to 3 ECG datasets (ECG200, ECGFiveDays, TwoLeadECG) from [2]. ECG200 dataset contains two class of time series which are normal heartbeat and myocardial infarction. The aim is to disthinguish heartbeats with myocardial infraction from normal healthy hearbeat. ECGFiveDays dataset contains ECG recordings of 67 years of male recorded in two different dates with the aim of distinguishing two types of recordings. TwoLeadECG is an ECG dataset contains two types of signals. The task is to classify signal 0 and signal 1.

Pre-processing requires discretization of the continuous time series which is realized by applying symbolic aggregate approximation (SAX) [12]. For a given alphabet size and dimensionality parameter SAX divides the series into intervals and represent them with their means. Alphabet size determines the number of distinct mean levels to be taken into account and dimensionality parameter decides number of observation to be considered for discretization. All ECG datasets are discretized by applying SAX with alphabet size α and dimensionality parameter β. Figure 6 contains an illustration of the discretization process on a sample time series from ECG200 dataset. These discretized time series are expressed by a sequence of categorical variables and forms categorical time series. SW-RF is then applied to the categorical time series for classification. Table 15 provides comparison of the classification accuracies of SW-RF against Euclidean distance, dynamic time warping (DTW), and derivative dynamic time warping (DDTW) followed by 1-nearest neighbor classifier. DTW is considered as one of the most well-known benchmarks in time series classification. DTW measures the distance between two time series with varying temporal patterns and shifts by calculating the optimal match using dynamic programming. DDTW is suggested as a variant of DTW which calculates the distance between two time series as a weighted combination of the DTW distance between the time series and the DTW distance between the first order differences. Although there is a loss of information due to discretization SW-RF yields a classification accuracy that is comparable to the state of the art approaches.
Fig. 6

Sample SAX representation for an alphabet size of four and desired dimensionality of 16. The time series length is 128. Horizontal grids differentiate the mean levels implied by each symbol and red boxes visualizes symbol assignments

Table 15

Comparison of SW-RF with benchmark methods on 3 ECG datasets

 

SW-RF

Euclidean 1NN

DTW_R1 1NN

DTW_Rn 1NN

DDTW_R1 1NN

DDTW_Rn 1NN

ECG200

0.830

0.880

0.770

0.880

0.830

0.890

ECGFiveDays

0.810

0.797

0.768

0.797

0.687

0.696

TwoLeadECG

0.923

0.747

0.905

0.868

0.995

0.982

Accuracy result for best performing approach is provided in italic

7 Conclusion and Future Work

This study proposes a novel data adaptive representation approach for categorical sequential data. The proposed SW-RF method offers a two-step processing of categorical sequences to obtain a meaningful representation. Applying tree-based ensemble to a simple sliding window representation of the sequential data and summarizing sequence as a frequency distribution of the sequences at each terminal node provide some advantageous properties to SW-RF compared to state-of-the-art representation methods. It gives relatively better classification accuracy while number of observation symbols are increasing. Furthermore, the SW-RF representation can learn longer words based on the settings of maximum number of terminal nodes on decision trees. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. We have shown that important motifs related to gene expression can be identified when the representation is used for training lasso logistic regression. Success of SW-RF is also illustrated on alternative applications such as protein sequence and numerical time series classification. Although not considered in this study, proposed approach can be extended to regression problems in a straightforward manner.

Notes

Funding Information

This research is supported by Air Force Office of Scientific Research Grant FA9550-17-1-0138.

Compliance with Ethical Standards

Conflict of Interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N (2009) Automated alphabet reduction for protein datasets. BMC Bioinf 10(1):6CrossRefGoogle Scholar
  2. 2.
    Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Disc 31(3):606–660MathSciNetCrossRefGoogle Scholar
  3. 3.
    Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Disc 29(2):400–422MathSciNetCrossRefGoogle Scholar
  4. 4.
    Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117(2):185–198CrossRefGoogle Scholar
  5. 5.
    Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, Lemaitre C (2016) Multiple comparative metagenomics using multiset k-mer counting. Peer J Computer Science 2:e94CrossRefGoogle Scholar
  6. 6.
    Blasiak S, Rangwala H (2011) A hidden markov model variant for sequence classification. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1192Google Scholar
  7. 7.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  8. 8.
    Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC pressGoogle Scholar
  9. 9.
    Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Computational linguistics. arXiv:1608.03533 18(4):467–479
  10. 10.
    Hapgood JP, Riedemann J, Scherer SD (2001) Regulation of gene expression by gc-rich dna cis-elements. Cell Biol Int 25(1):17–31CrossRefGoogle Scholar
  11. 11.
    Kuksa P, Pavlovic V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinforma 10(14):S9CrossRefGoogle Scholar
  12. 12.
    Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15:107–144MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ling CX, Huang J, Zhang H (2003) Auc: a better measure than accuracy in comparing learning algorithms. In: Conference of the canadian society for computational studies of intelligence, Springer, pp 329–341Google Scholar
  14. 14.
    MacNeil LT, Walhout AJ (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res 21 (5):645–657CrossRefGoogle Scholar
  15. 15.
    Meher PK, Sahu TK, Rao A (2016) Identification of species based on dna barcode using k-mer feature vector and random forest classifier. Gene 592(2):316–324CrossRefGoogle Scholar
  16. 16.
    Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1):236CrossRefGoogle Scholar
  17. 17.
    Phillips KA, Trosman JR, Kelley RK, Pletcher MJ, Douglas MP, Weldon CB (2014) Genomic sequencing: assessing the health care system, policy, and big-data implications. Health Aff 33(7):1246–1253CrossRefGoogle Scholar
  18. 18.
    Richter C, Luboschik M, Röhlig M, Schumann H (2015) Sequencing of categorical time series. In: 2015 IEEE conference on visual analytics science and technology (VAST), IEEE, pp 213–214Google Scholar
  19. 19.
    Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘perceptron’algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res 10(9):2997–3011CrossRefGoogle Scholar
  20. 20.
    Ullrich A, Schlessinger J (1990) Signal transduction by receptors with tyrosine kinase activity. Cell 61(2):203–212CrossRefGoogle Scholar
  21. 21.
    Vinogradov AE (2003) Dna helix: the importance of being gc-rich. Nucleic Acids Res 31(7):1838–1844CrossRefGoogle Scholar
  22. 22.
    Weiss GM, Hirsh H (1998) Learning to predict rare events in categorical time-series data. In: Proceedings of the AAAI/ICML workshop on time-series analysis, Madison, WisconsinGoogle Scholar
  23. 23.
    Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3):R46CrossRefGoogle Scholar
  24. 24.
    Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1):40–48CrossRefGoogle Scholar
  25. 25.
    Zissman MA, Singer E (1994) Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In: IEEE international conference on acoustics, speech and signal processing (ICASSP-94), vol 1, pp 305–308Google Scholar
  26. 26.
    Zou N (2015) A probabilistic framework of transfer learning: Theory and application. Arizona State UniversityGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Industrial EngineeringBoğaziçi UniversityİstanbulTurkey
  2. 2.Texas A&M UniversityCollege StationUSA
  3. 3.Arizona State UniversityTempeUSA

Personalised recommendations