Mobile Health pp 389409  Cite as
Time Series Feature Learning with Applications to Health Care
 1 Citations
 2.2k Downloads
Abstract
Exponential growth in mobile health devices and electronic health records has resulted in a surge of largescale time series data, which demands effective and fast machine learning models for analysis and discovery. In this chapter, we discuss a novel framework based on deep learning which automatically performs feature learning from heterogeneous time series data. It is wellsuited for healthcare applications, where available data have many sparse outputs (e.g., rare diagnoses) and exploitable structures (e.g., temporal order and relationships between labels). Furthermore, we introduce a simple yet effective knowledgedistillation approach to learn an interpretable model while achieving the prediction performance of deep models. We conduct experiments on several realworld datasets and show the empirical efficacy of our framework and the interpretability of the mimic models.
Introduction
The breakthroughs in sensor technologies and wearable devices in health domains have led to a surge of large volume time series data. This offers an unprecedented opportunity to improve future care by learning from past patient encounters. One important step towards this goal is learning richer and meaningful features so that accurate prediction and effective analysis can be performed. Representation learning from time series health care data has attracted many machine learning and data mining researcher [17, 40, 45]. Most work has focused on discovering cluster structure (e.g., disease subtypes) via variations of Gaussian processes [23, 29].
Learning robust representations of time series health care data is especially challenging because the underlying causes of health and wellness span body systems and physiologic processes, creating complex and nonlinear relationships among observed measurements (e.g., patients with septic shock may exhibit fever or hypothermia). Whereas classic shallow models (e.g., cluster models) may struggle in such settings, properly trained deep neural networks can often discover, model, and disentangle these types of latent factors [5] and extract meaningful abstract concepts from simple data types [19]. Because of these properties, deep learning has achieved state of the art results in speech recognition [15] and computer vision [39] and is wellsuited to time series health data. Several recent works have demonstrated the potential of deep learning to derive insight from clinical data [18, 22, 38]. Nonetheless, the practical reality of neural networks remains challenging, and we face a variety of questions when applying them to a new problem:
First, do we have enough data? Deep learning’s success is often associated with massive data sets, with potentially millions of examples [10, 34], but in medicine “big data” often means an Electronic Health Records (EHRs) database [14, 23] with tens of thousands of cases. Other questions regard data preprocessing, model architecture, training procedures, etc. Answering these questions often requires timeconsuming trial and error.
Second, in health domains, model interpretability is not only important but also necessary, since the primary care providers, physicians and clinical experts alike depend on the new healthcare technologies to help them in monitoring and decisionmaking for patient care. A good interpretable model is shown to result in faster adoptability among the clinical staff and results in better quality of patient care [20, 27]. Therefore we need to identify novel solutions which can provide interpretable models and achieve similar prediction performance as deep models in healthcare domain.

We formulate a priorbased regularization framework for guiding the training of multilabel neural networks using medical ontologies and other structured knowledge. Our formulation is based on graph Laplacian priors [1, 3, 37, 43], which can represent any graph structure and incorporate arbitrary relational information. We apply graph Laplacian priors to the problem of training neural networks to classify physiologic time series with diagnostic labels, where there are many labels and severe class imbalance. Our framework is general enough to incorporate datadriven (e.g., comorbidity patterns) and hybrid priors.

We propose an efficient incremental training procedure for building a series of neural networks that detect meaningful patterns of increasing length. We use the parameters of an existing neural net to initialize the training of a new neural net designed to detect longer temporal patterns. This technique exploits both the wellknown low rank structure of neural network weight matrices [11] and structure in our data domain, including temporal smoothness.

We propose a novel knowledge distillation methodology called Interpretable Mimic Learning where we mimic the performance of stateoftheart deep learning models using wellknown Gradient Boosting Trees (GBT). Our experiments on a realworld hospital dataset shows that our proposed Interpretable Mimic Learning models can achieve stateoftheart performance comparable to the deep learning models. We discuss the interpretable features learned by our Interpretable Mimic Learning models, which is validated by the expert clinicians.
The proposed deep learning solutions in this chapter are general and are applicable to a wide variety of time series healthcare data including longitudinal data from electronic healthcare records (EHR), sensor data from intensive care units (ICU), sensor data from mobile health devices and so on. We will use an example of computational phenotyping from ICU time series data to demonstrate the effectiveness of our proposed approach.
Related Work
Deep learning approaches have achieved breakthrough results in several sequential and temporal data domains including language modeling [24], speech recognition [9, 15], and paraphrase detection [31]. We expect similar results in more general time series data domains, including healthcare. In natural language processing, distributed representations of words, learned from context using neural networks, have provided huge boosts in performance [35]. Our use of neural networks to learn representations of time series is similar: a window of time series observations can be viewed as the context for a single observation within that window.
In medical applications, many predictive tasks suffer from severe class imbalance since most conditions are rare. One possible remedy is to use side information, such as class hierarchy, as a rich prior to prevent overfitting and improve performance. Reference [32] is the first work that combines a deep architecture with a treebased prior to encode relations among different labels and label categories, but their work is limited to modeling a restricted class of side information.
Our incremental training method has clear and interesting connections to ongoing research into efficient methods for training deep architectures. It can be viewed as a greedy method for building deep architectures horizontally by adding units to one or more layers, and can be connected to two recent papers: Zhou et al. [44] described a twostep incremental approach to feature learning in an online setting and focused on data drift. Denil et al. [11] described an approach for predicting parameters of neural networks by exploiting the smoothness of input data and the low rank structure of weight matrices.
As pointed out in the introduction, model interpretability is not only important but also necessary in healthcare domain. Decision trees [28]—due to their easy interpretability—have been quite successfully employed in the healthcare domain [6, 13, 41] and clinicians have embraced it to make informed decisions. However, decision trees can easily overfit and they do not achieve good performance on datasets with missing values which is common in today’s healthcare datasets. On the other hand, deep learning models have achieved remarkable performance in healthcare, but hard to interpret. Some recent works on deep learning interpretability in computer vision field [12, 33, 42] show that interpreting deep learning features is possible but the behavior of deep models may be more complex than previously believed. Therefore we believe there is a need to identify novel solutions which can provide interpretable models and achieve similar prediction performance as deep models.
Mimicking the performance of deep learning models using shallow models is a recent breakthrough in deep learning which has captured the attention of the machine learning community. Ba and Caruana [2] showed empirically that shallow neural networks are capable of learning the same function as deep neural networks. They demonstrated this by first training a stateoftheart deep model, and then training a shallow model to mimic the deep model. Motivated by the model compression idea from [7, 16] proposed an efficient knowledge distillation approach to transfer (dark) knowledge from model ensembles into a single model. These previous works motivate us to employ mimic learning strategy to learn an interpretable model from a welltrained deep neural network.
Methods
In this section, we describe our framework for performing effective deep learning on clinical time series data. We begin by discussing the Laplacian graphbased prior framework that we use to perform regularization when training multilabel neural networks. This allows us to effectively train neural networks, even with smaller data sets, and to exploit structured domain knowledge, such as ontologies. We then describe our incremental neural network procedure, which we developed in order to rapidly train a collection of neural networks to detect physiologic patterns of increasing length. Finally we describe a simple but effective knowledge distillation framework which recognizes interpretable features while maintaining the stateoftheart classification performance of the deep learning models.
General Framework
Given a multivariate time series with P variables and length T, we can represent it as a matrix \(\mathbf{X} \in \mathbb{R}^{P\times T}\). A feature map for time series X is a function \(g: \mathbb{R}^{P\times T}\mapsto \mathbb{R}^{D}\) that maps X to a vector of features \(\mathbf{x} \in \mathbb{R}^{D}\) useful for machine learning tasks like classification, segmentation, and indexing. Given the recent successes of deep learning, it is natural to investigate its effectiveness for feature learning in clinical time series data.
Suppose we have a data set of N multivariate time series, each with P variables and K binary labels. Without loss of generality, we assume all time series have the same length T. After a simple mapping that stacks all T column vectors in X to one vector x, we have N labeled instances {x_{ i }, y_{ i }}_{ i = 1}^{ N }, where \(\mathbf{x}_{i} \in \mathbb{R}^{D},\mathbf{y}_{i} \in \{ 0,1\}^{K},D = PT\). The goal of multilabel classification is to learn a function f which can be used to assign a set of labels to each instance x_{ i } such that y_{ ij } = 1 if jth label is assigned to the instance x_{ i } and 0 otherwise.
Our framework can easily be extended to other network architectures, hidden unit types, and training procedures.
PriorBased Regularization
Deep neural networks are known to work best in big data scenarios with many training examples. When we have access to only a few examples of each class label, incorporating prior knowledge can improve learning. Thus, it is useful to have a general framework able to incorporate a wider range of prior information in a unified way. Graph Laplacianbased regularization [1, 3, 37, 43] provides one such framework and is able to incorporate any relational information that can be represented as a (weighted) graph, including the treebased prior as a special case.
The graph Laplacian regularizer can represent any pairwise relationships between parameters. Here we discuss how to use different types of priors and the corresponding Laplacian regularizers to incorporate both structured domain knowledge (e.g., label hierarchies based on medical ontologies) and empirical similarities.
Structured Domain Knowledge as a TreeBased Prior
The graph Laplacian regularizer can represent a treebased prior based on hierarchical relationships found in medical ontologies. In our experiments, we use diagnostic codes from the Ninth Revision of the International Classification of Diseases (ICD9) system [25], which are widely used for classifying diseases and coding hospital data. The three digits (and two optional decimal digits) in each code form a natural hierarchy including broad body system categories (e.g., Respiratory), individual diseases (e.g., Pneumonia), and subtypes (e.g., viral vs. Pneumococcal pneumonia). Right part of Fig. 1 illustrates two levels of the hierarchical structure of the ICD9 codes. When using ICD9 codes as labels, we can treat their ontological structure as prior knowledge. If two diseases belong to the same category, then we add an edge between them in the adjacency graph A.
DataDriven Similarity as a Prior
Incremental Training
Next we describe our algorithm for efficiently training a series of deep models to discover and detect physiologic patterns of varying lengths. This framework utilizes a simple and robust strategy for incremental learning of larger neural networks from smaller ones by iteratively adding new units to one or more layers. Our strategy is founded upon intelligent initialization of the larger network’s parameters using those of the smaller network.
Given a multivariate time series \(\mathbf{X} \in \mathbb{R}^{P\times T}\), there are two ways in which to use feature maps of varying or increasing lengths. The first would be to perform time series classification in an online setting in which we want to regularly reclassify a time series based on all available data. For example, we might would want to reclassify (or diagnose) a patient after each new observation while also including all previous data. Second, we can apply a feature map g designed for a shorter time series of length T_{ S } to a longer time series of length T > T_{ S } using the sliding window approach: we apply g as a filter to subsequences of size T_{ S } with stride R_{ S } (there will be \(\frac{TT_{S}+1} {R_{S}}\)). Proper choice of window size T_{ S } and stride R_{ S } is critical for producing effective features. However, there is often no way to choose the right T_{ S } and R_{ S } beforehand without a priori knowledge (often unavailable). What is more, in many applications, we are interested in multiple tasks (e.g., patient diagnosis and risk quantification), for which different values of T_{ S } and R_{ S } may work best. Thus, generating and testing features for many T_{ S } and R_{ S } is useful and often necessary. Doing this with neural nets can be computationally expensive and timeconsuming.
To address this, we propose an incremental training procedure that leverages a neural net trained on windows of size T_{ S } to initialize and accelerate the training of a new neural net that detects patterns of length T^{ ′ } = T_{ S } + ΔT_{ S } (i.e., ΔT_{ S } additional time steps). That is, the input size of the first layer changes from D = PT_{ S } to D^{ ′ } = D + d = PT_{ S } + PΔT_{ S }.
Suppose that the existing and new networks have D^{(1)} and D^{(1)} + d^{(1)} hidden units in their first hidden layers, respectively. Recall that we compute the activations in our first hidden layer according to the formula \(\mathbf{h}^{(1)} =\sigma (\boldsymbol{W}^{(1)}\mathbf{x} +\boldsymbol{ b}^{(1)})\). This makes \(\boldsymbol{W}^{(1)}\) an D^{(1)} × D matrix and \(\boldsymbol{b}^{(1)}\) an D^{(1)}vector; we have a row for each feature (hidden unit) in h^{(1)} and a column for each input in x. From here on, we will treat the bias \(\boldsymbol{b}^{(1)}\) as a column in \(\boldsymbol{W}^{(1)}\) corresponding to a constant input and omit it from our notation.
The larger neural network has a (D^{(1)} + d^{(1)}) × (D + d) weight matrix \(\boldsymbol{W}^{{\prime}}{}^{(1)}\). The first D columns of \(\boldsymbol{W}^{{\prime}}{}^{(1)}\) correspond exactly to the D columns of \(\boldsymbol{W}^{(1)}\) because they take the same D inputs. In time series data, these inputs are the observations in the same T_{ S } × P matrix. We cannot guarantee the same identity for the first D^{(1)} columns of \(\boldsymbol{W}^{{\prime}}{}^{(1)}\), which are the first D^{(1)} hidden units of h^{ ′ }^{(1)}; nonetheless, we can make a reasonable assumption that these hidden units are highly similar to h^{(1)}. Thus, we can think of constructing \(\boldsymbol{W}^{{\prime}}{}^{(1)}\) by adding d new columns and d^{(1)} new rows to \(\boldsymbol{W}^{(1)}\).

\(\varDelta \boldsymbol{W}_{ne}\): D^{(1)} × d weights that connect new inputs to existing features.

\(\varDelta \boldsymbol{W}_{en}\): d^{(1)} × D weights that connect existing inputs to new features.

\(\varDelta \boldsymbol{W}_{nn}\): d^{(1)} × d weights that connect new inputs to new features.
We now describe strategies for using \(\boldsymbol{W}^{(1)}\) to choose initial values for parameters in each category.
Algorithm 1 Similaritybased initialization
SimilarityBased Initialization for New Inputs
To initialize \(\varDelta \boldsymbol{W}_{ne}\), we leverage the fact that we can compute or estimate the similarity among inputs. Let K be a (D + d) × (D + d) kernel similarity matrix between the inputs to the larger neural network that we want to learn. We can estimate the weight between the ith new input (i.e., input D + i) and the jth hidden unit as a linear combination of the parameters for the existing inputs, weighted by each existing input’s similarity to the ith new input. This is shown in Algorithm 1.
Choice of K is a matter of preference and input type. A time seriesspecific similarity measure might assign a zero for each pair of inputs that represents different variables (i.e., different univariate time series) and otherwise emphasize temporal proximity using, e.g., a squared exponential kernel. A more general approach might estimate similarity empirically, using sample covariance or cosine similarity. We find that the latter works well, for both time series inputs and arbitrary hidden layers.
Algorithm 2 Gaussian samplingbased initialization
SamplingBased Initialization for New Features
When initializing the weights for \(\boldsymbol{W}_{en}\), we do not have the similarity structure to guide us, but the weights in \(\boldsymbol{W}^{(1)}\) provide information. A simple but reasonable strategy is to sample random weights from the empirical distribution of entries in \(\boldsymbol{W}^{(1)}\). We have several choices here. The first regards whether to assume and estimate a parametric distribution (e.g., fit a Gaussian) or use a nonparametric approach, such as a kernel density estimator or histogram. The second regards whether to consider a single distribution over all weights or a separate distribution for each input.
For initializing weights in \(\boldsymbol{W}_{nn}\), which connect new inputs to new features, we could apply either strategy, as long as we have already initialized \(\boldsymbol{W}_{en}\) and \(\boldsymbol{W}_{ne}\). We found that estimating all new feature weights (for existing or new inputs) from the same simple distribution (based on \(\boldsymbol{W}^{(1)}\)) worked best. Our full Gaussian sampling initialization strategy is shown in Algorithm 2.
Initializing Other Layers
This framework generalizes beyond the input and first layers. Adding d′ new hidden units to h′^{(1)} is equivalent to adding d′ new inputs to h′^{(2)}. If we compute the activations in h′^{(1)} for a given data set, these become the new inputs for h′^{(2)} and we can apply both the similarity and samplingbased strategies to initialize new entries in the expanded weight matrix \(\boldsymbol{W}'^{(2)}\). The same goes for all layers. While we can no longer design special similarity matrices to exploit known structure in the inputs, we can still estimate empirical similarity from training data activations in, e.g., h′^{(2)}.
Intuition suggests that if our initializations from the previous pretrained values are sufficiently good, we may be able to forego pretraining and simply perform backpropagation. Thus, we choose to initialize with pretrained weights, then do the supervised finetuning on all weights.
Interpretable Mimic Learning
In this section, we describe our simple and effective knowledge distillation framework—the Interpretable Mimic Learning method also termed as the GBTmimic model, which trains Gradient Boosting Trees to mimic the performance of deep network models. Our mimic method aims to recognize interpretable features while maintaining the stateoftheart classification performance of deep learning models.
Our interpretable mimic learning model using GBT has several advantages over existing methods. First, GBT is good at maintaining the performance of the original complex model such as deep networks by mimicking its predictions. Second, it provides better interpretability than original model, from its decision rules and tree structures. Furthermore, using soft targets from deep learning models avoid overfitting to the original data and provide good generalizations, which can not be achieved by standard decision tree methods.
Experiments
To evaluate our frameworks, we ran a series of classification and featurelearning experiments using several clinical time series datasets collected during the delivery of care in intensive care units (ICUs) at large hospitals. More details of these datasets are introduced in section “Dataset Descriptions”. In section “Benefits of PriorBased Regularization”, we demonstrate the benefit of using priors (both knowledge and datadriven) to regularize the training of multilabel neural nets. In section “Efficacy of Incremental Training”, we show that incremental training both speeds up training of larger neural networks and keeps classification performance. We show the quantitative results of our interpretable mimic learning method in section “Interpretable Mimic Learning Results”, and the interpretations in section “Interpretability”.
Dataset Descriptions
We conduct the experiments on the following three real world healthcare datasets.
Physionet Challenge 2012 Data
The first dataset comes from PhysioNet Challenge 2012 website [30] which is a publicly available^{1} collection of multivariate clinical time series from 8000 ICU units. Each episode is a multivariate time series of roughly 48 h and containing over 30 variables. These data come from one ICU and four specialty units, including coronary care and cardiac, and general surgery recovery units. We use the Training Set A subset for which outcomes, including inhospital mortality, are available. We resample the time series on an hourly basis and propagate measurements forward (or backward) in time to fill gaps. We scale each variable to fall between [0, 1]. We discuss handling of entirely missing time series below.
ICU Data
The second dataset consists of ICU clinical time series extracted from the electronic health records (EHRs) system of a major hospital. The original dataset includes roughly ten thousand episodes of varying lengths, but we exclude episodes shorter than 12 h or longer than 128 h, yielding a dataset of 8500 multivariate time series of a dozen physiologic variables, which we resample once per hour and scale to [0,1]. Each episode has zero or more associated diagnostic codes from the Ninth Revision of the International Classification of Diseases (ICD9) [25]. From the raw 3–5 digit ICD9 codes, we create a two level hierarchy of labels and label categories using a twostep process. First, we truncate each code to the tens position (with some special cases handled separately), thereby merging related diagnoses and reducing the number of unique labels. Second, we treat the standard seventeen broad groups of codes (e.g., 460–519 for respiratory diseases), plus the supplementary V and E groups as label categories. After excluding one category that is absent in our data, we have 67 unique labels and 19 categories.
VENT Data
Another dataset [21] consists of data from 398 patients with acute hypoxemic respiratory failure in the intensive care unit at Children’s Hospital Los Angeles (CHLA). It contains a set of 27 static features, such as demographic information and admission diagnoses, and another set of 21 temporal features (recorded daily), including monitoring features and discretized scores made by experts, during the initial 4 days of mechanical ventilation. The missing value rate of this dataset is 13.43%, with some patients/variables having a missing rate of > 30%. We perform simple imputation for filling the missing values where we take the majority value for binary variables, and empirical mean for other variables.
Implementation Details
We implemented all neural networks in Theano [4] and Keras [8] platforms. We implement other baseline models based on the scikitlearn [26] package. In prior and incremental frameworks, we use multilayer perceptron with up to five hidden layers (of the same size) of sigmoid units. The input layer has PT input units for P variables and T time steps, while the output layer has one sigmoid output unit per label. Except when we use our incremental training procedure, we initialize each neural network by training it as an unsupervised stacked denoising autoencoder (SDAE), as this helps significantly because our datasets are relatively small and our labels are quite sparse. In mimic learning framework, our DNN implementation has two hidden layers and one prediction layer. We set the size of each hidden layer twice as large as input size.
Benefits of PriorBased Regularization
Our first set of experiments demonstrates the utility of using priors to regularize the training of multilabel neural networks, especially when labels are sparse and highly correlated or similar. From each time series, we extract all subsequences of length T = 12 in sliding window fashion, with an overlap of 50% (i.e., stride R = 0. 5T), and each subsequence receives its episode’s labels (e.g., diagnostic code or outcome). We use these subsequences to train a single unsupervised SDAE with five layers and increasing levels of corruption (from 0.1 to 0.3), which we then use to initialize the weights for all supervised neural networks. The sparse multilabel nature of the data makes stratified kfold cross validation difficult, so we instead randomly generate a series of 80/20 random training/test splits of episodes and keep the first five that have at least one positive example for each label or category. At testing time, we measure classification performance for both frames and episodes. We make episodelevel predictions by thresholding the mean score for all subsequences from that episode.
The ICU data set contains 8500 episodes varying in length from 12 to 128 h. The above subsequence procedure produces 50,000 subsequences. We treat the simultaneous prediction of all 86 diagnostic labels and categories as a multilabel prediction problem. This lends itself naturally to a treebased prior because of the hierarchical structure of the labels and categories (Fig. 6a, b). However, we also test a databased prior based on cooccurrence (Fig. 6c). Each neural network has an input layer of 156 units and five hidden layers of 312 units each.
The Physionet data set contains 3940 episodes, most of length 48 h, and yields 27,000 subsequences. These data have no such natural label structure to leverage, so we simply test whether a databased prior can improve performance. We create a small multilabel classification problem consisting of four binary labels with strong correlations, so that similaritybased regularization should help: inhospital mortality (mortality), lengthofstay less than 3 days (los < 3), whether the patient had a cardiac condition (cardiac), and whether the patient was recovering from surgery (surgery). The mortality rate among patients with lengthofstay less than 3 days is nearly double the overall rate. The cardiac and surgery are created from a single original variable indicating which type of critical care unit the patient was admitted to; nearly 60% of cardiac patients had surgery. Figure 6d shows the cooccurrence similarity between the labels.
We impute missing time series (where a patient has no measurements of a variable) with the median value for patients in the same unit. This makes the cardiac and surgery prediction problems easier but serves to demonstrate the efficacy of our priorbased training framework. Each neural network has an input layer of 396 units and five hidden layers of 900 units each.
AUROC for classification
Tasks  No prior  Cooccurrence  ICD9 tree  

Subsequence  All  0. 7079 ± 0. 0089  0. 7169 ± 0. 0087  0. 7143 ± 0. 0066  
Categories  0. 6758 ± 0. 0078  0. 6804 ± 0. 0109  0. 6710 ± 0. 0070  
Labels  0. 7148 ± 0. 0114  0. 7241 ± 0. 0093  0. 7237 ± 0. 0081  
Episode  All  0. 7245 ± 0. 0077  0. 7348 ± 0. 0064  0. 7316 ± 0. 0062  
Categories  0. 6952 ± 0. 0106  0. 7010 ± 0. 0136  0. 6902 ± 0. 0118  
Labels  0. 7308 ± 0. 0099  0. 7414 ± 0. 0064  0. 7407 ± 0. 0070 
Efficacy of Incremental Training

Full: Separately train each neural net, with unsupervised pretraining followed by supervised finetuning.

Incremental: Fully train the smallest (T_{ S } = 12) neural net and then use its weights to initialize supervised training of the next model (T_{ S } = 16). Repeat for subsequent networks.
AUROC for incremental training
Size  Level  Full  Inc  Prior+Full  Prior+Inc  

16  Subseq.  0.6928  0.6874  0.6556  0.6581  
Episode  0.7148  0.7090  0.6668  0.6744  
20  Subseq.  0.6853  0.6593  0.6674  0.6746  
Episode  0.7022  0.6720  0.6794  0.6944  
24  Subseq.  0.7002  0.6969  0.6946  0.7008  
Episode  0.7185  0.7156  0.7136  0.7171 
Interpretable Mimic Learning Results

Baseline machine learning algorithms which are popularly used in the healthcare domain: Linear Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Gradient Boosting Trees (GBT).

Neural networkbased method (NNbased): Deep Neural Networks (DNN).

Our Interpretable Mimic Learning methods: For the NNbased method described above, we take its soft predictions and treat it as the training target of Gradient Boosting Trees. This method is denoted by GBTmimicDNN.

Mortality (MOR) task—In this task we predict whether the patient dies within 60 days after admission or not. In the dataset, there are 80 patients with positive mortality label (patients who die).

Ventilator Free Days (VFD) task—In this task, we are interested in evaluating a surrogate outcome of morbidity and mortality (Ventilator free Days, of which lower value is bad), by identifying patients who survive and are on a ventilator for longer than 14 days. Since here lower VFD is bad, it is a bad outcome if the value ≤ 14, otherwise it is a good outcome. In the dataset, there are 235 patients with positive VFD labels (patients who survive and stay long enough on ventilators).
We train all the above methods with five different trials of fivefold random cross validation. We do 50 epochs of stochastic gradient descent (SGD) with learning rate 0.001. For Decision Trees, we expand the nodes as deep as possible until all leaves are pure. For Gradient Boosting Trees, we use stage shrinking rate 0.1 and maximum number of boosting stages 100. We set the depth of each individual trees to be 3, i.e., the number of terminal nodes is no more than 8, which is fairly enough for boosting.
Classification results
Task  

MOR  VFD  
Method  AUC (mean)  AUC (std)  AUC (mean)  AUC (std)  
Baseline  SVM  0.6431  0.059  0.7248  0.056  
LR  0.6888  0.068  0.7602  0.053  
DT  0.5965  0.081  0.6024  0.044  
GBT  0.7233  0.065  0.7630  0.051  
NNbased  DNN  0.7288  0.084  0.7756  0.053  
Mimic  GBTmimicDNN  0.7574  0.064  0.7835  0.054 
Interpretability
Top features and corresponding importance scores
Task  Model  Features (importance scores)  

MOR  GBT  MAPD1 (0.052)  PaO2D2 (0.052)  FiO2D3 (0.037)  
GBTmimicDNN  MAPD1 (0.031)  δPFD1 (0.031)  PHD1 (0.029)  
VFD  GBT  MAPD1(0.035)  MAPD3 (0.033)  PRISM12ROM (0.030)  
GBTmimicDNN  MAPD1 (0.042)  PaO2D0 (0.033)  PRISM12ROM (0.032) 
Discussion on Mobile Health
The deep learning solutions proposed in this chapter are general and are applicable to a wide variety of time series healthcare data including longitudinal data from electronic healthcare records (EHR), sensor data from intensive care units (ICU), sensor data from mobile health devices and so on. Our frameworks are well suited for mobile healthcare data. For example, our incremental training approach allows us to perform time series classification tasks in an online manner and thus, is able to efficiently utilize the realtime sensor data collected on mobile devices. Thus, we can perform realtime mobile health data analytics to improve prediction outcomes and reduce healthcare costs.
Summary
In this chapter, we introduced a general framework based on deep learning for representation learning from time series health data. It can incorporate prior knowledge, such as formal ontologies (e.g., ICD9 codes) and dataderived similarity, into deep learning models. Moreover, we presented a fast and scalable training procedure which can share deep network architectures of different sizes. We also proposed a simple yet effective knowledgedistillation approach called Interpretable Mimic Learning, to learn interpretable features for making robust prediction while mimicking the performance of deep learning models. Experiment results on several realworld hospital datasets demonstrate empirical efficacy and interpretability of our mimic models.
Footnotes
References
 1.Ando, R.K., Zhang, T.: Learning on graph with Laplacian regularization. NIPS (2007)Google Scholar
 2.Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)Google Scholar
 3.Bahadori, M.T., Yu, Q.R., Liu, Y.: Fast multivariate spatiotemporal analysis via low rank tensor learning. In: NIPS (2014)Google Scholar
 4.Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)Google Scholar
 5.Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. (2013)Google Scholar
 6.Bonner, G.: Decision making for health care professionals: use of decision trees within the community mental health setting. Journal of Advanced Nursing 35(3), 349–356 (2001)CrossRefGoogle Scholar
 7.Bucilu\(\check{\mathrm{a}}\), C., Caruana, R., NiculescuMizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM (2006)Google Scholar
 8.Chollet, F.: Keras: Theanobased deep learning library. Code: https://github.com/fchollet. Documentation: http://keras.io
 9.Dahl, G., Yu, D., Deng, L., Acero, A.: Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Trans. Audio, Speech, Language Process (2012)Google Scholar
 10.Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: ImageNet: A largescale hierarchical image database. In: CVPR (2009)Google Scholar
 11.Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. In: NIPS (2013)Google Scholar
 12.Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higherlayer features of a deep network. Dept. IRO, Université de Montréal, Tech. Rep 4323 (2009)Google Scholar
 13.Fan, C.Y., Chang, P.C., Lin, J.J., Hsieh, J.: A hybrid model combining casebased reasoning and fuzzy decision tree for medical data classification. Applied Soft Computing 11(1), 632–644 (2011)CrossRefGoogle Scholar
 14.Goldberger, A., Amaral, L.N., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals. Circulation (2000)Google Scholar
 15.Graves, A., Jaitly, N.: Towards endtoend speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1764–1772 (2014)Google Scholar
 16.Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)Google Scholar
 17.Ho, J.C., Ghosh, J., Sun, J.: Marble: highthroughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In: KDD (2014)Google Scholar
 18.Kale, D., Che, Z., Liu, Y., Wetzel, R.: Computational discovery of physiomes in critically ill children using deep learning. In: DMMI Workshop, AMIA, vol. 2014Google Scholar
 19.Karpathy, A., FeiFei, L.: Deep visualsemantic alignments for generating image descriptions. In: CVPR (2015)CrossRefGoogle Scholar
 20.Kerr, K.F., Bansal, A., Pepe, M.S.: Further insight into the incremental value of new markers: the interpretation of performance measures and the importance of clinical context. American journal of epidemiology p. kws210 (2012)Google Scholar
 21.Khemani, R.G., Conti, D., Alonzo, T.A., Bart III, R.D., Newth, C.J.: Effect of tidal volume in children with acute hypoxemic respiratory failure. Intensive care medicine 35(8), 1428–1437 (2009)CrossRefGoogle Scholar
 22.Lasko, T.A., Denny, J., Levy, M.: Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS ONE (2013)Google Scholar
 23.Marlin, B., Kale, D., Khemani, R., Wetzel, R.: Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In: IHI (2012)CrossRefGoogle Scholar
 24.Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernocký J.: Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH (2011)Google Scholar
 25.Organization, W.H.: International statistical classification of diseases and related health problems (2004)Google Scholar
 26.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikitlearn: Machine learning in Python. JMLR (2011)Google Scholar
 27.Peleg, M., Tu, S., Bury, J., Ciccarese, P., Fox, J., Greenes, R.A., Hall, R., Johnson, P.D., Jones, N., Kumar, A., et al.: Comparing computerinterpretable guideline models: a casestudy approach. Journal of the American Medical Informatics Association 10(1), 52–68 (2003)CrossRefGoogle Scholar
 28.Quinlan, J.R.: Induction of decision trees. Machine learning 1(1), 81–106 (1986)Google Scholar
 29.Schulam, P., Wigley, F., Saria, S.: Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery (2015)Google Scholar
 30.Silva, I., Moody, G., Scott, D.J., Celi, L.A., Mark, R.G.: Predicting inhospital mortality of ICU patients: The physionet/computing in cardiology challenge 2012. Computing in cardiology (2012)Google Scholar
 31.Socher, R., Huang, E., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: NIPS (2011)Google Scholar
 32.Srivastava, N., Salakhutdinov, R.R.: Discriminative transfer learning with treebased priors. In: NIPS, pp. 2094–2102 (2013)Google Scholar
 33.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)Google Scholar
 34.Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition. PAMI (2008)Google Scholar
 35.Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semisupervised learning. In: ACL (2010)Google Scholar
 36.Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML (2008)CrossRefGoogle Scholar
 37.Weinberger, K.Q., Sha, F., Zhu, Q., Saul, L.K.: Graph Laplacian regularization for largescale semidefinite programming. In: NIPS (2006)Google Scholar
 38.Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., Shen, D.: Unsupervised deep feature learning for deformable registration of mr brain images. In: MICCAI (2013)Google Scholar
 39.Wu, R., Yan, S., Shan, Y., Dang, Q., Sun, G.: Deep image: Scaling up image recognition. arXiv:1501.02876 (2015)Google Scholar
 40.Xiang, T., Ray, D., Lohrenz, T., Dayan, P., Montague, P.R.: Computational phenotyping of twoperson interactions reveals differential neural response to depthofthought. PLoS Comput. Biol. (2012)Google Scholar
 41.Yao, Z., Liu, P., Lei, L., Yin, J.: Rc4. 5 decision tree model and its applications to health care dataset. In: Services Systems and Services Management, 2005. Proceedings of ICSSSM’05. 2005 International Conference on, vol. 2, pp. 1099–1103. IEEE (2005)Google Scholar
 42.Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014, pp. 818–833. Springer (2014)Google Scholar
 43.Zhang, T., Popescul, A., Dom, B.: Linear prediction models with graph regularization for webpage categorization. In: KDD (2006)CrossRefGoogle Scholar
 44.Zhou, G., Sohn, K., Lee, H.: Online incremental feature learning with denoising autoencoders. In: AISTATS (2012)Google Scholar
 45.Zhou, J., Wang, F., Hu, J., Ye, J.: From micro to macro: Data driven phenotyping by densification of longitudinal electronic medical records. In: KDD (2014)Google Scholar