Mobile Health pp 389-409 | Cite as

Time Series Feature Learning with Applications to Health Care

  • Zhengping CheEmail author
  • Sanjay Purushotham
  • David Kale
  • Wenzhe Li
  • Mohammad Taha Bahadori
  • Robinder Khemani
  • Yan Liu


Exponential growth in mobile health devices and electronic health records has resulted in a surge of large-scale time series data, which demands effective and fast machine learning models for analysis and discovery. In this chapter, we discuss a novel framework based on deep learning which automatically performs feature learning from heterogeneous time series data. It is well-suited for healthcare applications, where available data have many sparse outputs (e.g., rare diagnoses) and exploitable structures (e.g., temporal order and relationships between labels). Furthermore, we introduce a simple yet effective knowledge-distillation approach to learn an interpretable model while achieving the prediction performance of deep models. We conduct experiments on several real-world datasets and show the empirical efficacy of our framework and the interpretability of the mimic models.


The breakthroughs in sensor technologies and wearable devices in health domains have led to a surge of large volume time series data. This offers an unprecedented opportunity to improve future care by learning from past patient encounters. One important step towards this goal is learning richer and meaningful features so that accurate prediction and effective analysis can be performed. Representation learning from time series health care data has attracted many machine learning and data mining researcher [17, 40, 45]. Most work has focused on discovering cluster structure (e.g., disease subtypes) via variations of Gaussian processes [23, 29].

Learning robust representations of time series health care data is especially challenging because the underlying causes of health and wellness span body systems and physiologic processes, creating complex and nonlinear relationships among observed measurements (e.g., patients with septic shock may exhibit fever or hypothermia). Whereas classic shallow models (e.g., cluster models) may struggle in such settings, properly trained deep neural networks can often discover, model, and disentangle these types of latent factors [5] and extract meaningful abstract concepts from simple data types [19]. Because of these properties, deep learning has achieved state of the art results in speech recognition [15] and computer vision [39] and is well-suited to time series health data. Several recent works have demonstrated the potential of deep learning to derive insight from clinical data [18, 22, 38]. Nonetheless, the practical reality of neural networks remains challenging, and we face a variety of questions when applying them to a new problem:

First, do we have enough data? Deep learning’s success is often associated with massive data sets, with potentially millions of examples [10, 34], but in medicine “big data” often means an Electronic Health Records (EHRs) database [14, 23] with tens of thousands of cases. Other questions regard data preprocessing, model architecture, training procedures, etc. Answering these questions often requires time-consuming trial and error.

Second, in health domains, model interpretability is not only important but also necessary, since the primary care providers, physicians and clinical experts alike depend on the new healthcare technologies to help them in monitoring and decision-making for patient care. A good interpretable model is shown to result in faster adoptability among the clinical staff and results in better quality of patient care [20, 27]. Therefore we need to identify novel solutions which can provide interpretable models and achieve similar prediction performance as deep models in healthcare domain.

In this chapter, we explore and propose solutions to the challenges above. By exploiting unique properties of both our domain (e.g., ontologies) and our data (e.g., temporal order in time series), we can improve the performance of deep neural networks and make the training process more efficient. Our main contributions are as follows:
  • We formulate a prior-based regularization framework for guiding the training of multi-label neural networks using medical ontologies and other structured knowledge. Our formulation is based on graph Laplacian priors [1, 3, 37, 43], which can represent any graph structure and incorporate arbitrary relational information. We apply graph Laplacian priors to the problem of training neural networks to classify physiologic time series with diagnostic labels, where there are many labels and severe class imbalance. Our framework is general enough to incorporate data-driven (e.g., comorbidity patterns) and hybrid priors.

  • We propose an efficient incremental training procedure for building a series of neural networks that detect meaningful patterns of increasing length. We use the parameters of an existing neural net to initialize the training of a new neural net designed to detect longer temporal patterns. This technique exploits both the well-known low rank structure of neural network weight matrices [11] and structure in our data domain, including temporal smoothness.

  • We propose a novel knowledge distillation methodology called Interpretable Mimic Learning where we mimic the performance of state-of-the-art deep learning models using well-known Gradient Boosting Trees (GBT). Our experiments on a real-world hospital dataset shows that our proposed Interpretable Mimic Learning models can achieve state-of-the-art performance comparable to the deep learning models. We discuss the interpretable features learned by our Interpretable Mimic Learning models, which is validated by the expert clinicians.

The proposed deep learning solutions in this chapter are general and are applicable to a wide variety of time series healthcare data including longitudinal data from electronic healthcare records (EHR), sensor data from intensive care units (ICU), sensor data from mobile health devices and so on. We will use an example of computational phenotyping from ICU time series data to demonstrate the effectiveness of our proposed approach.

Related Work

Deep learning approaches have achieved breakthrough results in several sequential and temporal data domains including language modeling [24], speech recognition [9, 15], and paraphrase detection [31]. We expect similar results in more general time series data domains, including healthcare. In natural language processing, distributed representations of words, learned from context using neural networks, have provided huge boosts in performance [35]. Our use of neural networks to learn representations of time series is similar: a window of time series observations can be viewed as the context for a single observation within that window.

In medical applications, many predictive tasks suffer from severe class imbalance since most conditions are rare. One possible remedy is to use side information, such as class hierarchy, as a rich prior to prevent overfitting and improve performance. Reference [32] is the first work that combines a deep architecture with a tree-based prior to encode relations among different labels and label categories, but their work is limited to modeling a restricted class of side information.

Our incremental training method has clear and interesting connections to ongoing research into efficient methods for training deep architectures. It can be viewed as a greedy method for building deep architectures horizontally by adding units to one or more layers, and can be connected to two recent papers: Zhou et al. [44] described a two-step incremental approach to feature learning in an online setting and focused on data drift. Denil et al. [11] described an approach for predicting parameters of neural networks by exploiting the smoothness of input data and the low rank structure of weight matrices.

As pointed out in the introduction, model interpretability is not only important but also necessary in healthcare domain. Decision trees [28]—due to their easy interpretability—have been quite successfully employed in the healthcare domain [6, 13, 41] and clinicians have embraced it to make informed decisions. However, decision trees can easily overfit and they do not achieve good performance on datasets with missing values which is common in today’s healthcare datasets. On the other hand, deep learning models have achieved remarkable performance in healthcare, but hard to interpret. Some recent works on deep learning interpretability in computer vision field [12, 33, 42] show that interpreting deep learning features is possible but the behavior of deep models may be more complex than previously believed. Therefore we believe there is a need to identify novel solutions which can provide interpretable models and achieve similar prediction performance as deep models.

Mimicking the performance of deep learning models using shallow models is a recent breakthrough in deep learning which has captured the attention of the machine learning community. Ba and Caruana [2] showed empirically that shallow neural networks are capable of learning the same function as deep neural networks. They demonstrated this by first training a state-of-the-art deep model, and then training a shallow model to mimic the deep model. Motivated by the model compression idea from [7, 16] proposed an efficient knowledge distillation approach to transfer (dark) knowledge from model ensembles into a single model. These previous works motivate us to employ mimic learning strategy to learn an interpretable model from a well-trained deep neural network.


In this section, we describe our framework for performing effective deep learning on clinical time series data. We begin by discussing the Laplacian graph-based prior framework that we use to perform regularization when training multi-label neural networks. This allows us to effectively train neural networks, even with smaller data sets, and to exploit structured domain knowledge, such as ontologies. We then describe our incremental neural network procedure, which we developed in order to rapidly train a collection of neural networks to detect physiologic patterns of increasing length. Finally we describe a simple but effective knowledge distillation framework which recognizes interpretable features while maintaining the state-of-the-art classification performance of the deep learning models.

General Framework

Given a multivariate time series with P variables and length T, we can represent it as a matrix \(\mathbf{X} \in \mathbb{R}^{P\times T}\). A feature map for time series X is a function \(g: \mathbb{R}^{P\times T}\mapsto \mathbb{R}^{D}\) that maps X to a vector of features \(\mathbf{x} \in \mathbb{R}^{D}\) useful for machine learning tasks like classification, segmentation, and indexing. Given the recent successes of deep learning, it is natural to investigate its effectiveness for feature learning in clinical time series data.

Suppose we have a data set of N multivariate time series, each with P variables and K binary labels. Without loss of generality, we assume all time series have the same length T. After a simple mapping that stacks all T column vectors in X to one vector x, we have N labeled instances {x i , y i } i = 1 N , where \(\mathbf{x}_{i} \in \mathbb{R}^{D},\mathbf{y}_{i} \in \{ 0,1\}^{K},D = PT\). The goal of multi-label classification is to learn a function f which can be used to assign a set of labels to each instance x i such that y ij = 1 if jth label is assigned to the instance x i and 0 otherwise.

We use a deep feed-forward neural network, as shown in Fig. 1, with L hidden layers and an output prediction layer. We use \(\boldsymbol{\varTheta }= (\boldsymbol{\varTheta }_{hid},\boldsymbol{B})\) to denote the model parameters. \(\boldsymbol{\varTheta }_{hid} =\{\langle \boldsymbol{ W}^{(\ell)},\boldsymbol{b}^{(\ell)}\rangle \}_{\ell=1}^{L}\) denotes the weights for the hidden layers (each with D() units), and the K columns \(\boldsymbol{\beta }_{k} \in \mathbb{R}^{D^{(L)} }\) of \(\boldsymbol{B} = [\boldsymbol{\beta }_{1}\boldsymbol{\beta }_{2}\cdots \boldsymbol{\beta }_{K}]\) are the prediction parameters. For convenience we denote h(0) = x and D(0) = D.
Fig. 1

A miniature illustration of the deep network with the regularization on categorical structure

Throughout this section, we assume a neural network with fully connected layers, linear activation (\(\boldsymbol{W}^{(\ell)}\mathbf{h}^{(\ell-1)} +\boldsymbol{ b}^{(\ell)}\)) and sigmoid nonlinearities (σ(z) = 1∕(1 + exp{ − z})). We pretrain each hidden layer as a denoising autoencoder (DAE) [36] by minimizing the reconstruction loss using stochastic gradient descent. In the supervised training stage, without any regularization, we treat multi-label classification as K separate logistic regressions, so the neural net has K sigmoid output units. To simplify the notation, we let \(\mathbf{h}_{i} = \mathbf{h}_{i}^{(L)} \in \mathbb{R}^{D^{(L)} }\) denote the output of top hidden layer for each instance x i . The conditional likelihood of y i given x i and model parameters \(\boldsymbol{\varTheta }\) can be written as
$$\displaystyle\begin{array}{rcl} \log & \ p(\mathbf{y}_{i}\vert \mathbf{x}_{i},\boldsymbol{\varTheta }) =\sum _{ k=1}^{K}\left [y_{ik}\log \sigma (\boldsymbol{\beta }_{k}^{\top }\mathbf{h}_{i}) + (1 - y_{ik})\log (1 -\sigma (\boldsymbol{\beta }_{k}^{\top }\mathbf{h}_{i}))\right ]& {}\\ \end{array}$$

Our framework can easily be extended to other network architectures, hidden unit types, and training procedures.

Prior-Based Regularization

Deep neural networks are known to work best in big data scenarios with many training examples. When we have access to only a few examples of each class label, incorporating prior knowledge can improve learning. Thus, it is useful to have a general framework able to incorporate a wider range of prior information in a unified way. Graph Laplacian-based regularization [1, 3, 37, 43] provides one such framework and is able to incorporate any relational information that can be represented as a (weighted) graph, including the tree-based prior as a special case.

Given a matrix \(\mathbf{A} \in \mathbb{R}^{K\times K}\) representing pairwise connections or similarities, the Laplacian matrix is defined as L = CA, where C is a diagonal matrix with kth diagonal element \(\mathrm{C}_{k,k} =\sum _{ k^{{\prime}}=1}^{K}(\mathrm{A}_{k,k^{{\prime}}})\). Given a set of K vectors of parameters \(\boldsymbol{\beta }_{k} \in \mathbb{R}^{D^{(L)} }\) and
$$\displaystyle\begin{array}{rcl} \mathrm{tr}(\boldsymbol{\beta }^{\top }\mathbf{L}\boldsymbol{\beta }) = \frac{1} {2}\sum _{1\leq k,k^{{\prime}}\leq K}\mathrm{A}_{k,k^{{\prime}}}\|\boldsymbol{\beta }_{k} -\boldsymbol{\beta }_{k^{{\prime}}}\|_{2}^{2},& & {}\\ \end{array}$$
where tr(⋅ ) represents the trace operator, the graph Laplacian regularizer enforces the parameters \(\boldsymbol{\beta }_{k}\) and \(\boldsymbol{\beta }_{k^{{\prime}}}\) to be similar, proportional to \(A_{k,k^{{\prime}}}\). The Laplacian regularizer can be combined with other regularizers \(R(\boldsymbol{\varTheta })\) (e.g., the Frobenius norm \(\|\boldsymbol{W}^{(\ell)}\|_{F}^{2}\) to keep hidden layer weights small), yielding the regularized loss function
$$\displaystyle\begin{array}{rcl} \mathscr{L} = -\sum _{i=1}^{N}\log p(\mathbf{y}_{ i}\vert \mathbf{x}_{i},\boldsymbol{\varTheta }) +\lambda R(\boldsymbol{\varTheta }) + \frac{\rho } {2}\mathrm{tr}(\boldsymbol{\beta }^{\top }\mathbf{L}\boldsymbol{\beta })& & {}\\ \end{array}$$
where ρ, λ > 0 are the Laplacian and other regularization hyperparameters, respectively. Note that the graph Laplacian regularizer is quadratic in terms of parameters and so does not add significantly to the computational cost.

The graph Laplacian regularizer can represent any pairwise relationships between parameters. Here we discuss how to use different types of priors and the corresponding Laplacian regularizers to incorporate both structured domain knowledge (e.g., label hierarchies based on medical ontologies) and empirical similarities.

Structured Domain Knowledge as a Tree-Based Prior

The graph Laplacian regularizer can represent a tree-based prior based on hierarchical relationships found in medical ontologies. In our experiments, we use diagnostic codes from the Ninth Revision of the International Classification of Diseases (ICD-9) system [25], which are widely used for classifying diseases and coding hospital data. The three digits (and two optional decimal digits) in each code form a natural hierarchy including broad body system categories (e.g., Respiratory), individual diseases (e.g., Pneumonia), and subtypes (e.g., viral vs. Pneumococcal pneumonia). Right part of Fig. 1 illustrates two levels of the hierarchical structure of the ICD-9 codes. When using ICD-9 codes as labels, we can treat their ontological structure as prior knowledge. If two diseases belong to the same category, then we add an edge between them in the adjacency graph A.

Data-Driven Similarity as a Prior

Laplacian regularization is not limited prior knowledge in the form of trees or ontologies. It can also incorporate empirical priors, in the form of similarity matrices, estimated from data. For example, we can use the co-occurrence matrix \(\mathbf{A} \in \mathbb{R}^{K\times K}\) whose elements are defined as follows:
$$\displaystyle{ \mathrm{A}_{k,k^{{\prime}}} = \frac{1} {N}\sum _{i=1}^{N}\mathscr{I}(y_{ ik}y_{ik^{{\prime}}} = 1) }$$
where N is the total number of the training data points, and \(\mathscr{I}(\cdot )\) is the indicator function. Given the fact that \(\mathrm{A}_{k,k^{{\prime}}}\) is the maximum likelihood estimation of the joint probability \(\mathbb{P}\{y_{ik} = 1,y_{ik^{{\prime}}} = 1\}\), regularization with the Laplacian constructed from the co-occurrence similarity matrix encourages the learning algorithm to find a solution for the deep network that predicts the pair-wise joint probability of the labels accurately. The co-occurrence similarity matrices of the labels in two datasets are shown in Fig. 6c and d.

Incremental Training

Next we describe our algorithm for efficiently training a series of deep models to discover and detect physiologic patterns of varying lengths. This framework utilizes a simple and robust strategy for incremental learning of larger neural networks from smaller ones by iteratively adding new units to one or more layers. Our strategy is founded upon intelligent initialization of the larger network’s parameters using those of the smaller network.

Given a multivariate time series \(\mathbf{X} \in \mathbb{R}^{P\times T}\), there are two ways in which to use feature maps of varying or increasing lengths. The first would be to perform time series classification in an online setting in which we want to regularly re-classify a time series based on all available data. For example, we might would want to re-classify (or diagnose) a patient after each new observation while also including all previous data. Second, we can apply a feature map g designed for a shorter time series of length T S to a longer time series of length T > T S using the sliding window approach: we apply g as a filter to subsequences of size T S with stride R S (there will be \(\frac{T-T_{S}+1} {R_{S}}\)). Proper choice of window size T S and stride R S is critical for producing effective features. However, there is often no way to choose the right T S and R S beforehand without a priori knowledge (often unavailable). What is more, in many applications, we are interested in multiple tasks (e.g., patient diagnosis and risk quantification), for which different values of T S and R S may work best. Thus, generating and testing features for many T S and R S is useful and often necessary. Doing this with neural nets can be computationally expensive and time-consuming.

To address this, we propose an incremental training procedure that leverages a neural net trained on windows of size T S to initialize and accelerate the training of a new neural net that detects patterns of length T = T S + ΔT S (i.e., ΔT S additional time steps). That is, the input size of the first layer changes from D = PT S to D = D + d = PT S + PΔT S .

Suppose that the existing and new networks have D(1) and D(1) + d(1) hidden units in their first hidden layers, respectively. Recall that we compute the activations in our first hidden layer according to the formula \(\mathbf{h}^{(1)} =\sigma (\boldsymbol{W}^{(1)}\mathbf{x} +\boldsymbol{ b}^{(1)})\). This makes \(\boldsymbol{W}^{(1)}\) an D(1) × D matrix and \(\boldsymbol{b}^{(1)}\) an D(1)-vector; we have a row for each feature (hidden unit) in h(1) and a column for each input in x. From here on, we will treat the bias \(\boldsymbol{b}^{(1)}\) as a column in \(\boldsymbol{W}^{(1)}\) corresponding to a constant input and omit it from our notation.

The larger neural network has a (D(1) + d(1)) × (D + d) weight matrix \(\boldsymbol{W}^{{\prime}}{}^{(1)}\). The first D columns of \(\boldsymbol{W}^{{\prime}}{}^{(1)}\) correspond exactly to the D columns of \(\boldsymbol{W}^{(1)}\) because they take the same D inputs. In time series data, these inputs are the observations in the same T S × P matrix. We cannot guarantee the same identity for the first D(1) columns of \(\boldsymbol{W}^{{\prime}}{}^{(1)}\), which are the first D(1) hidden units of h (1); nonetheless, we can make a reasonable assumption that these hidden units are highly similar to h(1). Thus, we can think of constructing \(\boldsymbol{W}^{{\prime}}{}^{(1)}\) by adding d new columns and d(1) new rows to \(\boldsymbol{W}^{(1)}\).

As illustrated in Fig. 2, the new weights can be divided into three categories.
Fig. 2

How adding various units changes the weights \(\boldsymbol{W}\)

  • \(\varDelta \boldsymbol{W}_{ne}\): D(1) × d weights that connect new inputs to existing features.

  • \(\varDelta \boldsymbol{W}_{en}\): d(1) × D weights that connect existing inputs to new features.

  • \(\varDelta \boldsymbol{W}_{nn}\): d(1) × d weights that connect new inputs to new features.

We now describe strategies for using \(\boldsymbol{W}^{(1)}\) to choose initial values for parameters in each category.

Algorithm 1 Similarity-based initialization

Similarity-Based Initialization for New Inputs

To initialize \(\varDelta \boldsymbol{W}_{ne}\), we leverage the fact that we can compute or estimate the similarity among inputs. Let K be a (D + d) × (D + d) kernel similarity matrix between the inputs to the larger neural network that we want to learn. We can estimate the weight between the ith new input (i.e., input D + i) and the jth hidden unit as a linear combination of the parameters for the existing inputs, weighted by each existing input’s similarity to the ith new input. This is shown in Algorithm 1.

Choice of K is a matter of preference and input type. A time series-specific similarity measure might assign a zero for each pair of inputs that represents different variables (i.e., different univariate time series) and otherwise emphasize temporal proximity using, e.g., a squared exponential kernel. A more general approach might estimate similarity empirically, using sample covariance or cosine similarity. We find that the latter works well, for both time series inputs and arbitrary hidden layers.

Algorithm 2 Gaussian sampling-based initialization

Sampling-Based Initialization for New Features

When initializing the weights for \(\boldsymbol{W}_{en}\), we do not have the similarity structure to guide us, but the weights in \(\boldsymbol{W}^{(1)}\) provide information. A simple but reasonable strategy is to sample random weights from the empirical distribution of entries in \(\boldsymbol{W}^{(1)}\). We have several choices here. The first regards whether to assume and estimate a parametric distribution (e.g., fit a Gaussian) or use a nonparametric approach, such as a kernel density estimator or histogram. The second regards whether to consider a single distribution over all weights or a separate distribution for each input.

In our experiments, we found that the existing weights often had recognizable distributions (e.g., Gaussian, see Figs. 3 and 4) and that it was simplest to estimate and sample from a parametric distribution. We also found that using a single distribution over all weights worked as well as, if not better than, a separate distribution for each input.
Fig. 3

Weight distributions for three layers of a neural network after pretraining

Fig. 4

Weight distributions for three layers of a neural network after finetuning

For initializing weights in \(\boldsymbol{W}_{nn}\), which connect new inputs to new features, we could apply either strategy, as long as we have already initialized \(\boldsymbol{W}_{en}\) and \(\boldsymbol{W}_{ne}\). We found that estimating all new feature weights (for existing or new inputs) from the same simple distribution (based on \(\boldsymbol{W}^{(1)}\)) worked best. Our full Gaussian sampling initialization strategy is shown in Algorithm 2.

Initializing Other Layers

This framework generalizes beyond the input and first layers. Adding d′ new hidden units to h(1) is equivalent to adding d′ new inputs to h(2). If we compute the activations in h(1) for a given data set, these become the new inputs for h(2) and we can apply both the similarity and sampling-based strategies to initialize new entries in the expanded weight matrix \(\boldsymbol{W}'^{(2)}\). The same goes for all layers. While we can no longer design special similarity matrices to exploit known structure in the inputs, we can still estimate empirical similarity from training data activations in, e.g., h(2).

Intuition suggests that if our initializations from the previous pretrained values are sufficiently good, we may be able to forego pretraining and simply perform backpropagation. Thus, we choose to initialize with pretrained weights, then do the supervised finetuning on all weights.

Interpretable Mimic Learning

In this section, we describe our simple and effective knowledge distillation framework—the Interpretable Mimic Learning method also termed as the GBTmimic model, which trains Gradient Boosting Trees to mimic the performance of deep network models. Our mimic method aims to recognize interpretable features while maintaining the state-of-the-art classification performance of deep learning models.

The general training pipeline of GBTmimic model is shown in Fig. 5. In the first step, we train a deep neural network with several hidden layers and one prediction layer, given the input features \(\boldsymbol{X}\) and target y. We then take the activations of the highest hidden layers as the extracted features \(\boldsymbol{X}_{nn}\) from that deep network. In the second step, we train a mimic model, i.e., Gradient Boosting Regression Trees, given the raw input \(\boldsymbol{X}\) and the soft targets y nn directly from the prediction layer of the neural network, to get the final output y m with minimum mean squared error. After finishing the training procedure, we can directly apply the mimic model from the final step for the classification task.
Fig. 5

Training pipeline for mimic method

Fig. 6

Similarity matrix examples of different priors for the ICU (ac) and Physionet (d) data sets. x-axis and y-axis refer to the tasks. Colors represent the similarity values, black: 0; white: 1

Our interpretable mimic learning model using GBT has several advantages over existing methods. First, GBT is good at maintaining the performance of the original complex model such as deep networks by mimicking its predictions. Second, it provides better interpretability than original model, from its decision rules and tree structures. Furthermore, using soft targets from deep learning models avoid overfitting to the original data and provide good generalizations, which can not be achieved by standard decision tree methods.


To evaluate our frameworks, we ran a series of classification and feature-learning experiments using several clinical time series datasets collected during the delivery of care in intensive care units (ICUs) at large hospitals. More details of these datasets are introduced in section “Dataset Descriptions”. In section “Benefits of Prior-Based Regularization”, we demonstrate the benefit of using priors (both knowledge- and data-driven) to regularize the training of multi-label neural nets. In section “Efficacy of Incremental Training”, we show that incremental training both speeds up training of larger neural networks and keeps classification performance. We show the quantitative results of our interpretable mimic learning method in section “Interpretable Mimic Learning Results”, and the interpretations in section “Interpretability”.

Dataset Descriptions

We conduct the experiments on the following three real world healthcare datasets.

Physionet Challenge 2012 Data

The first dataset comes from PhysioNet Challenge 2012 website [30] which is a publicly available1 collection of multivariate clinical time series from 8000 ICU units. Each episode is a multivariate time series of roughly 48 h and containing over 30 variables. These data come from one ICU and four specialty units, including coronary care and cardiac, and general surgery recovery units. We use the Training Set A subset for which outcomes, including in-hospital mortality, are available. We resample the time series on an hourly basis and propagate measurements forward (or backward) in time to fill gaps. We scale each variable to fall between [0, 1]. We discuss handling of entirely missing time series below.

ICU Data

The second dataset consists of ICU clinical time series extracted from the electronic health records (EHRs) system of a major hospital. The original dataset includes roughly ten thousand episodes of varying lengths, but we exclude episodes shorter than 12 h or longer than 128 h, yielding a dataset of 8500 multivariate time series of a dozen physiologic variables, which we resample once per hour and scale to [0,1]. Each episode has zero or more associated diagnostic codes from the Ninth Revision of the International Classification of Diseases (ICD-9) [25]. From the raw 3–5 digit ICD-9 codes, we create a two level hierarchy of labels and label categories using a two-step process. First, we truncate each code to the tens position (with some special cases handled separately), thereby merging related diagnoses and reducing the number of unique labels. Second, we treat the standard seventeen broad groups of codes (e.g., 460–519 for respiratory diseases), plus the supplementary V and E groups as label categories. After excluding one category that is absent in our data, we have 67 unique labels and 19 categories.


Another dataset [21] consists of data from 398 patients with acute hypoxemic respiratory failure in the intensive care unit at Children’s Hospital Los Angeles (CHLA). It contains a set of 27 static features, such as demographic information and admission diagnoses, and another set of 21 temporal features (recorded daily), including monitoring features and discretized scores made by experts, during the initial 4 days of mechanical ventilation. The missing value rate of this dataset is 13.43%, with some patients/variables having a missing rate of > 30%. We perform simple imputation for filling the missing values where we take the majority value for binary variables, and empirical mean for other variables.

Implementation Details

We implemented all neural networks in Theano [4] and Keras [8] platforms. We implement other baseline models based on the scikit-learn [26] package. In prior and incremental frameworks, we use multilayer perceptron with up to five hidden layers (of the same size) of sigmoid units. The input layer has PT input units for P variables and T time steps, while the output layer has one sigmoid output unit per label. Except when we use our incremental training procedure, we initialize each neural network by training it as an unsupervised stacked denoising autoencoder (SDAE), as this helps significantly because our datasets are relatively small and our labels are quite sparse. In mimic learning framework, our DNN implementation has two hidden layers and one prediction layer. We set the size of each hidden layer twice as large as input size.

Benefits of Prior-Based Regularization

Our first set of experiments demonstrates the utility of using priors to regularize the training of multi-label neural networks, especially when labels are sparse and highly correlated or similar. From each time series, we extract all subsequences of length T = 12 in sliding window fashion, with an overlap of 50% (i.e., stride R = 0. 5T), and each subsequence receives its episode’s labels (e.g., diagnostic code or outcome). We use these subsequences to train a single unsupervised SDAE with five layers and increasing levels of corruption (from 0.1 to 0.3), which we then use to initialize the weights for all supervised neural networks. The sparse multi-label nature of the data makes stratified k-fold cross validation difficult, so we instead randomly generate a series of 80/20 random training/test splits of episodes and keep the first five that have at least one positive example for each label or category. At testing time, we measure classification performance for both frames and episodes. We make episode-level predictions by thresholding the mean score for all subsequences from that episode.

The ICU data set contains 8500 episodes varying in length from 12 to 128 h. The above subsequence procedure produces 50,000 subsequences. We treat the simultaneous prediction of all 86 diagnostic labels and categories as a multi-label prediction problem. This lends itself naturally to a tree-based prior because of the hierarchical structure of the labels and categories (Fig. 6a, b). However, we also test a data-based prior based on co-occurrence (Fig. 6c). Each neural network has an input layer of 156 units and five hidden layers of 312 units each.

The Physionet data set contains 3940 episodes, most of length 48 h, and yields 27,000 subsequences. These data have no such natural label structure to leverage, so we simply test whether a data-based prior can improve performance. We create a small multi-label classification problem consisting of four binary labels with strong correlations, so that similarity-based regularization should help: in-hospital mortality (mortality), length-of-stay less than 3 days (los < 3), whether the patient had a cardiac condition (cardiac), and whether the patient was recovering from surgery (surgery). The mortality rate among patients with length-of-stay less than 3 days is nearly double the overall rate. The cardiac and surgery are created from a single original variable indicating which type of critical care unit the patient was admitted to; nearly 60% of cardiac patients had surgery. Figure 6d shows the co-occurrence similarity between the labels.

We impute missing time series (where a patient has no measurements of a variable) with the median value for patients in the same unit. This makes the cardiac and surgery prediction problems easier but serves to demonstrate the efficacy of our prior-based training framework. Each neural network has an input layer of 396 units and five hidden layers of 900 units each.

The results for Physionet are shown in Fig. 7. We observe two trends, which both suggest that multi-label neural networks work well and that priors help. First, jointly learning features, even without regularization, can provide a significant benefit. Both multi-label neural networks dramatically improve performance for the surgery and cardiac tasks, which are strongly correlated and easy to detect because of our imputation procedure. In addition, the addition of the co-occurrence prior yields clear improvements in the mortality and los < 3 tasks while maintaining the high performance in the other two tasks. Note that this is without tuning the regularization parameters.
Fig. 7

Physionet classification performance

Table 1 shows the results for the ICU data set. We report classification AUROC performance for both individual subsequences and episodes, computed across all outputs, as well as broken down into just labels and just categories. The priors provide some benefit but the improvement is not nearly as dramatic as it is for Physionet. We face a rather extreme case of class imbalance (some labels have fewer than 0.1% positive examples) multiplied across dozens of labels. In such settings, predicting all negatives yields a very low loss. We believe that even the prior-based regularization suffers from the imbalanced classes: enforcing similar parameters for equally rare labels may cause the model to make few positive predictions. However, the Co-Occurrence prior does provide a clear benefit, even in comparison to the ICD-9 prior. As Fig. 6c shows, this empirical prior captures not only the category/label relationship encoded by the ICD-9 tree prior but also includes valuable cross-category relationships that represent commonly co-morbid conditions.
Table 1

AUROC for classification



No prior


ICD-9 tree




0. 7079 ± 0. 0089

0. 7169 ± 0. 0087

0. 7143 ± 0. 0066



0. 6758 ± 0. 0078

0. 6804 ± 0. 0109

0. 6710 ± 0. 0070



0. 7148 ± 0. 0114

0. 7241 ± 0. 0093

0. 7237 ± 0. 0081




0. 7245 ± 0. 0077

0. 7348 ± 0. 0064

0. 7316 ± 0. 0062



0. 6952 ± 0. 0106

0. 7010 ± 0. 0136

0. 6902 ± 0. 0118



0. 7308 ± 0. 0099

0. 7414 ± 0. 0064

0. 7407 ± 0. 0070


Efficacy of Incremental Training

In these experiments we show that our incremental training procedure not only produces more effective classifiers (by allowing us to combine features of different lengths) but also speeds up training. We train a series of neural networks designed to model and detect patterns of lengths T S = 12, 16, 20, 24. Each neural net has PT S inputs (for P variables) and five layers of 2PT S hidden units each. We use each neural network to make an episode-level prediction as before (i.e., the mean real-valued output for all frames) and then combine those predictions to make a single episode level prediction. We combine two training strategies:
  • Full: Separately train each neural net, with unsupervised pretraining followed by supervised finetuning.

  • Incremental: Fully train the smallest (T S = 12) neural net and then use its weights to initialize supervised training of the next model (T S = 16). Repeat for subsequent networks.

We begin by comparing the training time (in minutes) saved by incremental learning in Fig. 8. Incremental training provides an alternative way to initialize larger neural networks and allows us to forego unsupervised pretraining. What is more, supervised finetuning converges just as quickly for the incrementally initialized networks as it does for the fully trained network. As a result, it reduces training time for a single neural net by half. Table 2 shows the that the incremental training reaches comparable performance. Moreover, the combination of the incremental training and Laplacian prior leads to better performance than using Laplacian prior only.
Fig. 8

Training time for different neural networks for full/incremental training strategies

Table 2

AUROC for incremental training















































Interpretable Mimic Learning Results

We categorize the methods for our mimic learning framework into three groups:
  • Baseline machine learning algorithms which are popularly used in the healthcare domain: Linear Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Gradient Boosting Trees (GBT).

  • Neural network-based method (NN-based): Deep Neural Networks (DNN).

  • Our Interpretable Mimic Learning methods: For the NN-based method described above, we take its soft predictions and treat it as the training target of Gradient Boosting Trees. This method is denoted by GBTmimic-DNN.

For evaluating the mimic learning approach, we conduct two binary classification tasks on VENT dataset.
  • Mortality (MOR) task—In this task we predict whether the patient dies within 60 days after admission or not. In the dataset, there are 80 patients with positive mortality label (patients who die).

  • Ventilator Free Days (VFD) task—In this task, we are interested in evaluating a surrogate outcome of morbidity and mortality (Ventilator free Days, of which lower value is bad), by identifying patients who survive and are on a ventilator for longer than 14 days. Since here lower VFD is bad, it is a bad outcome if the value ≤ 14, otherwise it is a good outcome. In the dataset, there are 235 patients with positive VFD labels (patients who survive and stay long enough on ventilators).

We train all the above methods with five different trials of fivefold random cross validation. We do 50 epochs of stochastic gradient descent (SGD) with learning rate 0.001. For Decision Trees, we expand the nodes as deep as possible until all leaves are pure. For Gradient Boosting Trees, we use stage shrinking rate 0.1 and maximum number of boosting stages 100. We set the depth of each individual trees to be 3, i.e., the number of terminal nodes is no more than 8, which is fairly enough for boosting.

Table 3 shows the prediction performance comparison of the models. We observe that for both the MOR and VFD tasks, the deep model obtains better performance than standard machine learning baselines; and our interpretable mimic methods obtain similar or better performance than the deep models.
Table 3

Classification results








AUC (mean)

AUC (std)

AUC (mean)

AUC (std)









































AUC(mean): Mean of Area under ROC AUC(std): Standard Deviation of Area under ROC


One advantage of decision tree methods is their interpretable feature selection and decision rules. Table 4 shows the top useful features, found by GBT and our GBTmimic models, in terms of the importance scores among all cross validations. We find that some important features are shared with several methods in these two tasks, e.g., MAP (Mean Airway Pressure) at day 1, δPF (Change of PaO2/FIO2 Ratio) at day 1, etc. Another interesting finding is that almost all the top features are temporal features, while among all static features, the PRISM (Pediatric Risk of Mortality) score, which is developed and widely used by the doctors and medical experts, is the most useful variable.
Table 4

Top features and corresponding importance scores



Features (importance scores)




MAP-D1 (0.052)

PaO2-D2 (0.052)

FiO2-D3 (0.037)



MAP-D1 (0.031)

δPF-D1 (0.031)

PH-D1 (0.029)





MAP-D3 (0.033)

PRISM12ROM (0.030)



MAP-D1 (0.042)

PaO2-D0 (0.033)

PRISM12ROM (0.032)


Discussion on Mobile Health

The deep learning solutions proposed in this chapter are general and are applicable to a wide variety of time series healthcare data including longitudinal data from electronic healthcare records (EHR), sensor data from intensive care units (ICU), sensor data from mobile health devices and so on. Our frameworks are well suited for mobile healthcare data. For example, our incremental training approach allows us to perform time series classification tasks in an online manner and thus, is able to efficiently utilize the real-time sensor data collected on mobile devices. Thus, we can perform real-time mobile health data analytics to improve prediction outcomes and reduce healthcare costs.


In this chapter, we introduced a general framework based on deep learning for representation learning from time series health data. It can incorporate prior knowledge, such as formal ontologies (e.g., ICD-9 codes) and data-derived similarity, into deep learning models. Moreover, we presented a fast and scalable training procedure which can share deep network architectures of different sizes. We also proposed a simple yet effective knowledge-distillation approach called Interpretable Mimic Learning, to learn interpretable features for making robust prediction while mimicking the performance of deep learning models. Experiment results on several real-world hospital datasets demonstrate empirical efficacy and interpretability of our mimic models.



  1. 1.
    Ando, R.K., Zhang, T.: Learning on graph with Laplacian regularization. NIPS (2007)Google Scholar
  2. 2.
    Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)Google Scholar
  3. 3.
    Bahadori, M.T., Yu, Q.R., Liu, Y.: Fast multivariate spatio-temporal analysis via low rank tensor learning. In: NIPS (2014)Google Scholar
  4. 4.
    Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)Google Scholar
  5. 5.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. (2013)Google Scholar
  6. 6.
    Bonner, G.: Decision making for health care professionals: use of decision trees within the community mental health setting. Journal of Advanced Nursing 35(3), 349–356 (2001)CrossRefGoogle Scholar
  7. 7.
    Bucilu\(\check{\mathrm{a}}\), C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM (2006)Google Scholar
  8. 8.
    Chollet, F.: Keras: Theano-based deep learning library. Code: Documentation:
  9. 9.
    Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Language Process (2012)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  11. 11.
    Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. In: NIPS (2013)Google Scholar
  12. 12.
    Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Dept. IRO, Université de Montréal, Tech. Rep 4323 (2009)Google Scholar
  13. 13.
    Fan, C.Y., Chang, P.C., Lin, J.J., Hsieh, J.: A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Applied Soft Computing 11(1), 632–644 (2011)CrossRefGoogle Scholar
  14. 14.
    Goldberger, A., Amaral, L.N., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals. Circulation (2000)Google Scholar
  15. 15.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)Google Scholar
  16. 16.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)Google Scholar
  17. 17.
    Ho, J.C., Ghosh, J., Sun, J.: Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In: KDD (2014)Google Scholar
  18. 18.
    Kale, D., Che, Z., Liu, Y., Wetzel, R.: Computational discovery of physiomes in critically ill children using deep learning. In: DMMI Workshop, AMIA, vol. 2014Google Scholar
  19. 19.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)CrossRefGoogle Scholar
  20. 20.
    Kerr, K.F., Bansal, A., Pepe, M.S.: Further insight into the incremental value of new markers: the interpretation of performance measures and the importance of clinical context. American journal of epidemiology p. kws210 (2012)Google Scholar
  21. 21.
    Khemani, R.G., Conti, D., Alonzo, T.A., Bart III, R.D., Newth, C.J.: Effect of tidal volume in children with acute hypoxemic respiratory failure. Intensive care medicine 35(8), 1428–1437 (2009)CrossRefGoogle Scholar
  22. 22.
    Lasko, T.A., Denny, J., Levy, M.: Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS ONE (2013)Google Scholar
  23. 23.
    Marlin, B., Kale, D., Khemani, R., Wetzel, R.: Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In: IHI (2012)CrossRefGoogle Scholar
  24. 24.
    Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernocký J.: Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH (2011)Google Scholar
  25. 25.
    Organization, W.H.: International statistical classification of diseases and related health problems (2004)Google Scholar
  26. 26.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. JMLR (2011)Google Scholar
  27. 27.
    Peleg, M., Tu, S., Bury, J., Ciccarese, P., Fox, J., Greenes, R.A., Hall, R., Johnson, P.D., Jones, N., Kumar, A., et al.: Comparing computer-interpretable guideline models: a case-study approach. Journal of the American Medical Informatics Association 10(1), 52–68 (2003)CrossRefGoogle Scholar
  28. 28.
    Quinlan, J.R.: Induction of decision trees. Machine learning 1(1), 81–106 (1986)Google Scholar
  29. 29.
    Schulam, P., Wigley, F., Saria, S.: Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery (2015)Google Scholar
  30. 30.
    Silva, I., Moody, G., Scott, D.J., Celi, L.A., Mark, R.G.: Predicting in-hospital mortality of ICU patients: The physionet/computing in cardiology challenge 2012. Computing in cardiology (2012)Google Scholar
  31. 31.
    Socher, R., Huang, E., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: NIPS (2011)Google Scholar
  32. 32.
    Srivastava, N., Salakhutdinov, R.R.: Discriminative transfer learning with tree-based priors. In: NIPS, pp. 2094–2102 (2013)Google Scholar
  33. 33.
    Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)Google Scholar
  34. 34.
    Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition. PAMI (2008)Google Scholar
  35. 35.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semi-supervised learning. In: ACL (2010)Google Scholar
  36. 36.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML (2008)CrossRefGoogle Scholar
  37. 37.
    Weinberger, K.Q., Sha, F., Zhu, Q., Saul, L.K.: Graph Laplacian regularization for large-scale semidefinite programming. In: NIPS (2006)Google Scholar
  38. 38.
    Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., Shen, D.: Unsupervised deep feature learning for deformable registration of mr brain images. In: MICCAI (2013)Google Scholar
  39. 39.
    Wu, R., Yan, S., Shan, Y., Dang, Q., Sun, G.: Deep image: Scaling up image recognition. arXiv:1501.02876 (2015)Google Scholar
  40. 40.
    Xiang, T., Ray, D., Lohrenz, T., Dayan, P., Montague, P.R.: Computational phenotyping of two-person interactions reveals differential neural response to depth-of-thought. PLoS Comput. Biol. (2012)Google Scholar
  41. 41.
    Yao, Z., Liu, P., Lei, L., Yin, J.: R-c4. 5 decision tree model and its applications to health care dataset. In: Services Systems and Services Management, 2005. Proceedings of ICSSSM’05. 2005 International Conference on, vol. 2, pp. 1099–1103. IEEE (2005)Google Scholar
  42. 42.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014, pp. 818–833. Springer (2014)Google Scholar
  43. 43.
    Zhang, T., Popescul, A., Dom, B.: Linear prediction models with graph regularization for web-page categorization. In: KDD (2006)CrossRefGoogle Scholar
  44. 44.
    Zhou, G., Sohn, K., Lee, H.: Online incremental feature learning with denoising autoencoders. In: AISTATS (2012)Google Scholar
  45. 45.
    Zhou, J., Wang, F., Hu, J., Ye, J.: From micro to macro: Data driven phenotyping by densification of longitudinal electronic medical records. In: KDD (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Zhengping Che
    • 1
    Email author
  • Sanjay Purushotham
    • 1
  • David Kale
    • 1
  • Wenzhe Li
    • 1
  • Mohammad Taha Bahadori
    • 1
  • Robinder Khemani
    • 2
  • Yan Liu
    • 1
  1. 1.Department of Computer ScienceUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Children’s Hospital Los AngelesLos AngelesUSA

Personalised recommendations