Imputation of clinical covariates in time series

Bertsimas, Dimitris; Orfanoudaki, Agni; Pawlowski, Colin

doi:10.1007/s10994-020-05923-2

Imputation of clinical covariates in time series

Published: 10 November 2020

Volume 110, pages 185–248, (2021)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Imputation of clinical covariates in time series

Download PDF

3089 Accesses
13 Citations
Explore all metrics

Abstract

Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach integrates patient specific information into an optimization formulation that can be adjusted for different imputation algorithms. We present the formulation for a K-nearest neighbors model and derive a corresponding scalable first-order method med.knn. Our algorithm provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time. In computational experiments on three real-world clinical datasets, we test its performance on imputation and downstream predictive tasks, varying the percentage of missing data, the number of observations per patient, and the mechanism of missing data. The proposed method improves upon both the imputation accuracy and downstream predictive performance relative to the best of the benchmark imputation methods considered. We show that this edge is consistently present both in longitudinal and electronic health records datasets as well as in binary classification and regression settings. On computational experiments on synthetic data, we test the scalability of this algorithm on large datasets, and we show that an efficient method for hyperparameter tuning scales to datasets with 10,000’s of observations and 100’s of covariates while maintaining high imputation accuracy.

A survey on missing data in machine learning

Article Open access 27 October 2021

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

Deep learning for survival analysis: a review

Article Open access 19 February 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning applied to healthcare data can generate actionable insights ranging from predicting the onset of disease to streamlining hospital operations. Statistical models that leverage the variety and richness of clinical data are still relatively rare and offer an exciting avenue for further research (Callahan and Shah 2017). As an increasing amount of information becomes available the medical field expects machine learning to become an indispensable tool for clinicians (Obermeyer and Emanuel 2016).

This information will come from various clinical and epidemiological sources. Claims records, clinical trials, and data from longitudinal studies have been an invaluable resource for medical research over the past decades. In many of these datasets, data from individual subjects is gathered over time via continuous or repeated monitoring of both risk factors and health outcomes. For example, longitudinal cohort studies are used to discover relationships between exposures of interest and long term health effects including adverse events and chronic disease. By design, these studies mitigate recall bias in participants by collecting data prospectively and prior to knowledge of a possible subsequent event (Caruana et al. 2015).

Another valuable source of clinical data are Electronic Health Records (EHR). Over the past years, widespread uptake of EHR has generated massive datasets that contain quantitative, qualitative, and transactional data (Murdoch and Detsky 2013). Their hospital adoption has skyrocketed in part due to the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009, which provided $30 billion in incentives for hospitals and physician practices to adopt EHR systems (Birkhead et al. 2015). While primarily designed for archiving patient information and performing administrative healthcare tasks, many researchers have found secondary use of these records for various clinical informatics applications (Shickel et al. 2018). Because heterogeneous labs, measurements, and notes are recorded for patients during each visit, EHR data has a rich and complex structure with time series information.

However, it is algorithms and not merely datasets that will prove transformative for the medical field (Obermeyer and Emanuel 2016). To make progress, we need to develop new statistical tools tailored to clinical applications which address the challenges and leverage common structure encountered in healthcare data. One of the most important issues is the ubiquitous presence of missing time series data (Pedersen et al. 2017), particularly for variables requiring complex, time-sensitive, or resource-intensive procedures to collect. There are many reasons for “missingness”, including missed study visits, patients lost to follow-up, missing information in source documents, lack of availability (e.g., laboratory tests that were not performed), and clinical scenarios preventing collection of certain variables (e.g., missing coma scale data in sedated patients) (Newgard and Lewis 2015). Thus, creating a consistent dataset for individuals over multiple visits even at the same healthcare organization for a fixed set of covariates remains a challenge. Even in longitudinal studies, where a set of covariates is collected over time, missing data are pervasive and complete ascertainment of all variables is rare (Landrum and Becker 2001).

The presence of missing data poses considerable challenges in the analyses and interpretation of clinical investigations’ results (Wood et al. 2004), potentially weakening their validity and leading to biased inferences. Their presence may complicate interpretation or even invalidate an otherwise important study (Ware et al. 2012). Many methods commonly used for handling missing values during data analysis can yield biased results, decrease study power, or lead to underestimates of uncertainty, all reducing the chance of drawing valid conclusions (Newgard and Lewis 2015). As many statistical models and machine learning algorithms rely on complete datasets, it is key to handle the missing data appropriately.

1.1 Review of methods for handling missing values

In this section, we present some of the most common approaches for missing data imputation. First, we introduce fairly simple and intuitive techniques that do not require the use of sophisticated machine learning methods. We then provide brief descriptions of advanced missing data imputation algorithms, both general purpose methods as well as approaches tailored to medical records and time series.

Excluding observations that contain missing values has been a standard practice for clinical research, primarily due to the lack of interpretable, accurate machine learning methods that can be easily applied by medical researchers (Sterne et al. 2009; Janssen et al. 2010). Unsurprisingly, complete case analysis may suffer from severe bias and the reduced sample size results in lower study power (Newgard and Lewis 2015). Recent advances in machine learning have allowed missing values to be accurately imputed prior to running statistical analyses on the complete dataset. The benefit of the latter approach is that once a set (or multiple sets) of complete data has been generated, practitioners can easily apply their own learning algorithms to the imputed dataset. In healthcare settings, often times those datasets contain numerous visits of the same person corresponding to various patterns of missing data. This special structure challenges state-of-the-art missing data methods which do not consider the connection of multiple observations to the same individual (Che et al. 2018).

A variety of machine learning approaches have been introduced in the literature to impute missing values ignoring the potential dependency between observations of the same individual. The simplest approach is the mean imputation that uses the mean of the observed values to replace those missing for the same covariate (Little and Rubin 2019). However, mean imputation underestimates the variance, ignores the correlation between the features leading to poor imputation outcomes.

Another common method called bpca uses the singular value decomposition (SVD) of the data matrix and information from a Bayesian prior distribution on the model parameters to impute missing values. This method outperforms basic SVD methods (Oba et al. 2003). In cases where the level of missing data is above 30%, we have found that this method reduces to mean imputation, leading to similar biases (Faria et al. 2018).

Joint modeling assumes the existence of a joint distribution on the entire dataset and a parametric density function on the data given model parameters. Current implementations of the method estimate the model parameters using an Expectation-Maximization (EM) approach in order to maximize the likelihood function. One widely used software package which implements this approach, Amelia I, assumes that data are drawn from a multivariate normal distribution (Honaker et al. 1999). In practice, healthcare data typically violate this condition (Sterne et al. 2009).

Recent review articles indicate that single imputation methods can lead to seriously misleading results and advise us to consider multiple imputation (Janssen et al. 2010; Little and Rubin 2019). This approach, implemented in the software package mice, allows for uncertainty about the missing data by creating several different plausible imputed datasets and appropriately combining results obtained from each of them (Schafer and Olsen 1998). The Amelia I package was extended to multiple imputation in the Amelia II algorithm (Honaker et al. 2011). Multiple imputation entails two stages: (1) generating replacement values for missing data and repeating this procedure many times, resulting in many datasets with replaced missing information, and (2) analyzing the many imputed datasets and combining the results (Li et al. 2015). As a result, multiple imputation methods are slower and require pooling results, which may not be appropriate for certain applications. For example, in clinical applications, where the interpretability of the underlying model matters, a single imputed dataset and simple predictive model may be preferred.

Most recently, Bertsimas et al. (Bertsimas et al. 2018b) proposed a general optimization framework with a predictive model-based cost function that can explicitly handle both continuous and categorical variables and can be used to generate single, as well as multiple, imputations. This optimization perspective has led to new scalable algorithms for more accurate data imputation. We describe this method OptImpute in more detail in Sect. 2.2, which we use as a foundation for the imputation method proposed in this paper.

The algorithms above are not tailored to multivariate time series datasets despite the fact that covariates may be strongly correlated over time (Lipton et al. 2016). Preliminary work has been done demonstrating their performance in that setting (Zhang 2016). Recurrent Neural Network approaches have also been employed to handle missing values in time series among the covariates for a particular prediction task (Lipton et al. 2016; Che et al. 2018). However, these approaches differ from traditional imputation methods because they also use features derived from the missing pattern itself, and they require that the downstream learning method is a neural network. In contrast, our method produces a single imputed dataset that can be used as training data for any supervised learning method which is preferred for the downstream task.

In practice, simpler techniques are more commonly applied in the panel data setting. Researchers often opt for a moving average approach with a fixed time window using previous observations from the same individual (Flores et al. 2019). For example, the last-observation-carried-forward method is used to impute a present missing value by carrying only the last non-missing value forward for a defined time period (Siddiqui and Ali 1998). However, these techniques ignore the correlation between covariates which is leveraged by other more advanced imputation methods. There have been a few methods that give weights to instances of the same patient in temporal data. For example, this approach has been applied to adverse drug events monitoring (Zhao and Henriksson 2016). In addition, similar methods have been applied in the political science and economics fields where time-series cross-sectional data are quite common (Shor et al. 2007).

1.2 Contributions

Given multivariate time series data, we develop a novel imputation method that utilizes optimization and machine learning techniques and outperforms state-of-the-art algorithms. Our contributions are as follows:

1.
We formulate the problem of missing data imputation with time series information under the MedImpute framework, extending the OptImpute framework proposed by Bertsimas et al. (2018b). Our approach can be adjusted to account for different imputation models based on predictive methods such as K-NN, SVM, and trees. We focus on a K-NN formulation to solve the problem and derive a corresponding fast first-order algorithm med.knn. This method provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time.
2.
We design a series of computational experiments on three real-world sets of data with direct clinical implications. We consider the Framingham Heart Study (FHS) and the Parkinson’s Progression Markers Initiative (PPMI), two longitudinal datasets with rich time series data recorded at regular time intervals, and Electronic Health Record (EHR) data from the Dana Farber Cancer Institute (DFCI), which is less structured and more sparse time series data. We provide a comprehensive framework for our experiments that tests the performance of our method across a diverse range of scenarios, varying parameters including: (1) the percentage of missing data, (2) the number of observations per individual, and (3) the mechanism of missing data. For the latter, we consider different mechanisms for the longitudinal and EHR datasets corresponding to the different patterns of missing data which are typically observed in real-world datasets. We demonstrate that med.knn obtains the best predictive performance and lowest imputation error as we vary the missing percentage from 10% to 50%. In addition, we show that for all datasets, the relative performance of med.knn improves as we increase the number of observations per individual. Finally, we demonstrate that med.knn performs well on missing patterns commonly encountered in practice for both longitudinal studies and EHR data. These improvements are relative to the best of the comparator methods among amelia, moving average, mean, bpca, mice, and opt.knn, which are described in Sect. 3.
3.
We propose a new custom tuning procedure to efficiently learn the hyperparameters in the optimization problem avoiding the use of traditional approaches such as Grid Search. Our methodology allows for decoupling the problem into multiple parts, enabling parallel computation that can decrease the run time. We create synthetic EHR data to test the scaling performance of the algorithm as we increase the number of observations and features. Our results show that the custom tuning approach leads to both superior scaling performance and better imputation accuracy compared to standard cross-validation. The tuning procedure is described in Sect. 2.4 and the scaling experiments with synthetic data are provided in Sect. 4.

The structure of the paper is as follows. In Sect. 2, we describe our framework for imputation of clinical covariates in time series and proposed method med.knn. In Sect. 3, we describe computational experiments on three real-world datasets evaluating both imputation and prediction accuracy. In Sect. 4, we present scaling experiments on simulated clinical datasets. In Sect. 5, we discuss properties of our algorithm and key insights from our experiments. We conclude our work in Sect. 6.

2 Methods

In this section, we describe our proposed method for imputation. In Sect. 2.1, we define variables and notation that we use in this paper. In Sect. 2.2, we review the OptImpute framework for missing data imputation. In Sect. 2.3, we introduce our new framework for imputation MedImpute which directly models clinical covariates in time series, and we present the K-Nearest Neighbors (K-NN) based formulation. In Sect. 2.4, we describe a custom tuning procedure to efficiently learn the hyperparameters in the optimization problem. Finally, in Sect. 2.5 we provide the detailed steps of the first-order method med.knn that can be used to find high-quality solutions.

2.1 Variables and notation

In this paper, we consider the single imputation problem for which our task is to fill in the missing values of dataset $\mathbf {X} \in {\mathbb {R}}^{n \times p}$ with n observations (rows) and p features (columns). Without loss of generality, we assume that the first $p_0$ features are continuous and that the next $p_1 = p - p_0$ features are categorical, and the missing and known indices are specified by the following sets:

$$\begin{aligned} \begin{aligned} \mathcal {M}_0&= \{(i,d) : \text {entry}~x_{id}~\text {is missing}, ~1 \le d \le p_0, {~1 \le i \le n} \}, \\ \mathcal {N}_0&= \{(i,d) : \text {entry}~x_{id}~\text {is known}, ~1 \le d \le p_0, {~1 \le i \le n}\},\\ \mathcal {M}_1&= \{(i,d) : \text {entry}~x_{id}~\text {is missing}, ~p_0 + 1 \le d \le p_0 + p_1, {~1 \le i \le n}\},\\ \mathcal {N}_1&= \{(i,d) : \text {entry}~x_{id}~\text {is known}, ~p_0 + 1 \le d \le p_0 + p_1, {~1 \le i \le n}\},\\ \mathcal {I}&= \{i : \mathbf {x}_i~\text {has one or more missing values}\}. \end{aligned} \end{aligned}$$

(1)

Here, $\mathcal {M}_0$, $\mathcal {M}_1$ are the sets of indices of the missing values in the continuous and categorical variables, respectively. Similarly, $\mathcal {N}_0$, $\mathcal {N}_1$ are the sets of indices of the known values in the continuous and categorical variables, respectively. $\mathcal {I}$ is the set of rows which contains at least one missing value.

We suppose that all of the continuous variables are normalized with unit SD and that the dth categorical variable takes value among $k_d$ classes. Given this data, we introduce the decision variables $\mathbf {W} \in {\mathbb {R}}^{n \times p_0}$, $\mathbf {V} \in \{1,\ldots ,k_{p_0 + 1}\} \times \ldots \times \{1, \ldots , k_{p_0 + p_1}\}$ to be the matrices of imputed continuous and categorical variables, respectively. For each entry $x_{id}$, $w_{id}$ is the imputed value if $d \in \{1,\ldots , p_0\}$, and $v_{id}$ is the imputed value if $d \in \{p_0 + 1,\ldots , p_0 + p_1\}$. We refer to the full imputation for observation $\mathbf {x}_i$ as $(\mathbf {w}_i, \mathbf{v}_i)$. For the MedImpute method, we also assume that each observation $\mathbf {x}_i$ corresponds to a particular patient with the unique ID $y_i$ observed at time-stamp $t_i$.

2.2 Review of OptImpute

Next, we review the OptImpute framework for general imputation which we use as a foundation for our method. In this approach, we formulate the missing data problem as an optimization problem in which all entries are simultaneously filled in and used as covariates to predict the other entries. Our key decision variables are the imputed values $\{w_{id}: (i,d) \in \mathcal {M}_0 \}$ and $\{v_{id}: (i,d) \in \mathcal {M}_1 \}$. We will also introduce auxiliary decision variables $\mathbf {Z}$. For any given set of imputed values and a corresponding data $\mathbf {X}$, we associate a cost function $c(\cdot )$ to it. Thus, our objective is to solve the following optimization problem:

$$\begin{aligned} \begin{aligned}&\begin{aligned} \min ~~ c(\mathbf {Z}, \mathbf {W}, \mathbf {V}; \mathbf {X}) \end{aligned}\\&\begin{aligned} \text {s.t.}~~~~~ w_{id}&= x_{id}&(i,d) \in \mathcal {N}_0,\\ v_{id}&= x_{id}&(i,d) \in \mathcal {N}_1,\\ (\mathbf {Z},&\mathbf {W}, \mathbf {V}) \in \mathcal {Z}, \\ \end{aligned} \end{aligned} \end{aligned}$$

(2)

where $\mathcal {Z}$ is the set of all feasible combinations $(\mathbf {Z}, \mathbf {W}, \mathbf {V})$ of auxiliary vectors and imputations. In this paper, we only consider an OptImpute formulation based upon K-Nearest Neighbors (K-NN), however it is also possible to consider formulations based upon SVM and trees (Bertsimas et al. 2018b).

In the K-NN formulation, the objective is to impute the missing values so that each point is as close to its K-nearest neighbors as possible. First, we define a distance metric on the dataset. Given two observations i and j, we say that the distance between them is:

$$\begin{aligned} d_{ij} := \sum _{d=1}^{p_0} (w_{id} - w_{jd})^2 + \sum _{d=p_0+1}^{p_0+p_1} \mathbb {1}_{\{v_{id} \ne v_{jd}\}}. \end{aligned}$$

(3)

In this distance metric, we weight the contributions from the continuous and categorical variables equally, but it is also possible to introduce a scaling factor to weight these terms differently. Given this distance metric, we introduce the binary variables $\mathbf {Z} \in \{0,1\}^{|\mathcal {I}| \times n}$, where

$$\begin{aligned} z_{ij} = \left\{ \begin{array}{ll} 1, &\quad \hbox {if}\,j\,\hbox {is among the}\,K-\hbox {nearest neighbors of } i\\ &\quad \text {with respect to distance metric}~(3), \\ 0, &\quad \text {otherwise.} \end{array} \right. \end{aligned}$$

(4)

The OptImpute formulation with the K-NN objective function is

$$\begin{aligned} \begin{aligned}&\begin{aligned} \min ~~ \sum _{i\in \mathcal {I}}\sum _{j=1}^n z_{ij}\left( \sum _{d=1}^{p_0} (w_{id}-w_{jd})^2 + \sum _{d=p_0+1}^{p_0+p_1} \mathbb {1}_{\{v_{id}\ne v_{jd}\}}\right) \end{aligned}\\&\begin{aligned} \text {s.t.}~~~~~ w_{id}&= x_{id}&(i,d) \in \mathcal {N}_0,\\ v_{id}&= x_{id}&(i,d) \in \mathcal {N}_1,\\ z_{ii}&= 0&i \in \mathcal {I}, \\ \sum _{j=1}^n z_{ij}&=K&i \in \mathcal {I},\\ \mathbf {Z}&\in \{0,1\}^{|\mathcal {I}| \times n}, \\ \end{aligned} \end{aligned} \end{aligned}$$

(5)

where $\mathcal {I} = \{i : \mathbf {x}_i~\text {has one or more missing values}\}$. Problem (5) is non-convex with integer constraints for the categorical variables. In order to solve this problem, the authors find near optimal feasible solutions using first-order methods with random and targeted warm starts, resulting in a new imputation algorithm called opt.knn (Bertsimas et al. 2018b).

At a high level, the opt.knn algorithm works as follows. The user provides as input an incomplete data matrix $\mathbf {X}$, a convergence threshold $\delta _0 > 0$, and a warm start imputation $(\mathbf {W}^0, \mathbf {V}^0)$. The output of the algorithm is the full matrix $\mathbf {X}^{imp}$ with the imputed variables. In each iteration, we alternate updating the auxiliary variables $\mathbf {Z}$ and the imputation $(\mathbf {W}, \mathbf {V})$ using either Coordinate Descent (CD) or Block Coordinate Descent (BCD). The problem of updating $\mathbf {Z}$ given an imputation reduces to a simple sorting procedure on the distances. To update $(\mathbf {W}, \mathbf {V})$ in CD, we locally optimize each imputed value ($w_{id}$ or $v_{id}$) one at a time. To update $(\mathbf {W}, \mathbf {V})$ in BCD, for each continuous or categorical feature we solve a Quadratic Optimization problem or a Mixed-Integer Optimization problem, respectively. We continue updating these values until the objective value stops improving by a sufficiently large amount $\delta _0$. Notice that the objective function value is strictly decreasing by at least $\delta _0$ at every iteration until the algorithm terminates. As a result, the number of steps required for the algorithm termination is:

$$\begin{aligned} T = \frac{1}{\delta _0} c(\mathbf {Z}^0, \mathbf {W}^0, \mathbf {V}^0;\,\mathbf {X}), \end{aligned}$$

(6)

where $\mathbf {W}^0, \mathbf {V}^0$ are the warmstart values, $\mathbf {X}$ is data, and $\mathbf {Z}^0$ is the initialized auxiliary variables. There are no analytical guarantees that the algorithm will find the globaly optimal solution (Wright 2015). We repeat this process for multiple warm starts and take the solution with the best objective value to be the final imputation. The algorithm for a single warm start is summarized in Algorithm 1.

2.3 MedImpute

In this section, we present the MedImpute framework for imputation of clinical covariates in time series. We extend the general OptImpute framework by weighting instances of the same person in the imputation model. We focus on the K-NN classifier and provide the specific formulation to solve this problem. Our new framework takes into account the time series structure frequently encountered in healthcare data. In addition, unlike univariate time series methods, this approach leverages statistical correlations between multiple clinical covariates.

Suppose that we are given the same problem setup for single imputation as described in Sect. 2.2. In addition, assume that each observation i corresponds to an individual patient with unique identifier $y_i \in \{1,\ldots ,M\}$ recorded at a particular time point. For datasets with multiple observations of individuals over time, we have $M < n$. Define $t_i \in {\mathbb {R}}^+$ as the number of (days/months/years) after a reference date that observation i was recorded. It follows that $|t_i - t_j|$ is the time difference in (days/months/years) between observations i and j. Note that this framework captures the common structure of many clinical datasets collected over time, including longitudinal studies, insurance claims, and EHR data.

For each clinical covariate $d = 1,\ldots ,p$, we introduce the parameters $\alpha _d$, $h_d$. We learn $\alpha _d$ and $h_d$ via a custom tuning procedure which we describe in Sect. 2.4. The first learned parameter $\alpha _d \in [0,1]$ is the relative weight given to the time series component of the objective function for variable d. At the extremes, $\alpha _d = 0$ corresponds to imputing covariate d under the OptImpute objective, and $\alpha _d = 1$ corresponds to imputing covariate d using each individual’s time series information independently. The second learned parameter $h_d \in (0,\infty )$ is the halflife parameter for the covariate d. This parameter is called the “halflife” parameter because it is the halflife of an exponential decay function $f(x) = 2 ^ {-x / h_d}$ that we use to determine the relative weights for multiple observations of the same patient.

We introduce this parameter $h_d$ so that observations from the same individual at nearby points in time will be weighted most heavily in the imputation. We make this design decision under the assumption that each clinical covariate can be approximated as a continuous function which is relatively smooth over time. For example, Body Mass Index (BMI) is a clinical covariate with values that are relatively smooth over time. Under this model, we assume that a BMI measurement from one week ago is more predictive of a patient’s current BMI than a BMI measurement from one year ago. However, we do not make any assumptions about how much more/less predictive these different measurements are, only that their relative weights follow an exponential distribution. The halflife of this exponential distribution for covariate d is the modelling parameter that we refer to as $h_d$.

For each pair of observations i, j, covariate d, and corresponding halflife parameter $h_d$, define the two derived parameters:

$$\begin{aligned} \begin{aligned}&\begin{aligned} C_{ijd} = \left\{ \begin{array}{ll} 2^{-|t_i - t_j| / h_d}, &\quad \text {if}~y_i = y_j, \\ 0, &\quad \text {otherwise}, \\ \end{array} \right. \end{aligned}\\&\begin{aligned} {\overline{C}}_{ijd} = \frac{C_{ijd}}{ \displaystyle \sum \nolimits _{\{j' : y_i = y_{j'}, j' \ne i\}} C_{ijd} }. \end{aligned} \end{aligned} \end{aligned}$$

(7)

The first derived parameter $C_{ijd}$ is the relative weight that observation j is given for time-series based imputation of observation i in covariate d. Note that this parameters is only non-zero when $y_i = y_j$, i.e. i and j are observations from the same patient. For example, if $h_d = 7$ days, then past observations of covariate d from one week and two weeks ago from the same patient would be given relative weights 0.5 and 0.25, respectively. The second derived parameter, ${\overline{C}}_{ijd}$, is the normalized variation of $C_{ijd}$. In particular, ${\overline{C}}_{ijd}$ is the relative weight that observation j is given to impute observation i in covariate d, divided by the sum of all relative weights of observations from the same patient in covariate d.

The MedImpute formulation with the K-NN objective function is

$$\begin{aligned} \begin{aligned}&\begin{aligned} \min ~~ {\frac{1}{K}} \sum _{i\in \mathcal {I}} \sum _{j=1}^n z_{ij}\left( \sum _{d=1}^{p_0} (1 - \alpha _d) (w_{id}-w_{jd})^2 + \sum _{d=p_0+1}^{p_0+p_1} (1 - \alpha _d) \mathbb {1}_{\{v_{id}\ne v_{jd}\}}\right) \\ + \sum _{i\in \mathcal {I}} \sum _{j=1}^n\left( \sum _{d=1}^{p_0} \alpha _d {{\overline{C}}_{ijd}}(w_{id}-w_{jd})^2 + \sum _{d=p_0+1}^{p_0+p_1} \alpha _d {{\overline{C}}_{ijd}}\mathbb {1}_{\{v_{id}\ne v_{jd}\}}\right) \\ \end{aligned}\\&\begin{aligned} \text {s.t.}~~~~~ w_{id}&= x_{id}&(i,d) \in \mathcal {N}_0,\\ v_{id}&= x_{id}&(i,d) \in \mathcal {N}_1,\\ z_{ii}&= 0&i \in \mathcal {I}, \\ \sum _{j=1}^n z_{ij}&=K&i \in \mathcal {I},\\ \mathbf {Z}&\in \{0,1\}^{|\mathcal {I}| \times n}, \\ \end{aligned} \end{aligned} \end{aligned}$$

(8)

where $\mathcal {I} = \{i : \mathbf {x}_i~\hbox {has one or more missing values}\}\hbox { and }\alpha _d , {\overline{C}}_{ijd}$ are constants. This problem is equivalent to (5) plus a penalty term in the objective for each feature d with different weights $\alpha _d$ in order to account for instances of the same person in the dataset. At the optimal solution, the objective function is the sum of the distances from each point to its K-nearest neighbors with respect to distance metric (3), plus the sum of the distances from each point to other observations from the same individual.

We derive a fast algorithm to provide high quality solutions to this problem using first order methods with random restarts, alternatively updating the binary variables and the imputed values as in opt.knn (Bertsekas 1999). In Algorithm 2, we summarize the med.knn method for a single warm start. In the next section, we describe the steps of this algorithm in detail.

MedImpute provides a flexible framework that can be easily extended as well. For example, we may consider other predictive models besides K-NN such as support vector machines and decision tree based methods by adjusting the objective functions of the corresponding OptImpute formulations appropriately. We refer the reader to (Bertsimas et al. 2018b) for more discussion on these alternate formulations, which is a possible area of future work. In these cases, we add the same penalty term to the objective functions that we added in formulation (8), and we solve using first-order methods with random starts. In this manuscript, we focus on the K-NN formulation due to the method’s simplicity that is close to the medical practice. The idea of imputing a patient’s missing values using the mean or the mode of the covariates from the most similar individuals to that observation is intuitive. Various implementations of the heuristic K-NN approach are already widely accepted and used in practice (Crookston and Finley 2008). For these reasons, we decided to extend upon those combining the time series component and an optimization framework.

The method can also be adapted to a multiple imputation setting. However, while multiple imputation has been considered for several years to be the most accurate method for dealing with missing data (Rubin 1996), there is a tradeoff because single imputation is more interpretable. In particular, with single imputation we obtain one downstream predictive model that can be easily presented and explained to an entire clinical team, which is a critical step in the process of data-driven medical research (Shrive et al. 2006).

2.4 Learning $\alpha _d$ and $h_d$

In this section, we describe a custom tuning procedure to efficiently learn $\alpha _d$ and $h_d$, which are hyperparameters in the optimization problem (8). We run this custom tuning procedure as a pre-processing step before the med.knn algorithm, which allows us to learn these parameters without using cross-validation. This is a heuristic procedure which decouples the problem into multiple parts, first learning $h_d$ for each covariate, and then learning $\alpha _d$ for each covariate. As a result, this custom tuning procedure is more computationally efficient and scales to larger problem sizes than cross-validation. In Sect. 4, we present the results from computational experiments comparing the speed and imputation accuracy of this custom tuning procedure against a traditional cross-validation method for selecting $\alpha _d$ and $h_d$.

In the first step of the custom tuning procedure, we learn the halflife parameter $h_d$ for each covariate. As in cross-validation, we tune the halflife parameters over a discrete range of values, denoted as $\mathcal H$. For example, in the computational experiments, we set $\mathcal H = \{1, 7, 30, 90, 365, 1000\}$, representing halflife values of 1 day, 1 week, 1 month, etc. For each covariate d, we compute the leave-one-out error for each halflife value $h_d \in \mathcal H$. In particular, to compute the leave-one-out error for the halflife value $h_d$, first we derive the weights ${\overline{C}}_{ijd}$, then we impute the known values in covariate d using these weights, and finally we compute the sum-of-squared errors. Afterwards, we select the halflife parameter $h_d$ which yields the lowest leave-one-out error.

For each continuous covariate $d \in \{1, \ldots , p_0\}$, the leave-one-out error is defined as:

$$\begin{aligned} \sum _{\{i: (i,d) \in \mathcal {N}_0\}} (x_{id} - {\hat{w}}_{id})^2, \end{aligned}$$

(9)

where:

$$\begin{aligned} {\hat{w}}_{id} := \sum _{j=1}^n {\overline{C}}_{ijd} x_{jd}. \end{aligned}$$

(10)

Here, ${\hat{w}}_{id}$ is equivalent to the MedImpute imputation of a continuous covariate $x_{id}$ when $\alpha _d = 1$. For each categorical covariate $d \in \{p_0 + 1, \ldots , p_0 + p_1\}$, the leave-one-out error is defined as:

$$\begin{aligned} \sum _{\{i: (i,d) \in \mathcal {N}_1\}} \mathbb {1}_{\{x_{id} \ne {\hat{v}}_{id}\}}, \end{aligned}$$

(11)

where:

$$\begin{aligned} {\hat{v}}_{id} := \arg \max _{v_{id}} \sum _{j=1}^n {\overline{C}}_{ijd} \mathbb {1}_{\{x_{jd} = v_{id}\}}. \end{aligned}$$

(12)

Intuitively, ${\hat{v}}_{id}$ is the weighted mode of covariate d, where the weights are ${\overline{C}}_{ijd}$. This is equivalent to the MedImpute imputation of the categorical covariate $x_{id}$ when $\alpha _d = 1$.

Note that we are able to learn $h_d$ independently from $\alpha _d$ because the selection of ${\overline{C}}_{ijd}$ which minimizes the objective function (8) for any fixed value of $\alpha _d$ also minimizes the objective function for any choice of $\alpha _d \in [0,1]$. Similarly, we can learn the halflife parameters $\{h_1, h_2, \ldots , h_p\}$ independently from one another, because the optimal choice of $h_d$ which minimizes the objective function (8) does not depend upon the values of $\{h_1, \ldots , h_{d-1}, h_{d+1}, \ldots , h_p\}$. Therefore, in this custom tuning procedure, we take advantage of this fact, and tune each of the halflife parameters as an initial step.

In the second step of the custom tuning procedure, we learn the MedImpute weight parameter $\alpha _d$ for each covariate. As in cross-validation, we tune the MedImpute weight parameters over a discrete range of values, denoted as $\mathcal {A}$. For example, in the computational experiments, we set $\mathcal {A} = \{0, 0.05, \ldots , 0.95, 1.0\}$, denoting relative MedImpute weights of 0%, 5%, ..., 100%, respectively. For each covariate d, we compute the k-fold error for each MedImpute weight value $\alpha _d \in \mathcal {A}$. In particular, to compute the k-fold error for the MedImpute weight value $\alpha _d$, first we split the dataset into k subsets (aka “folds”), next we impute each data subset using the rest of the subsets as training data, and finally we compute the total sum-of-squared errors across all of the folds. We select the MedImpute weight parameter $\alpha _d$ which yields the lowest k-fold error. For continuous covariates, the k-fold error is defined as:

$$\begin{aligned} \sum _{\ell = 1}^k \sum _{\{i: (i,d) \in \mathcal {N}_0^\ell \}} (x_{id} - {\hat{w}}_{id}^\ell )^2, \end{aligned}$$

(13)

where $\mathcal {N}_0^\ell$ are the known continuous values in the $\ell$th fold. The imputed values ${\hat{w}}_{id}^\ell$ are given by:

$$\begin{aligned} {\hat{w}}_{id}^\ell := (1 - \alpha _d) w^{\text {OPT}_\ell }_{id} + \alpha _d \sum _{\{i: (i,d) \in \mathcal {N}_0 \setminus \mathcal {N}_0^\ell \}} {\overline{C}}_{ijd} x_{jd}, \end{aligned}$$

(14)

where $w^{\text {OPT}_\ell }_{id}$ is the OptImpute imputation of $x_{id}$ using the data from the other $k-1$ folds, and $\mathcal {N}_0 \setminus \mathcal {N}_0^\ell$ are the known continuous values not in the $\ell$th fold. For categorical covariates, the k-fold error is defined as:

$$\begin{aligned} \sum _{\ell = 1}^k \sum _{\{i: (i,d) \in \mathcal {N}_1^\ell \}} \mathbb {1}_{\{x_{id} \ne {\hat{v}}^\ell _{id}\}}, \end{aligned}$$

(15)

where $\mathcal {N}_1^\ell$ are the known categorical values in the $\ell$th fold. The imputed values ${\hat{v}}_{id}^\ell$ are given by:

$$\begin{aligned} {\hat{v}}_{id}^\ell := \arg \max _{v_{id}} \left[ (1 - \alpha _d) \mathbb {1}_{\{v^{\text {OPT}_\ell }_{id} = v_{id}\}} + \alpha _d \sum _{\{i: (i,d) \in \mathcal {N}_0 \setminus \mathcal {N}_0^\ell \}} {\overline{C}}_{ijd} \mathbb {1}_{\{x_{jd} = v_{id}\}} \right] . \end{aligned}$$

(16)

where $v^{\text {OPT}_\ell }_{id}$ is the OptImpute imputation of $x_{id}$ using the data from the other $k-1$ folds, and $\mathcal {N}_1 \setminus \mathcal {N}_1^\ell$ are the known categorical values not in the $\ell$th fold. Intuitively, ${\hat{v}}_{id}^\ell$ is the weighted mode of the OptImpute value and the other known values of the same covariate, where the weights are $(1 - \alpha _d)$ and $\alpha _d {\overline{C}}_{ijd}$, respectively.

Finally, we note that there is another hyperparameter that we may tune for the med.knn algorithm, K, which is the number of nearest-neighbors. In the computational experiments, we fix $K = 10$, which works well for the datasets that we consider here. Previously, it has been shown that the OptImpute methods are relatively robust even if their hyperparameters are misspecified (Bertsimas et al. 2018b). Thus, while the accuracy of the med.knn algorithm can be improved slightly by tuning over K, the relative improvement in imputation accuracy is outweighed by the increased computational costs.

2.5 The med.knn algorithm

In this section, we provide details for the updates in the med.knn imputation algorithm. This is a first-order method to find locally optimal solutions to Problem (5). As in the opt.knn algorithm, in this algorithm we alternatively update $\mathbf {Z}$ and $(\mathbf {W}, \mathbf {V})$ until the solution converges. The update for $\mathbf {Z}$ is identical to the one for opt.knn, and is computed with a simple sorting procedure on the distances. However, the update for $(\mathbf {W}, \mathbf {V})$ is modified and depends upon the MedImpute parameters $\alpha _d$, $C_{ijd}$. As in opt.knn, we can update the values of $(\mathbf {W}, \mathbf {V})$ either with Block Coordinate Descent (BCD) or Coordinate Descent (CD) which are described in the following subsections. The opt.knn updates for both BCD and CD are equivalent to the corresponding med.knn updates when $\alpha _d = 0$ for all $d = 1,\ldots ,p$.

2.5.1 Block coordinate descent

In this approach, we update all of the imputed values at once. We call this approach BCD because we update the variables $(\mathbf {W}, \mathbf {V})$ as an entire block, keeping $\mathbf {Z}$ fixed. Our formulation Problem (8) decomposes by dimension into $p_0$ Quadratic Optimization problems for the continuous features and $p_1$ Mixed Integer Optimization problems for the categorical features. To update the imputed values $\mathbf {w}^d$ for continuous feature $d = 1,\ldots ,p_0$, we solve:

$$\begin{aligned} \begin{aligned}&\begin{aligned} \min _{\mathbf {w}^d}~~ \sum _{i \in \mathcal {I}}\sum _{j=1}^n z_{ij} (1 - \alpha _d)(w_{id}-w_{jd})^2 + \sum _{i \in \mathcal {I}}\sum _{j=1}^n \alpha _d {{\overline{C}}_{ijd}} (w_{id}-w_{jd})^2 \\ \end{aligned}\\&\begin{aligned} \text {s.t.}~~~~~ w_{id}&= x_{id}&(i,d) \in \mathcal {N}_0.\\ \end{aligned} \end{aligned} \end{aligned}$$

(17)

Taking the partial derivative of the objective function with respect to $w_{id}$ for some missing entry $(i,d) \in \mathcal {M}_0$ and setting it to zero, we obtain after some simplifications:

$$\begin{aligned} \begin{aligned} 0 =&\bigg ((1-\alpha _d)K + \alpha _d + \sum _{j \in \mathcal {I}}[(1 - \alpha _d)z_{ji} + \alpha _d {\overline{C}}_{jid}]\bigg )w_{id}\\&- \sum _{(j,d)\in \mathcal {M}_0} [(1 - \alpha _d)(z_{ij} + z_{ji}) + \alpha _d ({\overline{C}}_{ijd} + {\overline{C}}_{jid})]w_{jd}\\&- \sum _{(j,d) \in \mathcal {N}_0} [(1 - \alpha _d)(z_{ij} + \mathbb {1}_{\{j \in \mathcal {I}\}} z_{ji}) + \alpha _d({\overline{C}}_{ijd} + \mathbb {1}_{\{j \in \mathcal {I}\}} {\overline{C}}_{jid})]x_{jd}. \end{aligned} \end{aligned}$$

(18)

This follows directly from equation (9) in (Bertsimas et al. 2018b). For each feature $d = 1,\ldots ,p_0$, we have a system of equations of the above form which we can solve to determine the optimal imputed values $w_{id}, (i,d) \in \mathcal {M}_0$. Simplifying the notation, suppose that the missing values for the dimension d are $\widetilde{\mathbf {w}}^d:= (w_{1d}, \ldots , w_{ad})$ and the known values are $\mathbf {x}^d:=(x_{(a+1)d}, \ldots , x_{nd})$. Then the set of optimal imputed values $w_{id}^d, (i,d) \in \mathcal {M}_0$ is the solution to the linear system

$$\begin{aligned} ((1-\alpha _d)\mathcal {Q}+\alpha _d \mathbf {P})\widetilde{\mathbf {w}}^d = ((1-\alpha _d)\mathcal {R}+\alpha _d \mathbf {Y})\mathbf {x}^d, \end{aligned}$$

(19)

where the matrices $\mathcal {Q}$, $\mathbf {P}$, $\mathcal {R}$, and $\mathbf {Y}$ are defined as

$$\begin{aligned} \mathcal {Q}&= \begin{bmatrix} K + \sum _{j \in \mathcal {I}}z_{j1} - 2z_{11} & -z_{12}-z_{21} & \ldots & -z_{1a} -z_{a1} \\ -z_{21}-z_{12} & K + \sum _{j \in \mathcal {I}}z_{j2}- 2z_{22} & \ldots & -z_{2a} -z_{a2} \\ \vdots & \vdots & \ddots & \vdots \\ -z_{a1}-z_{1a}& -z_{a2}-z_{2a} & \ldots & K + \sum _{j \in \mathcal {I}}z_{ja} - 2z_{aa} \end{bmatrix},\end{aligned}$$

(20)

$$\begin{aligned} \mathbf {P}&= \begin{bmatrix} \sum _{j \in \mathcal {I}}{\overline{C}}_{j1d} - 2{\overline{C}}_{11d} & -{\overline{C}}_{12d}-{\overline{C}}_{21d} & \ldots & - {\overline{C}}_{1ad}-{\overline{C}}_{a1d} \\ -{\overline{C}}_{21d}-{\overline{C}}_{12d} & \sum _{j \in \mathcal {I}} {\overline{C}}_{j2d} - 2{\overline{C}}_{22d} & \ldots & - {\overline{C}}_{2ad}-{\overline{C}}_{a2d} \\ \vdots & \vdots & \ddots & \vdots \\ -{\overline{C}}_{a1d}-{\overline{C}}_{1ad}& -{\overline{C}}_{a2d} - {\overline{C}}_{2ad} & \ldots & \sum _{j \in \mathcal {I}}{\overline{C}}_{jad}- 2{\overline{C}}_{aad} \end{bmatrix}, \end{aligned}$$

(21)

$$\begin{aligned} \mathcal {R}&= \begin{bmatrix} z_{1(a+1)} + \mathbb {1}_{\{(a+1)\in \mathcal {I}\}}z_{(a+1)1}\ldots z_{1n} + \mathbb {1}_{\{n \in \mathcal {I}\}}z_{n1} \\ \vdots \quad \quad \quad \quad \vdots \\ z_{a(a+1)} + \mathbb {1}_{\{(a+1)\in \mathcal {I}\}}z_{(a+1)a}\ldots z_{an} + \mathbb {1}_{\{n \in \mathcal {I}\}}z_{na} \\ \end{bmatrix}, \end{aligned}$$

(22)

$$\begin{aligned} \mathbf {Y}&= \begin{bmatrix} {\overline{C}}_{1(a+1)d} + \mathbb {1}_{\{(a+1)\in \mathcal {I}\}}{\overline{C}}_{(a+1)1d}\ldots {\overline{C}}_{1nd} + \mathbb {1}_{\{n \in \mathcal {I}\}}{\overline{C}}_{n1d} \\ \vdots \quad \quad \quad \quad \vdots \\ {\overline{C}}_{a(a+1)d} + \mathbb {1}_{\{(a+1)\in \mathcal {I}\}}{\overline{C}}_{(a+1)ad} \ldots {\overline{C}}_{and} + \mathbb {1}_{\{n \in \mathcal {I}\}}{\overline{C}}_{nad} \\ \end{bmatrix}. \end{aligned}$$

(23)

Without loss of generality, there exists a closed-form solution

$$\begin{aligned} \widetilde{\mathbf {w}}^d = ((1 - \alpha _d)\mathcal {Q}+\alpha _d \mathbf {P})^{-1}((1 - \alpha _d)\mathcal {R}+\alpha _d \mathbf {Y}) \mathbf {x}^d \end{aligned}$$

(24)

to this system of equations for each feature $d = 1,\ldots ,p_0$. To update the imputed values $\mathbf{v}^d$ for each categorical feature $d = (p_0+1),\ldots ,p$, we solve the following mixed-integer optimization problem:

$$\begin{aligned} \begin{aligned}&\begin{aligned}&{\min _{\mathbf{v}^d}}&\sum _{i \in \mathcal {I}}\sum _{j=1}^n ((1 - \alpha _d)z_{ij} + \alpha _d {{\overline{C}}_{ijd}}) y_{ij} \end{aligned}\\&\begin{aligned} \text {s.t.}~~~&v_{id} = x_{id}&(i,d) \in \mathcal {N}_1, \\&v_{id} - v_{jd} \le y_{ij}k_d&i=1,\ldots ,n, j=1,\ldots ,n,\\&v_{jd} - v_{id} \le y_{ij}k_d&i=1,\ldots ,n, j=1,\ldots ,n,\\&y_{ij} \in \{0,1\}^{|\mathcal {I}| \times n}.\\ \end{aligned} \end{aligned} \end{aligned}$$

(25)

This is a Mixed Integer Optimization problem, which is practically solvable as the BCD update for opt.knn. Since the BCD update step requires inverting a matrix with $O(n^2)$ entries and solving an optimization problem with $O(n^2)$ binary variables, this method works best for smaller problem sizes $n \le 10,000$.

2.5.2 Coordinate descent

In CD, we update the imputed values one at a time. In order to update the imputed value for $x_{id}$, we fix all of the variables in Problem (8) except for $w_{id}$ or $v_{id}$ and solve the corresponding one-dimensional optimization problem. This results in fast, closed-form updates for both the continuous and categorical variables. Each $w_{id}, (i,d)\in \mathcal {M}_0$ is imputed as the minimizer of the following:

$$\begin{aligned} \begin{aligned} \min _{w_{id}}~~ \sum _{r\in \mathcal {I}}\sum _{j=1}^n z_{rj}\sum _{d=1}^{p_0} (1 - \alpha _d)(w_{rd}-w_{jd})^2 + \sum _{r\in \mathcal {I}}\sum _{j=1}^n\sum _{d=1}^{p_0} \alpha _d {{\overline{C}}_{rjd}}(w_{rd}-w_{jd})^2. \end{aligned} \end{aligned}$$

(26)

Solving the above gives the closed-form solution for every $(i,d)\in \mathcal {M}_0$:

$$\begin{aligned} w_{id} = \frac{\sum _{j=1}^n ((1 - \alpha _d)z_{ij} + \alpha _d {\overline{C}}_{ijd})w_{jd} + \sum _{j \in \mathcal {I}}((1 - \alpha _d)z_{ji}+ \alpha _d {\overline{C}}_{jid})}{K + \sum _{j=1}^n \alpha _d {\overline{C}}_{ijd}+ \sum _{j \in \mathcal {I}}((1 - \alpha _d)z_{ji}+\alpha _d {\overline{C}}_{jid})}. \end{aligned}$$

(27)

Similarly, each categorical variable $v_{id}, (i,d) \in \mathcal {M}_1$ is imputed as the minimizer of the following:

$$\begin{aligned} \begin{aligned} \min _{v_{id}}~~ \sum _{r\in \mathcal {I}}\sum _{j=1}^n z_{rj}\sum _{d=p_0+1}^{p_0+p_1} (1 - \alpha _d)\mathbb {1}_{\{v_{rd}\ne v_{jd}\}} + \sum _{r\in \mathcal {I}}\sum _{j=1}^n\sum _{d=p_0+1}^{p_0+p_1} \alpha _d {{\overline{C}}_{rjd}}\mathbb {1}_{\{v_{rd}\ne v_{jd}\}}. \end{aligned} \end{aligned}$$

(28)

Suppose that the value of categorical variable $v_{id}$ is one of $k_d$ distinct categories $\{1, 2,\ldots , k_d\}$. Then, the solution to problem (28) is

$$\begin{aligned} \arg \max _{k \in \{1,\ldots ,k_d\}} \Bigg [\sum _{j=1}^n \bigg ((1 - \alpha _d)z_{ij}+\alpha _d {{\overline{C}}_{ijd}}\bigg )\mathbb {1}_{\{v_{jd} =k\}} + \sum _{j\in \mathcal {I}} \bigg ((1 - \alpha _d)z_{ji}+\alpha _d {{\overline{C}}_{ijd}}\bigg )\mathbb {1}_{\{v_{jd} =k\}}\Bigg ]. \end{aligned}$$

(29)

Here, we set the imputed variable to be the value with the highest frequency in the neighborhood, with instances of the same person i receiving additional weight calibrated by the parameters $\{{\overline{C}}_{ijd}\}_{j=1}^n$ and $\alpha _d$.

This approach scales to large problem sizes (n in the 100,000’s), and it is the method that we implement for the computational experiments.

3 Computational experiments on real-world clinical datasets

In this section, we run a series of computational experiments testing the performance of med.knn imputing missing values in real-world clinical datasets. In Sect. 3.1, we provide an overview of the three datasets and their baseline characteristics. In Sect. 3.2, we describe the mechanisms for generating Missing Not at Random (MNAR) data that are used in some of the experiments. In Sect. 3.3, we describe the setup of the computational experiments, and we describe the imputation methods that we run for comparison across all of the computational experiments. In Sect. 3.4, we report the results of the experiments on the imputation tasks. In Sect. 3.5, we report the results of the experiments on the downstream predictive tasks. In Sect. 3.6 we discuss the results and major takeaways from the computational experiments.

3.1 Description of real-world clinical datasets

In this section we describe the three real-world clinical datasets used in the computational experiments. In Sect. 3.1.1, we describe the FHS dataset. In Sect. 3.1.2, we describe the DFCI dataset. Finally, in Sect. 3.1.3, we describe the PPMI dataset.

3.1.1 Framingham heart study (FHS) dataset

The FHS was started in 1948 with the goal of observing a large population of healthy adults over time to better understand the factors that lead to cardiovascular disease. Over 80 variables were collected from 5209 people at a time for more than 40 years. The FHS is arguably the most influential longitudinal study in the field of cardiovascular and cerebrovascular research. This data has now been used in more than 2400 studies and is considered one of the top 10 cardiology advances of the twentieth century alongside the electrocardiogram and open-heart surgery (Daniel Levy 2006).

In our computational experiments, we consider all individuals from the FHS Original Cohort (National Heart, Lung, and Blood Institute, Boston University 2012) with 10 or more observations, which includes $M = 1107$ unique patients. For each patient, we take the 10 most recent observations, so the dataset has $n = 11{,}070$ observations total. We include $p = 13$ continuous (Age, Body Mass Index, Systolic Blood Pressure, High-Density Lipoproteins, Hematocrit, Blood Glucose levels) and categorical covariates (Gender, Smoking, presence of Cardiovascular Disease, presence of Atrial Fibrillation, presence of diabetes, currently under prescription of antihypertensive medication, presence of Left Ventricular Hypertrophy from ECG results).

Overall, there are 12.56% missing values in the FHS dataset. The percentage of missing values in each covariate is shown in Table 5 in Appendix 7.1. Due to the design of the longitudinal study, the 10 observations for each patient occur at regular intervals spaced 2 years apart, for a total span of 18 years. For the imputation tasks, we add in additional missing values to the FHS dataset, and evaluate the accuracy of med.knn and comparison methods against the ground-truth values. For the downstream tasks, we evaluate classification models which predict 10-year risk of stroke given the imputed training data.

3.1.2 Dana farber cancer institute (DFCI) dataset

The DFCI dataset was obtained from a recently published work on predicting mortality in late-stage cancer patients (Bertsimas et al. 2018a). In this study, the authors retrospectively obtained patient data from EHR and linked Social Security Administration mortality data for cancer patients at the Dana Farber Cancer Institute / Brigham and Women’s Cancer Center from 2004 through 2014. Predictive models were fit for the entire population and individual cancers, including breast, lung, colorectal, kidney, and prostate cancer. Study eligibility required adult patients that have received at least one anticancer treatment over the course of their care, including chemotherapy, immunotherapy, and targeted therapy.

In our computational experiments, we consider all patients with late-stage breast cancer from the DFCI dataset. Each observation corresponds to a patient initiating an anticancer regimen which was systematically recorded in the hospital’s database. As a result, for every patient who followed more than one regimen, multiple observations were collected. For each patient, we include all of their observations in either the training set or testing set, respectively. In total, we have 12,206 observations that correspond to 5987 unique patients. This includes 3228 individuals who have just one line of therapy and therefore only appear once in this dataset. For each observation, there are 106 covariates which describe the patient at that point in time, including demographics, lab tests, vital signs, current medications, medical history, biomarkers, and variables derived from the patient’s temporal EHR history.

Overall, there are 10.79% missing values in the DFCI dataset. The percentage of missing values in each covariate is shown in Table 6 in Appendix 7.1. Due to the nature of this observational study, the observations for each patient occur at irregular intervals, which correspond to hospital visits. In addition, in the dataset each patient has anywhere from 1 to 12 observations. In Appendix 7.1, we provide some more details on the DFCI dataset, including the distribution of observations per patient (see Fig. 13) and summary statistics of the time intervals between each visit (see Table 8). For the imputation tasks, we add in additional missing values to the DFCI dataset, and evaluate the accuracy of med.knn and comparison methods against the ground-truth values. For the downstream tasks, we evaluate classification models which predict 60-day risk of mortality given the imputed training data.

3.1.3 Parkinson’s progression markers initiative (PPMI) dataset

The PPMI (Marek et al. 2011) was a landmark observational clinical study with the aim to comprehensively evaluate patient cohorts using imaging, biologic sampling as well as clinical and behavioral data to identify biomarkers of Parkinson’s disease progression.

In our computational experiments, we consider data from the PPMI baseline examination as well as the following three years of follow-up. In this longitudinal study, 20 patients appeared only in one follow-up examination, 33 in two while the rest of the population participated in all 352 clinical evaluations. As a result, in total we have 1547 observations corresponding to 405 distinct patients. For each observation, there are 116 covariates which describe the demographic characteristics, the results of behavioral tests, clinical test results, as well as the presence or absence of genetic mutations related to the disease.

Overall, there are 2.61% missing values in the PPMI dataset. The percentage of missing values in each covariate is shown in Table 7 in Appendix 7.1. Due to the design of the longitudinal study, the 4 observations for each patient occur at regular intervals spaced 1 year apart, for a total span of 4 years. For the imputation tasks, we add in additional missing values to the PPMI dataset, and evaluate the accuracy of med.knn and comparison methods against the ground-truth values. For the downstream tasks, we evaluate regression models which predict the Montreal Cognitive Assessment (MoCA) score one year in advance. The MoCA score is a rapid screening instrument for mild cognitive dysfunction, a clinical state that often progresses to dementia (Nasreddine et al. 2005).

3.2 Mechanisms for generating missing not at random (MNAR) data

Missing data can either be Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR) (Little and Rubin 2019). The type of missingness can be determined through an understanding of the specific feature and what systematic biases may exist in its collection process. Different types of missingness must be treated differently for meaningful analysis. In reality, missing data are most commonly associated with the MNAR category where the presence of unknown values is systematically related to unobserved factors.

In this section, we describe mechanisms for generating Missing Not at Random (MNAR) data for our computational experiments. We consider different mechanisms for the longitudinal and EHR datasets corresponding to the different patterns of missing data which are typically observed in real-world datasets. In Sect. 3.2.1, we describe the missing data mechanism that we use for the MNAR experiments on the two longitudinal datasets: FHS and PPMI. In Sect. 3.2.2, we describe the missing data mechanism that we use for the MNAR experiments on the EHR dataset: DFCI.

For all MNAR experiments, the total percentage of missing data is fixed to $30\%$. For each individual experiment, we assume that the dataset is ($\gamma 30$% MNAR, $(1-\gamma )30\%$ MCAR), where $\gamma$ is a constant that we select between 0 and 1. To generate the missing data patterns, first we generate the $\gamma 30\%$ MNAR patterns, and then we randomly select an additional $(1-\gamma )30\%$ subset of the data to be Missing Completely at Random (MCAR). In the following two sections, we describe the specific ways that we generate MNAR data for longitudinal studies and EHR data, which are influenced by real-world missing data mechanisms.

3.2.1 MNAR mechanism for data from longitudinal studies

In longitudinal studies, missing data patterns often result from changes in the experiment design. Researchers may decide to include an additional set of variables as the study progresses over time due to new information from other investigations. Thus, it is common for feature d to be missing for the first $t_d$ rounds of long-term longitudinal studies. For example, ECG results were only first recorded in the FHS study 14 years after the study began (D’Agostino et al. 2013; Mahmood et al. 2014).

To generate $\gamma$30% MNAR patterns under this mechanism, we use the following process. First, we randomly select a covariate d and a discrete uniform random variable $t_d \in \{1,2,\ldots ,N\}$, where $N = 10$ for the FHS dataset and $N = 4$ for the PPMI dataset. The value $t_d$ corresponds to the last round of the longitudinal study that covariate d is missing. For example, if $t_d = 2$ for the covariate “Left Ventricular Hypertrophy” (LVH), then the value for LVC will be missing for all observations in the two first clinical examinations. We continue this process until we have introduced $\gamma 30\%$ MNAR missing values. Afterwards, we introduce additional MCAR missing values to the remaining dataset in order to obtain the final dataset with 30% missing values.

3.2.2 MNAR mechanism for data from EHR

In EHR data, missing data patterns may be correlated with the severity of patient’s condition. Consider the case of a patient whose physician suspects the existence of chronic kidney disease. The associated record is more likely to have a recorded value for Glomerular Filtration Rate since it is a direct indication of the kidney’s functional status (Levey et al. 2005). Therefore, observed values are more likely to be below the threshold of $60 \text {mL/min/1.73 m}^2$ since they correspond to sicker patients.

To generate $\gamma 30\%$ MNAR patterns under this mechanism, we suppose that missing indicators are independent Bernoulli random variables where the probability that entry $x_{id}$ is missing equals the probability that a normal random variable $N(x_{id}, \epsilon )$ is greater than a particular threshold for covariate d. The threshold for each covariate d is the quantile of $\mathbf {X}^d$ which corresponds to the desired missing percentage level $\gamma 30\%$. Then, we introduce additional MCAR missing values to the remaining dataset in order to obtain the final dataset with 30% missing values total for this experiment.

3.3 Experimental setup

In this section, we describe the setup of computational experiments that compare med.knn to other state-of-the-art imputation methods. We use data from three distinct sources to test the performance of our algorithm on both longitudinal cohort study and EHR datasets. The codebase for the computational experiments is publicly available at https://github.com/colin78/medimpute_computational_experiments.

In our experiments, we take the full dataset to be the ground truth. First, we normalize the data so that each continuous covariate has mean zero and SD equal to one. Then, we run some of the most commonly-used and state-of-the-art methods for imputation to predict the missing values and compare against med.knn. The methods that we compare are as follows:

1.
Mean (mean) This is the simplest method. For each continuous feature, we impute the mean of the observed values and, for each categorical feature, we impute the mode of the observed values (Little and Rubin 2019).
2.
Moving average (moving.avg) This method takes into account only observations of the same entity (i.e., patient) and imputes their averages under a given time window. In cases where only one observation per entity is available, the method reduces to the mean. For each dataset, we consider a different time horizon depending on the relative scale of the data (i.e, years, months, or days). Implemented in the Julia programming language.
3.
Bayesian principal component analysis (bpca) This method takes a singular value decomposition (SVD) of the data matrix and information from a Bayesian prior distribution on the model parameters to impute missing values (Oba et al. 2003). Implemented using the pcaMethods package in the R programming language.
4.
Multivariate imputation via chained equations (mice) In this multiple imputation method, we begin from m random starts and iteratively update each one to produce m independent imputations. In each iteration, we update the imputed values in feature d by drawing from a distribution conditional on all other features (van Buuren and Groothuis-Oudshoorn 2011). We use Classification Trees for the categorical features and Regression Trees for the continuous features. Implemented using the mice package in in the R programming language.
5.
Multiple imputation with boostrap expectation maximization (Amelia II) This is another multiple imputation method that builds upon the Amelia I framework, which assumes that the data is jointly distributed as multivariate normal and uses an expectation-maximization (EM) algorithm with bootstrapping (Honaker et al. 2011; King et al. 2001). In addition, a newer version of the method allows for the imputation of cross-sectional time series data. It can build a general model of patterns within variables across time by creating a sequence of polynomials of the time index. Thus, it is able to capture variables that are recorded over time within a cross-sectional unit and are observed to vary smoothly over time. Implemented using the amelia package in the R programming language.
6.
OptImpute under K-NN objective (opt.knn) This method finds a high quality solution to Problem (5) minimizing the sum of distances from each point to its K-Nearest Neighbors (Bertsimas et al. 2018b). We find solutions to this problem using Algorithm 1 with the CD update. Fixing $K = 10$, we use several warm and random restarts and select the imputation with the best objective value. Implemented using the OptImpute package in the Julia programming language.
7.
MedImpute under K-NN objective (med.knn) This method finds a high quality solution to Problem (8) minimizing the sum of distances from each point to its K-Nearest Neighbors and other instances of the same individual. We find solutions to this problem using Algorithm 2 with the CD update. For each feature d, we perform cross-validation to tune the parameters $\alpha _d, h_d$ with the rest of the MedImpute parameters set equal to zero. Fixing $K = 10$, we use several warm and random restarts and select the imputation with the best objective value. Implemented in the Julia programming language.

For each experiment, we evaluate the imputation accuracy of each method using the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) metrics, which are extended to accommodate both continuous and categorical covariates. Let $\mathcal {M}_0^{test}$, $\mathcal {M}_1^{test}$ be the hold-out sets for the missing continuous and categorical covariates, respectively. We define the MAE and RMSE metrics to be:

$$\begin{aligned}&\text {MAE} := \frac{1}{|\mathcal {M}_0^{test}|} \sum _{(i,d) \in \mathcal {M}_0^{test}} |w_{id} - x_{id}| + \frac{1}{|\mathcal {M}_1^{test}|} \sum _{(i,d) \in \mathcal {M}_1^{test}} \mathbb {1}_{\{v_{id} \ne x_{id}\}}, \end{aligned}$$

(30)

$$\begin{aligned}&\text {RMSE} := \sqrt{\frac{1}{|\mathcal {M}_0^{test}|} \sum _{(i,d) \in \mathcal {M}_0^{test}} (w_{id} - x_{id})^2 + \frac{1}{|\mathcal {M}_1^{test}|} \sum _{(i,d) \in \mathcal {M}_1^{test}} \mathbb {1}_{\{v_{id} \ne x_{id}\}}}. \end{aligned}$$

(31)

In addition to comparing the accuracy of each method on the imputation task, we also compare their performance on downstream predictive tasks which are tailored for each dataset. In these experiments, we use the imputation methods to fill in the missing values of the datasets, and then we train machine learning models with the data from completed datasets. By comparing the accuracy of the predictive models on the downstream tasks, we can see the relative impact of using one imputation method versus another in a machine learning pipeline. For the FHS dataset, the downstream task is to predict 10-year risk of stroke, a classification task. For the DFCI dataset, the downstream task is to predict 60-day risk of mortality, which is also a classification task. For the PPMI dataset, the downstream task is to predict the Montreal Cognitive Assessment (MoCA) score for next year, which is a regression task.

To evaluate the accuracy on the downstream predictive task, first we split the patients from the completed dataset into a training and testing set using a 75%/25% ratio. For the longitudinal datasets (FHS and PPMI) we include only one visit per patient, the most recent one. Thus, the time series component of the dataset is only present in the missing data imputation process but not in the supervised learning part of the experiment. This setup allows us to quantify the relative benefit of med.knn per individual. For the EHR dataset (DFCI), we include all of the observations from each patient in either the training or testing set for the supervised learning task.

Next, we train predictive models on the training set and report the out-of-sample accuracy on the testing set. For the classification tasks, we train $\ell _1$-regularized logistic regression models and report the out-of-sample Area Under the Receiver Operator Characteristic Curve (AUC). For the regression task, we train $\ell _1$-regularized linear regression models and report the out-of-sample Mean Absolute Error (MAE). These two metrics are commonly used evaluation criteria in machine learning (Hastie et al. 2009). We repeat all experiments for 25 random seeds and average the results. Each iteration corresponds to a different random split of the patients into the training and testing sets, a random warmstart, and a randomly generated missing data pattern. In particular, we note that the patient IDs and the time stamps corresponding to each row of the dataset are maintained across the different random seeds, so that the temporal sequence of the records remains the same as the original dataset.

We artificially created missing data under different mechanisms and random patterns to compare the imputation accuracy of the proposed method. The missing data generation process was independently applied to each column. For a fixed missing percentage $f\%$, we remove the necessary number of known values for each feature to reach the $f\%$ target. The patient ID $y_i$ was not factored in the missing data generation process and all rows were considered independent observations. If the existing percent of missing data for a column was higher than the target $f\%$, we do not generate any artificial missing values for the covariate, and thus the feature does not contribute to the estimation of the imputation accuracy metrics.

Given this framework for evaluating imputation methods on both imputation and downstream tasks, we conduct a variety of experiments which vary the pattern of the missing data. In particular, we conduct three different types of experiments that correspond to variations in the form of missing data that we frequently encounter in medical datasets:

1.
Percentage of missing data We generate patterns of missing data for various percentages ranging from 10 to 50% under the missing completely at random (MCAR) mechanism. Given a target proportion of missing data f (i.e., $f = 20\%$), we generate among all observed data f missing values at each column independently from the rest completely at random.
2.
Number of observations per patient With the missing percentage fixed at 50% MCAR, we vary the time frame during which patient observations are included in the imputation task. Our goal is to quantify the effect of the time series component as we vary its intensity.
3.
Mechanism of missing data With the missing percentage fixed at 30%, we vary the missing data mechanism from Missing Completely At Random (MCAR) to Missing Not At Random (MNAR) on a gradient scale. In particular, we suppose that the missing pattern is ($\gamma$30% MNAR, ($1 - \gamma$)30% MCAR), where $\gamma$ varies from 0 to 1. We consider two different MNAR mechanisms that correspond to distinct missing data patterns observed in longitudinal studies and EHR.

The objective of the first set of experiments is to determine which imputation methods perform best at high and low levels of missing data. For these experiments, we also report the results from statistical hypothesis tests (Friedman Rank and pairwise t-tests) to evaluate whether the rankings and differences between the imputation algorithms are statistically significant. The objective of the second set of experiments is to determine how the performance of med.knn and other imputation methods varies as the amount of time series information available on each patient fluctuates. Finally, the objective of the third set of experiments is to determine how robust each imputation method is with respect to the missing data mechanism. In the previous section, we describe the two mechanisms for generating MNAR data for the third set of experiments. Below, we summarize all of the steps required to run one of the computational experiments for a single random seed:

1.
Fix a random seed s, a dataset, a desired missingness percentage level $f\%$, a missing data imputation method, and a value for the $\gamma$ parameter.
2.
Generate a random missing data pattern in the given dataset using the targeted percentage of missing values $f\%$, the random seed s, and the value of the $\gamma$ parameter.
3.
Impute the missing values in the provided dataset using the specified algorithm (i.e. med.knn, mean, bpca).
4.
Calculate the imputation error using the MAE and RMSE metrics (see Eqs. 30, 31) on the artificially generated missing data.
5.
Split the patients in the dataset into a training and testing set using a 75%/25% ratio. For the longitudinal datasets, only include the most recent observation from each individual in the training and testing sets. For the EHR (DFCI) dataset, include all of the observations from each individual in the training or testing set.
6.
Train a downstream predictive model on the training set using the cv.glmnet function from the R glmnet package (Friedman et al. 2009). For the FHS and DFCI datasets which have binary outcomes variables, train a logistic regression model with $l_1$ regularization. For the PPMI dataset which has a continuous outcome variable, train a linear regression model with $l_1$ regularization.
7.
Report the out-of-sample performance of the trained model on the testing set. For the classification tasks, report the out-of-sample AUC, and for the regression task, report the out-of-sample MAE.

3.4 Imputation results

In this section, we provide the results from all experiments on the imputation tasks. In particular, we present the imputation results from the (1) percentage of missing data, (2) number of observations per patient, and (3) mechanism of missing data experiments.

Percentage of missing data

In Fig. 1, we show the MAE imputation accuracy results from the first set of experiments in which we vary the percentage of missing data from 10 to 50%, and the missing data mechanism is fixed to MCAR. We present the exact values and standard errors in this plot in the Appendix in Table 9. Across all of the datasets, med.knn achieves the lowest average MAE for all of the missing percentages tested. On the FHS longitudinal dataset with 50% MCAR data, med.knn has an average MAE of 0.289 compared to the next best method opt.knn with an average MAE of 0.503, a 42.54% reduction. Similarly, on the PPMI longitudinal dataset with 50% MCAR data, med.knn has an average MAE of 1.286 compared to the next best method opt.knn with an average MAE of 1.99, a 35.37% reduction. On the DFCI dataset with 50% MCAR data, med.knn has an average MAE of 3.568 compared to the next best method mean with an average MAE of 4.367, a 22.39% reduction.

In Fig. 2, we present the RMSE imputation accuracy results. In general, the results are similar to the MAE imputation accuracy results, and med.knn produces the imputation with the lowest RMSE across all experiments. One notable difference is on the DFCI dataset, the relative improvement of med.knn compared to bpca, moving.avg, and mean is much smaller. Because the mean imputation method performs relatively well, this suggests that there are some difficult-to-impute covariates in the DFCI dataset which are resulting in large RMSE values for all of the more complex methods.

In Table 1, we present the results from the Friedman Rank test for each of the Missing Data imputation experiments. In this statistical test, we compare the relative rank of med.knn against the relative ranks of the comparator methods for each of the 25 random seeds. These results demonstrate that the med.knn method is consistently ranked higher than the others across each of the experiments.

Table 1 The Friedman rank test results for the imputation tasks varying the percentage of missing data from 10 to 50% MCAR, using either the MAE or RMSE metric for comparison

Imputation of clinical covariates in time series

Abstract

Similar content being viewed by others

A survey on missing data in machine learning

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Deep learning for survival analysis: a review

1 Introduction

1.1 Review of methods for handling missing values

1.2 Contributions

2 Methods

2.1 Variables and notation

2.2 Review of OptImpute

2.3 MedImpute

2.4 Learning \(\alpha _d\) and \(h_d\)

2.5 The med.knn algorithm

2.5.1 Block coordinate descent

2.5.2 Coordinate descent

3 Computational experiments on real-world clinical datasets

3.1 Description of real-world clinical datasets

3.1.1 Framingham heart study (FHS) dataset

3.1.2 Dana farber cancer institute (DFCI) dataset

3.1.3 Parkinson’s progression markers initiative (PPMI) dataset

3.2 Mechanisms for generating missing not at random (MNAR) data

3.2.1 MNAR mechanism for data from longitudinal studies

3.2.2 MNAR mechanism for data from EHR

3.3 Experimental setup

3.4 Imputation results

Percentage of missing data

Number of observations per patient

Mechanism of missing data

3.5 Prediction results

Percentage of missing data

Number of observations per patient

Mechanism of missing data

3.6 Discussion of the computational experiments on real-world clinical datasets

4 Scaling experiments on simulated clinical datasets

4.1 Simulated data: Synthea

4.2 Experimental setup for the scaling experiments

4.3 Results of the scaling experiments

4.4 Discussion of the scaling experiments on simulated clinical datasets

5 Discussion

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Detailed properties of real-world clinical datasets

1.2 Supplemental experimental results

1.3 Medimpute hyperparameter tuning results for FHS experiments

1.4 Supplemental synthetic experiments results

1.5 Nemenyi critical diagrams

1.6 Additional statistical significance tests for the percentage of missing data experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation