Secondary Analysis of Electronic Health Records pp 143162  Cite as
Missing Data
Abstract
In this chapter, the reader will learn about common sources for missing data, how missing data can be classified depending on the origin of missingness, what options are available for handling missing data and how to choose the most appropriate technique for a specific dataset.
Keywords
Missing data Missingness Imputation InterpolationLearning Objectives

What are the different types of missing data, and the sources for missingness.

What options are available for dealing with missing data.

What techniques exist to help choose the most appropriate technique for a specific dataset.
13.1 Introduction
Missing data is a problem affecting most databases and electronic medical records (EHR) are no exception. Because most statistical models operate only on complete observations of exposure and outcome variables, it is necessary to deal with missing data, either by deleting incomplete observations or by replacing any missing values with an estimated value based on the other information available, a process called imputation. Both methods can significantly effect the conclusions that can be drawn from the data.
Identifying the source of “missingness” is important, as it influences the choice of the imputation technique. Schematically, several cases are possible: (i) the value is missing because it was forgotten or lost; (ii) the value is missing because it was not applicable to the instance; (iii) the value is missing because it is of no interest to the instance. If we were to put this in a medical context: (i) the variable is measured but for some unidentifiable reason the values are not electronically recorded, e.g. disconnection of sensors, errors in communicating with the database server, accidental human omission, electricity failures, and others; (ii) the variable is not measured during a certain period of time due to an identifiable reason, for instance the patient is disconnected from the ventilator because of a medical decision; (iii) the variable is not measured because it is unrelated with the patient condition and provides no clinical useful information to the physician [1].
An important distinction must be made between data missing for identifiable or unidentified reasons. In the first case, imputing values can be inadequate and add bias to the dataset, so the data is said to be nonrecoverable. On the other hand, when data is missing for unidentifiable reasons it is assumed that values are missing because of random and unintended causes. This type of missing data is classified as recoverable.
The first section of this chapter focuses on describing the theory of some commonly used methods to handle missing data. In order to demonstrate the advantages and disadvantages of the methods, their application is demonstrated in the second part of the chapter on actual datasets that were created to study the relation between mortality and insertion of indwelling arterial catheters (IAC) in the intensive care unit (ICU).
13.2 Part 1—Theoretical Concepts
In knowledge discovery in databases, data preparation is the most crucial and time consuming task, that strongly influences the success of the research. Variable selection consists in identifying a useful subset of potential predictors from a large set of candidates (please refer to Chap. 5—Data Analysis for further information on feature selection). Rejecting variables with an excessive number of missing values (e.g. >50 %) is usually a good rule of thumb, however it is not a riskfree procedure. Rejecting a variable may lead to a loss of predictive power and ability to detect statistically significant differences and it can be a source of bias, affecting the representativeness of the results. For these reasons, variable selection needs to be tailored to the missing data mechanism. Imputation can be done before and/or after variable selection.

Identify patterns and reasons for missing data;

Analyse the proportion of missing data;

Choose the best imputation method.
13.2.1 Types of Missingness
The mechanisms by which the data is missing will affect some assumptions supporting our data imputation methods. Three major mechanisms of missingness of the data can be described, depending on the relation between observed (available) and unobserved (missing) data.

Missing Completely at Random (MCAR)—When the missing observations are dependent on the observed and unobserved measurements. In this case the probability of an observation being missing depends only on itself, and reduces to \( \Pr \left( {R x_{o} ,x_{m} } \right) = { \Pr }(R) \). As an example, imagine that a doctor forgets to record the gender of every six patients that enter the ICU. There is no hidden mechanism related to any variable and it does not depend on any characteristic of the patients.

Missing at Random (MAR)—In this case the probability of a value being missing is related only to the observable data, i.e., the observed data is statistically related with the missing variables and it is possible to estimate the missing values from the observed data. This case is not completely ‘random’, but it is the most general case where we can ignore the missing mechanism, as we control the information upon which the missingness depends, the observed data. Said otherwise, the probability that some data is missing for a particular variable does not depend on the values of that variable, after adjusting for observed values. Mathematically the probability of missing reduces to \( \Pr \left( {R x_{o} ,x_{m} } \right) = { \Pr }(Rx_{o} ) \). Imagine that if elderly people are less likely to inform the doctor that they had had a pneumonia before, the response rate of the variable pneumonia will depend on the variable age.

Missing Not at Random (MNAR)—This refers to the case when neither MCAR nor MAR hold. The missing data depends on both missing and observed values. Determining the missing mechanism is usually impossible, as it depends on unseen data. From that derives the importance of performing sensitivity analyses and test how the inferences hold under different assumptions. For example, we can imagine that patients with low blood pressure are more likely to have their blood pressure measured less frequently (the missing data for the variable “blood pressure” partially depends on the values of the blood pressure).
13.2.2 Proportion of Missing Data
Examples of missing data in EHR
Gender  Glucose  AST  Age  

Patient 1  ?  120  ?  ? 
Patient 2  M  105  ?  68 
Patient 3  F  203  45  63 
Patient 4  M  145  ?  42 
Patient 5  M  89  ?  80 
13.2.3 Dealing with Missing Data
Overview of Methods for Handling Missing Data
The methods should be tailored to the dataset of interest, the reasons for missingness and the proportion of missing data. In general, a method is chosen for its simplicity and its ability to introduces as little bias as possible in the dataset.
When data are MCAR or MAR a researcher can ignore the reasons for missing data, which simplifies the choice of the methods to apply. In this case, any method can be applied. Nevertheless it is difficult to obtain empirical evidence about whether or not the data are MCAR or MAR. A valid strategy is to examine the sensitivity of results to the MCAR and MAR assumptions by comparing several analyses, where the differences in results across several analyses may provide some information about what assumptions may be the most relevant.
A significant body of evidence has focused on comparing the performance of missing data handling methods, both in general [2, 3, 4] and in context of specific factors such as proportion of missing data and sample size [5, 6, 7]. More detailed technical aspects, and application of these methods in various fields can also be found in the works of Jones and Little [8, 9].
 1.
Deletion methods (listwise deletion, i.e. completecase analysis, pairwise deletion, i.e. availablecase analysis)
 2.
Single Imputation Methods (mean/mode substitution, linear interpolation, Hot deck and cold deck)
 3.
ModelBased Methods (regression, multiple imputation, knearest neighbors)
Deletion Methods
The simplest way to deal with missing data is to discard the cases or observations that have missing values. In general, case deletion methods lead to valid inferences only for MCAR [10]. There are three ways of doing this: completecase analysis; availablecase analysis; and weighting methods.
CompleteCase Analysis (Listwise Deletion)
The principal assumption is that the remaining subsample is representative of the population, and will thus not bias the analysis towards a subgroup. This assumption is rather restrictive and assumes a MCAR mechanism. Listwise deletion often produces unbiased regression slope estimates, as long as missingness is not a function of the outcome variable. The biggest advantage of this method is its simplicity, it is always reasonable to use it when the number of discarded observations is relatively small when compared to the total. Its main drawbacks are the reduced statistical power (because it reduces the number of samples n, the estimates will have larger standard errors), waste of information, and possible bias of the analysis specially if data is not MCAR.
AvailableCase Analysis
WeightingCase Analysis
Weighting is a way of weighting the completecases by modelling the missingness in order to reduce the bias introduced in the availablecase.
SingleValue Imputation
In single imputation, missing values are filled by some type of “predicted” values [9, 11]. Single imputation ignores uncertainty and almost always underestimates the variance. Multiple imputation overcomes this problem, by taking into account both within—and between—imputation uncertainty.
Mean and Median
The simplest imputation method is to substitute missing values by the mean or the median of that variable. Using the median is more robust in the presence of outliers in the observed data. The main disadvantages are that (1) it reduces variability, thereby lowering the estimate errors compared to deletion approaches, and (2) it disregards the relationship between variables, decreasing therefore their correlation. While this method diminishes the bias of using a nonrepresentative sample, it introduces other bias.
Linear Interpolation
This method is particularly suitable for timeseries. In linear interpolation, a missing value is computed by interpolating the values of the previous and next available measurements for the patient. For example, if the natremia changes from 132 to 136 mEq/L in 8 h, one can reasonably assume that its value was close to 134 mEq/L at midpoint.
Hot Deck and Cold Deck
In the hot deck method, a missing attribute value is replaced with a value from an estimated distribution of the current data. It is especially used in survey research [9]. Hot deck is typically implemented in two stages. First, the data is partitioned into clusters, and then each instance with missing data is associated with one cluster. The complete cases in a cluster are used to fill in the missing values. This can be done by calculating the mean or mode of the attribute within a cluster. Cold deck imputation is similar to hot deck, except that the data source is different from the current dataset. Hotdeck imputation replaces the missing data by realistic values that preserve the variable distribution. However it underestimates the standard errors and the variability [12].
Last Observation Carried Forward
Sometimes called “sampleandhold” method [13]. The last value carried forward method is specific to longitudinal designs. This technique imputes the missing value with the last available observation of the individual. This method makes the assumption that the observation of the individual has not changed at all since the last measured observation, which is often unrealistic [14].
ModelBased Imputation
In modelbased imputation, a predictive model is created to estimate values that will substitute the missing data. In this case, the dataset is divided into two subsets: one with no missing values for the variable under evaluation (used for training the model) and one containing missing values, that we want to estimate. Several modeling methods can be used such as: regression, logistic regression, neural networks and other parametric and nonparametric modeling techniques. There are two main drawbacks in this approach: the model estimates values are usually more wellbehaved than the true values, and the models perform poorly if the observed and missing variables are independent.
Linear Regression
In this model, all the available variables are used to create a linear regression model using the available observations of the variable of interest as output. The advantages of this method is that it takes into account the relationship between variables, unlike the mean/median imputation. The disadvantages are that it overestimates the model fit and the correlation between the variables, as it does not take into account the uncertainty in the missing data and underestimates variances and covariances. A method that was created to introduce uncertainty is the stochastic linear regression (see below).
The case of multivariate imputation is more complex as missing values exist for several variables, which do not follow the same pattern of missingness through the observations. The method used is a multivariate extension of the linear model and relies on an iterative process carried until convergence.
Stochastic Regression
Stochastic regression imputation aims to reduce the bias by an extra step of augmenting each predicted score with a residual term. This residual term is normally distributed with a mean of zero and a variance equal to the residual variance from the regression of the predictor on the target. This method allows to preserve the variability in the data and unbiased parameter estimates with MAR data. However, the standard error tends to be underestimated, because the uncertainty about the imputed values is not included, which increases the risk of type I error [15].
MultipleValue Imputation

Imputation, where the missing values are filled in using any method of choice, leading to M ≥ 2 completed datasets (5–10 is generally sufficient) [10]. In these M multiplyimputed datasets, all the observed values are the same, but the imputed values are different, reflecting the uncertainty about imputation [10].

Analysis: each of the M completed datasets is analysed (e.g. a logistic regression classifier for mortality prediction is built), which gives M analyses.

Pooling: the M analyses are integrated into a final result, for example by computing the mean (and 95 % CI) of the M analyses.
KNearest Neighbors
Knearest neighbors (kNN) can be used for handling missing values. Here, they will be filled with the mean of the k values coming from the k most similar complete observations. The similarity of two observations is determined, after normalization of the dataset, using a distance function which can be Euclidean, Manhattan, Mahalanobis, Pearson, etc. The main advantage of the kNN algorithm is that given enough data it can predict with a reasonable accuracy the conditional probability distribution around a point and thus make well informed estimations. It can predict qualitative and quantitative (discrete and continuous) attributes. Another advantage of this method is that the correlation structure of the data is taken into consideration. The choice of the kvalue is very critical. A higher value of k would include attributes which are significantly different from our target observation, while lower value of k implies missing out of significant attributes.
13.2.4 Choice of the Best Imputation Method
Different imputation methods are expected to perform differently on various datasets. We describe here a generic and simple method that can be used to evaluate the performance of various imputation methods on your own dataset, in order to help selecting the most appropriate method. Of note, this simple approach does not test the effect of deletion methods. A more complex approach is described in the case study below, in which the performance of a predictive model is tested on the dataset completed by various imputation methods.
 1.
Use a sample of your own dataset that does not contain any missing data (will serve as ground truth).
 2.
Introduce increasing proportions of missing data at random (e.g. 5–50 % in 5 % increments).
 3.
Reconstruct the missing data using the various methods.
 4.
Compute the sum of squared errors between the reconstructed and the original data, for each method and each proportion of missing data.
 5.
Repeat steps 1–4 a number of times (10 times for example) and compute the average performance of each method (average SSE).
 6.Plot the average SSE versus proportion of missing data (1 plot per imputation method), similarly to the example shown in Fig. 13.4.
 7.
Choose the method that performs best at the level of missing data in your dataset. E.g. if your data had 10 % of missing data, you would want to pick kNN; at 40 % linear regression performs better (madeup data, for illustrative purpose only).
13.3 Part 2—Case Study
 1.
Various proportions of missing data at random were inserted into the variable “age”, then imputed using the various methods described above. The distribution of the imputed observations was compared to the original distribution for all the methods.
 2.
The performance of imputed datasets with different degrees of missingness was tested on a predictive model (logistic regression to predict mortality), first for univariate missing data (the variable age), then for all the variables (multivariate).
The code used to generate the analyses and the figures is provided in the in the accompanying R functions document.
13.3.1 Proportion of Missing Data and Possible Reasons for Missingness
Missing data in some of the variables of the IAC and nonIAC datasets
IAC  NonIAC  

# points  %  # points  %  
Arterial line time day  0  0  792  100 
Hospital length of stay  0  0  0  0 
Age  0  0  0  0 
Gender  0  0  0  0 
Weight first  39  3.96  71  8.96 
SOFA first  2  0.20  4  0.51 
Hemoglobin first  2  0.20  5  0.63 
Bilirubin first  418  42.48  365  46.09 
… 
Since IAC are mainly used for continuous hemodynamic monitoring and for arterial blood sampling for blood gas analysis, we can expect a higher percentage of missing data in blood gasrelated variables in the nonIAC group. We can also expect that patient diagnoses are often able to provide an explanation for the lack of specific laboratory results: if a certain test is not ordered because it will most likely provide no clinical insight, a missing value will occur; it is fair to estimate that such value lies within a normal range. In both cases, the fact that data is missing contains information about the response, thus it is MNAR. Body mass index (BMI) has a relatively high percentage of missing data. Assuming that this variable is calculated automatically from the weight and height of patients, we can conclude that this data is MAR: because the height and/or weight are missing, BMI cannot be calculated. If the weight is missing because someone forgot to introduce it into the system then it is MCAR. Besides the missing data mechanism, it is also important to consider the sample distribution in each variable, as some imputation methods assume specific data distributions, usually the normal distribution.
13.3.2 Univariate Missingness Analysis
In this section, the specific influence of each imputation method will be explored for the variable age, using all the other variables. Two different levels of missingness (20 and 40 %) were artifically introduced in the datasets. The original dataset represents the ground truth, to which the imputed datasets were compared using frequency histograms.
CompleteCase Analysis
This method is only exploitable when there is a small percentage of missing data. This method does not require any assumption in the distribution of the missing data, besides that the complete cases should be representative of the original population, which is difficult to prove.
Single Value Imputation
Mean and Median Imputation
Linear Regression Imputation
Stochastic Linear Regression Imputation
KNearest Neighbors
We limit the demonstration to the case where k = 1. In the extreme case where all neighbors are used without weights, this method converges to the mean imputation.
Multiple Imputation

Multivariate normal regression

Linear regression
13.3.3 Evaluating the Performance of Imputation Methods on Mortality Prediction
This test aims to assess the generalization capabilities of the models constructed using imputed data, and check their performance by comparing them to the original data. All the methods described previously were used to reconstruct a sample of both IAC and nonIAC datasets, with increasing proportions of missing data at random, first only on the variable age (univariate), then on all the variables in the dataset (multivariate). A logistic regression model was built on the reconstructed data and tested on a sample of the original data (that does not contain imputations or missing data).
The performance of the models is evaluated in terms of area under the receiver operating characteristic curve (AUC), accuracy (correct classification rate), sensitivity (true positive classification rate—TPR, also known as recall), specificity (true negative classification rate—TNR) and Cohen’s kappa. All the methods were compared against a reference logistic regression that was fitted with the original data without missingness. The results were averaged over a 10fold cross validation and the AUC results are presented graphically.
Overall, and quite surprisingly, the methods had a reasonable performance even for 80 % of missingness in every variable. The reason behind this is that almost half of the variables are binary, and because of their relation with the output, reconstructing them from frequent values in each class is usually the best guess. The decrease in AUC was due to a decrease in the sensitivity, as the specificity values remained more or less unchanged with the increase in missingness. The method that performed the best overall in terms of AUC was the multiple imputation linear regression. In IAC it achieved a minimum value of AUC of 0.81 at 70 % of missingness, corresponding to a reference AUC of 0.84 and in nonIAC it achieved an AUC of 0.85 at 70 % of missingness, close to the reference AUC of 0.89.
13.4 Conclusion
Missing data is a widespread problem in EHR due to the nature of medical information itself, the massive amounts of data collected, the heterogeneity of data standards and recording devices, data transfers and conversions, and finally Human errors and omissions. When dealing with the problem of missing data, just like in many other domains of data mining, there is no onesizefitsall approach, and the data scientist should ultimately rely on robust evaluation tools when choosing an imputation method to handle missing values in a particular dataset.
TakeHome Messages

Always evaluate the reasons for missingness: is it MCAR/MAR/MNAR?

What is the proportion of missing data per variable and per record?

Multiple imputation approaches generally perform better than other methods.

Evaluation tools must be used to tailor the imputation methods to a particular dataset.
References
 1.Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58(1):63–72CrossRefPubMedGoogle Scholar
 2.Peng CY, Harwell MR, Liou SM, Ehman LH (2006) Advances in missing data methods and implications for educational researchGoogle Scholar
 3.Peugh JL, Enders CK (2004) Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev Educ Res 74(4):525–556CrossRefGoogle Scholar
 4.Young W, Weckman G, Holland W (2011) A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theor Issues Ergon Sci 12(1):15–43CrossRefGoogle Scholar
 5.Alosh M (2009) The impact of missing data in a generalized integervalued autoregression model for count data. J Biopharm Stat 19(6):1039–1054CrossRefPubMedGoogle Scholar
 6.Knol MJ, Janssen KJM, Donders ART, Egberts ACG, Heerdink ER, Grobbee DE, Moons KGM, Geerlings MI (2010) Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epidemiol 63(7):728–736CrossRefPubMedGoogle Scholar
 7.Little RJA, Rubin DB (2002) Missing data in experiments. In: Statistical analysis with missing data. Wiley, pp 24–40Google Scholar
 8.Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 91(433):222–230CrossRefGoogle Scholar
 9.Little RJA (2016) Statistical analysis with missing data. Wiley, New YorkGoogle Scholar
 10.Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15CrossRefPubMedGoogle Scholar
 11.de Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, New YorkGoogle Scholar
 12.Roth PL (1994) Missing data: a conceptual review for applied psychologists. Pers Psychol 47(3):537–560CrossRefGoogle Scholar
 13.Hug CW (2009) Detecting hazardous intensive care patient episodes using realtime mortality models. Thesis, Massachusetts Institute of TechnologyGoogle Scholar
 14.Wood AM, White IR, Thompson SG (2004) Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials 1(4):368–376CrossRefPubMedGoogle Scholar
 15.Enders CK (2010) Applied missing data analysis, 1st edn. The Guilford Press, New YorkGoogle Scholar
 16.Rubin DB (1988) An overview of multiple imputation. In: Proceedings of the survey research section, American Statistical Association, pp 79–84Google Scholar
 17.Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, Heldt T, Kyaw TH, Moody B, Mark RG (2011) Multiparameter intelligent monitoring in intensive care II (MIMICII): a publicaccess intensive care unit database. Crit Care Med 39(5):952–960CrossRefPubMedPubMedCentralGoogle Scholar
 18.Scott DJ, Lee J, Silva I, Park S, Moody GB, Celi LA, Mark RG (2013) Accessing the public MIMICII intensive care relational database for clinical research. BMC Med Inform Decis Mak 13(1):9CrossRefPubMedPubMedCentralGoogle Scholar
 19.Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missingdata problems: a data analyst’s perspective. Multivar Behav Res 33(4):545–571CrossRefGoogle Scholar
Copyright information
Open Access This chapter is distributed under the terms of the Creative Commons AttributionNonCommercial 4.0 International License (http://creativecommons.org/licenses/bync/4.0/), which permits any noncommercial use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.