NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data
A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.
Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.
We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.
Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.
Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
KeywordsMetabolomics kNN Imputation Missing data GC–MS
The authors acknowledge the National Science Foundation (MCB-1254382) and the National Institutes of Health (R35-GM119701) for financial support.
JYL participated in the design of the study, carried out the computational experiments, and helped draft the manuscript. MPS conceived of the study, participated in the design of the study, and helped draft the manuscript. All authors read and approved the final manuscript.
Compliance with ethical standards
The article does not contain any studies with human and/or animal participants.
Conflict of interest
The authors declare no conflicts of interest.
The MATLAB code developed in this study is accessible via https://github.com/gtStyLab/NSkNN.
- Boeckel, J. N., Palapies, L., Zeller, T., Reis, S. M., von Jeinsen, B., Tzikas, S., Bickel, C., Baldus, S., Blankenberg, S., Munzel, T., Zeiher, A. M., Lackner, K. J., & Keller, T. (2015). Estimation of values below the limit of detection of a contemporary sensitive troponin I assay improves diagnosis of acute myocardial infarction. Clinical Chemistry, 61, 1197–1206.CrossRefGoogle Scholar
- Di Guida, R., Engel, J., Allwood, J. W., Weber, R. J., Jones, M. R., Sommer, U., Viant, M. R., & Dunn, W. B. (2016). Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics, 12, 93.CrossRefGoogle Scholar
- Lee, M., Rahbar, M. H., Brown, M., Gensler, L., Weisman, M., Diekman, L., & Reveille, J. D. (2018). A multiple imputation method based on weighted quantile regression models for longitudinal censored biomarker data with missing values at early visits. BMC Medical Research Methodology, 18, 8.CrossRefGoogle Scholar
- Niehaus, T. D., Gerdes, S., Hodge-Hanson, K., Zhukov, A., Cooper, A. J., ElBadawi-Sidhu, M., Fiehn, O., Downs, D. M., & Hanson, A. D. (2015). Genomic and experimental evidence for multiple metabolic functions in the RidA/YjgF/YER057c/UK114 (Rid) protein family. BMC Genomics, 16, 382.CrossRefGoogle Scholar