Advertisement

Metabolomics

, 14:153 | Cite as

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

  • Justin Y. Lee
  • Mark P. StyczynskiEmail author
Original Article

Abstract

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.

Keywords

Metabolomics kNN Imputation Missing data GC–MS 

Notes

Acknowledgements

The authors acknowledge the National Science Foundation (MCB-1254382) and the National Institutes of Health (R35-GM119701) for financial support.

Author contributions

JYL participated in the design of the study, carried out the computational experiments, and helped draft the manuscript. MPS conceived of the study, participated in the design of the study, and helped draft the manuscript. All authors read and approved the final manuscript.

Compliance with ethical standards

Ethical approval

The article does not contain any studies with human and/or animal participants.

Conflict of interest

The authors declare no conflicts of interest.

Software availability

The MATLAB code developed in this study is accessible via https://github.com/gtStyLab/NSkNN.

Supplementary material

11306_2018_1451_MOESM1_ESM.docx (7.2 mb)
Supplementary material 1 (DOCX 7412 KB)

References

  1. Armitage, E. G., Godzien, J., Alonso-Herranz, V., Lopez-Gonzalvez, A., & Barbas, C. (2015). Missing value imputation strategies for metabolomics data. Electrophoresis, 36, 3050–3060.CrossRefGoogle Scholar
  2. Barnard, J., & Meng, X. L. (1999). Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research, 8, 17–36.CrossRefGoogle Scholar
  3. Boeckel, J. N., Palapies, L., Zeller, T., Reis, S. M., von Jeinsen, B., Tzikas, S., Bickel, C., Baldus, S., Blankenberg, S., Munzel, T., Zeiher, A. M., Lackner, K. J., & Keller, T. (2015). Estimation of values below the limit of detection of a contemporary sensitive troponin I assay improves diagnosis of acute myocardial infarction. Clinical Chemistry, 61, 1197–1206.CrossRefGoogle Scholar
  4. Chen, H., Quandt, S. A., Grzywacz, J. G., & Arcury, T. A. (2011). A distribution-based multiple imputation method for handling bivariate pesticide data with values below the limit of detection. Environ Health Perspect, 119, 351–356.CrossRefGoogle Scholar
  5. Di Guida, R., Engel, J., Allwood, J. W., Weber, R. J., Jones, M. R., Sommer, U., Viant, M. R., & Dunn, W. B. (2016). Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics, 12, 93.CrossRefGoogle Scholar
  6. Dromms, R. A., & Styczynski, M. P. (2012). Systematic applications of metabolomics in metabolic engineering. Metabolites, 2, 1090–1122.CrossRefGoogle Scholar
  7. Fiehn, O., Garvey, W. T., Newman, J. W., Lok, K. H., Hoppel, C. L., & Adams, S. H. (2010). Plasma metabolomic profiles reflective of glucose homeostasis in non-diabetic and type 2 diabetic obese African-American women. PLoS ONE, 5, e15234.CrossRefGoogle Scholar
  8. Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., Turner, M. L., & Goodacre, R. (2014). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452.CrossRefGoogle Scholar
  9. Hrydziuszko, O., & Viant, M. R. (2011). Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics, 8, 161–174.CrossRefGoogle Scholar
  10. Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance function effect on k-nearest neighbor classification for medical datasets. Springerplus, 5, 1304.CrossRefGoogle Scholar
  11. Kim, H., Golub, G. H., & Park, H. (2005). Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21, 187–198.CrossRefGoogle Scholar
  12. Lazar, C., Gatto, L., Ferro, M., Bruley, C., & Burger, T. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. Journal of Proteome Research, 15, 1116–1125.CrossRefGoogle Scholar
  13. Lee, M., Rahbar, M. H., Brown, M., Gensler, L., Weisman, M., Diekman, L., & Reveille, J. D. (2018). A multiple imputation method based on weighted quantile regression models for longitudinal censored biomarker data with missing values at early visits. BMC Medical Research Methodology, 18, 8.CrossRefGoogle Scholar
  14. Liu, Y., & Brown, S. D. (2014). Imputation of left-censored data for cluster analysis. Journal of Chemometrics, 28, 148–160.CrossRefGoogle Scholar
  15. Niehaus, T. D., Gerdes, S., Hodge-Hanson, K., Zhukov, A., Cooper, A. J., ElBadawi-Sidhu, M., Fiehn, O., Downs, D. M., & Hanson, A. D. (2015). Genomic and experimental evidence for multiple metabolic functions in the RidA/YjgF/YER057c/UK114 (Rid) protein family. BMC Genomics, 16, 382.CrossRefGoogle Scholar
  16. Shah, J. S., Rai, S. N., DeFilippis, A. P., Hill, B. G., Bhatnagar, A., & Brock, G. N. (2017). Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics, 18, 114.CrossRefGoogle Scholar
  17. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525.CrossRefGoogle Scholar
  18. Wei, R., Wang, J., Jia, E., Chen, T., Ni, Y., & Jia, W. (2018a). GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies. PLoS Computational Biology, 14, e1005973.CrossRefGoogle Scholar
  19. Wei, R., Wang, J., Su, M., Jia, E., Chen, S., Chen, T., & Ni, Y. (2018b). Missing Value imputation approach for mass spectrometry-based metabolomics data. Scientific Reports, 8, 663.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Chemical & Biomolecular EngineeringGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations