Abstract
Disguised missing data, an emerging data quality problem coined by Pearson in 2006, is a special kind of missing data that refers to values not exactly missing in the data entries, but cannot reflect the fact and so may lead to severe bias on analysis results. In this paper, we present a novel problem of detecting disguised missing data, i.e., finding out the data group most prone to a specific disguise value. We show that this problem can be formalized as an optimization problem and so a genetic-algorithms-based method is proposed to handle this problem. According to preliminary experimental results conducted on real datasets, our method can discover the same optimal data groups obtained by exhaustive method. A further evaluation on the FDA adverse drug event reporting dataset shows that our method yields similar results concluded by manual examinations performed by experienced analyzers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Belen, R.: Detecting disguised missing data. Master thesis, The Middle East Technical University (2009)
Belen, R., Temizel, T.T.: A framework to detect disguised missing data. In: Senthil Kumar, A.V. (ed.) Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, pp. 1–22. IGI Global, Hershey (2010)
FDA Adverse Event Reporting System. http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm
Hua, M., Pei, J.: Cleaning disguised missing data: a heuristic approach. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 950–958 (2007)
Hua, M., Pei, J.: DiMaC: a system for cleaning disguised missing data. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1263–1266 (2008)
Little, R., Rubin, D.: Statistical Analysis with Missing Data. Wiley Publishers, New York (1987)
Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996)
Natarajan, K., Li, J., Koronios, A.: Detecting mis-entered values in large data sets. In: Proceedings of the 4th World Congress on Engineering Asset Management, pp. 805–812 (2009)
Pearson, R.K.: The Problem of Disguised Missing Data. ACM SIGKDD Explor. Newslett. 8(1), 83–92 (2006)
UCI Machine Learning Repository: Pima Indians Diabetes Data Set. http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Lin, WY., Feng, WY. (2014). Detecting the Data Group Most Prone to a Specific Disguise Value. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-13186-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)