Detecting the Data Group Most Prone to a Specific Disguise Value

Lin, Wen-Yang; Feng, Wen-Yu

doi:10.1007/978-3-319-13186-3_10

Wen-Yang Lin¹¹ &
Wen-Yu Feng¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2157 Accesses

Abstract

Disguised missing data, an emerging data quality problem coined by Pearson in 2006, is a special kind of missing data that refers to values not exactly missing in the data entries, but cannot reflect the fact and so may lead to severe bias on analysis results. In this paper, we present a novel problem of detecting disguised missing data, i.e., finding out the data group most prone to a specific disguise value. We show that this problem can be formalized as an optimization problem and so a genetic-algorithms-based method is proposed to handle this problem. According to preliminary experimental results conducted on real datasets, our method can discover the same optimal data groups obtained by exhaustive method. A further evaluation on the FDA adverse drug event reporting dataset shows that our method yields similar results concluded by manual examinations performed by experienced analyzers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Belen, R.: Detecting disguised missing data. Master thesis, The Middle East Technical University (2009)
Google Scholar
Belen, R., Temizel, T.T.: A framework to detect disguised missing data. In: Senthil Kumar, A.V. (ed.) Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, pp. 1–22. IGI Global, Hershey (2010)
Chapter Google Scholar
FDA Adverse Event Reporting System. http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm
Hua, M., Pei, J.: Cleaning disguised missing data: a heuristic approach. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 950–958 (2007)
Google Scholar
Hua, M., Pei, J.: DiMaC: a system for cleaning disguised missing data. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1263–1266 (2008)
Google Scholar
Little, R., Rubin, D.: Statistical Analysis with Missing Data. Wiley Publishers, New York (1987)
MATH Google Scholar
Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996)
Google Scholar
Natarajan, K., Li, J., Koronios, A.: Detecting mis-entered values in large data sets. In: Proceedings of the 4th World Congress on Engineering Asset Management, pp. 805–812 (2009)
Google Scholar
Pearson, R.K.: The Problem of Disguised Missing Data. ACM SIGKDD Explor. Newslett. 8(1), 83–92 (2006)
Article Google Scholar
UCI Machine Learning Repository: Pima Indians Diabetes Data Set. http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, 811, Taiwan
Wen-Yang Lin & Wen-Yu Feng

Authors

Wen-Yang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Yu Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen-Yang Lin .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, WY., Feng, WY. (2014). Detecting the Data Group Most Prone to a Specific Disguise Value. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_10
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics