Adaptive Label Cleaning for Error Detection on Tabular Data

Zhang, Yaru; Qin, Jianbin; Mao, Rui; Ji, Yan; Wang, Yaoshu; Ali, Muhammad Asif

doi:10.1007/978-981-97-2421-5_5

Yaru Zhang¹²,
Jianbin Qin¹²,
Rui Mao¹²,
Yan Ji¹²,
Yaoshu Wang¹² &
…
Muhammad Asif Ali¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14334))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

44 Accesses

Abstract

Existing supervised methods for error detection require access to clean labels to train the classification model. While the majority of error detection algorithms ignore the harm of noisy labels to detection models. In this paper, we design an effective approach for error detection when both data values and labels may be noisy. Nevertheless, we present AdaptiveClean, a method for error detection on tabular data with noisy training labels. We introduce an effective strategy that can choose the most representative instance to clean. For feature extraction, we use the existing four error detection algorithms for handling multiple types of errors. To reduce the negative effect of noisy training labels on the classification model, we use an adaptive label-cleaning method by training any arbitrary ML models iteratively. Our approach can not only prioritize erroneous instances but also clean noisy labels that affect the classifier primarily. Performance evaluation using five different datasets shows that AdaptiveClean excels over the best baseline error detection system by 0.01 to 0.12 in terms of F1 score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Biessmann, F., et al.: DataWig: missing value imputation for tables. J. Mach. Learn. Res. 20(175), 1–6 (2019)
Google Scholar
Bolstad, W.M., Curran, J.M.: Introduction to Bayesian Statistics. Wiley, New York (2016)
Google Scholar
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)
Google Scholar
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)
Google Scholar
Chu, X., et al.: KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: 2015 ACM SIGMOD (2015)
Google Scholar
Dallachiesa, M., et al.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)
Google Scholar
Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The Magellan data repository (2015)
Google Scholar
Deng, D., et al.: The data civilizer system. In: CIDR (2017)
Google Scholar
Dimitriadis, I., Poiitis, M., Faloutsos, C., Vakali, A.: TG-OUT: temporal outlier patterns detection in Twitter attribute induced graphs. World Wide Web 25(6), 2429–2453 (2022)
Article Google Scholar
Dolatshah, M.: Cleaning crowdsourced labels using oracles for statistical classification. Ph.D. thesis, Applied Sciences: School of Computing Science (2018)
Google Scholar
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE (2013)
Google Scholar
Fu, J., Wang, L., Ke, J., Yang, K., Yu, R.: GANAD: a GAN-based method for network anomaly detection. World Wide Web 26, 2727–2748 (2023)
Article Google Scholar
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: HoloDetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)
Google Scholar
Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)
Google Scholar
Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)
Google Scholar
Jeatrakul, P., Wong, K.W., Fung, C.C.: Data cleaning for classification using misclassification analysis. J. Adv. Comput. Intell. Intell. Inform. 14(3), 297–302 (2010)
Article Google Scholar
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)
Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2011)
Google Scholar
Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: ActiveClean: interactive data cleaning for statistical modeling. PVLDB 9, 948–959 (2016)
Google Scholar
Li, J., Socher, R., Hoi, S.C.: DivideMix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ML pipelines. arXiv (2020)
Google Scholar
Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)
Google Scholar
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Miranda, A.L.B., Garcia, L.P.F., Carvalho, A.C.P.L.F., Lorena, A.C.: Use of classification algorithms in noise detection and elimination. In: Corchado, E., Wu, X., Oja, E., Herrero, Á., Baruque, B. (eds.) HAIS 2009. LNCS (LNAI), vol. 5572, pp. 417–424. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02319-4_50
Chapter Google Scholar
Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ML to cleaning for ML. IEEE (2021)
Google Scholar
Neutatz, F., Mahdavi, M., Abedjan, Z.: ED2: two-stage active learning for error detection–technical report. arXiv (2019)
Google Scholar
Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)
Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)
Google Scholar
Rammelaere, J., Geerts, F.: Explaining repaired data with CFDS. VLDB (2018)
Google Scholar
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. arXiv (2017)
Google Scholar
Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)
Article Google Scholar
van de Schoot, R., et al.: Bayesian statistics and modelling. Nat. Rev. Methods Primers 1, 1 (2021)
Article Google Scholar
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)
Google Scholar
Xiang, H., Zhang, X.: Edge computing empowered anomaly detection framework with dynamic insertion and deletion schemes on data streams. World Wide Web 25(5), 2163–2183 (2022)
Article Google Scholar
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)
Google Scholar
Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: SCODED: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)
Google Scholar
Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Google Scholar
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)
Google Scholar

Download references

Acknowledgements

This research is supported by the National Key R&D Program of China 2021YFB3301500, Shenzhen Continuous Support Grant 20200811104054002, Guangdong Provincial National Science Foundation 2019A1515111047, and Shenzhen Colleges and Universities Continuous Support Grant 20220810142731001.

Author information

Authors and Affiliations

Shenzhen Institute of Computing Sciences, Shenzhen University, Shenzhen, China
Yaru Zhang, Jianbin Qin, Rui Mao, Yan Ji & Yaoshu Wang
King Abdullah University of Science and Technology, Jeddah, Saudi Arabia
Muhammad Asif Ali

Authors

Yaru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Rui Mao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yaoshu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Asif Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianbin Qin .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Qin, J., Mao, R., Ji, Y., Wang, Y., Ali, M.A. (2024). Adaptive Label Cleaning for Error Detection on Tabular Data. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_5

Download citation

DOI: https://doi.org/10.1007/978-981-97-2421-5_5
Published: 12 May 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2420-8
Online ISBN: 978-981-97-2421-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Label Cleaning for Error Detection on Tabular Data