Abstract
Existing supervised methods for error detection require access to clean labels to train the classification model. While the majority of error detection algorithms ignore the harm of noisy labels to detection models. In this paper, we design an effective approach for error detection when both data values and labels may be noisy. Nevertheless, we present AdaptiveClean, a method for error detection on tabular data with noisy training labels. We introduce an effective strategy that can choose the most representative instance to clean. For feature extraction, we use the existing four error detection algorithms for handling multiple types of errors. To reduce the negative effect of noisy training labels on the classification model, we use an adaptive label-cleaning method by training any arbitrary ML models iteratively. Our approach can not only prioritize erroneous instances but also clean noisy labels that affect the classifier primarily. Performance evaluation using five different datasets shows that AdaptiveClean excels over the best baseline error detection system by 0.01 to 0.12 in terms of F1 score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Biessmann, F., et al.: DataWig: missing value imputation for tables. J. Mach. Learn. Res. 20(175), 1–6 (2019)
Bolstad, W.M., Curran, J.M.: Introduction to Bayesian Statistics. Wiley, New York (2016)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)
Chu, X., et al.: KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: 2015 ACM SIGMOD (2015)
Dallachiesa, M., et al.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)
Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The Magellan data repository (2015)
Deng, D., et al.: The data civilizer system. In: CIDR (2017)
Dimitriadis, I., Poiitis, M., Faloutsos, C., Vakali, A.: TG-OUT: temporal outlier patterns detection in Twitter attribute induced graphs. World Wide Web 25(6), 2429–2453 (2022)
Dolatshah, M.: Cleaning crowdsourced labels using oracles for statistical classification. Ph.D. thesis, Applied Sciences: School of Computing Science (2018)
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE (2013)
Fu, J., Wang, L., Ke, J., Yang, K., Yu, R.: GANAD: a GAN-based method for network anomaly detection. World Wide Web 26, 2727–2748 (2023)
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: HoloDetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)
Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)
Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)
Jeatrakul, P., Wong, K.W., Fung, C.C.: Data cleaning for classification using misclassification analysis. J. Adv. Comput. Intell. Intell. Inform. 14(3), 297–302 (2010)
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2011)
Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: ActiveClean: interactive data cleaning for statistical modeling. PVLDB 9, 948–959 (2016)
Li, J., Socher, R., Hoi, S.C.: DivideMix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ML pipelines. arXiv (2020)
Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: Advances in Neural Information Processing Systems (2017)
Miranda, A.L.B., Garcia, L.P.F., Carvalho, A.C.P.L.F., Lorena, A.C.: Use of classification algorithms in noise detection and elimination. In: Corchado, E., Wu, X., Oja, E., Herrero, Á., Baruque, B. (eds.) HAIS 2009. LNCS (LNAI), vol. 5572, pp. 417–424. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02319-4_50
Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ML to cleaning for ML. IEEE (2021)
Neutatz, F., Mahdavi, M., Abedjan, Z.: ED2: two-stage active learning for error detection–technical report. arXiv (2019)
Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)
Rammelaere, J., Geerts, F.: Explaining repaired data with CFDS. VLDB (2018)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. arXiv (2017)
Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)
van de Schoot, R., et al.: Bayesian statistics and modelling. Nat. Rev. Methods Primers 1, 1 (2021)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)
Xiang, H., Zhang, X.: Edge computing empowered anomaly detection framework with dynamic insertion and deletion schemes on data streams. World Wide Web 25(5), 2163–2183 (2022)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)
Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: SCODED: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)
Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)
Acknowledgements
This research is supported by the National Key R&D Program of China 2021YFB3301500, Shenzhen Continuous Support Grant 20200811104054002, Guangdong Provincial National Science Foundation 2019A1515111047, and Shenzhen Colleges and Universities Continuous Support Grant 20220810142731001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, Y., Qin, J., Mao, R., Ji, Y., Wang, Y., Ali, M.A. (2024). Adaptive Label Cleaning for Error Detection on Tabular Data. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-97-2421-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2420-8
Online ISBN: 978-981-97-2421-5
eBook Packages: Computer ScienceComputer Science (R0)