Skip to main content

Adaptive Label Cleaning for Error Detection on Tabular Data

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

Existing supervised methods for error detection require access to clean labels to train the classification model. While the majority of error detection algorithms ignore the harm of noisy labels to detection models. In this paper, we design an effective approach for error detection when both data values and labels may be noisy. Nevertheless, we present AdaptiveClean, a method for error detection on tabular data with noisy training labels. We introduce an effective strategy that can choose the most representative instance to clean. For feature extraction, we use the existing four error detection algorithms for handling multiple types of errors. To reduce the negative effect of noisy training labels on the classification model, we use an adaptive label-cleaning method by training any arbitrary ML models iteratively. Our approach can not only prioritize erroneous instances but also clean noisy labels that affect the classifier primarily. Performance evaluation using five different datasets shows that AdaptiveClean excels over the best baseline error detection system by 0.01 to 0.12 in terms of F1 score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  2. Biessmann, F., et al.: DataWig: missing value imputation for tables. J. Mach. Learn. Res. 20(175), 1–6 (2019)

    Google Scholar 

  3. Bolstad, W.M., Curran, J.M.: Introduction to Bayesian Statistics. Wiley, New York (2016)

    Google Scholar 

  4. Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)

    Google Scholar 

  5. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)

    Google Scholar 

  6. Chu, X., et al.: KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: 2015 ACM SIGMOD (2015)

    Google Scholar 

  7. Dallachiesa, M., et al.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)

    Google Scholar 

  8. Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The Magellan data repository (2015)

    Google Scholar 

  9. Deng, D., et al.: The data civilizer system. In: CIDR (2017)

    Google Scholar 

  10. Dimitriadis, I., Poiitis, M., Faloutsos, C., Vakali, A.: TG-OUT: temporal outlier patterns detection in Twitter attribute induced graphs. World Wide Web 25(6), 2429–2453 (2022)

    Article  Google Scholar 

  11. Dolatshah, M.: Cleaning crowdsourced labels using oracles for statistical classification. Ph.D. thesis, Applied Sciences: School of Computing Science (2018)

    Google Scholar 

  12. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE (2013)

    Google Scholar 

  13. Fu, J., Wang, L., Ke, J., Yang, K., Yu, R.: GANAD: a GAN-based method for network anomaly detection. World Wide Web 26, 2727–2748 (2023)

    Article  Google Scholar 

  14. Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  15. Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: HoloDetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)

    Google Scholar 

  16. Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)

    Google Scholar 

  17. Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)

    Google Scholar 

  18. Jeatrakul, P., Wong, K.W., Fung, C.C.: Data cleaning for classification using misclassification analysis. J. Adv. Comput. Intell. Intell. Inform. 14(3), 297–302 (2010)

    Article  Google Scholar 

  19. Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)

    Google Scholar 

  20. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2011)

    Google Scholar 

  21. Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: ActiveClean: interactive data cleaning for statistical modeling. PVLDB 9, 948–959 (2016)

    Google Scholar 

  22. Li, J., Socher, R., Hoi, S.C.: DivideMix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)

  23. Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ML pipelines. arXiv (2020)

    Google Scholar 

  24. Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)

    Google Scholar 

  25. Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  26. Miranda, A.L.B., Garcia, L.P.F., Carvalho, A.C.P.L.F., Lorena, A.C.: Use of classification algorithms in noise detection and elimination. In: Corchado, E., Wu, X., Oja, E., Herrero, Á., Baruque, B. (eds.) HAIS 2009. LNCS (LNAI), vol. 5572, pp. 417–424. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02319-4_50

    Chapter  Google Scholar 

  27. Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ML to cleaning for ML. IEEE (2021)

    Google Scholar 

  28. Neutatz, F., Mahdavi, M., Abedjan, Z.: ED2: two-stage active learning for error detection–technical report. arXiv (2019)

    Google Scholar 

  29. Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)

    Google Scholar 

  30. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)

    Google Scholar 

  31. Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)

    Google Scholar 

  32. Rammelaere, J., Geerts, F.: Explaining repaired data with CFDS. VLDB (2018)

    Google Scholar 

  33. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: HoloClean: holistic data repairs with probabilistic inference. arXiv (2017)

    Google Scholar 

  34. Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)

    Article  Google Scholar 

  35. van de Schoot, R., et al.: Bayesian statistics and modelling. Nat. Rev. Methods Primers 1, 1 (2021)

    Article  Google Scholar 

  36. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  37. Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)

    Google Scholar 

  38. Xiang, H., Zhang, X.: Edge computing empowered anomaly detection framework with dynamic insertion and deletion schemes on data streams. World Wide Web 25(5), 2163–2183 (2022)

    Article  Google Scholar 

  39. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)

    Google Scholar 

  40. Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: SCODED: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)

    Google Scholar 

  41. Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)

    Google Scholar 

  42. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)

    Google Scholar 

Download references

Acknowledgements

This research is supported by the National Key R&D Program of China 2021YFB3301500, Shenzhen Continuous Support Grant 20200811104054002, Guangdong Provincial National Science Foundation 2019A1515111047, and Shenzhen Colleges and Universities Continuous Support Grant 20220810142731001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianbin Qin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Qin, J., Mao, R., Ji, Y., Wang, Y., Ali, M.A. (2024). Adaptive Label Cleaning for Error Detection on Tabular Data. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2421-5_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2420-8

  • Online ISBN: 978-981-97-2421-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics