Data collection and quality challenges in deep learning: a data-centric AI perspective

Whang, Steven Euijong; Roh, Yuji; Song, Hwanjun; Lee, Jae-Gil

doi:10.1007/s00778-022-00775-9

Data collection and quality challenges in deep learning: a data-centric AI perspective

Regular Paper
Published: 03 January 2023

Volume 32, pages 791–813, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

6590 Accesses
54 Citations
2 Altmetric
Explore all metrics

Abstract

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here, software engineering needs to be re-thought where data become a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

Article 04 April 2024

DRE: density-based data selection with entropy for adversarial-robust deep learning models

Article Open access 19 October 2022

Towards testing big data analytics software: the essential role of metamorphic testing

Article 18 December 2018

References

Amazon Mechanical Turk. https://www.mturk.com/. Accessed 13 July 2022
Amazon SageMaker Ground Truth. https://aws.amazon.com/sagemaker/groundtruth/. Accessed 13 July 2022
Amazon scraps secret AI recruiting tool that showed bias against women. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G. Accessed 13 July 2022
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., Wallach., H.M.: A reductions approach to fair classification. In: ICML, pp. 60–69 (2018)
Agrawal, P., Arya, R., Bindal, A., Bhatia, S., Gagneja, Godlewski, J., Low, Y., Muss, T., Paliwal, M.M., Raman, S., Shah, V., Shen, Sugden, L., Zhao, K., Wu, M.-C.: Data platform for machine learning. In: SIGMOD, pp. 1803–1816 (2019)
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H.C., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T.: Software engineering for machine learning: a case study. In: ICSE, pp. 291–300 (2019)
Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: there’s software used across the country to predict future criminals. And its biased against blacks (2016)
Anwar, S., Barnes, N.: Real image denoising with feature attention. In: CVPR, pp. 3155–3164 (2019)
Asudeh, A., Jin, Z., Jagadish, H.V.: Assessing and remedying coverage for a given dataset. In: ICDE, pp. 554–565 (2019)
Bach, S.H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Ré, C.,Malkin, R.: Snorkel Drybell: a case study in deploying weak supervision at industrial scale. In: SIGMOD, pp. 362–375 (2019)
Baltrusaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Article Google Scholar
Barocas, S., Hardt, M., Narayanan, A.: Fairness and machine learning. fairmlbook.org. http://www.fairmlbook.org (2019)
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C.Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C.Y., Lew, L., Mewald, C., Modi, A.N., Polyzotis, N., Ramesh, S., Roy, S., Whang, S.E., Wicke, M., Wilkiewicz, J., Zhang, X., Zinkevich, M.: TFX: a tensorflow-based production-scale machine learning platform. In: KDD, pp. 1387–1395 (2017)
Bellamy, R.K.E., Dey, K., Hind, M., et al.: AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Dev. 63, 4:1-4:15 (2019)
Article Google Scholar
Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: the state of the art (2017)
Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., Raffel, C.: Remixmatch: semi-supervised learning with distribution matching and augmentation anchoring. In: ICLR (2020)
Berthelot, D., Carlini, N., Goodfellow, I.J., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: a holistic approach to semi-supervised learning. In: NeurIPS, pp. 5050–5060 (2019)
Biessmann, F., Golebiowski, J., Rukat, T., Lange, D., Schmidt, P.: Automated data validation in machine learning systems. IEEE Data Eng. Bull. 44(1), 51–65 (2021)
Google Scholar
Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., Roli, F.: Evasion attacks against machine learning at test time. In: ECML PKDD, pp. 387–402. Springer (2013)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100. ACM, New York (1998)
Boehm, M., Antonov, I., Baunsgaard, S., Dokter, M., Ginthör, R., Innerebner, K., Klezin, F., Lindstaedt, S.N., Phani, A., Rath, B., Reinwald, B., Siddiqui, S., Wrede, S.B.: Systemds: a declarative machine learning system for the end-to-end data science lifecycle. In: CIDR (2020)
Breck, E., Zinkevich, M., Polyzotis, N., Whang, S., Roy, S.: Data validation for machine learning. In: MLSys (2019)
Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: WWW, pp. 1365–1375 (2019)
CrowdFlower Data Science Report. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
Cafarella, M.J., Halevy, A.Y., Lee, H., Madhavan, J., Cong, Y., Wang, D.Z., Wu, E.: Ten years of webtables. PVLDB 11(12), 2140–2149 (2018)
Google Scholar
Cambronero, J., Feser, J.K., Smith, M.J., Madden, S.: Query optimization for dynamic imputation. Proc. VLDB Endow. 10(11), 1310–1321 (2017)
Article Google Scholar
Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., Mukhopadhyay, D.: Adversarial attacks and defences: a survey. CoRR arXiv:1810.00069 (2018)
Chang, H.-S., Learned-Miller, E.G., McCallum., A.: Active bias: training more accurate neural networks by emphasizing high variance samples. In: NeurIPS, pp. 1002–1012 (2017)
Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y.: Recurrent neural networks for multivariate time series with missing values. Nat. Sci. Rep. 8(1), 6085 (2018)
Google Scholar
Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong, S.A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., Singh, A., Xie, F., Zaharia, M., Zang, R., Zheng, J., Zumar, C.: Developments in mlflow: a system to accelerate the machine learning lifecycle. In: DEEM@SIGMOD, pp. 5:1–5:4 (2020)
Chen, I.Y., Johansson, F.D., Sontag, D.A.: Why is my classifier discriminatory? In: NeurIPS, pp. 3543–3554 (2018)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: KDD, pp. 785–794 (2016)
Cheng, Y., Diakonikolas, I., Ge, R.: High-dimensional robust mean estimation in nearly-linear time. In: SIAM, pp. 2755–2771 (2019)
Choi, K., Grover, A., Singh, T., Shu, R., Ermon, S.: Fair generative modeling via weak supervision. In: ICML, pp. 1887–1898 (2020)
Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2), 153–163 (2017)
Article Google Scholar
Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Commun. ACM 63(5), 82–89 (2020)
Article Google Scholar
Chzhen, E., Denis, C., Hebiri, M., Oneto, L., Pontil, M.: Leveraging labeled and unlabeled data for consistent fair binary classification. In: NeurIPS, pp. 12739–12750 (2019)
Cotter, A., Jiang, H., Sridharan, K.: Two-player games for efficient non-convex constrained optimization. In: ALT, pp. 300–332 (2019)
Cretu, G.F., Stavrou, A., Locasto, M.E., Stolfo, S.J., Keromytis, A.D.: Casting out demons: sanitizing training data for anomaly sensors. In: IEEE S &P, pp. 81–95 (2008)
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: CVPR, pp. 113–123 (2019)
Data age 2025. https://www.seagate.com/our-story/data-age-2025/
Data-centric AI resource hub. https://datacentricai.org/
Data prep still dominates data scientists’ time, survey finds. https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/
Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., Stewart, A.: Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)
Article MathSciNet MATH Google Scholar
Dieterich, W., Mendoza, C., Brennan, T.: Compas risk scales: demonstrating accuracy equity and predictive parity. Technical report, Northpoint Inc (2016)
Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012)
Google Scholar
Dolatshah, M., Teoh, M., Wang, J., Pei, J.: Cleaning crowdsourced labels using oracles for statistical classification. PVLDB 12(4), 376–389 (2018)
Google Scholar
Dong, X.L., Rekatsinas, T.: Data integration and machine learning: a natural synergy. In: KDD, pp. 3193–3194 (2019)
Dreves, M., Huang, G., Peng, Z., Polyzotis, N., Rosen, E., Paul Suganthan, G.C.: Validating data and models in continuous ML pipelines. IEEE Data Eng. Bull. 44(1), 42–50 (2021)
Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017)
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: ITCS, pp. 214–226 (2012)
Facets—visualization for ML datasets. https://pair-code.github.io/facets/. Accessed 13 July 2022
Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact. In: KDD, pp. 259–268 (2015)
Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: ICDE, pp. 1001–1012 (2018)
Foster, D.P., Stine, R.A.: Alpha-investing: a procedure for sequential control of expected false discoveries. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(2), 429–444 (2008)
Article MathSciNet MATH Google Scholar
GCP AI platform data labeling service. https://cloud.google.com/ai-platform/data-labeling/docs. Accessed 13 July 2022
Google apologises for Photos app’s racist blunder. https://www.bbc.com/news/technology-33347866. Accessed 13 July 2022
Goel, K., Albert, G., Li, Y., Ré, C.: Model patching: closing the subgroup performance gap with data augmentation. In: ICLR (2021)
Goodfellow, I.J.: NIPS 2016 tutorial: generative adversarial networks. CoRR arXiv:1701.00160 (2017)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015)
Gordon, J.: Introducing tensorflow hub: a library for reusable machine learning modules in tensorflow (2018)
Grafberger, S., Stoyanovich, J., Schelter, S.: Lightweight inspection of data preprocessing in native machine learning pipelines. In: CIDR (2021)
Halevy, A.Y., Korn, F., Noy, N.F., Olston, C., Polyzotis, N., Roy, S., Whang, S.E.: Goods: organizing Google’s datasets. In: SIGMOD, pp. 795–806 (2016)
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I.W., M. Sugiyama. Co-teaching: robust training of deep neural networks with extremely noisy labels. In: NeurIPS, pp. 8536–8546 (2018)
Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: NIPS, pp. 3315–3323 (2016)
Hashimoto, T.B., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demographics in repeated loss minimization. In: Dy, J.G., Krause, A. (eds.) ICML, vol. 80, pp. 1934–1943. PMLR (2018)
Hazelwood, K.M., Bird, S., Brooks, D.M., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X.: Applied machine learning at Facebook: a datacenter infrastructure perspective. In: HPCA, pp. 620–629 (2018)
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2020)
Heo, G., Roh, Y., Hwang, S., Lee, D., Whang, S.E.: Inspector gadget: a data programming-based labeling system for industrial images. In: PVLDB (2021)
Hermann, J.M., Baso, D.: Meet michelangelo: Uber’s machine learning platform (2017)
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article MATH Google Scholar
Huber, P.J.: Robust estimation of a location parameter. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 492–518. Springer, Berlin (1992)
Chapter Google Scholar
Ilyas, I.F., Chu, X.: Data Cleaning. ACM, New York (2019)
Book MATH Google Scholar
Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: Which serves the other? J. Data Inf. Qual. (2021). Just Accepted
Iosifidis, V., Ntoutsi, E.: Adafair: cumulative fairness adaptive boosting. In: CIKM, pp. 781–790 (2019)
Jiang, H., Nachum, O.: Identifying and correcting label bias in machine learning. In: AISTATS, pp. 702–712 (2020)
Jiang, L., Zhou, Z., Leung, T., Li, L.-J., Fei-Fei, L.: Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML, pp. 2309–2318 (2018)
Kaggle. https://www.kaggle.com
Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2011)
Article Google Scholar
Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: ECML PKDD, pp. 35–50 (2012)
Karlas, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. Proc. VLDB Endow. 14(3), 255–267 (2020)
Article Google Scholar
Khademi, A., Lee, S., Foley, D., Honavar, V.: Fairness in algorithmic decision making: an excursion through the lens of causality. In: WWW, pp. 2907–2914 (2019)
Khani, F., Liang, P.: Removing spurious features can hurt accuracy and affect groups disproportionately. In: FAccT, pp. 196–205. ACM (2021)
Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., Schölkopf, B.: Avoiding discrimination through causal reasoning. In: NeurIPS, pp. 656–666 (2017)
Kim, H., Lee, K., Hwang, G., Suh, C.: Crash to not crash: learn to identify dangerous vehicles using a simulator. In: AAAI, pp. 978–985 (2019)
Koh, P.W., Steinhardt, J., Liang, P.: Stronger data poisoning attacks break data sanitization defenses. CoRR arXiv:1811.00741 (2018)
Krishnan, S., Wang, J., Eugene, W., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. PVLDB 9(12), 948–959 (2016)
Google Scholar
Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S.: The GAN landscape: losses, architectures, regularization, and normalization. CoRR arXiv:1807.04720 (2018)
Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: NeurIPS, pp. 4066–4076 (2017)
Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., Wang, X., Chi, E.: Fairness without demographics through adversarially reweighted learning. In: NeurIPS (2020)
Lamy, A.L., Zhong, Z.: Noise-tolerant fair classification. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) NeurIPS, pp. 294–305 (2019)
Lee, D.J.L., Parameswaran, A.G.: The case for a visual discovery assistant: a holistic solution for accelerating visual data exploration. IEEE Data Eng. Bull. 41(3), 3–14 (2018)
Google Scholar
Lee, J.-G., Roh, Y., Song, H., Whang, S.E.: Machine learning robustness, fairness, and their convergence. In: KDD, pp. 4046–4047 (2021)
Li, J., Socher, R., Hoi, S.C.H.: Dividemix: learning with noisy labels as semi-supervised learning. In: ICLR (2020)
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: CleanML: a benchmark for joint data cleaning and machine learning [experiments and analysis]. In: ICDE (2021)
Liu, Z., Park, J.H., Rekatsinas, T., Tzamos, C.: On robust mean estimation under coordinate-level corruption. In: ICML, pp. 6914–6924. PMLR (2021)
Liu, Z., Park, J., Rekatsinas, T., Tzamos, C.: On robust mean estimation under coordinate-level corruption. In: ICML, pp. 6914–6924 (2021)
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: NIPS, pp. 960–970 (2017)
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. CoRR arXiv:1908.09635 (2019)
Melgar, L.A., Dao, D., Gan, S., Gürel, N.M., Hollenstein, N., Jiang, J., Karlas, B., Lemmin, T., Li, T., Li, Y., Rao, X., Rausch, J., Renggli, C., Rimanic, L., Weber, M., Zhang, S., Zhao, Z., Schawinski, K., Wu, W., Zhang, C.: Ease.ml: a lifecycle management system for machine learning. In: CIDR (2021)
Meng, D., Chen, H.: Magnet: a two-pronged defense against adversarial examples. In: Thuraisingham, B.M., Evans, D., Malkin, T., Xu, D. (eds.) ACM SIGSAC, pp. 135–147 (2017)
Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. In: ICLR (2017)
Miller, R.J., Nargesian, F., Zhu, E., Christodoulakis, C., Pu, K.Q., Andritsos, P.: Making open data transparent: data discovery on open data. IEEE Data Eng. Bull. 41(2), 59–70 (2018)
Google Scholar
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Su, K.-Y., Su, J., Wiebe, J. (eds.) ACL, pp. 1003–1011 (2009)
Nabi, R., Shpitser, I.: Fair inference on outcomes. In: AAAI, pp. 1931–1940 (2018)
Neutatz, F., Chen, B., Abedjan, Z., Eugene, W.: From cleaning before ML to cleaning for ML. IEEE Data Eng. Bull. 44(1), 24–41 (2021)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV, pp. 69–84 (2016)
Principles for AI ethics. https://research.samsung.com/artificial-intelligence. Accessed 13 July 2022
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE TKDE 22(10), 1345–1359 (2010)
Google Scholar
Papernot, N., McDaniel, P.D., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE SP, pp. 582–597 (2016)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc. (2019)
Patrini, G., Rozza, A., Menon, A.K., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: CVPR, pp. 2233–2241 (2017)
Paudice, A., Muñoz-González, L., György, A., Lupu, E.C.: Detection of adversarial training examples in poisoning attacks through anomaly detection. CoRR arXiv:1802.03041 (2018)
Pelekis, N., Ntrigkogias, C., Tampakis, P., Sideridis, S., Theodoridis, Y.: Hermoupolis: a trajectory generator for simulating generalized mobility patterns. In: ECML PKDD, pp. 659–662 (2013)
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J.M., Weinberger, K.Q.: On fairness and calibration. In: NIPS, pp. 5680–5689 (2017)
Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data management challenges in production machine learning. In: SIGMOD, pp. 1723–1726 (2017)
Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data lifecycle challenges in production machine learning: a survey. SIGMOD Rec. 47(2), 17–28 (2018)
Article Google Scholar
Qayyum, A., Qadir, J., Bilal, M., Al-Fuqaha, A.: Secure and robust machine learning for healthcare: a survey. IEEE Rev. Biomed. Eng. 14, 156–180 (2020)
Article Google Scholar
Responsible AI practices. https://ai.google/responsibilities/responsible-ai-practices. Accessed 13 July 2022
Responsible AI principles from Microsoft. https://www.microsoft.com/en-us/ai/responsible-ai. Accessed 13 July 2022
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Sen, W., Ré, C.: Snorkel: rapid training data creation with weak supervision. PVLDB 11(3), 269–282 (2017)
Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H.R., Fries, J.A., Sen, W., Ré, C.: Snorkel: rapid training data creation with weak supervision. VLDB J. 29(2–3), 709–730 (2020)
Article Google Scholar
Ratner, A.J., Ehrenberg, H.R., Hussain, Z., Dunnmon, J., Ré, C.: Learning to compose domain-specific transformations for data augmentation. In: NIPS, pp. 3239–3249 (2017)
Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: EDBT, pp. 61–72 (2021)
Reed, S.E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. In: ICLR (2015)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)
Google Scholar
Renggli, C., Rimanic, L., Gürel, N.M., Karlas, B., Wu, W., Zhang, C.: A data quality-driven view of mlops. IEEE Data Eng. Bull. 44(1), 11–23 (2021)
Google Scholar
Ricci, F., Rokach, L., Shapira, B. (eds.): Recommender Systems Handbook. Springer, Berlin (2015)
MATH Google Scholar
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data—AI integration perspective. In: IEEE TKDE (2019)
Roh, Y., Lee, K., Whang, S.E., Suh, C.: FR-Train: a mutual information-based approach to fair and robust training. In: ICML (2020)
Roh, Y., Lee, K., Whang, S.E., Suh, C.: Fairbatch: batch selection for model fairness. In: ICLR. OpenReview.net (2021)
Roh, Y., Lee, K., Whang, S.E., Suh, C.: Sample selection for fair and robust training. In: NeurIPS (2021)
Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35
South Korean AI chatbot pulled from Facebook after hate speech towards minorities. https://www.theguardian.com/world/2021/jan/14/time-to-properly-socialise-hate-speech-ai-chatbot-pulled-from-facebook. Accessed 13 July 2022
Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: causal database repair for algorithmic fairness. In: SIGMOD, pp. 793–810 (2019)
Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., Seufert, S.: Automatically tracking metadata and provenance of machine learning experiments. In: Workshop on ML Systems at NIPS (2017)
Schelter, S., Grafberger, S., Schmidt, P., Rukat, T., Kießling, M., Taptunov, A., Bießmann, F., Lange, D.: Differential data quality verification on partitioned data. In: ICDE, pp. 1940–1945 (2019)
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Bießmann, F., Grafberger, A.: Automating large-scale data quality verification. Proc. VLDB Endow. 11(12), 1781–1794 (2018)
Article Google Scholar
Schelter, S., Rukat, T., Biessmann, F.: JENGA: a framework to study the impact of data errors on the predictions of machine learning models. In: EDBT, pp. 529–534 (2021)
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D.: Hidden technical debt in machine learning systems. In: NIPS, pp. 2503–2511 (2015)
Settles, B.: Active learning. In: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)
Shafahi, A., Huang, W.R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., Goldstein, T.: Poison frogs! Targeted clean-label poisoning attacks on neural networks. In: NeurIPS, pp. 6106–6116 (2018)
Shang, L.: Denoising natural images based on a modified sparse coding algorithm. Appl. Math. Comput. 205(2), 883–889 (2008)
MathSciNet MATH Google Scholar
Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., Cui, P.: Towards out-of-distribution generalization: a survey. arXiv:2108.13624 (2021)
Sheng, V.S., Provost, F.J., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD, pp. 614–622 (2008)
Sinha, A., Namkoong, H., Duchi, J.C.: Certifying some distributional robustness with principled adversarial training. In: ICLR (2018)
Solans, D., Biggio, B., Castillo, C.: Poisoning attacks on algorithmic fairness. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds.) ECML PKDD, vol. 12457, pp. 162–177. Springer (2020)
Song, H., Kim, M., Lee, J.-G.: SELFIE: refurbishing unclean samples for robust deep learning. In: ICML, pp. 5907–5915 (2019)
Song, H., Kim, M., Park, D., Lee, J.-G.: Learning from noisy labels with deep neural networks: a survey. CoRR arXiv:2007.08199 (2020)
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.-G.: Robust learning by self-transition for handling noisy labels. In: KDD, pp. 1490–1500 (2021)
Stonebraker, M., Ilyas, I.F.: Data integration: the current status and the way forward. IEEE Data Eng. Bull. 41(2), 3–9 (2018)
Google Scholar
Stonebraker, M., Rezig, E.K.: Machine learning and big data: what is important? IEEE Data Eng. Bull. 42, 3–7 (2019)
Google Scholar
Trusting AI. https://www.research.ibm.com/artificial-intelligence/trusted-ai/. Accessed 13 July 2022
Tae, K.H., Whang, S.E.: Slice tuner: a selective data acquisition framework for accurate and fair machine learning models. In: SIGMOD, pp. 1771–1783. ACM (2021)
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: NIPS, pp. 1195–1204 (2017)
Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging Yourney from the wild to the lake. In: CIDR (2015)
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS, pp. 23–30 (2017)
Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: CVPR Workshops, pp. 969–977 (2018)
Triguero, I., García, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2015)
Article Google Scholar
Tukey, J.W.: A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics, pp. 448–485 (1960)
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3), 1–67 (2011)
Article Google Scholar
Varma, P., Ré, C.: Snuba: automating weak supervision to label training data. Proc. VLDB Endow. 12(3), 223–236 (2018)
Article Google Scholar
Vartak, M., Rahman, S., Madden, S., Parameswaran, A.G., Polyzotis, N.: SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB 8(13), 2182–2193 (2015)
Google Scholar
Venkatasubramanian, S.: Algorithmic fairness: measures, methods and representations. In: PODS, p. 481 (2019)
Wang, H., Liu, B., Li, C., Yang, Y., Li, T.: Learning with noisy labels for sentence-level sentiment classification. In: EMNLP (2019)
Wang, J., Liu, Y., Levy, C.: Fair classification with group-dependent label noise. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT, pp. 526–536. ACM (2021)
Wang, S., Guo, W., Narasimhan, H., Cotter, A., Gupta, M.R., Jordan, M.I.: Robust optimization for fairness with noisy protected groups. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) NeurIPS (2020)
Whang, S.E., Lee, J.-G.: Data collection and quality challenges for deep learning. Proc. VLDB Endow. 13(12), 3429–3432 (2020)
Article Google Scholar
Xin, D., Petersohn, D., Tang, D., Yifan, W., Gonzalez, J.E., Hellerstein, J.M., Joseph, A.D., Parameswaran, A.G.: Enhancing the interactivity of dataframe queries by leveraging think time. IEEE Data Eng. Bull. 44(1), 66–78 (2021)
Google Scholar
Xu, D., Yuan, S., Zhang, L., Wu, X.: Fairgan: fairness-aware generative adversarial networks. In: IEEE Big Data, pp. 570–575 (2018)
Xu, H., Liu, X., Li, Y., Jain, A.K., Tang, J.: To be robust or to be fair: towards fairness in adversarial training. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 11492–11501. PMLR (2021)
Xu, W., Evans, D., Qi, Y.: Feature squeezing: detecting adversarial examples in deep neural networks. In: NDSS (2018)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL, pp. 189–196, Stroudsburg, PA, USA (1995). Association for Computational Linguistics
Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: ICCV, pp. 6022–6031 (2019)
Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In: WWW, pp. 1171–1180. ACM (2017)
Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness constraints: mechanisms for fair classification. In: AISTATS, pp. 962–970 (2017)
Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: AIES, pp. 335–340 (2018)
Zhang, H., Chu, X., Asudeh, A., Navathe, S.B.: Omnifair: a declarative system for model-agnostic group fairness in machine learning. In: SIGMOD, pp. 2076–2088 (2021)
Zhang, H., Davidson, I.: Facct. pp. 138–148. ACM (2021)
Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: ICLR (2018)
Zhang, J., Bareinboim, E.: Fairness in decision-making: the causal explanation formula. In: AAAI (2018)
Zhang, Y., Ives, Z.G.: Finding related tables in data lakes for interactive data science. In: SIGMOD, pp. 1951–1966 (2020)
Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: SIGMOD, pp. 527–540 (2017)
Zhou, Y., Goldman, S.A.: Democratic co-learning. In: IEEE ICTAI, pp. 594–602 (2004)
Zhou, Z.-H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE TKDE 17(11), 1529–1541 (2005)
Google Scholar
Zhu, C., Ronny Huang, W., Li, H., Taylor, G., Studer, C., Goldstein, T.: Transferable clean-label poisoning attacks on deep neural nets. In: ICML, pp. 7614–7623 (2019)
Zhu, X.: Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison (2005)

Download references

Acknowledgements

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00157 and 2020-0-00862) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2018R1A5A1059921 and NRF-2022R1A2C2004382).

Author information

Authors and Affiliations

KAIST, Daejeon, South Korea
Steven Euijong Whang, Yuji Roh & Jae-Gil Lee
Naver AI Lab, Seongnam, South Korea
Hwanjun Song

Authors

Steven Euijong Whang
View author publications
You can also search for this author in PubMed Google Scholar
Yuji Roh
View author publications
You can also search for this author in PubMed Google Scholar
Hwanjun Song
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Gil Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Euijong Whang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article extends tutorials the authors delivered at the VLDB 2020 [169] and KDD 2021 [94] conferences.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Whang, S.E., Roh, Y., Song, H. et al. Data collection and quality challenges in deep learning: a data-centric AI perspective. The VLDB Journal 32, 791–813 (2023). https://doi.org/10.1007/s00778-022-00775-9

Download citation

Received: 12 December 2021
Revised: 08 November 2022
Accepted: 10 December 2022
Published: 03 January 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s00778-022-00775-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Abstract

Access this article

Similar content being viewed by others

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

DRE: density-based data selection with entropy for adversarial-robust deep learning models

Towards testing big data analytics software: the essential role of metamorphic testing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data collection and quality challenges in deep learning: a data-centric AI perspective

Abstract

Access this article

Similar content being viewed by others

Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks

DRE: density-based data selection with entropy for adversarial-robust deep learning models

Towards testing big data analytics software: the essential role of metamorphic testing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation