Skip to main content
Log in

Data collection and quality challenges in deep learning: a data-centric AI perspective

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here, software engineering needs to be re-thought where data become a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Amazon Mechanical Turk. https://www.mturk.com/. Accessed 13 July 2022

  2. Amazon SageMaker Ground Truth. https://aws.amazon.com/sagemaker/groundtruth/. Accessed 13 July 2022

  3. Amazon scraps secret AI recruiting tool that showed bias against women. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G. Accessed 13 July 2022

  4. Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., Wallach., H.M.: A reductions approach to fair classification. In: ICML, pp. 60–69 (2018)

  5. Agrawal, P., Arya, R., Bindal, A., Bhatia, S., Gagneja, Godlewski, J., Low, Y., Muss, T., Paliwal, M.M., Raman, S., Shah, V., Shen, Sugden, L., Zhao, K., Wu, M.-C.: Data platform for machine learning. In: SIGMOD, pp. 1803–1816 (2019)

  6. Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H.C., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T.: Software engineering for machine learning: a case study. In: ICSE, pp. 291–300 (2019)

  7. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: there’s software used across the country to predict future criminals. And its biased against blacks (2016)

  8. Anwar, S., Barnes, N.: Real image denoising with feature attention. In: CVPR, pp. 3155–3164 (2019)

  9. Asudeh, A., Jin, Z., Jagadish, H.V.: Assessing and remedying coverage for a given dataset. In: ICDE, pp. 554–565 (2019)

  10. Bach, S.H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Ré, C.,Malkin, R.: Snorkel Drybell: a case study in deploying weak supervision at industrial scale. In: SIGMOD, pp. 362–375 (2019)

  11. Baltrusaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  12. Barocas, S., Hardt, M., Narayanan, A.: Fairness and machine learning. fairmlbook.org. http://www.fairmlbook.org (2019)

  13. Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C.Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C.Y., Lew, L., Mewald, C., Modi, A.N., Polyzotis, N., Ramesh, S., Roy, S., Whang, S.E., Wicke, M., Wilkiewicz, J., Zhang, X., Zinkevich, M.: TFX: a tensorflow-based production-scale machine learning platform. In: KDD, pp. 1387–1395 (2017)

  14. Bellamy, R.K.E., Dey, K., Hind, M., et al.: AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Dev. 63, 4:1-4:15 (2019)

    Article  Google Scholar 

  15. Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: the state of the art (2017)

  16. Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., Raffel, C.: Remixmatch: semi-supervised learning with distribution matching and augmentation anchoring. In: ICLR (2020)

  17. Berthelot, D., Carlini, N., Goodfellow, I.J., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: a holistic approach to semi-supervised learning. In: NeurIPS, pp. 5050–5060 (2019)

  18. Biessmann, F., Golebiowski, J., Rukat, T., Lange, D., Schmidt, P.: Automated data validation in machine learning systems. IEEE Data Eng. Bull. 44(1), 51–65 (2021)

    Google Scholar 

  19. Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., Roli, F.: Evasion attacks against machine learning at test time. In: ECML PKDD, pp. 387–402. Springer (2013)

  20. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100. ACM, New York (1998)

  21. Boehm, M., Antonov, I., Baunsgaard, S., Dokter, M., Ginthör, R., Innerebner, K., Klezin, F., Lindstaedt, S.N., Phani, A., Rath, B., Reinwald, B., Siddiqui, S., Wrede, S.B.: Systemds: a declarative machine learning system for the end-to-end data science lifecycle. In: CIDR (2020)

  22. Breck, E., Zinkevich, M., Polyzotis, N., Whang, S., Roy, S.: Data validation for machine learning. In: MLSys (2019)

  23. Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: WWW, pp. 1365–1375 (2019)

  24. CrowdFlower Data Science Report. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

  25. Cafarella, M.J., Halevy, A.Y., Lee, H., Madhavan, J., Cong, Y., Wang, D.Z., Wu, E.: Ten years of webtables. PVLDB 11(12), 2140–2149 (2018)

    Google Scholar 

  26. Cambronero, J., Feser, J.K., Smith, M.J., Madden, S.: Query optimization for dynamic imputation. Proc. VLDB Endow. 10(11), 1310–1321 (2017)

    Article  Google Scholar 

  27. Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., Mukhopadhyay, D.: Adversarial attacks and defences: a survey. CoRR arXiv:1810.00069 (2018)

  28. Chang, H.-S., Learned-Miller, E.G., McCallum., A.: Active bias: training more accurate neural networks by emphasizing high variance samples. In: NeurIPS, pp. 1002–1012 (2017)

  29. Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y.: Recurrent neural networks for multivariate time series with missing values. Nat. Sci. Rep. 8(1), 6085 (2018)

    Google Scholar 

  30. Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong, S.A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., Singh, A., Xie, F., Zaharia, M., Zang, R., Zheng, J., Zumar, C.: Developments in mlflow: a system to accelerate the machine learning lifecycle. In: DEEM@SIGMOD, pp. 5:1–5:4 (2020)

  31. Chen, I.Y., Johansson, F.D., Sontag, D.A.: Why is my classifier discriminatory? In: NeurIPS, pp. 3543–3554 (2018)

  32. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: KDD, pp. 785–794 (2016)

  33. Cheng, Y., Diakonikolas, I., Ge, R.: High-dimensional robust mean estimation in nearly-linear time. In: SIAM, pp. 2755–2771 (2019)

  34. Choi, K., Grover, A., Singh, T., Shu, R., Ermon, S.: Fair generative modeling via weak supervision. In: ICML, pp. 1887–1898 (2020)

  35. Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2), 153–163 (2017)

    Article  Google Scholar 

  36. Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Commun. ACM 63(5), 82–89 (2020)

    Article  Google Scholar 

  37. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., Pontil, M.: Leveraging labeled and unlabeled data for consistent fair binary classification. In: NeurIPS, pp. 12739–12750 (2019)

  38. Cotter, A., Jiang, H., Sridharan, K.: Two-player games for efficient non-convex constrained optimization. In: ALT, pp. 300–332 (2019)

  39. Cretu, G.F., Stavrou, A., Locasto, M.E., Stolfo, S.J., Keromytis, A.D.: Casting out demons: sanitizing training data for anomaly sensors. In: IEEE S &P, pp. 81–95 (2008)

  40. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: CVPR, pp. 113–123 (2019)

  41. Data age 2025. https://www.seagate.com/our-story/data-age-2025/

  42. Data-centric AI resource hub. https://datacentricai.org/

  43. Data prep still dominates data scientists’ time, survey finds. https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/

  44. Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., Stewart, A.: Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  45. Dieterich, W., Mendoza, C., Brennan, T.: Compas risk scales: demonstrating accuracy equity and predictive parity. Technical report, Northpoint Inc (2016)

  46. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012)

    Google Scholar 

  47. Dolatshah, M., Teoh, M., Wang, J., Pei, J.: Cleaning crowdsourced labels using oracles for statistical classification. PVLDB 12(4), 376–389 (2018)

    Google Scholar 

  48. Dong, X.L., Rekatsinas, T.: Data integration and machine learning: a natural synergy. In: KDD, pp. 3193–3194 (2019)

  49. Dreves, M., Huang, G., Peng, Z., Polyzotis, N., Rosen, E., Paul Suganthan, G.C.: Validating data and models in continuous ML pipelines. IEEE Data Eng. Bull. 44(1), 42–50 (2021)

    Google Scholar 

  50. Dua, D., Graff, C.: UCI machine learning repository (2017)

  51. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: ITCS, pp. 214–226 (2012)

  52. Facets—visualization for ML datasets. https://pair-code.github.io/facets/. Accessed 13 July 2022

  53. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact. In: KDD, pp. 259–268 (2015)

  54. Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: ICDE, pp. 1001–1012 (2018)

  55. Foster, D.P., Stine, R.A.: Alpha-investing: a procedure for sequential control of expected false discoveries. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(2), 429–444 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  56. GCP AI platform data labeling service. https://cloud.google.com/ai-platform/data-labeling/docs. Accessed 13 July 2022

  57. Google apologises for Photos app’s racist blunder. https://www.bbc.com/news/technology-33347866. Accessed 13 July 2022

  58. Goel, K., Albert, G., Li, Y., Ré, C.: Model patching: closing the subgroup performance gap with data augmentation. In: ICLR (2021)

  59. Goodfellow, I.J.: NIPS 2016 tutorial: generative adversarial networks. CoRR arXiv:1701.00160 (2017)

  60. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

  61. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015)

  62. Gordon, J.: Introducing tensorflow hub: a library for reusable machine learning modules in tensorflow (2018)

  63. Grafberger, S., Stoyanovich, J., Schelter, S.: Lightweight inspection of data preprocessing in native machine learning pipelines. In: CIDR (2021)

  64. Halevy, A.Y., Korn, F., Noy, N.F., Olston, C., Polyzotis, N., Roy, S., Whang, S.E.: Goods: organizing Google’s datasets. In: SIGMOD, pp. 795–806 (2016)

  65. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I.W., M. Sugiyama. Co-teaching: robust training of deep neural networks with extremely noisy labels. In: NeurIPS, pp. 8536–8546 (2018)

  66. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: NIPS, pp. 3315–3323 (2016)

  67. Hashimoto, T.B., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demographics in repeated loss minimization. In: Dy, J.G., Krause, A. (eds.) ICML, vol. 80, pp. 1934–1943. PMLR (2018)

  68. Hazelwood, K.M., Bird, S., Brooks, D.M., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X.: Applied machine learning at Facebook: a datacenter infrastructure perspective. In: HPCA, pp. 620–629 (2018)

  69. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2020)

  70. Heo, G., Roh, Y., Hwang, S., Lee, D., Whang, S.E.: Inspector gadget: a data programming-based labeling system for industrial images. In: PVLDB (2021)

  71. Hermann, J.M., Baso, D.: Meet michelangelo: Uber’s machine learning platform (2017)

  72. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)

    Article  MATH  Google Scholar 

  73. Huber, P.J.: Robust estimation of a location parameter. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 492–518. Springer, Berlin (1992)

    Chapter  Google Scholar 

  74. Ilyas, I.F., Chu, X.: Data Cleaning. ACM, New York (2019)

    Book  MATH  Google Scholar 

  75. Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: Which serves the other? J. Data Inf. Qual. (2021). Just Accepted

  76. Iosifidis, V., Ntoutsi, E.: Adafair: cumulative fairness adaptive boosting. In: CIKM, pp. 781–790 (2019)

  77. Jiang, H., Nachum, O.: Identifying and correcting label bias in machine learning. In: AISTATS, pp. 702–712 (2020)

  78. Jiang, L., Zhou, Z., Leung, T., Li, L.-J., Fei-Fei, L.: Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML, pp. 2309–2318 (2018)

  79. Kaggle. https://www.kaggle.com

  80. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2011)

    Article  Google Scholar 

  81. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: ECML PKDD, pp. 35–50 (2012)

  82. Karlas, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. Proc. VLDB Endow. 14(3), 255–267 (2020)

    Article  Google Scholar 

  83. Khademi, A., Lee, S., Foley, D., Honavar, V.: Fairness in algorithmic decision making: an excursion through the lens of causality. In: WWW, pp. 2907–2914 (2019)

  84. Khani, F., Liang, P.: Removing spurious features can hurt accuracy and affect groups disproportionately. In: FAccT, pp. 196–205. ACM (2021)

  85. Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., Schölkopf, B.: Avoiding discrimination through causal reasoning. In: NeurIPS, pp. 656–666 (2017)

  86. Kim, H., Lee, K., Hwang, G., Suh, C.: Crash to not crash: learn to identify dangerous vehicles using a simulator. In: AAAI, pp. 978–985 (2019)

  87. Koh, P.W., Steinhardt, J., Liang, P.: Stronger data poisoning attacks break data sanitization defenses. CoRR arXiv:1811.00741 (2018)

  88. Krishnan, S., Wang, J., Eugene, W., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. PVLDB 9(12), 948–959 (2016)

    Google Scholar 

  89. Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S.: The GAN landscape: losses, architectures, regularization, and normalization. CoRR arXiv:1807.04720 (2018)

  90. Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: NeurIPS, pp. 4066–4076 (2017)

  91. Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., Wang, X., Chi, E.: Fairness without demographics through adversarially reweighted learning. In: NeurIPS (2020)

  92. Lamy, A.L., Zhong, Z.: Noise-tolerant fair classification. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) NeurIPS, pp. 294–305 (2019)

  93. Lee, D.J.L., Parameswaran, A.G.: The case for a visual discovery assistant: a holistic solution for accelerating visual data exploration. IEEE Data Eng. Bull. 41(3), 3–14 (2018)

    Google Scholar 

  94. Lee, J.-G., Roh, Y., Song, H., Whang, S.E.: Machine learning robustness, fairness, and their convergence. In: KDD, pp. 4046–4047 (2021)

  95. Li, J., Socher, R., Hoi, S.C.H.: Dividemix: learning with noisy labels as semi-supervised learning. In: ICLR (2020)

  96. Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: CleanML: a benchmark for joint data cleaning and machine learning [experiments and analysis]. In: ICDE (2021)

  97. Liu, Z., Park, J.H., Rekatsinas, T., Tzamos, C.: On robust mean estimation under coordinate-level corruption. In: ICML, pp. 6914–6924. PMLR (2021)

  98. Liu, Z., Park, J., Rekatsinas, T., Tzamos, C.: On robust mean estimation under coordinate-level corruption. In: ICML, pp. 6914–6924 (2021)

  99. Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update”. In: NIPS, pp. 960–970 (2017)

  100. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. CoRR arXiv:1908.09635 (2019)

  101. Melgar, L.A., Dao, D., Gan, S., Gürel, N.M., Hollenstein, N., Jiang, J., Karlas, B., Lemmin, T., Li, T., Li, Y., Rao, X., Rausch, J., Renggli, C., Rimanic, L., Weber, M., Zhang, S., Zhao, Z., Schawinski, K., Wu, W., Zhang, C.: Ease.ml: a lifecycle management system for machine learning. In: CIDR (2021)

  102. Meng, D., Chen, H.: Magnet: a two-pronged defense against adversarial examples. In: Thuraisingham, B.M., Evans, D., Malkin, T., Xu, D. (eds.) ACM SIGSAC, pp. 135–147 (2017)

  103. Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. In: ICLR (2017)

  104. Miller, R.J., Nargesian, F., Zhu, E., Christodoulakis, C., Pu, K.Q., Andritsos, P.: Making open data transparent: data discovery on open data. IEEE Data Eng. Bull. 41(2), 59–70 (2018)

    Google Scholar 

  105. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Su, K.-Y., Su, J., Wiebe, J. (eds.) ACL, pp. 1003–1011 (2009)

  106. Nabi, R., Shpitser, I.: Fair inference on outcomes. In: AAAI, pp. 1931–1940 (2018)

  107. Neutatz, F., Chen, B., Abedjan, Z., Eugene, W.: From cleaning before ML to cleaning for ML. IEEE Data Eng. Bull. 44(1), 24–41 (2021)

    Google Scholar 

  108. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV, pp. 69–84 (2016)

  109. Principles for AI ethics. https://research.samsung.com/artificial-intelligence. Accessed 13 July 2022

  110. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE TKDE 22(10), 1345–1359 (2010)

    Google Scholar 

  111. Papernot, N., McDaniel, P.D., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE SP, pp. 582–597 (2016)

  112. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, pp. 8024–8035. Curran Associates, Inc. (2019)

  113. Patrini, G., Rozza, A., Menon, A.K., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: CVPR, pp. 2233–2241 (2017)

  114. Paudice, A., Muñoz-González, L., György, A., Lupu, E.C.: Detection of adversarial training examples in poisoning attacks through anomaly detection. CoRR arXiv:1802.03041 (2018)

  115. Pelekis, N., Ntrigkogias, C., Tampakis, P., Sideridis, S., Theodoridis, Y.: Hermoupolis: a trajectory generator for simulating generalized mobility patterns. In: ECML PKDD, pp. 659–662 (2013)

  116. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J.M., Weinberger, K.Q.: On fairness and calibration. In: NIPS, pp. 5680–5689 (2017)

  117. Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data management challenges in production machine learning. In: SIGMOD, pp. 1723–1726 (2017)

  118. Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data lifecycle challenges in production machine learning: a survey. SIGMOD Rec. 47(2), 17–28 (2018)

    Article  Google Scholar 

  119. Qayyum, A., Qadir, J., Bilal, M., Al-Fuqaha, A.: Secure and robust machine learning for healthcare: a survey. IEEE Rev. Biomed. Eng. 14, 156–180 (2020)

    Article  Google Scholar 

  120. Responsible AI practices. https://ai.google/responsibilities/responsible-ai-practices. Accessed 13 July 2022

  121. Responsible AI principles from Microsoft. https://www.microsoft.com/en-us/ai/responsible-ai. Accessed 13 July 2022

  122. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Sen, W., Ré, C.: Snorkel: rapid training data creation with weak supervision. PVLDB 11(3), 269–282 (2017)

    Google Scholar 

  123. Ratner, A., Bach, S.H., Ehrenberg, H.R., Fries, J.A., Sen, W., Ré, C.: Snorkel: rapid training data creation with weak supervision. VLDB J. 29(2–3), 709–730 (2020)

    Article  Google Scholar 

  124. Ratner, A.J., Ehrenberg, H.R., Hussain, Z., Dunnmon, J., Ré, C.: Learning to compose domain-specific transformations for data augmentation. In: NIPS, pp. 3239–3249 (2017)

  125. Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: EDBT, pp. 61–72 (2021)

  126. Reed, S.E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. In: ICLR (2015)

  127. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)

    Google Scholar 

  128. Renggli, C., Rimanic, L., Gürel, N.M., Karlas, B., Wu, W., Zhang, C.: A data quality-driven view of mlops. IEEE Data Eng. Bull. 44(1), 11–23 (2021)

    Google Scholar 

  129. Ricci, F., Rokach, L., Shapira, B. (eds.): Recommender Systems Handbook. Springer, Berlin (2015)

    MATH  Google Scholar 

  130. Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data—AI integration perspective. In: IEEE TKDE (2019)

  131. Roh, Y., Lee, K., Whang, S.E., Suh, C.: FR-Train: a mutual information-based approach to fair and robust training. In: ICML (2020)

  132. Roh, Y., Lee, K., Whang, S.E., Suh, C.: Fairbatch: batch selection for model fairness. In: ICLR. OpenReview.net (2021)

  133. Roh, Y., Lee, K., Whang, S.E., Suh, C.: Sample selection for fair and robust training. In: NeurIPS (2021)

  134. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35

  135. South Korean AI chatbot pulled from Facebook after hate speech towards minorities. https://www.theguardian.com/world/2021/jan/14/time-to-properly-socialise-hate-speech-ai-chatbot-pulled-from-facebook. Accessed 13 July 2022

  136. Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: causal database repair for algorithmic fairness. In: SIGMOD, pp. 793–810 (2019)

  137. Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., Seufert, S.: Automatically tracking metadata and provenance of machine learning experiments. In: Workshop on ML Systems at NIPS (2017)

  138. Schelter, S., Grafberger, S., Schmidt, P., Rukat, T., Kießling, M., Taptunov, A., Bießmann, F., Lange, D.: Differential data quality verification on partitioned data. In: ICDE, pp. 1940–1945 (2019)

  139. Schelter, S., Lange, D., Schmidt, P., Celikel, M., Bießmann, F., Grafberger, A.: Automating large-scale data quality verification. Proc. VLDB Endow. 11(12), 1781–1794 (2018)

    Article  Google Scholar 

  140. Schelter, S., Rukat, T., Biessmann, F.: JENGA: a framework to study the impact of data errors on the predictions of machine learning models. In: EDBT, pp. 529–534 (2021)

  141. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D.: Hidden technical debt in machine learning systems. In: NIPS, pp. 2503–2511 (2015)

  142. Settles, B.: Active learning. In: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)

  143. Shafahi, A., Huang, W.R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., Goldstein, T.: Poison frogs! Targeted clean-label poisoning attacks on neural networks. In: NeurIPS, pp. 6106–6116 (2018)

  144. Shang, L.: Denoising natural images based on a modified sparse coding algorithm. Appl. Math. Comput. 205(2), 883–889 (2008)

    MathSciNet  MATH  Google Scholar 

  145. Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., Cui, P.: Towards out-of-distribution generalization: a survey. arXiv:2108.13624 (2021)

  146. Sheng, V.S., Provost, F.J., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD, pp. 614–622 (2008)

  147. Sinha, A., Namkoong, H., Duchi, J.C.: Certifying some distributional robustness with principled adversarial training. In: ICLR (2018)

  148. Solans, D., Biggio, B., Castillo, C.: Poisoning attacks on algorithmic fairness. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds.) ECML PKDD, vol. 12457, pp. 162–177. Springer (2020)

  149. Song, H., Kim, M., Lee, J.-G.: SELFIE: refurbishing unclean samples for robust deep learning. In: ICML, pp. 5907–5915 (2019)

  150. Song, H., Kim, M., Park, D., Lee, J.-G.: Learning from noisy labels with deep neural networks: a survey. CoRR arXiv:2007.08199 (2020)

  151. Song, H., Kim, M., Park, D., Shin, Y., Lee, J.-G.: Robust learning by self-transition for handling noisy labels. In: KDD, pp. 1490–1500 (2021)

  152. Stonebraker, M., Ilyas, I.F.: Data integration: the current status and the way forward. IEEE Data Eng. Bull. 41(2), 3–9 (2018)

    Google Scholar 

  153. Stonebraker, M., Rezig, E.K.: Machine learning and big data: what is important? IEEE Data Eng. Bull. 42, 3–7 (2019)

    Google Scholar 

  154. Trusting AI. https://www.research.ibm.com/artificial-intelligence/trusted-ai/. Accessed 13 July 2022

  155. Tae, K.H., Whang, S.E.: Slice tuner: a selective data acquisition framework for accurate and fair machine learning models. In: SIGMOD, pp. 1771–1783. ACM (2021)

  156. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: NIPS, pp. 1195–1204 (2017)

  157. Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging Yourney from the wild to the lake. In: CIDR (2015)

  158. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS, pp. 23–30 (2017)

  159. Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: CVPR Workshops, pp. 969–977 (2018)

  160. Triguero, I., García, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2015)

    Article  Google Scholar 

  161. Tukey, J.W.: A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics, pp. 448–485 (1960)

  162. van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3), 1–67 (2011)

    Article  Google Scholar 

  163. Varma, P., Ré, C.: Snuba: automating weak supervision to label training data. Proc. VLDB Endow. 12(3), 223–236 (2018)

    Article  Google Scholar 

  164. Vartak, M., Rahman, S., Madden, S., Parameswaran, A.G., Polyzotis, N.: SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB 8(13), 2182–2193 (2015)

    Google Scholar 

  165. Venkatasubramanian, S.: Algorithmic fairness: measures, methods and representations. In: PODS, p. 481 (2019)

  166. Wang, H., Liu, B., Li, C., Yang, Y., Li, T.: Learning with noisy labels for sentence-level sentiment classification. In: EMNLP (2019)

  167. Wang, J., Liu, Y., Levy, C.: Fair classification with group-dependent label noise. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT, pp. 526–536. ACM (2021)

  168. Wang, S., Guo, W., Narasimhan, H., Cotter, A., Gupta, M.R., Jordan, M.I.: Robust optimization for fairness with noisy protected groups. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) NeurIPS (2020)

  169. Whang, S.E., Lee, J.-G.: Data collection and quality challenges for deep learning. Proc. VLDB Endow. 13(12), 3429–3432 (2020)

    Article  Google Scholar 

  170. Xin, D., Petersohn, D., Tang, D., Yifan, W., Gonzalez, J.E., Hellerstein, J.M., Joseph, A.D., Parameswaran, A.G.: Enhancing the interactivity of dataframe queries by leveraging think time. IEEE Data Eng. Bull. 44(1), 66–78 (2021)

    Google Scholar 

  171. Xu, D., Yuan, S., Zhang, L., Wu, X.: Fairgan: fairness-aware generative adversarial networks. In: IEEE Big Data, pp. 570–575 (2018)

  172. Xu, H., Liu, X., Li, Y., Jain, A.K., Tang, J.: To be robust or to be fair: towards fairness in adversarial training. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 11492–11501. PMLR (2021)

  173. Xu, W., Evans, D., Qi, Y.: Feature squeezing: detecting adversarial examples in deep neural networks. In: NDSS (2018)

  174. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL, pp. 189–196, Stroudsburg, PA, USA (1995). Association for Computational Linguistics

  175. Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: ICCV, pp. 6022–6031 (2019)

  176. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In: WWW, pp. 1171–1180. ACM (2017)

  177. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness constraints: mechanisms for fair classification. In: AISTATS, pp. 962–970 (2017)

  178. Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: AIES, pp. 335–340 (2018)

  179. Zhang, H., Chu, X., Asudeh, A., Navathe, S.B.: Omnifair: a declarative system for model-agnostic group fairness in machine learning. In: SIGMOD, pp. 2076–2088 (2021)

  180. Zhang, H., Davidson, I.: Facct. pp. 138–148. ACM (2021)

  181. Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: ICLR (2018)

  182. Zhang, J., Bareinboim, E.: Fairness in decision-making: the causal explanation formula. In: AAAI (2018)

  183. Zhang, Y., Ives, Z.G.: Finding related tables in data lakes for interactive data science. In: SIGMOD, pp. 1951–1966 (2020)

  184. Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., Kraska, T.: Controlling false discoveries during interactive data exploration. In: SIGMOD, pp. 527–540 (2017)

  185. Zhou, Y., Goldman, S.A.: Democratic co-learning. In: IEEE ICTAI, pp. 594–602 (2004)

  186. Zhou, Z.-H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE TKDE 17(11), 1529–1541 (2005)

    Google Scholar 

  187. Zhu, C., Ronny Huang, W., Li, H., Taylor, G., Studer, C., Goldstein, T.: Transferable clean-label poisoning attacks on deep neural nets. In: ICML, pp. 7614–7623 (2019)

  188. Zhu, X.: Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison (2005)

Download references

Acknowledgements

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00157 and 2020-0-00862) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2018R1A5A1059921 and NRF-2022R1A2C2004382).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Euijong Whang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article extends tutorials the authors delivered at the VLDB 2020 [169] and KDD 2021 [94] conferences.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Whang, S.E., Roh, Y., Song, H. et al. Data collection and quality challenges in deep learning: a data-centric AI perspective. The VLDB Journal 32, 791–813 (2023). https://doi.org/10.1007/s00778-022-00775-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00775-9

Keywords

Navigation