Skip to main content
Log in

Data distribution tailoring revisited: cost-efficient integration of representative data

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://salaries.texastribune.org.

References

  1. Nargesian, F., Asudeh, A., Jagadish, H.V.: Tailoring data source distributions for fairness-aware data integration. Proceed. VLDB Endow. 14(11), 2519–2532 (2021). https://doi.org/10.14778/3476249.3476299

  2. Rose, A.: Are face-detection cameras racist? Time Business (2010)

  3. Mulshine, M.: A major flaw in google’s algorithm allegedly tagged two black people’s faces with the word ’gorillas’. Business Insider (2015)

  4. Townsend, T.: Most engineers are white and so are the faces they use to train software. Recode (2017)

  5. Dastin, J.: Amazon scraps secret ai recruiting tool that showed bias against women. Reuters (2018)

  6. Holt, D., Elliot, D.: Methods of weighting for unit non-response. J. R. Stat. Soc. Series D (The Statistician) 40(3), 333–342 (1991)

    Google Scholar 

  7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  8. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsl 6(1), 20–29 (2004)

    Article  Google Scholar 

  9. Parsa, A.B., Taghipour, H., Derrible, S., Mohammadian, A.K.: Real-time accident detection: coping with imbalanced data. Accident Anal. Prevent. 129, 202–210 (2019)

    Article  Google Scholar 

  10. Chung, Y., Kraska, T., Polyzotis, N., Tae, K.H., Whang, S.E.: Slice finder: Automated data slicing for model validation. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1550–1553. IEEE (2019)

  11. Sagadeeva, S., Boehm, M.: Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2290–2299 (2021)

  12. Tae, K.H., Whang, S.E.: Slice tuner: A selective data acquisition framework for accurate and fair machine learning models. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1771–1783 (2021)

  13. Bartlett, R., Morse, A., Stanton, R., Wallace, N.: Consumer-lending discrimination in the fintech era. Tech. rep, National Bureau of Economic Research (2019)

  14. Dawex: Dawex: Sell, buy and share data. https://www.dawex.com/en

  15. Xignite: Market data solutions. https://www.xignite.com/

  16. WorldQuant: Worldquant. https://www.worldquant.com

  17. Singer, N.: A data broker offers a peek behind the curtain. The New York Times (2013)

  18. of California, S.: Data broker registry. https://oag.ca.gov/data-brokers (2020)

  19. Turk, A.M.: Amazon mechanical turk. Retrieved August 17, 2012 (2012)

  20. Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. PVLDB 11(7), 813–825 (2018)

    Google Scholar 

  21. Rapid: Google flights api: Incorporate travel data into your app. The Rapid API Blog (2020)

  22. Chai, C., Fan, J., Li, G.: Incentive-based entity collection using crowdsourcing. In: ICDE, pp. 341–352 (2018)

  23. Fan, J., Wei, Z., Zhang, D., Yang, J., Du, X.: Distribution-aware crowdsourced entity collection. IEEE Trans. Knowl. Data Eng. 31(7), 1312–1326 (2019)

    Article  Google Scholar 

  24. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)

  25. Asudeh, A., Nargesian, F.: Towards distribution-aware query answering in data markets. Proc. VLDB Endow. 15(11), 3137–3144 (2022)

    Article  Google Scholar 

  26. The texas tribune data set. https://salaries.texastribune.org (2021)

  27. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD, pp. 252–262 (2002)

  28. Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: online aggregation via random walks. In: SIGMOD, pp. 615–629 (2016)

  29. Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: SIGMOD, pp. 1525–1539 (2018)

  30. The socrata open data api. https://developer.twitter.com/en/products/twitter-api/enterprise

  31. Li, Y., Yu, X., Koudas, N.: Data acquisition for improving machine learning models. Proc. VLDB Endow. 14(10), 1832–1844 (2021)

    Article  Google Scholar 

  32. Sheng, C., Zhang, N., Tao, Y., Jin, X.: Optimal algorithms for crawling a hidden database in the web. arXiv preprint arXiv:1208.0075 (2012)

  33. Madhavan, J., Ko, D., Kot, Ł, Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proceed. VLDB Endow. 1(2), 1241–1252 (2008)

    Article  Google Scholar 

  34. Asudeh, A., Thirumuruganathan, S., Zhang, N., Das, G.: Discovering the skyline of web databases. PVLDB 9(7), 600–611 (2016)

    Google Scholar 

  35. Asudeh, A., Zhang, N., Das, G.: Query reranking as a service. PVLDB 9(11), 888–899 (2016)

    Google Scholar 

  36. Sundarkumar, G.G., Ravi, V.: A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng. Appl. Artif. Intell. 37, 368–377 (2015)

    Article  Google Scholar 

  37. Select Issues: Assessing Adverse Impact in Software, Algorithms, and Artificial Intelligence Used in Employment Selection Procedures Under Title VII of the Civil Rights Act of 1964 (2023)

  38. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact. In: proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268 (2015)

  39. Kearns, M., Neel, S., Roth, A., Wu, Z.S.: Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In: International conference on machine learning, pp. 2564–2572. PMLR (2018)

  40. Kearns, M., Neel, S., Roth, A., Wu, Z.S.: An empirical study of rich subgroup fairness for machine learning. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 100–109 (2019)

  41. Foulds, J.R., Islam, R., Keya, K.N., Pan, S.: An intersectional definition of fairness. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1918–1921. IEEE (2020)

  42. Asudeh, A., Jin, Z., Jagadish, H.V.: Assessing and remedying coverage for a given dataset. In: ICDE, pp. 554–565 (2019)

  43. Deng, S., Lu, S., Tao, Y.: On join sampling and the hardness of combinatorial output-sensitive join algorithms. In: PODS, pp. 99–111. ACM (2023)

  44. Asudeh, A., Nargesian, F.: Towards distribution-aware query answering in data markets. Proc. VLDB Endow. 15(11), 3137–3144 (2022)

    Article  Google Scholar 

  45. Bird, R.S.: Tabulation techniques for recursive programs. ACM Comput. Surveys 12(4), 403–417 (1980). https://doi.org/10.1145/356827.356831

    Article  MathSciNet  Google Scholar 

  46. Aggarwal, A., Klawe, M.M., Moran, S., Shor, P., WIlber, R.: Geometric applications of a matrix searching algorithm. In: Proceedings of the Second Annual Symposium on Computational Geometry (1986). https://dl.acm.org/doi/pdf/10.1145/10515.10546

  47. Galil, Z., Park, K.: Dynamic programming with convexity, concavity and sparsity. Theor. Comput. Sci. 92(1), 49–76 (1992). https://doi.org/10.1016/0304-3975(92)90135-3

  48. Motwani, R., Raghavan, P.: Randomized algorithms. Cambridge university press (1995)

  49. Brown, M., Ross, S.M.: Optimality results for coupon collection. J. Appl. Probab. 53(3), 930–937 (2016)

    Article  MathSciNet  Google Scholar 

  50. Katehakis, M.N., Jr., A.F.V.: The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 12(2), 262–268 (1987)

  51. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)

    Article  Google Scholar 

  52. Slivkins, A.: Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning 12(1-2), 1–286 (2019). https://doi.org/10.1561/2200000068

  53. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)

    Article  Google Scholar 

  54. of Transportation Statistics, B.: Airborne flights database. U.S. Department of Transportation, https://www.transtats.bts.gov (2021)

  55. ProPublica: Compas-analysis. ProPublica (2023). https://github.com/propublica/compas-analysis

  56. Mattu, J., Angwin, L., Kirchner, S., Larson, J.: How We Analyzed the COMPAS Recidivism Algorithm (2016). https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm?token=TiqCeZIj4uLbXl91e3wM2PnmnWbCVOvS

  57. Lagioia, F., Rovatti, R., Sartor, G.: Algorithmic fairness through group parities? the case of compas-sapmoc. AI & SOCIETY pp. 1–20 (2022)

  58. Fabris, A., Messina, S., Silvello, G., Susto, G.A.: Algorithmic fairness datasets: the story so far. Data Min. Knowl. Disc. 36(6), 2074–2152 (2022)

    Article  Google Scholar 

  59. Barocas, S., Hardt, M., Narayanan, A.: Fairness and machine learning: Limitations and opportunities. URL: fairmlbook.org (2019)

  60. Friedler, S.A., Scheidegger, C., Venkatasubramanian, S., Choudhary, S., Hamilton, E.P., Roth, D.: A comparative study of fairness-enhancing interventions in machine learning. In: Proceedings of the conference on fairness, accountability, and transparency, pp. 329–338 (2019)

  61. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)

    Article  Google Scholar 

  62. Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K.N., Varshney, K.R.: Optimized pre-processing for discrimination prevention. In: Advances in Neural Information Processing Systems, pp. 3992–4001 (2017)

  63. Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: Causal database repair for algorithmic fairness. In: SIGMOD, pp. 793–810 (2019)

  64. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 35–50. Springer (2012)

  65. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., Dwork, C.: Learning fair representations. In: ICML (2013)

  66. Zafar, M.B., Valera, I., Rodriguez, M.G., Gummadi, K.P.: Fairness constraints: Mechanisms for fair classification. CoRR, abs/1507.05259 (2015)

  67. Zhang, H., Chu, X., Asudeh, A., Navathe, S.: Omnifair: A declarative system for model-agnostic group fairness in machine learning. SIGMOD (2021)

  68. Kamiran, F., Calders, T., Pechenizkiy, M.: Discrimination aware decision tree learning. In: 2010 IEEE International Conference on Data Mining, pp. 869–874. IEEE (2010)

  69. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413 (2016)

  70. Woodworth, B., Gunasekar, S., Ohannessian, M.I., Srebro, N.: Learning non-discriminatory predictors. In: Conference on Learning Theory, pp. 1920–1953. PMLR (2017)

  71. Salimi, B., Howe, B., Suciu, D.: Database repair meets algorithmic fairness. ACM SIGMOD Rec. 49(1), 34–41 (2020)

    Article  Google Scholar 

  72. Asudeh, A., Jagadish, H., Stoyanovich, J., Das, G.: Designing fair ranking schemes. In: SIGMOD, pp. 1259–1276 (2019)

  73. Kuhlman, C., Rundensteiner, E.: Rank aggregation algorithms for fair consensus. PVLDB 13(12), 2706–2719 (2020)

    Google Scholar 

  74. Asudeh, A., Jagadish, H., Miklau, G., Stoyanovich, J.: On obtaining stable rankings. PVLDB 12(3) (2019)

  75. Guan, Y., Asudeh, A., Mayuram, P., Jagadish, H., Stoyanovich, J., Miklau, G., Das, G.: Mithraranking: A system for responsible ranking design. In: SIGMOD, pp. 1913–1916 (2019)

  76. Sun, C., Asudeh, A., Jagadish, H., Howe, B., Stoyanovich, J.: Mithralabel: Flexible dataset nutritional labels for responsible data science. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2893–2896 (2019)

  77. Yang, K., Stoyanovich, J., Asudeh, A., Howe, B., Jagadish, H., Miklau, G.: A nutritional label for rankings. In: SIGMOD, pp. 1773–1776 (2018)

  78. Getoor, L.: Responsible data science. In: SIGMOD (2019)

  79. Stoyanovich, J., Howe, B., Jagadish, H.: Responsible data management. PVLDB 13(12), 3474–3488 (2020)

    Google Scholar 

  80. Shah, N.B., Lipton, Z.: Sigmod 2020 tutorial on fairness and bias in peer review and other sociotechnical intelligent systems. In: SIGMOD, pp. 2637–2640 (2020)

  81. Venkatasubramanian, S.: Algorithmic fairness: measures, methods and representations. In: PODS, pp. 481–481 (2019)

  82. Asudeh, A., Jagadish, H.V.: Fairly evaluating and scoring items in a data set. PVLDB 13(12), 3445–3448 (2020)

    Google Scholar 

  83. Neyman, J., Pearson, E.S.: Contributions to the theory of testing statistical hypotheses. Stat. Res. Memoirs (1936)

  84. Olteanu, A., Castillo, C., Diaz, F., Kiciman, E.: Social data: Biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2, 13 (2019)

    Article  Google Scholar 

  85. Barocas, S., Selbst, A.D.: Big data’s disparate impact. Calif. L. Rev. 104, 671 (2016)

    Google Scholar 

  86. Chen, I., Johansson, F.D., Sontag, D.: Why is my classifier discriminatory? In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 3539–3550 (2018)

  87. Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudik, M., Wallach, H.: Improving fairness in machine learning systems: What do industry practitioners need? In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–16 (2019)

  88. Drosou, M., Jagadish, H., Pitoura, E., Stoyanovich, J.: Diversity in big data: A review. Big data 5(2) (2017)

  89. Lin, Y., Guan, Y., Asudeh, A., V., J.H.: Identifying insufficient data coverage in databases with multiple relations. PVLDB 13(11), 2229–2242 (2020)

  90. Jin, Z., Xu, M., Sun, C., Asudeh, A., Jagadish, H.: Mithracoverage: A system for investigating population bias for intersectional fairness. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 2721–2724 (2020)

  91. Accinelli, C., Minisi, S., Catania, B.: Coverage-based rewriting for data preparation. In: EDBT/ICDT Workshops (2020)

  92. Asudeh, A., Shahbazi, N., Jin, Z., Jagadish, H.: Identifying insufficient data coverage for ordinal continuous-valued attributes. SIGMOD (2021)

  93. Orr, L.J., Balazinska, M., Suciu, D.: Sample debiasing in the themis open world database system. In: SIGMOD, pp. 257–268 (2020)

  94. Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. PVLDB 9(12), 1185–1196 (2016)

    Google Scholar 

  95. Sadiq, S.W., Dasu, T., Dong, X.L., Freire, J., Ilyas, I.F., Link, S., Miller, R.J., Naumann, F., Zhou, X., Srivastava, D.: Data quality: The role of empiricism. SIGMOD Rec. 46(4), 35–43 (2017)

    Article  Google Scholar 

  96. Rekatsinas, T., Deshpande, A., Dong, X.L., Getoor, L., Srivastava, D.: Sourcesight: Enabling effective source selection. In: SIGMOD, pp. 2157–2160 (2016)

  97. Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)

  98. Qian, L., Cafarella, M.J., Jagadish, H.V.: Sample-driven schema mapping. In: SIGMOD, pp. 73–84 (2012)

  99. Lehmberg, O., Bizer, C.: Synthesizing n-ary relations from web tables. In: WIMS, pp. 17:1–17:12 (2019)

  100. Pimplikar, R., Sarawagi, S.: Answering table queries on the web using column keywords. PVLDB 5(10), 908–919 (2012)

    Google Scholar 

  101. Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: WWW, pp. 1365–1375 (2019)

  102. Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D.: Query-based data pricing. J. ACM 62(5), 43:1–43:44 (2015)

  103. Chepurko, N., Marcus, R., Zgraggen, E., Fernandez, R.C., Kraska, T., Karger, D.: ARDA: automatic relational data augmentation for machine learning. PVLDB 13(9), 1373–1387 (2020)

    Google Scholar 

  104. Radosavovic, I., Dollár, P., Girshick, R.B., Gkioxari, G., He, K.: Data distillation: Towards omni-supervised learning. In: CVPR, pp. 4119–4128 (2018)

  105. Brucato, M., Beltran, J.F., Abouzied, A., Meliou, A.: Scalable package queries in relational database systems. arXiv preprint arXiv:1512.03564 (2015)

  106. Brucato, M., Mannino, M., Abouzied, A., Haas, P.J., Meliou, A.: spaqltools: a stochastic package query interface for scalable constrained optimization. Proceedings of the VLDB Endowment 13(12) (2020)

  107. Erkut, E.: The discrete p-dispersion problem. Eur. J. Oper. Res. 46(1), 48–60 (1990)

  108. Wang, Y., Fabbri, F., Mathioudakis, M.: Streaming algorithms for diversity maximization with fairness constraints. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 41–53. IEEE (2022)

  109. Wang, Y., Mathioudakis, M., Li, J., Fabbri, F.: Max-min diversification with fairness constraints: Exact and approximation algorithms. In: SIAM nternational Conference on Data Mining (SDM23) (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiwon Chang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported in part by NSF 1741022, 2107290, 1934565, 2107050, the Google research scholar award, and the Schwartz Discover Grant.

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, J., Cui, B., Nargesian, F. et al. Data distribution tailoring revisited: cost-efficient integration of representative data. The VLDB Journal (2024). https://doi.org/10.1007/s00778-024-00849-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00778-024-00849-w

Navigation