Skip to main content

Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2023)

Abstract

ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The appendices available at https://doi.org/10.5281/zenodo.8052683 contain: the dts (A), the provenance of the field of information composing it (B), the metadata of the selected datasets (C), the reading principles that guided the documentation investigation (D), the raw results (E) and additional tables and figures (F).

  2. 2.

    For reasons of space, summary tables with raw data are presented in Appendix E.

  3. 3.

    https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research.

  4. 4.

    Due to the unavailability of direct download count APIs, datasets were sorted by upvotes via APIs and then sorted by download count, as presented in the results.

References

  1. Afzal, S., Rajmohan, C., Kesarwani, M., Mehta, S., Patel, H.: Data readiness report. In: 2021 IEEE International Conference on Smart Data Services (SMDS), pp. 42–51 (2021). https://doi.org/10.1109/SMDS53860.2021.00016

  2. Arnold, M., Bellamy, R.K.E., Hind, M., Houde, S., Mehta, S., Mojsilović, A., Nair, R., Ramamurthy, K.N., Olteanu, A., Piorkowski, D., Reimer, D., Richards, J., Tsay, J., Varshney, K.R.: FactSheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 63(4/5), 6:1–6:13 (2019). https://doi.org/10.1147/JRD.2019.2942288

  3. Bender, E.M., Friedman, B.: Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Trans. Ass. Comp. Ling. 6, 587–604 (2018). https://doi.org/10.1162/tacl_a_00041

  4. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the Dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on FAccT, pp. 610–623. FAccT ’21. ACM (2021). https://doi.org/10.1145/3442188.3445922

  5. Boyd, K.L.: Datasheets for datasets help ML engineers notice and understand ethical issues in training data. Proc. ACM Hum.-Comput. Interact. 5(CSCW2), 438:1–438:27 (2021). https://doi.org/10.1145/3479582

  6. Fabris, A., Messina, S., Silvello, G., Susto, G.A.: Algorithmic fairness datasets: the story so far. Data Min. Knowl. Disc. 36(6), 2074–2152 (2022). https://doi.org/10.1007/s10618-022-00854-z

    Article  Google Scholar 

  7. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., III, H.D., Crawford, K.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021). https://doi.org/10.1145/3458723

  8. Geiger, R.S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., Huang, J.: Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In: Proceedings of the 2020 Conference on FAccT, pp. 325–336 (2020). https://doi.org/10.1145/3351095.3372862

  9. Holland, S., Hosny, A., Newman, S., Joseph, J., Chmielinski, K.: The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards (2018). arXiv:1805.03677 [cs]

  10. Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., Mitchell, M.: Towards Accountability for machine learning datasets: practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on FAccT, pp. 560–575. FAccT ’21, ACM (2021). https://doi.org/10.1145/3442188.3445918

  11. Jo, E.S., Gebru, T.: Lessons from archives: strategies for collecting sociocultural data in machine learning. In: Proceedings of the 2020 Conference on FAccT, pp. 306–316 (2020). https://doi.org/10.1145/3351095.3372829

  12. Koch, B., Denton, E., Hanna, A., Foster, J.G.: Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research (2021). https://doi.org/10.48550/arXiv.2112.01716

  13. Königstorfer, F., Thalmann, S.: Software documentation is not enough! requirements for the documentation of AI. Digital Policy, Regul. Gov. 23(5), 475–488 (2021). https://doi.org/10.1108/DPRG-03-2021-0047

    Article  Google Scholar 

  14. Luccioni, A.S., Corry, F., Sridharan, H., Ananny, M., Schultz, J., Crawford, K.: A framework for deprecating datasets: standardizing documentation, identification, and communication. In: Proceedings of the 2022 ACM Conference on FAccT, pp. 199–212. FAccT ’22, ACM (2022). https://doi.org/10.1145/3531146.3533086

  15. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the Conference on FAccT, pp. 220–229. FAT* ’19. ACM (2019). https://doi.org/10.1145/3287560.3287596

  16. Papakyriakopoulos, O., Choi, A.S.G., Thong, W., Zhao, D., Andrews, J., Bourke, R., Xiang, A., Koenecke, A.: Augmented datasheets for speech datasets and ethical decision-making. In: Proceedings of the 2023 ACM Conference on FAccT, pp. 881–904. FAccT ’23, ACM (2023). https://doi.org/10.1145/3593013.3594049

  17. Peng, K., Mathur, A., Narayanan, A.: Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers (2021). arXiv:2108.02922 [cs]

  18. Petersen, A.H., Ekstrøm, C.T.: dataMaid: Your assistant for documenting supervised data quality screening in R. J. Stat. Softw. 90, 1–38 (2019). https://doi.org/10.18637/jss.v090.i06

  19. Richards, J., Piorkowski, D., Hind, M., Houde, S., Mojsilović, A.: A Methodology for Creating AI FactSheets. arXiv:2006.13796 [cs] (2020). https://doi.org/10.48550/arXiv.2006.13796

  20. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: "Everyone wants to do the model work, not the data work": data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computer System, pp. 1–15. CHI ’21. ACM (2021). https://doi.org/10.1145/3411764.3445518

  21. Scheuerman, M.K., Denton, E., Hanna, A.: Do datasets have politics? Disciplinary values in computer vision dataset development. In: Proceedings of ACM Human-Computer Interaction 5(CSCW2), 1–37 (2021). https://doi.org/10.1145/3476058

  22. Schramowski, P., Tauchmann, C., Kersting, K.: Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In: Proceedings of 2022 ACM Conference on FAccT, pp. 1350–1361. FAccT ’22. ACM (2022). https://doi.org/10.1145/3531146.3533192

  23. Sun, C., Asudeh, A., Jagadish, H.V., Howe, B., Stoyanovich, J.: MithraLabel: flexible dataset nutritional labels for responsible data science. In: Proceedings of 28th ACM International Conference on Information and Knowledge Management, pp. 2893–2896. CIKM ’19. ACM (2019). https://doi.org/10.1145/3357384.3357853

  24. Thylstrup, N.B.: The ethics and politics of data sets in the age of machine learning: deleting traces and encountering remains. Media, Culture & Soc. 44(4), 655–671 (2022). https://doi.org/10.1177/01634437211060226

    Article  Google Scholar 

  25. Yang, K., Stoyanovich, J., Asudeh, A., Howe, B., Jagadish, H.V., Miklau, G.: A Nutritional label for rankings. In: Proceedings of 2018 International Conference on Management of Data, pp. 1773–1776 (2018). https://doi.org/10.1145/3183713.3193568

  26. Zehlike, M., Yang, K., Stoyanovich, J.: Fairness in Ranking: A Survey (2021). https://doi.org/10.48550/arXiv.2103.14000

Download references

Acknowledgements

This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) - MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 - D.D. 1555 11/10/2022, PE00000013). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Rondina .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rondina, M., Vetrò, A., De Martin, J.C. (2023). Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds) Progress in Artificial Intelligence. EPIA 2023. Lecture Notes in Computer Science(), vol 14115. Springer, Cham. https://doi.org/10.1007/978-3-031-49008-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49008-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49007-1

  • Online ISBN: 978-3-031-49008-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics