Selecting third-party libraries: the data scientist’s perspective

Nadi, Sarah; Sakr, Nourhan

doi:10.1007/s10664-022-10241-3

Selecting third-party libraries: the data scientist’s perspective

Published: 07 December 2022

Volume 28, article number 15, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

458 Accesses
1 Citation
Explore all metrics

Abstract

With the increased reliance on data-driven decisions and software services, data scientists are becoming an integral part of many software teams and enterprise operations. To perform their tasks, data scientists rely on various third-party libraries (e.g., pandas in Python for data wrangling or ggplot in R for data visualization). Selecting the right library to use is often a difficult task, with many factors influencing this selection. While there has been a lot of research on the factors that software developers take into account when selecting a library, it is not clear if these factors influence data scientists’ library selection in the same way, especially given several differences between both groups. To address this gap, we replicate a recent survey of library selection factors, but target data scientists instead of software developers. Our survey of 90 participants shows that data scientists consider several factors when selecting libraries to use, with technical factors such as the usability of the library, fit for purpose, and documentation being the three highest influencing factors. Additionally, we find that there are 11 factors that data scientists rate differently than software developers. For example, data scientists are influenced more by the collective experience of the community but less by the library’s security or license. We also uncover new factors that influence data scientists’ library selection, such as the statistical rigor of the library. We triangulate our survey results with feedback from five focus groups involving 18 additional data science experts with various roles, whose input allow us to further interpret our survey results. We discuss the implications of our findings for data science library maintainers as well as researchers who want to design recommender and/or comparison systems that help data scientists with library selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why reinventing the wheels? An empirical study on library reuse and re-implementation

Article 05 September 2019

The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries

Article Open access 29 May 2018

Digging Deeper into the State of the Practice for Domain Specific Research Software

Notes

Note that the layout of the survey sometimes combines questions of different categories to optimize the flow of the survey. For example, we ask participants about their current role at the beginning of the factor ratings to contextualize the information, while we keep all optional demographic questions at the end. The exact survey we use is available on our artifact page (Artifact can be found at https://doi.org/10.6084/m9.figshare.16563885.v1).
https://www.linkedin.com/help/linkedin/answer/1584/inmail-messages?lang=en
Thanks to the authors for releasing their raw rating data (Larios Vargas et al. 2020a), which allowed us to reproduce their results and enabled a direct distribution comparison

References

Abdalkareem R, Nourry O, Wehaibi S, Mujahid S, Shihab E (2017) Why do developers use trivial packages? an empirical case study on npm. In: Proceedings of the 11th joint meeting on foundations of software engineering, ser. ESEC/FSE 2017. https://doi.org/10.1145/3106237.3106267. Association for Computing Machinery, New York, pp 385–395
Biswas S, Wardat M, Rajan H (2021) The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. arXiv:2112.01590
Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71
Article Google Scholar
De La Mora FL, Nadi S (2018a) An empirical study of metric-based comparisons of software libraries. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, ser. PROMISE’18. https://doi.org/10.1145/3273934.3273937. Association for Computing Machinery, New York, pp 22–31
De La Mora, FL, Nadi S (2018b) Which library should i use?: A metric-based comparison of software libraries. In: Proceedings of the 40th IEEE/ACM international conference on software engineering: new ideas and emerging technologies results (ICSE-NIER), pp 37–40
Dong H, Zhou S, Guo J, Kästner C (2021) Splitting, renaming, removing: a study of common cleaning activities in jupyter notebooks. In: Proceedings of the 9tn international workshop on realizing artificial intelligence synergies in software engineering (RAISE), p 11
El-Hajj R, Nadi S (2020) LibComp: an IntelliJ plugin for comparing Java libraries. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2020. https://doi.org/10.1145/3368089.3417922. Association for Computing Machinery, New York, pp 1591–1595
Gizas A, Christodoulou S, Papatheodorou T (2012) Comparative evaluation of javascript frameworks. In: Proceedings of the 21st international conference on world wide web. WWW ’12 Companion. https://doi.org/10.1145/2187980.2188103. Association for Computing Machinery, New York, pp 513–514
Harris H, Murphy S, Vaisman M (2013) Analyzing the analyzers: an introspective survey of data scientists and their work. O’Reilly Media, Inc.
Hora A, Valente MT (2015) Apiwave: keeping track of api popularity and migration. In: Proceedings of the 31st IEEE international conference on software maintenance and evolution, ser. ICSME ’15. IEEE Computer Society, Washington, pp 321–323
Hu J, Joung J, Jacobs M, Gajos KZ, Seltzer MI (2020) Improving data scientist efficiency with provenance. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE), pp 1086–1097
Kaggle (2020) Kaggle’s 2020 state of data science and machine learning survey. https://www.kaggle.com/kaggle-survey-2020
Kandel S, Paepcke A, Hellerstein JM, Heer J (2012) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917–2926
Article Google Scholar
Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–11
Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: Proceedings of the 38th IEEE/ACM international conference on software engineering (ICSE), IEEE, pp 96–107
Kim M, Zimmermann T, DeLine R, Begel A (2018) Data scientists in software teams: state of the art and challenges. IEEE Trans Softw Eng 44 (11):1024–1038
Article Google Scholar
Kontio J, Lehtola L, Bragge J (2004) Using the focus group method in software engineering: obtaining practitioner and user experiences. In: Proceedings of the international symposium on empirical software engineering (ISESE’04), IEEE, pp 271–280
Kross S, Guo PJ (2019) Practitioners teaching data science in industry and academia: expectations, workflows, and challenges. Association for Computing Machinery, New York, pp 1–14. https://doi.org/10.1145/3290605.3300493 https://doi.org/10.1145/3290605.3300493
Google Scholar
Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020a) Selecting third-party libraries: the practitioners’ perspective. https://doi.org/10.5281/zenodo.3979446
Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020b) Selecting third-party libraries: the practitioners’ perspective. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE). https://doi.org/10.1145/3368089.3409711. Association for Computing Machinery, New York, pp 245–256
Ma Y, Mockus A, Zaretzki R, Bichescu B, Bradley R (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Trans Softw Eng 48(2):485–501
Article Google Scholar
Matplotlib (2021). https://matplotlib.org/
Metwalli SA (2020) Data visualization 101: how to choose a python plotting library. https://towardsdatascience.com/data-visualization-101-how-to-choose-a-python-plotting-library-853460a08a8a https://towardsdatascience.com/data-visualization-101-how-to-choose-a-python-plotting-library-853460a08a8a
Mileva YM, Dallmeier V, Burger M, Zeller A (2009) Mining trends of library usage. In: Proceedings of the joint international and annual ERCIM workshops on principles of software evolution (IWPSE) and software evolution (Evol) workshops, ser. IWPSE-Evol ’09. ACM, New York, pp 57–62
Muller M, Lange I, Wang D, Piorkowski D, Tsay J, Liao QV, Dugan C, Erickson T (2019) How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–15
Myers BA, Stylos J (2016) Improving api usability. Commun ACM 59(6):62–69
Article Google Scholar
Nahar N, Zhou S, Lewis G, Kästner C (2022) Collaboration challenges in building ml-enabled systems: communication, documentation, engineering, and process. In: Proceedings of the 44th international conference on software engineering (ICSE ’22)
Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, Malík P, Hluchỳ L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124
Article Google Scholar
Ni A, Ramos D, Yang AZH, Lynce I, Manquinho V, Martins R, Le Goues C (2021) Soar: a synthesis approach for data science api refactoring. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 112–124
Pandas (2021). https://pandas.pydata.org/
Pano A, Graziotin D, Abrahamsson P (2018) Factors and actors leading to the adoption of a javascript framework. Empir Softw Eng 23(6):3503–3534
Article Google Scholar
Patil DJ (2011) Building data science teams. O’Reilly Media, Inc.
Google Scholar
Piccioni M, Furia CA, Meyer B (2013) An empirical study of api usability. In: ACM/IEEE international symposium on empirical software engineering and measurement, pp 5–14
Pressman RS (2005) Software engineering: a practitioner’s approach. Macmillan, Palgrave
MATH Google Scholar
Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, et al. (2019) Data science through the looking glass and what we found there. arXiv:1912.09536
Ralph P, bin Ali N, Baltes S, Bianculli D, Diaz J, Dittrich Y, Ernst N, Felderer M, Feldt R, Filieri A, de França BBN, Furia CA, Gay G, Gold N, Graziotin D, He P, Hoda R, Juristo N, Kitchenham B, Lenarduzzi V, Martínez J, Melegati J, Mendez D, Menzies T, Molleri J, Pfahl D, Robbes R, Russo D, Saarimäki N, Sarro F, Taibi D, Siegmund J, Spinellis D, Staron M, Stol K, Storey M-A, Taibi D, Tamburri D, Torchiano M, Treude C Turhan B, Wang X, Vegas S (2020) Empirical standards for software engineering research. arXiv:2010.03525
Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732
Article Google Scholar
Robinson S (2018) The best machine learning libraries in python. https://stackabuse.com/the-best-machine-learning-libraries-in-python/https://stackabuse.com/the-best-machine-learning-libraries-in-python/
Siebert J, Groß J, Schroth C (2021) A systematic review of packages for time series analysis. Eng Proc 5(1):22. https://www.mdpi.com/2673-4591/5/1/22. https://doi.org/10.3390/engproc2021005022
Google Scholar
Sol T (2021) Choosing an open source machine learning library? here’s the list! https://gbksoft.com/blog/choosing-an-open-source-machine-learning-library-heres-the-list/
Stack Overflow (2021). https://stackoverflow.com/
Stančin I, Jović A (2019) An overview and comparison of free python libraries for data mining and big data analysis. In: 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), IEEE, pp 977–982
T. S. community (2021) SciPy library. https://www.scipy.org/
The SciPy community (2021) Wilcoxon rank sum test. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ranksums.html
Tensorflow (2021). https://www.tensorflow.org/
Teyton C, Falleri J-R, Blanc X (2012) Mining library migration graphs. In: Proceedings of the 19th working conference on reverse engineering (WCRE), pp 289–298
Teyton C, Falleri J-R, Palyart M, Blanc X (2014) A study of library migrations in java. J Softw Evol Process 26(11):1030–1052
Article Google Scholar
The Economist (2017) The world’s most valuable resource is no longer oil, but data. The Economist Group Limited, London. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
Google Scholar
Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: Proceedings of the 20th working conference on reverse engineering (WCRE), pp 182–191
Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: 20th working conference on reverse engineering (WCRE), pp 182–191
Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering, ser. ASE ’17
What you should know about the different data science job titles (2020). https://www.linkedin.com/pulse/what-you-should-know-different-data-science-job-big-data-scientist/
Wickham H, Chang W, Lionel Henry TLP, Takahashi K, Wilke C, Woo K, Yutani H, Dunnington D (2021) ggplot. https://ggplot2.tidyverse.org/
Wickham H, François R, Henry L, Müller K (2021) dplyr. https://dplyr.tidyverse.org/
Xu B, An L, Thung F, Khomh F, Lo D (2020) Why reinventing the wheels? an empirical study on library reuse and re-implementation. Empir Softw Eng 25(1):755–789
Article Google Scholar
Yang C, Zhou S, Guo JL, Kästner C (2021) Subtle bugs everywhere: generating documentation for data wrangling code. In: Proceedings of the 36th IEEE/ACM international conference on automated software engineering (ASE), vol 11
Zhang AX, Muller M, Wang D (2020) How do data science workers collaborate? roles, workflows, and tools. Proc ACM Human-Comput Interact 4 (CSCW1):1–23. https://doi.org/10.1145/3392826
Google Scholar

Download references

Acknowledgements

We would like to thank all our participants who filled out the survey. Thanks to Andrew Nady for help with early data analysis scripts. Sarah Nadi’s research is funded by the Canada Research Chairs program.

Author information

Authors and Affiliations

University of Alberta, Edmonton, AB, Canada
Sarah Nadi
The American University in Cairo, Cairo, Egypt
Nourhan Sakr

Authors

Sarah Nadi
View author publications
You can also search for this author in PubMed Google Scholar
Nourhan Sakr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarah Nadi.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Alexander Serebrenik

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nadi, S., Sakr, N. Selecting third-party libraries: the data scientist’s perspective. Empir Software Eng 28, 15 (2023). https://doi.org/10.1007/s10664-022-10241-3

Download citation

Accepted: 09 September 2022
Published: 07 December 2022
DOI: https://doi.org/10.1007/s10664-022-10241-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selecting third-party libraries: the data scientist’s perspective

Abstract

Access this article

Similar content being viewed by others

Why reinventing the wheels? An empirical study on library reuse and re-implementation

The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries

Digging Deeper into the State of the Practice for Domain Specific Research Software

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selecting third-party libraries: the data scientist’s perspective

Abstract

Access this article

Similar content being viewed by others

Why reinventing the wheels? An empirical study on library reuse and re-implementation

The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries

Digging Deeper into the State of the Practice for Domain Specific Research Software

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation