Skip to main content
Log in

Selecting third-party libraries: the data scientist’s perspective

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

With the increased reliance on data-driven decisions and software services, data scientists are becoming an integral part of many software teams and enterprise operations. To perform their tasks, data scientists rely on various third-party libraries (e.g., pandas in Python for data wrangling or ggplot in R for data visualization). Selecting the right library to use is often a difficult task, with many factors influencing this selection. While there has been a lot of research on the factors that software developers take into account when selecting a library, it is not clear if these factors influence data scientists’ library selection in the same way, especially given several differences between both groups. To address this gap, we replicate a recent survey of library selection factors, but target data scientists instead of software developers. Our survey of 90 participants shows that data scientists consider several factors when selecting libraries to use, with technical factors such as the usability of the library, fit for purpose, and documentation being the three highest influencing factors. Additionally, we find that there are 11 factors that data scientists rate differently than software developers. For example, data scientists are influenced more by the collective experience of the community but less by the library’s security or license. We also uncover new factors that influence data scientists’ library selection, such as the statistical rigor of the library. We triangulate our survey results with feedback from five focus groups involving 18 additional data science experts with various roles, whose input allow us to further interpret our survey results. We discuss the implications of our findings for data science library maintainers as well as researchers who want to design recommender and/or comparison systems that help data scientists with library selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Note that the layout of the survey sometimes combines questions of different categories to optimize the flow of the survey. For example, we ask participants about their current role at the beginning of the factor ratings to contextualize the information, while we keep all optional demographic questions at the end. The exact survey we use is available on our artifact page (Artifact can be found at https://doi.org/10.6084/m9.figshare.16563885.v1).

  2. https://www.linkedin.com/help/linkedin/answer/1584/inmail-messages?lang=en

  3. Thanks to the authors for releasing their raw rating data (Larios Vargas et al. 2020a), which allowed us to reproduce their results and enabled a direct distribution comparison

References

  • Abdalkareem R, Nourry O, Wehaibi S, Mujahid S, Shihab E (2017) Why do developers use trivial packages? an empirical case study on npm. In: Proceedings of the 11th joint meeting on foundations of software engineering, ser. ESEC/FSE 2017. https://doi.org/10.1145/3106237.3106267. Association for Computing Machinery, New York, pp 385–395

  • Biswas S, Wardat M, Rajan H (2021) The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. arXiv:2112.01590

  • Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71

    Article  Google Scholar 

  • De La Mora FL, Nadi S (2018a) An empirical study of metric-based comparisons of software libraries. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, ser. PROMISE’18. https://doi.org/10.1145/3273934.3273937. Association for Computing Machinery, New York, pp 22–31

  • De La Mora, FL, Nadi S (2018b) Which library should i use?: A metric-based comparison of software libraries. In: Proceedings of the 40th IEEE/ACM international conference on software engineering: new ideas and emerging technologies results (ICSE-NIER), pp 37–40

  • Dong H, Zhou S, Guo J, Kästner C (2021) Splitting, renaming, removing: a study of common cleaning activities in jupyter notebooks. In: Proceedings of the 9tn international workshop on realizing artificial intelligence synergies in software engineering (RAISE), p 11

  • El-Hajj R, Nadi S (2020) LibComp: an IntelliJ plugin for comparing Java libraries. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2020. https://doi.org/10.1145/3368089.3417922. Association for Computing Machinery, New York, pp 1591–1595

  • Gizas A, Christodoulou S, Papatheodorou T (2012) Comparative evaluation of javascript frameworks. In: Proceedings of the 21st international conference on world wide web. WWW ’12 Companion. https://doi.org/10.1145/2187980.2188103. Association for Computing Machinery, New York, pp 513–514

  • Harris H, Murphy S, Vaisman M (2013) Analyzing the analyzers: an introspective survey of data scientists and their work. O’Reilly Media, Inc.

  • Hora A, Valente MT (2015) Apiwave: keeping track of api popularity and migration. In: Proceedings of the 31st IEEE international conference on software maintenance and evolution, ser. ICSME ’15. IEEE Computer Society, Washington, pp 321–323

  • Hu J, Joung J, Jacobs M, Gajos KZ, Seltzer MI (2020) Improving data scientist efficiency with provenance. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE), pp 1086–1097

  • Kaggle (2020) Kaggle’s 2020 state of data science and machine learning survey. https://www.kaggle.com/kaggle-survey-2020

  • Kandel S, Paepcke A, Hellerstein JM, Heer J (2012) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917–2926

    Article  Google Scholar 

  • Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–11

  • Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: Proceedings of the 38th IEEE/ACM international conference on software engineering (ICSE), IEEE, pp 96–107

  • Kim M, Zimmermann T, DeLine R, Begel A (2018) Data scientists in software teams: state of the art and challenges. IEEE Trans Softw Eng 44 (11):1024–1038

    Article  Google Scholar 

  • Kontio J, Lehtola L, Bragge J (2004) Using the focus group method in software engineering: obtaining practitioner and user experiences. In: Proceedings of the international symposium on empirical software engineering (ISESE’04), IEEE, pp 271–280

  • Kross S, Guo PJ (2019) Practitioners teaching data science in industry and academia: expectations, workflows, and challenges. Association for Computing Machinery, New York, pp 1–14. https://doi.org/10.1145/3290605.3300493https://doi.org/10.1145/3290605.3300493

    Google Scholar 

  • Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020a) Selecting third-party libraries: the practitioners’ perspective. https://doi.org/10.5281/zenodo.3979446

  • Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020b) Selecting third-party libraries: the practitioners’ perspective. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE). https://doi.org/10.1145/3368089.3409711. Association for Computing Machinery, New York, pp 245–256

  • Ma Y, Mockus A, Zaretzki R, Bichescu B, Bradley R (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Trans Softw Eng 48(2):485–501

    Article  Google Scholar 

  • Matplotlib (2021). https://matplotlib.org/

  • Metwalli SA (2020) Data visualization 101: how to choose a python plotting library. https://towardsdatascience.com/data-visualization-101-how-to-choose-a-python-plotting-library-853460a08a8ahttps://towardsdatascience.com/data-visualization-101-how-to-choose-a-python-plotting-library-853460a08a8a

  • Mileva YM, Dallmeier V, Burger M, Zeller A (2009) Mining trends of library usage. In: Proceedings of the joint international and annual ERCIM workshops on principles of software evolution (IWPSE) and software evolution (Evol) workshops, ser. IWPSE-Evol ’09. ACM, New York, pp 57–62

  • Muller M, Lange I, Wang D, Piorkowski D, Tsay J, Liao QV, Dugan C, Erickson T (2019) How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–15

  • Myers BA, Stylos J (2016) Improving api usability. Commun ACM 59(6):62–69

    Article  Google Scholar 

  • Nahar N, Zhou S, Lewis G, Kästner C (2022) Collaboration challenges in building ml-enabled systems: communication, documentation, engineering, and process. In: Proceedings of the 44th international conference on software engineering (ICSE ’22)

  • Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, Malík P, Hluchỳ L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124

    Article  Google Scholar 

  • Ni A, Ramos D, Yang AZH, Lynce I, Manquinho V, Martins R, Le Goues C (2021) Soar: a synthesis approach for data science api refactoring. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 112–124

  • Pandas (2021). https://pandas.pydata.org/

  • Pano A, Graziotin D, Abrahamsson P (2018) Factors and actors leading to the adoption of a javascript framework. Empir Softw Eng 23(6):3503–3534

    Article  Google Scholar 

  • Patil DJ (2011) Building data science teams. O’Reilly Media, Inc.

    Google Scholar 

  • Piccioni M, Furia CA, Meyer B (2013) An empirical study of api usability. In: ACM/IEEE international symposium on empirical software engineering and measurement, pp 5–14

  • Pressman RS (2005) Software engineering: a practitioner’s approach. Macmillan, Palgrave

    MATH  Google Scholar 

  • Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, et al. (2019) Data science through the looking glass and what we found there. arXiv:1912.09536

  • Ralph P, bin Ali N, Baltes S, Bianculli D, Diaz J, Dittrich Y, Ernst N, Felderer M, Feldt R, Filieri A, de França BBN, Furia CA, Gay G, Gold N, Graziotin D, He P, Hoda R, Juristo N, Kitchenham B, Lenarduzzi V, Martínez J, Melegati J, Mendez D, Menzies T, Molleri J, Pfahl D, Robbes R, Russo D, Saarimäki N, Sarro F, Taibi D, Siegmund J, Spinellis D, Staron M, Stol K, Storey M-A, Taibi D, Tamburri D, Torchiano M, Treude C Turhan B, Wang X, Vegas S (2020) Empirical standards for software engineering research. arXiv:2010.03525

  • Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732

    Article  Google Scholar 

  • Robinson S (2018) The best machine learning libraries in python. https://stackabuse.com/the-best-machine-learning-libraries-in-python/https://stackabuse.com/the-best-machine-learning-libraries-in-python/

  • Siebert J, Groß J, Schroth C (2021) A systematic review of packages for time series analysis. Eng Proc 5(1):22. https://www.mdpi.com/2673-4591/5/1/22. https://doi.org/10.3390/engproc2021005022

    Google Scholar 

  • Sol T (2021) Choosing an open source machine learning library? here’s the list! https://gbksoft.com/blog/choosing-an-open-source-machine-learning-library-heres-the-list/

  • Stack Overflow (2021). https://stackoverflow.com/

  • Stančin I, Jović A (2019) An overview and comparison of free python libraries for data mining and big data analysis. In: 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), IEEE, pp 977–982

  • T. S. community (2021) SciPy library. https://www.scipy.org/

  • The SciPy community (2021) Wilcoxon rank sum test. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ranksums.html

  • Tensorflow (2021). https://www.tensorflow.org/

  • Teyton C, Falleri J-R, Blanc X (2012) Mining library migration graphs. In: Proceedings of the 19th working conference on reverse engineering (WCRE), pp 289–298

  • Teyton C, Falleri J-R, Palyart M, Blanc X (2014) A study of library migrations in java. J Softw Evol Process 26(11):1030–1052

    Article  Google Scholar 

  • The Economist (2017) The world’s most valuable resource is no longer oil, but data. The Economist Group Limited, London. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data

    Google Scholar 

  • Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: Proceedings of the 20th working conference on reverse engineering (WCRE), pp 182–191

  • Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: 20th working conference on reverse engineering (WCRE), pp 182–191

  • Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering, ser. ASE ’17

  • What you should know about the different data science job titles (2020). https://www.linkedin.com/pulse/what-you-should-know-different-data-science-job-big-data-scientist/

  • Wickham H, Chang W, Lionel Henry TLP, Takahashi K, Wilke C, Woo K, Yutani H, Dunnington D (2021) ggplot. https://ggplot2.tidyverse.org/

  • Wickham H, François R, Henry L, Müller K (2021) dplyr. https://dplyr.tidyverse.org/

  • Xu B, An L, Thung F, Khomh F, Lo D (2020) Why reinventing the wheels? an empirical study on library reuse and re-implementation. Empir Softw Eng 25(1):755–789

    Article  Google Scholar 

  • Yang C, Zhou S, Guo JL, Kästner C (2021) Subtle bugs everywhere: generating documentation for data wrangling code. In: Proceedings of the 36th IEEE/ACM international conference on automated software engineering (ASE), vol 11

  • Zhang AX, Muller M, Wang D (2020) How do data science workers collaborate? roles, workflows, and tools. Proc ACM Human-Comput Interact 4 (CSCW1):1–23. https://doi.org/10.1145/3392826

    Google Scholar 

Download references

Acknowledgements

We would like to thank all our participants who filled out the survey. Thanks to Andrew Nady for help with early data analysis scripts. Sarah Nadi’s research is funded by the Canada Research Chairs program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarah Nadi.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Alexander Serebrenik

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nadi, S., Sakr, N. Selecting third-party libraries: the data scientist’s perspective. Empir Software Eng 28, 15 (2023). https://doi.org/10.1007/s10664-022-10241-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-022-10241-3

Keywords

Navigation