Abstract
Processes and practices used in data science projects have been reshaping especially over the last decade. These are different from their software engineering counterparts. However, to a large extent, data science relies on software, and, once taken to use, the results of a data science project are often embedded in software context. Hence, seeking synergy between software engineering and data science might open promising avenues. However, while there are various studies on data science workflows and data science project teams, there have been no attempts to combine these two very interlinked aspects. Furthermore, existing studies usually focus on practices within one company. Our study will fill these gaps with a multi-company case study, concentrating both on the roles found in data science project teams as well as the process. In this paper, we have studied a number of practicing data scientists to understand a typical process flow for a data science project. In addition, we studied the involved roles and the teamwork that would take place in the data context. Our analysis revealed three main elements of data science projects: Experimentation, Development Approach, and Multi-disciplinary team(work). These key concepts are further broken down to 13 different sub-themes in total. The found themes pinpoint critical elements and challenges found in data science projects, which are still often done in an ad-hoc fashion. Finally, we compare the results with modern software development to analyse how good a match there is.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amershi, S., et al.: Software engineering for machine learning: a case study. In: IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (2019)
Angée, S., Lozano-Argel, S.I., Montoya-Munera, E.N., Ospina-Arango, J.D., Tabares-Betancur, M.S.: Towards an improved ASUM-DM process methodology for cross-disciplinary multi-organization big data & analytics projects. In: International Conference on Knowledge Management in Organizations (2018)
Azevedo, A., Santos, M.F.: KDD, SEMMA and CRISP-DM: a parallel overview. In: IADIS European Conference on Data Mining (2008)
Brachman, R.J., Anand, T.: The process of knowledge discovery in databases: a first sketch. In: AAAI Workshop on Knowledge Discovery in Databases (1994)
Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77–101 (2006)
Budde, R., Kautz, K., Kuhlenkamp, K., Züllighoven, H.: What is prototyping? Prototyping, pp. 6–9. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-642-76820-0_2
Grady, N.W.: KDD meets big data. In: IEEE International Conference on Big Data (2016)
Grady, N.W., Payne, J.A., Parker, H.: Agile big data analytics: AnalyticsOps for data science. In: IEEE International Conference on Big Data (2017)
Hill, C., Bellamy, R., Erickson, T., Burnett, M.: Trials and tribulations of developers of intelligent systems: a field study. In: IEEE Symposium on Visual Languages and Human-Centric Computing (2016)
Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J.: Enterprise data analysis and visualization: an interview study. IEEE Trans. Visual Comput. Graphics 18(12), 2917–2926 (2012)
Kim, M., Zimmermann, T., DeLine, R., Begel, A.: The emerging role of data scientists on software development teams. In: IEEE/ACM International Conference on Software Engineering (2016)
Kim, M., Zimmermann, T., DeLine, R., Begel, A.: Data scientists in software teams: state of the art and challenges. IEEE Trans. Software Eng. 44, 1024–1038 (2018)
Piatetsky, G.: CRISP-DM, still the top methodology for analytics, data mining, or data science projects. KDnuggets (2014). https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html. Accessed June 2020
Ries, E.: The lean startup: how today’s entrepreneurs use continuous innovation to create radically successful businesses. Currency (2011)
Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14(2), 131 (2008)
Safhi, H.M., Frikh, B., Hirchoua, B., Ouhbi, B., Khalil, I.: Data intelligence in the context of big data: a survey. J. Mobile Multimedia 13(1&2) (2017)
Saltz, J., Shamshurin, I., Connors, C.: Predicting data science sociotechnical execution challenges by categorizing data science projects. J. Assoc. Inf. Sci. Technol. 68, 2720–2728 (2017)
Saltz, J.S.: The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness. In: IEEE International Conference on Big Data (2015)
Saltz, J., Hotz, N., Wild, D., Stirling, K.: Exploring project management methodologies used within data science teams. In: Americas Conference on Information Systems (2018)
Saltz, J.S., Shamshurin, I.: Exploring the process of doing data science via an ethnographic study of a media advertising company. In: IEEE International Conference on Big Data (2015)
Saltz, J.S., Shamshurin, I.: Big data team process methodologies: A literature review and the identification of key factors for a project’s success. In: IEEE International Conference on Big Data (2016)
Schmidt, C., Sun, W.N.: Synthesizing agile and knowledge discovery: case study results. J. Comput. Inf. Syst. 58(2), 142–150 (2018)
Sculley, D., et al.: Hidden technical debt in machine learning systems. In: Advances in Neural Information Processing Systems (2015)
Shafique, U., Qaiser, H.: A comparative study of data mining process models (KDD, CRISP-DM and SEMMA). Int. J. Innov. Sci. Res. 12, 217–222 (2014)
Terho, H., Suonsyrjä, S., Systä, K., Mikkonen, T.: Understanding the relations between iterative cycles in software engineering. In: Hawaii International Conference on System Sciences (2017)
Wirth, R., Hipp, J.: CRISP-DM: Towards a standard process model for data mining. In: International Conference on the Practical Applications of Knowledge Discovery and Data Mining (2000)
Yin, R.K.: Case Study Research: Design and Methods, 5th edn. SAGE Publications, Thousand Oaks (2013)
Acknowledgements
The authors wish to thank the professionals who provided their time and experience for our interviews. This study would not have been possible without them.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Aho, T., Sievi-Korte, O., Kilamo, T., Yaman, S., Mikkonen, T. (2020). Demystifying Data Science Projects: A Look on the People and Process of Data Science Today. In: Morisio, M., Torchiano, M., Jedlitschka, A. (eds) Product-Focused Software Process Improvement. PROFES 2020. Lecture Notes in Computer Science(), vol 12562. Springer, Cham. https://doi.org/10.1007/978-3-030-64148-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-64148-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64147-4
Online ISBN: 978-3-030-64148-1
eBook Packages: Computer ScienceComputer Science (R0)