Skip to main content

Demystifying Data Science Projects: A Look on the People and Process of Data Science Today

  • Conference paper
  • First Online:
Product-Focused Software Process Improvement (PROFES 2020)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12562))

Abstract

Processes and practices used in data science projects have been reshaping especially over the last decade. These are different from their software engineering counterparts. However, to a large extent, data science relies on software, and, once taken to use, the results of a data science project are often embedded in software context. Hence, seeking synergy between software engineering and data science might open promising avenues. However, while there are various studies on data science workflows and data science project teams, there have been no attempts to combine these two very interlinked aspects. Furthermore, existing studies usually focus on practices within one company. Our study will fill these gaps with a multi-company case study, concentrating both on the roles found in data science project teams as well as the process. In this paper, we have studied a number of practicing data scientists to understand a typical process flow for a data science project. In addition, we studied the involved roles and the teamwork that would take place in the data context. Our analysis revealed three main elements of data science projects: Experimentation, Development Approach, and Multi-disciplinary team(work). These key concepts are further broken down to 13 different sub-themes in total. The found themes pinpoint critical elements and challenges found in data science projects, which are still often done in an ad-hoc fashion. Finally, we compare the results with modern software development to analyse how good a match there is.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at https://documentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm1a2.htm&docsetVersion=15.1.

  2. 2.

    The interview protocol https://drive.google.com/file/d/1rKvt_10oeINv0hXvQUQHgIgtFyEj9sAf/view?usp=sharing.

References

  1. Amershi, S., et al.: Software engineering for machine learning: a case study. In: IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (2019)

    Google Scholar 

  2. Angée, S., Lozano-Argel, S.I., Montoya-Munera, E.N., Ospina-Arango, J.D., Tabares-Betancur, M.S.: Towards an improved ASUM-DM process methodology for cross-disciplinary multi-organization big data & analytics projects. In: International Conference on Knowledge Management in Organizations (2018)

    Google Scholar 

  3. Azevedo, A., Santos, M.F.: KDD, SEMMA and CRISP-DM: a parallel overview. In: IADIS European Conference on Data Mining (2008)

    Google Scholar 

  4. Brachman, R.J., Anand, T.: The process of knowledge discovery in databases: a first sketch. In: AAAI Workshop on Knowledge Discovery in Databases (1994)

    Google Scholar 

  5. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77–101 (2006)

    Article  Google Scholar 

  6. Budde, R., Kautz, K., Kuhlenkamp, K., Züllighoven, H.: What is prototyping? Prototyping, pp. 6–9. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-642-76820-0_2

    Chapter  MATH  Google Scholar 

  7. Grady, N.W.: KDD meets big data. In: IEEE International Conference on Big Data (2016)

    Google Scholar 

  8. Grady, N.W., Payne, J.A., Parker, H.: Agile big data analytics: AnalyticsOps for data science. In: IEEE International Conference on Big Data (2017)

    Google Scholar 

  9. Hill, C., Bellamy, R., Erickson, T., Burnett, M.: Trials and tribulations of developers of intelligent systems: a field study. In: IEEE Symposium on Visual Languages and Human-Centric Computing (2016)

    Google Scholar 

  10. Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J.: Enterprise data analysis and visualization: an interview study. IEEE Trans. Visual Comput. Graphics 18(12), 2917–2926 (2012)

    Article  Google Scholar 

  11. Kim, M., Zimmermann, T., DeLine, R., Begel, A.: The emerging role of data scientists on software development teams. In: IEEE/ACM International Conference on Software Engineering (2016)

    Google Scholar 

  12. Kim, M., Zimmermann, T., DeLine, R., Begel, A.: Data scientists in software teams: state of the art and challenges. IEEE Trans. Software Eng. 44, 1024–1038 (2018)

    Article  Google Scholar 

  13. Piatetsky, G.: CRISP-DM, still the top methodology for analytics, data mining, or data science projects. KDnuggets (2014). https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html. Accessed June 2020

  14. Ries, E.: The lean startup: how today’s entrepreneurs use continuous innovation to create radically successful businesses. Currency (2011)

    Google Scholar 

  15. Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14(2), 131 (2008)

    Article  Google Scholar 

  16. Safhi, H.M., Frikh, B., Hirchoua, B., Ouhbi, B., Khalil, I.: Data intelligence in the context of big data: a survey. J. Mobile Multimedia 13(1&2) (2017)

    Google Scholar 

  17. Saltz, J., Shamshurin, I., Connors, C.: Predicting data science sociotechnical execution challenges by categorizing data science projects. J. Assoc. Inf. Sci. Technol. 68, 2720–2728 (2017)

    Article  Google Scholar 

  18. Saltz, J.S.: The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness. In: IEEE International Conference on Big Data (2015)

    Google Scholar 

  19. Saltz, J., Hotz, N., Wild, D., Stirling, K.: Exploring project management methodologies used within data science teams. In: Americas Conference on Information Systems (2018)

    Google Scholar 

  20. Saltz, J.S., Shamshurin, I.: Exploring the process of doing data science via an ethnographic study of a media advertising company. In: IEEE International Conference on Big Data (2015)

    Google Scholar 

  21. Saltz, J.S., Shamshurin, I.: Big data team process methodologies: A literature review and the identification of key factors for a project’s success. In: IEEE International Conference on Big Data (2016)

    Google Scholar 

  22. Schmidt, C., Sun, W.N.: Synthesizing agile and knowledge discovery: case study results. J. Comput. Inf. Syst. 58(2), 142–150 (2018)

    Google Scholar 

  23. Sculley, D., et al.: Hidden technical debt in machine learning systems. In: Advances in Neural Information Processing Systems (2015)

    Google Scholar 

  24. Shafique, U., Qaiser, H.: A comparative study of data mining process models (KDD, CRISP-DM and SEMMA). Int. J. Innov. Sci. Res. 12, 217–222 (2014)

    Google Scholar 

  25. Terho, H., Suonsyrjä, S., Systä, K., Mikkonen, T.: Understanding the relations between iterative cycles in software engineering. In: Hawaii International Conference on System Sciences (2017)

    Google Scholar 

  26. Wirth, R., Hipp, J.: CRISP-DM: Towards a standard process model for data mining. In: International Conference on the Practical Applications of Knowledge Discovery and Data Mining (2000)

    Google Scholar 

  27. Yin, R.K.: Case Study Research: Design and Methods, 5th edn. SAGE Publications, Thousand Oaks (2013)

    Google Scholar 

Download references

Acknowledgements

The authors wish to thank the professionals who provided their time and experience for our interviews. This study would not have been possible without them.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timo Aho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aho, T., Sievi-Korte, O., Kilamo, T., Yaman, S., Mikkonen, T. (2020). Demystifying Data Science Projects: A Look on the People and Process of Data Science Today. In: Morisio, M., Torchiano, M., Jedlitschka, A. (eds) Product-Focused Software Process Improvement. PROFES 2020. Lecture Notes in Computer Science(), vol 12562. Springer, Cham. https://doi.org/10.1007/978-3-030-64148-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64148-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64147-4

  • Online ISBN: 978-3-030-64148-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics