Advertisement

The CRISP-DCW Method for Distributed Computing Workflows

  • Marco SpruitEmail author
  • Stijn Meijers
Conference paper
Part of the Springer Proceedings in Complexity book series (SPCOM)

Abstract

Big data analysis is increasingly becoming a crucial part of many organizations, popularizing the distributed computing paradigm. Within the emerging research field of Applied Data Science, multiple notable methods are available that help analysists and scientists to create their analytical processes. However, for distributed computing problems such methods are not available yet. Therefore, to support data analysts, scientists and software engineers in the creation of distributed computing processes, we present the CRoss-Industry Standard Process for Distributed Computing Workflows (CRISP-DCW) method. The CRISP-DCW method lets users create distributed computing workflows through following a predefined cycle and using reference manuals, where the critical elements of such a workflow are developed for the context at hand. Using our method’s reference manuals and predefined steps, data scientists can spend less time on developing big data processing workflows, thus increasing efficiency. Results were evaluated with experts and found to be satisfactory. Therefore, we argue that the CRISP-DCW method provides a good starting point for applied data scientists to develop and document their distributed computing workflow, making their processes both more efficient and effective.

References

  1. 1.
    McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harvard Bus. Rev. 90(10), 3–9 (2012)Google Scholar
  2. 2.
    Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data : the next frontier for innovation, competition, and productivity (2011)Google Scholar
  3. 3.
    NIST Big Data Public Working Group: NIST Special Publication 1500-1—NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST Special Publication (Vol. 1). Gaithersburg.  https://doi.org/10.6028/NIST.SP.1500-1 (2015)
  4. 4.
    Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. Knowl. Eng. Rev. 21(1), 1–24 (2006).  https://doi.org/10.1017/S0269888906000737CrossRefGoogle Scholar
  5. 5.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010).  https://doi.org/10.1145/1721654.1721672CrossRefGoogle Scholar
  6. 6.
    Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Market-oriented cloud computing: Vision, hype, and reality of delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009).  https://doi.org/10.1109/CCGRID.2009.97CrossRefGoogle Scholar
  7. 7.
    Zhao, Y., Raicu, I., Foster, I.: Scientific workflow systems for 21st century, new bottle or new wine? In: IEEE Congress on Services—Part I, 2008, pp. 467–471. IEEE Computer Society, Washington.  https://doi.org/10.1109/SERVICES-1.2008.79 (2008)
  8. 8.
    Spruit, M., Jagesar, R.: Power to the people! Meta-algorithmic modelling in applied data science. In: Fred, A. et al. (eds.) Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 400–406. KDIR 2016, November 11–13, 2016. ScitePress, Porto, Portugal (2016)Google Scholar
  9. 9.
    Wieringa, R.: Design Science Methodology for information Systems and Software Engineering, vol. 2. Springer, Heidelberg, New York, Dordrecht, London.  https://doi.org/10.1145/1810295.1810446 (2010)
  10. 10.
    Spruit, M., Lytras, M.: Applied data science in patient-centric healthcare: adaptive analytic systems for empowering physicians and patients. Telematics Inform. 35(4), 643–653 (2018)CrossRefGoogle Scholar
  11. 11.
    Ooms, R., Spruit, M., Overbeek, S.: 3PM revisited: dissecting the three phases method for outsourcing knowledge discovery. Int. J. Bus. Intell. Res. 10(1), Article 5 (2019)CrossRefGoogle Scholar
  12. 12.
    Vleugel, A., Spruit, M., Van Daal, A.: Historical data analysis through data mining from an outsourcing perspective: the three-phases model. Int. J. Bus. 1(3), 24.  https://doi.org/10.4018/jbir.2010070104 (2010)CrossRefGoogle Scholar
  13. 13.
    Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0. In: CRISP-DM Consortium.  https://doi.org/10.1109/ICETET.2008.239 (2000)
  14. 14.
    Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014).  https://doi.org/10.1109/TKDE.2013.109CrossRefGoogle Scholar
  15. 15.
    Microsoft: Microsoft Azure. Retrieved from https://azure.microsoft.com/ (2017 November 24)
  16. 16.
    Amazon Web Services Inc.: Amazon Web Services. Retrieved from Amazon Web Services: https://www.aws.amazon.com/ (2017 December 13)
  17. 17.
    Leong, L., Bala, R., Lowery, C., Smith, D.: Magic Quadrant for Cloud Infrastructure as a Service, Worldwide (2017)Google Scholar
  18. 18.
    Van Steen, M., Tanenbaum, A.S.: Distributed Systems, Third, vol. 1. Maarten van Steen (2017)Google Scholar
  19. 19.
    Voorsluys, W., Broberg, J., Buyya, R.: Introduction to cloud computing. In: Buyya, R., Broberg, J., Goscinski, A. (eds.) Cloud Computing: Principles and Paradigms, 1st ed., pp. 3–41. Wiley (2011)Google Scholar
  20. 20.
    Apache Spark: Spark Overview. Retrieved from Apache Spark. https://spark.apache.org/docs/2.3.0/index.html(2018 April 17)
  21. 21.
    The Apache Software Foundation: Apache Hadoop. Retrieved from Apache Hadoop: http://hadoop.apache.org/ (2017 November 28)
  22. 22.
    The Apache Software Foundation: Documentation. Retrieved from Apache Kafka a distributed streaming platform: https://kafka.apache.org/documentation/#uses (2017 April 17)
  23. 23.
    The Apache Software Foundation: Flume 1.8.0 User Guide. Retrieved from Apache Flume: https://flume.apache.org/FlumeUserGuide.html (2018 April 17)
  24. 24.
    The Apache Software Foundation: User Guide. Retrieved from Apache Sqoop: http://sqoop.apache.org/docs/1.99.7/user.html (2018 April 17)
  25. 25.
    Allen, R., Li, M.: Ranking Popular Distributed Computing Packages for Data Science. Retrieved from KDnuggets. https://www.kdnuggets.com/2018/03/top-distributed-computing-packages-data-science.html (2018, March 29)
  26. 26.
    Apache Spark: SparkR (R on Spark). Retrieved from Apache Spark. https://spark.apache.org/docs/latest/sparkr.html (2018 April 17)
  27. 27.
    Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data—SIGMOD ’15, pp. 1383–1394.  https://doi.org/10.1145/2723372.2742797 (2015)
  28. 28.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1–7 (2016). 10.1145/2882903.2912565Google Scholar
  29. 29.
    White, T.: Hadoop: The Definitive Guide (Third). O’Reilly, Beijing, Cambridge, Farnham, Koln, Tokyo (2015)Google Scholar
  30. 30.
    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems (GRADES 2013), p. 6.  https://doi.org/10.1145/2484425.2484427 (2013)
  31. 31.
    Islam, M., Huang, A.K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Srinivasan, S., Peters, C., Neumann, A., Abdelnur, A.: Oozie: towards a scalable workflow management system for Hadoop. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies—SWEET ’12 (pp. 1–10). Scottsdale: ACM.  https://doi.org/10.1145/2443416.2443420 (2012)
  32. 32.
    Spotify AB: Luigi is now open source: build complex pipelines of tasks. Retrieved from Spotify Developer: https://developer.spotify.com/news-stories/2012/09/24/hello (2012 September 24)
  33. 33.
    Van de Weerd, I., Brinkkemper, S.: Meta-modeling for situational analysis and design methods. In: Syed, M.R., Syed, S.N. (eds.) Handbook of Research on Modern Systems Analysis and Design Technologies and Applications, pp. 38–58. Information Science Reference, Hershey.  https://doi.org/10.4018/978-1-59904-887-1.ch003 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Information and Computing SciencesUtrecht UniversityUtrechtThe Netherlands

Personalised recommendations