Advertisement

Use case-based evaluation of workflow optimization strategy in real-time computation system

  • Saima Gulzar Ahmad
  • Hikmat Ullah KhanEmail author
  • Samia Ijaz
  • Ehsan Ullah Munir
Article
  • 14 Downloads

Abstract

With the start of big data era, data stream computing has emerged as a well-known approach to optimize data-intensive workflows. Apache STORM is an open-source real-time distributed computation system for processing data streams and has been opted by famous organizations such as Twitter, Yahoo, Alibaba, Baidu, Groupon. The workflows are implemented as topologies in STORM. The main aspect that controls the execution performance of a workflow in STORM is the strategy of scheduling the topology components (spout and bolts). In this paper, we evaluate and analyze the performance of our algorithm Partition-based Data-intensive Workflow optimization Algorithm (PDWA) in Apache STORM using a use case workflow, EURExpressII. It is a real-world application-based workflow that builds a transcriptome-wide atlas of gene expression for the developing mouse embryo established by ribonucleic acid (RNA) in situ hybridization. Our proposed algorithm, PDWA, partitions the application task graph so that the data movement between partitions is minimum. Each partition is then mapped on one machine for the execution of tasks of that partition. It provides minimum execution time for that particular partition. Partial task duplication is also part of this algorithm that enhances the performance. A STORM-based computing cluster is developed in OpenStack cloud which is used as a computing environment. The performance of PDWA-based optimizer is evaluated with the data sets of different sizes. The achieved results show that PDWA performs with 21% improved average execution time for different sizes of data sets and varying execution nodes. In addition, the comparative results show that on average the efficiency of PDWA is 20.4% higher as compared to STORM default scheduler (SDS).

Keywords

Workflow optimization STORM topology Partitions Data intensive Stream data processing 

Notes

References

  1. 1.
    Hey T, Tansley S, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research. https://www.amazon.com/Fourth-Paradigm-Data-Intensive-Scientific-Discovery
  2. 2.
    Bhadani A, Jothimani D. Big data: challenges, opportunities and realities. http://arxiv.org/abs/1705.04928v1
  3. 3.
    Kune R, Konugurthi PK, Agarwal A, Chillarige RR, Buyya R (2015) The anatomy of big data computing. Softw Pract Exp 46(1):79–105.  https://doi.org/10.1002/spe.2374 CrossRefGoogle Scholar
  4. 4.
    Umasri ML, Shyamalagowri D, Kumar S (2014) Aspects and infrastructure of big data. Int J Adv Res Comput Sci Softw Eng 4(1):609–612Google Scholar
  5. 5.
    Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Fut Gener Comput Syst 25(5):528–540CrossRefGoogle Scholar
  6. 6.
    Laszewski GV, Hategan M, Kodeboyina D (2007) Workflows for e-Science: scientific workflows for grids. Springer, London, pp 340–356CrossRefGoogle Scholar
  7. 7.
    Liew CS, van Hemert JI, Atkinson MP, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: High-Performance Parallel and Distributed Computing, pp 725–736Google Scholar
  8. 8.
    Dayarathna M, Perera S (2018) Recent advancements in event processing. ACM Comput Surv 51(2):1–36.  https://doi.org/10.1145/3170432 CrossRefGoogle Scholar
  9. 9.
    Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, vol 123, pp 129–136Google Scholar
  10. 10.
    Han L, van Hemert JI, Baldock RA (2011) Automatically identifying and annotating mouse embryo gene expression patterns. Bioinformatics 27(8):1101–1107CrossRefGoogle Scholar
  11. 11.
    Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712MathSciNetCrossRefGoogle Scholar
  12. 12.
    Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105CrossRefGoogle Scholar
  13. 13.
    Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation SymposiumGoogle Scholar
  14. 14.
    Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USAGoogle Scholar
  15. 15.
    Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)Google Scholar
  16. 16.
    Cao F, Zhu MM, Ding D (2014) Distributed workflow scheduling under throughput and budget constraints in grid environments. In: Lecture notes in computer science, Job scheduling strategies for parallel processing. Springer, Berlin, pp 62–80Google Scholar
  17. 17.
    Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85CrossRefGoogle Scholar
  18. 18.
    Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128Google Scholar
  19. 19.
    Aniello L, Baldoni R, Querzoni L (2013) Adaptive online scheduling in storm. In: 7th ACM International Conference on Distributed Event-Based Systems, pp 207–218Google Scholar
  20. 20.
    Sun D, Zhang G, Yang S, Zheng W, Khan SU, Li K (2015) Re-Stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112MathSciNetCrossRefGoogle Scholar
  21. 21.
    Rychly M, Skdo P, Smrz P (2014) Scheduling decisions in stream processing on heterogeneous clusters. In: International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp 614–619Google Scholar
  22. 22.
    Liu X, Buyya R (2017) D-storm: dynamic resource-efficient scheduling of stream processing applications. In: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS), IEEE.  https://doi.org/10.1109/icpads.2017.00070
  23. 23.
    Wang J, Hang S, Liu J (2016) Multi-level scheduling algorithm based on storm. KSII Trans Internet Inf Syst.  https://doi.org/10.3837/tiis.2016.03.008 CrossRefGoogle Scholar
  24. 24.
    Peng B, Hosseini M, Hong Z, Farivar R, Campbell R (2019) R-storm: resource-aware scheduling in storm. In: Proceedings of the 16th Annual Middleware Conference on Middleware. ACM Press.  https://doi.org/10.1145/2814576.2814808
  25. 25.
    Sun LC (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. The University of Edinburgh, EdinburghGoogle Scholar
  26. 26.
    Smirnov P, Melnik M, Nasonov D (2017) Performance-aware scheduling of streaming applications using genetic algorithm. In: Proceedings of the International Conference on Computational Science, ICCS 12–14 June 2017. Zurich, SwitzerlandGoogle Scholar
  27. 27.
    Sun D, Gao S, Liu X, Li F, Zheng X, Buyya R (2019) State and runtime-aware scheduling in elastic stream computing systems. Fut Gener Comput Syst (FGCS) 97:194–209CrossRefGoogle Scholar
  28. 28.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Hum Genet 7(2):179–188Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Saima Gulzar Ahmad
    • 1
  • Hikmat Ullah Khan
    • 1
    Email author
  • Samia Ijaz
    • 1
  • Ehsan Ullah Munir
    • 1
  1. 1.COMSATS University IslamabadIslamabadPakistan

Personalised recommendations