Abstract
We introduce FlowFlex, a highly generic and effective scheduler for flows of MapReduce jobs connected by precedence constraints. Such a flow can result, for example, from a single user-level Pig, Hive or Jaql query. Each flow is associated with an arbitrary function describing the cost incurred in completing the flow at a particular time. The overall objective is to minimize either the total cost (minisum) or the maximum cost (minimax) of the flows. Our contributions are both theoretical and practical. Theoretically, we advance the state of the art in malleable parallel scheduling with precedence constraints. We employ resource augmentation analysis to provide bicriteria approximation algorithms for both minisum and minimax objective functions. As corollaries, we obtain approximation algorithms for total weighted completion time (and thus average completion time and average stretch), and for maximum weighted completion time (and thus makespan and maximum stretch). Practically, the average case performance of the FlowFlex scheduler is excellent, significantly better than other approaches. Specifically, we demonstrate via extensive experiments the overall performance of FlowFlex relative to optimal and also relative to other, standard MapReduce scheduling schemes. All told, FlowFlex dramatically extends the capabilities of the earlier Flex scheduler for singleton MapReduce jobs while simultaneously providing a solid theoretical foundation for both.
Chapter PDF
Similar content being viewed by others
Keywords
- Completion Time
- Precedence Constraint
- Total Weighted Completion Time
- Average Completion Time
- Maximum Stretch
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, P., Kifer, D., Olston, C.: Scheduling Shared Scans of Large Data Files. In: Proceedings of VLDB (2008)
Balmin, A., Hildrum, K., Nagarajan, V., Wolf, J.: Malleable Scheduling for Flows of MapReduce Jobs, Research Report RC25364, IBM Research (2013)
Berlinska, J., Drozdowski, M.: Scheduling Divisible MapReduce Computations. Journal of Parallel and Distributed Computing 71, 450–459 (2011)
Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., Ozcan, F., Shekita, E.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. In: Proceedings of VLDB (2011)
BigInsights: http://www-01.ibm.com/software/data/infosphere/biginsights/
Coffman, E., Garey, M., Johnson, D., Tarjan, R.: Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms. SIAM Journal on Computing 9(4), 808–826 (1980)
De Pauw, W., Wolf, J., Balmin, A.: Visualizing Jobs with Shared Resources in Distributed Environments. In: IEEE Working Conference on Software Visualization, Eindhoven, Holland (2013)
Dean, J., Ghemawat, S.: Mapreduce: Simplified Data Processing on Large Clusters. ACM Transactions on Computer Systems 51(1), 107–113 (2008)
Drozdowski, M.: Scheduling for Parallel Processing. Springer (2009)
Drozdowski, M., Kubiak, W.: Scheduling Parallel Tasks With Sequential Heads and Tails. Annals of Operations Research 90, 221–246 (1999)
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow System on Top of MapReduce: The Pig Experience. In: Proceedings of VLDB (2009)
Hochbaum, D.S., Shmoys, D.B.: A Unified Approach to Approximation Algorithms for Bottleneck Problems. J. ACM 33(3), 533–550 (1986)
Kalyanasundaram, B., Pruhs, K.: Speed is as Powerful as Clairvoyance. J. ACM 47(4), 617–643 (2000)
Karloff, H., Suri, S., Vassilvitskii, S.: A Model of Computation for MapReduce. In: SODA, pp. 938–948 (2010)
Koutris, P., Suciu, D.: Parallel evaluation of conjunctive queries. In: PODS, pp. 223–234 (2011)
Leung, J.: Handbook of Scheduling. Chapman and Hall/CRC (2004)
McNaughton, R.: Scheduling with Deadlines and Loss Functions. Management Science 6(1), 1–12 (1959)
Moseley, B., Dasgupta, A., Kumar, R., Sarlós, T.: On Scheduling in Map-Reduce and Flow-Shops. In: SPAA, pp. 289–298 (2011)
Popescu, A., Ercegovac, V., Balmin, A., Branco, M., Ailamaki, A.: Same Queries, Different Data: Can We Predict Runtime Performance? In: ICDE Workshops, pp. 275–280 (2012)
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a Petabyte Scale Data Warehouse using Hadoop. In: ICDE, pp. 996–1005 (2010)
Schuurman, P., Woeginger, G.J.: A Polynomial Time Approximation Scheme for the Two-Stage Multiprocessor Flow Shop Problem. Theor. Comput. Sci. 237(1-2), 105–122 (2000)
Schwiegelshohn, U., Ludwig, W., Wolf, J., Turek, J., Yu, P.: Smart SMART Bounds for Weighted Response Time Scheduling. SIAM Journal on Computing 28(1), 237–253 (1999)
Turek, J., Wolf, J., Yu, P.: Approximate Algorithms for Scheduling Parallelizable Tasks. In: SPAA (1992)
Wolf, J., Balmin, A., Rajan, D., Hildrum, K., Khandekar, R., Parekh, S., Wu, K.-L., Vernica, R.: On the Optimization of Schedules for MapReduce Workloads in the Presence of Shared Scans. VLDB Journal 21(5) (2012)
Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.-L., Balmin, A.: FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In: Gupta, I., Mascolo, C. (eds.) Middleware 2010. LNCS, vol. 6452, pp. 1–20. Springer, Heidelberg (2010)
Zaharia, M., Borthakur, D., Sarma, J., Elmeleegy, K., Schenker, S., Stoica, I.: Job Scheduling for Multi-User MapReduce Clusters, UC Berkeley Technical Report EECS-2009-55 (2009)
Zaharia, M., Borthakur, D., Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In: EuroSys (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 IFIP International Federation for Information Processing
About this paper
Cite this paper
Nagarajan, V., Wolf, J., Balmin, A., Hildrum, K. (2013). FlowFlex: Malleable Scheduling for Flows of MapReduce Jobs. In: Eyers, D., Schwan, K. (eds) Middleware 2013. Middleware 2013. Lecture Notes in Computer Science, vol 8275. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45065-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-45065-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45064-8
Online ISBN: 978-3-642-45065-5
eBook Packages: Computer ScienceComputer Science (R0)