Cloud Computing for Data-Intensive Applications

pp 277-302


Cross-Phase Optimization in MapReduce

  • Benjamin HeintzAffiliated withUniversity of Minnesota Email author 
  • , Abhishek ChandraAffiliated withUniversity of Minnesota
  • , Jon WeissmanAffiliated withUniversity of Minnesota

* Final gross prices may vary according to local VAT.

Get Access


MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. This chapter continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated and stored at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same application. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7 to 18 % depending on the execution environment and application.