From automatically translating documents to analyzing electoral voting patterns; from computing personalized movie recommendations to predicting flu epidemics: all of these tasks are possible due to the success and proliferation of the MapReduce parallel programming paradigm. Yet almost ten years after the system was introduced, we still do not have a good understanding of what problems can and cannot be efficiently computed in MapReduce.
In this talk I will give an overview of the MapReduce framework, and explain its connections to both Valiant’s Bulk Synchronous Parallel (BSP) model and the classical PRAM model of parallel computing. To demonstrate the power of the MapReduce model I will present the Sample and Prune approach that finds an approximate coreset of a manageable size, thereby reducing the problem from the realm of ‘Big Data’ to that of ‘Small Data.’
I will conclude by discussing other considerations that make a large difference when working with MapReduce in practice, but have so far resisted a careful theoretical analysis.