Abstract
In this chapter we consider situations in which a single host computer is inadequate because the data volume or processing demand exceeds the capacity of the host. A popular solution distributes the data and computations across a network of computers or a short-lived network created for the task (a cluster). In this scenario, each cluster node (a computing unit) stores and processes a subset of the data. The results are merged as one when all nodes have been completed their tasks. For this solution to succeed, the computational algorithm must conform to a certain structure and the cluster execution must be managed. The Hadoop environment and the MapReduce programming design provide the management and algorithmic structure. Hadoop is a collection of software and services that builds the cluster, distributes the data across the cluster, and controls the data processing algorithms and the transmission of results. The MapReduce programming design insures scalability, and scalability insures that the results are independent of the cluster configuration. The reader is guided through an introductory application of Hadoop and MapReduce after a discussion of the essential components.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Notable commercial vendors are Cloudera, Hortonworks, and MapR.
- 2.
A cluster created and controlled by HDFS may consist of thousands of nodes.
- 3.
Specifically, the 50 U.S. states, District of Columbia, and Puerto Rico.
- 4.
Medicaid expenditures were $495.8 billion.
- 5.
There are minor formatting differences between year 2014 data file and the data files for the years 2012 and 2013.
- 6.
Recall that p% of a distribution is smaller in value than the pth percentile and (100 − p)% of the distribution is larger. Thus, the median is the 50th percentile.
- 7.
The percentile function will compute percentiles from any object that can be converted to an array by Numpy.
- 8.
A conditional statement such as if m % 3 ==0 may be used to select every third record for processing.
- 9.
It’s possible that the reader may be able to obtain an academic discount or free trial.
- 10.
The default version was 2.7.2 at the time of this writing.
- 11.
At the time of this writing, Hue, Pig and Hive are included in the default configuration. These programs are not necessary.
References
Centers for Medicare & Medicaid Services, NHE Fact Sheet (2015). https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/nationalhealthexpenddata/nhe-fact-sheet.html
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in Proceedings of the Sixth Symposium on Operating System Design and Implementation (2004), pp. 107–113
J. Dean, S. Ghemawat, MapReduce: a flexible data processing tool. Commun. Assoc. Comput. Mach. 53 (1), 72–77 (2010)
J. Janssens, Data Science at the Command Line (O’Reilly Media, Sebastopol, 2014)
B. Lublinsky, K.T. Smith, A. Yakubovich, Hadoop Solutions (Wiley, Indianapolis, 2013)
N. Super, The geography of medicare: explaining differences in payment and costs. Natl Health Policy Forum (792) (2003)
Wikipedia, List of zip code prefixes - Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes. Accessed 30 Apr 2016
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Steele, B., Chandler, J., Reddy, S. (2016). Hadoop and MapReduce. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-45797-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)