Hadoop and MapReduce

Steele, Brian; Chandler, John; Reddy, Swarna

doi:10.1007/978-3-319-45797-0_4

Brian Steele⁴,
John Chandler⁵ &
Swarna Reddy⁶

7218 Accesses
2 Citations

Abstract

In this chapter we consider situations in which a single host computer is inadequate because the data volume or processing demand exceeds the capacity of the host. A popular solution distributes the data and computations across a network of computers or a short-lived network created for the task (a cluster). In this scenario, each cluster node (a computing unit) stores and processes a subset of the data. The results are merged as one when all nodes have been completed their tasks. For this solution to succeed, the computational algorithm must conform to a certain structure and the cluster execution must be managed. The Hadoop environment and the MapReduce programming design provide the management and algorithmic structure. Hadoop is a collection of software and services that builds the cluster, distributes the data across the cluster, and controls the data processing algorithms and the transmission of results. The MapReduce programming design insures scalability, and scalability insures that the results are independent of the cluster configuration. The reader is guided through an introductory application of Hadoop and MapReduce after a discussion of the essential components.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Notable commercial vendors are Cloudera, Hortonworks, and MapR.
2.
A cluster created and controlled by HDFS may consist of thousands of nodes.
3.
Specifically, the 50 U.S. states, District of Columbia, and Puerto Rico.
4.
Medicaid expenditures were $495.8 billion.
5.
There are minor formatting differences between year 2014 data file and the data files for the years 2012 and 2013.
6.
Recall that p% of a distribution is smaller in value than the pth percentile and (100 − p)% of the distribution is larger. Thus, the median is the 50th percentile.
7.
The percentile function will compute percentiles from any object that can be converted to an array by Numpy.
8.
A conditional statement such as if m % 3 ==0 may be used to select every third record for processing.
9.
It’s possible that the reader may be able to obtain an academic discount or free trial.
10.
The default version was 2.7.2 at the time of this writing.
11.
At the time of this writing, Hue, Pig and Hive are included in the default configuration. These programs are not necessary.

References

Centers for Medicare & Medicaid Services, NHE Fact Sheet (2015). https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/nationalhealthexpenddata/nhe-fact-sheet.html
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in Proceedings of the Sixth Symposium on Operating System Design and Implementation (2004), pp. 107–113
Google Scholar
J. Dean, S. Ghemawat, MapReduce: a flexible data processing tool. Commun. Assoc. Comput. Mach. 53 (1), 72–77 (2010)
Google Scholar
J. Janssens, Data Science at the Command Line (O’Reilly Media, Sebastopol, 2014)
Google Scholar
B. Lublinsky, K.T. Smith, A. Yakubovich, Hadoop Solutions (Wiley, Indianapolis, 2013)
Google Scholar
N. Super, The geography of medicare: explaining differences in payment and costs. Natl Health Policy Forum (792) (2003)
Google Scholar
Wikipedia, List of zip code prefixes - Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes. Accessed 30 Apr 2016

Download references

Author information

Authors and Affiliations

University of Montana, Missoula, MT, USA
Brian Steele
School of Business Administration, University of Montana, Missoula, MT, USA
John Chandler
SoftMath Consultants, LLC, Missoula, MT, USA
Swarna Reddy

Authors

Brian Steele
View author publications
You can also search for this author in PubMed Google Scholar
John Chandler
View author publications
You can also search for this author in PubMed Google Scholar
Swarna Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Hadoop and MapReduce. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-45797-0_4
Published: 27 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics