Skip to main content

Hadoop and MapReduce

  • Chapter
  • First Online:
Algorithms for Data Science

Abstract

In this chapter we consider situations in which a single host computer is inadequate because the data volume or processing demand exceeds the capacity of the host. A popular solution distributes the data and computations across a network of computers or a short-lived network created for the task (a cluster). In this scenario, each cluster node (a computing unit) stores and processes a subset of the data. The results are merged as one when all nodes have been completed their tasks. For this solution to succeed, the computational algorithm must conform to a certain structure and the cluster execution must be managed. The Hadoop environment and the MapReduce programming design provide the management and algorithmic structure. Hadoop is a collection of software and services that builds the cluster, distributes the data across the cluster, and controls the data processing algorithms and the transmission of results. The MapReduce programming design insures scalability, and scalability insures that the results are independent of the cluster configuration. The reader is guided through an introductory application of Hadoop and MapReduce after a discussion of the essential components.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Notable commercial vendors are Cloudera, Hortonworks, and MapR.

  2. 2.

    A cluster created and controlled by HDFS may consist of thousands of nodes.

  3. 3.

    Specifically, the 50 U.S. states, District of Columbia, and Puerto Rico.

  4. 4.

    Medicaid expenditures were $495.8 billion.

  5. 5.

    There are minor formatting differences between year 2014 data file and the data files for the years 2012 and 2013.

  6. 6.

    Recall that p% of a distribution is smaller in value than the pth percentile and (100 − p)% of the distribution is larger. Thus, the median is the 50th percentile.

  7. 7.

    The percentile function will compute percentiles from any object that can be converted to an array by Numpy.

  8. 8.

    A conditional statement such as if m % 3 ==0 may be used to select every third record for processing.

  9. 9.

    It’s possible that the reader may be able to obtain an academic discount or free trial.

  10. 10.

    The default version was 2.7.2 at the time of this writing.

  11. 11.

    At the time of this writing, Hue, Pig and Hive are included in the default configuration. These programs are not necessary.

References

  1. Centers for Medicare & Medicaid Services, NHE Fact Sheet (2015). https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/nationalhealthexpenddata/nhe-fact-sheet.html

  2. J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in Proceedings of the Sixth Symposium on Operating System Design and Implementation (2004), pp. 107–113

    Google Scholar 

  3. J. Dean, S. Ghemawat, MapReduce: a flexible data processing tool. Commun. Assoc. Comput. Mach. 53 (1), 72–77 (2010)

    Google Scholar 

  4. J. Janssens, Data Science at the Command Line (O’Reilly Media, Sebastopol, 2014)

    Google Scholar 

  5. B. Lublinsky, K.T. Smith, A. Yakubovich, Hadoop Solutions (Wiley, Indianapolis, 2013)

    Google Scholar 

  6. N. Super, The geography of medicare: explaining differences in payment and costs. Natl Health Policy Forum (792) (2003)

    Google Scholar 

  7. Wikipedia, List of zip code prefixes - Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes. Accessed 30 Apr 2016

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Hadoop and MapReduce. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45797-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45795-6

  • Online ISBN: 978-3-319-45797-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics