Skip to main content

A Deep Dive into the Hadoop World to Explore Its Various Performances

  • Chapter
  • First Online:
Book cover Techniques and Environments for Big Data Analysis

Part of the book series: Studies in Big Data ((SBD,volume 17))

Abstract

Size of the data used in todays enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large scale datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System (HDFS). HDFS is written completely in Java and is depicted in such a way that in can store Big data more reliably, and can stream those at high processing time to the user applications. Hadoop has been widely used in recent days by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, experiments on Data-Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently. This chapter initially provides an overview of the HDFS in details. The next portion of the paper evaluates Hadoops performance with various factors in different environments. The chapter shows how files less than the block size affect Hadoops R/W performance and how the time of execution of a job depends on block size and number of reducers. Chapter concludes with providing the different real challenges of Hadoop in recent days and scope for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache Hadoop. http://hadoop.apache.org/

  2. Beaver, D., Kumar, S., Li, H. C., Sobel, J., & Vajgel, P. (2010). Finding a needle in haystack: Facebooks photo storage. In OSDI, ACM (pp. 1–8).

    Google Scholar 

  3. Bhandarkar, M. (2010). MapReduce programming with apache Hadoop. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (Vol. 1, No. 1, pp. 19–23).

    Google Scholar 

  4. Carns, P. H., Ligon III, W. B., Ross, R. B., & Thakur, R. (2000). PVFS: A parallel file system for Linux clusters. In Proceedings of 4th Annual Linux Showcase and Conference (pp. 317–327).

    Google Scholar 

  5. Daxin, X., & Fei, L. (2011). Research on optimization techniques for Hadoop cluster performance. Computer Knowledge and Technology, 8(7), 5484–5486.

    Google Scholar 

  6. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings Sixth Symposium Operating System Design and Implementation (OSDI 04) (pp. 137–150).

    Google Scholar 

  7. Dev D., & Patgiri, R. (in press). HAR+: Archive and metadata distribution! Why not both? In ICCCI 2015.

    Google Scholar 

  8. Dev D., & Patgiri, R. (in press). Performance evaluation of HDFS in big data management. In ICHPCA-2014.

    Google Scholar 

  9. Dev, D., & Baishnab, K. L. A. (2014). Review and research towards mobile cloud computing. In 2nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (Mobile- Cloud) (pp. 252–256).

    Google Scholar 

  10. Ghemawat, S., Gobio, H. & Leung, S.-T. (2003). The google file system. In Proceedings 19th ACM Symposium Operating Systems Principles (SOSP03) (pp. 29–43).

    Google Scholar 

  11. Grobauer, B., Walloschek, T., & Stocker, E. Understanding cloud computing vulnerabilities. In IEEE International Conference on Security & Privacy (vol. 9, pp. 50–57).

    Google Scholar 

  12. Guilan, X., & Shengxian, L. (2010). Research on applications based on Hadoop MapReduce model. Microcomputer & Its Applications (8), 4–7.

    Google Scholar 

  13. Hadoop Distributed File System Rebalancing Blocks. (2012). http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing.

    Google Scholar 

  14. HDFS Federation. (2012). http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federation.html.

  15. Lustre File System. http://www.lustre.org.

  16. McKusick, K., & Quinlan, S. G. F. S. (2010). Evolution on Fast-Forward. Communication of the ACM, 53(3), 42–49.

    Article  Google Scholar 

  17. Nicolae, B., Moise, D., Antoniu, G., Boug, L., & Dorier, M. (2010). BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications. In Proceedongs 24th IEEE Interational Parallel and Distributed Processing Symposium (IPDPS 2010).

    Google Scholar 

  18. Shafer, J. A. (2010). Storage architecture for data-intensive computing. Ph.D. thesis, Rice University. Advisor-Rixner, Scott.

    Google Scholar 

  19. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Vol. 1, No. 10, pp. 3–7).

    Google Scholar 

  20. Tantisiriroj, W., Patil, S., & Gibson, G. (2008, October). Data-intensive file systems for Internet services: A rose by any other name. Technical Report CMUPDL- 08–114, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA.

    Google Scholar 

  21. Tantisiriroj, W., Patil, S., Gibson, G., Son, S. W., Lang, S. J., & Ross, R. B. On the duality of data-intensive file system design: Reconciling HDFS and PVFS. In SC11.

    Google Scholar 

  22. Ubuntu. http://releases.ubuntu.com/14.04/.

  23. Weil, S. A., Pollack, K. T., Brandt, S. A., & Miller, E. L. (2004). Dynamic metadata management for petabyte-scale file systems. In ACM/IEEE SC (pp. 4–12).

    Google Scholar 

  24. White, T. (2009). Hadoop, guide, The Definitive, & Inc, O’ Reilly Media. (1005). Gravenstein Highway North, Sebastopol. CA, 95472.

    Google Scholar 

  25. Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012). Performance optimization for short MapReduce job execution in Hadoop. In: 2012 Second International Conference on Cloud and Green Computing (CGC) (Vol. 688, No. 694, pp. 1–3).

    Google Scholar 

Download references

Acknowledgments

The research is supported by Data Science & Analytic Lab of NIT Silchar. The authors would also like to thank the anonymous reviewers for their valuable and constructive comments on improving the chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dipayan Dev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Dev, D., Patgiri, R. (2016). A Deep Dive into the Hadoop World to Explore Its Various Performances. In: Mishra, B., Dehuri, S., Kim, E., Wang, GN. (eds) Techniques and Environments for Big Data Analysis. Studies in Big Data, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-27520-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27520-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27518-5

  • Online ISBN: 978-3-319-27520-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics