A Deep Dive into the Hadoop World to Explore Its Various Performances

Dev, Dipayan; Patgiri, Ripon

doi:10.1007/978-3-319-27520-8_3

Dipayan Dev⁶ &
Ripon Patgiri⁶

Part of the book series: Studies in Big Data ((SBD,volume 17))

2424 Accesses
2 Citations

Abstract

Size of the data used in todays enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large scale datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System (HDFS). HDFS is written completely in Java and is depicted in such a way that in can store Big data more reliably, and can stream those at high processing time to the user applications. Hadoop has been widely used in recent days by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, experiments on Data-Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently. This chapter initially provides an overview of the HDFS in details. The next portion of the paper evaluates Hadoops performance with various factors in different environments. The chapter shows how files less than the block size affect Hadoops R/W performance and how the time of execution of a job depends on block size and number of reducers. Chapter concludes with providing the different real challenges of Hadoop in recent days and scope for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Hadoop. http://hadoop.apache.org/
Beaver, D., Kumar, S., Li, H. C., Sobel, J., & Vajgel, P. (2010). Finding a needle in haystack: Facebooks photo storage. In OSDI, ACM (pp. 1–8).
Google Scholar
Bhandarkar, M. (2010). MapReduce programming with apache Hadoop. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (Vol. 1, No. 1, pp. 19–23).
Google Scholar
Carns, P. H., Ligon III, W. B., Ross, R. B., & Thakur, R. (2000). PVFS: A parallel file system for Linux clusters. In Proceedings of 4th Annual Linux Showcase and Conference (pp. 317–327).
Google Scholar
Daxin, X., & Fei, L. (2011). Research on optimization techniques for Hadoop cluster performance. Computer Knowledge and Technology, 8(7), 5484–5486.
Google Scholar
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings Sixth Symposium Operating System Design and Implementation (OSDI 04) (pp. 137–150).
Google Scholar
Dev D., & Patgiri, R. (in press). HAR+: Archive and metadata distribution! Why not both? In ICCCI 2015.
Google Scholar
Dev D., & Patgiri, R. (in press). Performance evaluation of HDFS in big data management. In ICHPCA-2014.
Google Scholar
Dev, D., & Baishnab, K. L. A. (2014). Review and research towards mobile cloud computing. In 2nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (Mobile- Cloud) (pp. 252–256).
Google Scholar
Ghemawat, S., Gobio, H. & Leung, S.-T. (2003). The google file system. In Proceedings 19th ACM Symposium Operating Systems Principles (SOSP03) (pp. 29–43).
Google Scholar
Grobauer, B., Walloschek, T., & Stocker, E. Understanding cloud computing vulnerabilities. In IEEE International Conference on Security & Privacy (vol. 9, pp. 50–57).
Google Scholar
Guilan, X., & Shengxian, L. (2010). Research on applications based on Hadoop MapReduce model. Microcomputer & Its Applications (8), 4–7.
Google Scholar
Hadoop Distributed File System Rebalancing Blocks. (2012). http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing.
Google Scholar
HDFS Federation. (2012). http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federation.html.
Lustre File System. http://www.lustre.org.
McKusick, K., & Quinlan, S. G. F. S. (2010). Evolution on Fast-Forward. Communication of the ACM, 53(3), 42–49.
Article Google Scholar
Nicolae, B., Moise, D., Antoniu, G., Boug, L., & Dorier, M. (2010). BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications. In Proceedongs 24th IEEE Interational Parallel and Distributed Processing Symposium (IPDPS 2010).
Google Scholar
Shafer, J. A. (2010). Storage architecture for data-intensive computing. Ph.D. thesis, Rice University. Advisor-Rixner, Scott.
Google Scholar
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Vol. 1, No. 10, pp. 3–7).
Google Scholar
Tantisiriroj, W., Patil, S., & Gibson, G. (2008, October). Data-intensive file systems for Internet services: A rose by any other name. Technical Report CMUPDL- 08–114, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA.
Google Scholar
Tantisiriroj, W., Patil, S., Gibson, G., Son, S. W., Lang, S. J., & Ross, R. B. On the duality of data-intensive file system design: Reconciling HDFS and PVFS. In SC11.
Google Scholar
Ubuntu. http://releases.ubuntu.com/14.04/.
Weil, S. A., Pollack, K. T., Brandt, S. A., & Miller, E. L. (2004). Dynamic metadata management for petabyte-scale file systems. In ACM/IEEE SC (pp. 4–12).
Google Scholar
White, T. (2009). Hadoop, guide, The Definitive, & Inc, O’ Reilly Media. (1005). Gravenstein Highway North, Sebastopol. CA, 95472.
Google Scholar
Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012). Performance optimization for short MapReduce job execution in Hadoop. In: 2012 Second International Conference on Cloud and Green Computing (CGC) (Vol. 688, No. 694, pp. 1–3).
Google Scholar

Download references

Acknowledgments

The research is supported by Data Science & Analytic Lab of NIT Silchar. The authors would also like to thank the anonymous reviewers for their valuable and constructive comments on improving the chapter.

Author information

Authors and Affiliations

Department of Computer Science, NIT Silchar, Silchar, India
Dipayan Dev & Ripon Patgiri

Authors

Dipayan Dev
View author publications
You can also search for this author in PubMed Google Scholar
Ripon Patgiri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dipayan Dev .

Editor information

Editors and Affiliations

School of Computer, KIIT University, Bhubaneswar, Odisha, India
Bhabani Shankar Prasad Mishra
Fakir Mohan University, Department of Information and Communicat, Odisha, India
Satchidananda Dehuri
Department of Systems Engineering, Ajou University, Suwon, Korea (Republic of)
Euiwhan Kim
Department of Industrial Engineerin, Ajou University, Suwon, Korea (Republic of)
Gi-Name Wang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dev, D., Patgiri, R. (2016). A Deep Dive into the Hadoop World to Explore Its Various Performances. In: Mishra, B., Dehuri, S., Kim, E., Wang, GN. (eds) Techniques and Environments for Big Data Analysis. Studies in Big Data, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-27520-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-27520-8_3
Published: 06 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27518-5
Online ISBN: 978-3-319-27520-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics