Abstract
Size of the data used in todays enterprises has been growing at exponential rates from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. To handle and for analysis of large scale datasets, an open-source implementation of Apache framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed file system called Hadoop Distributed File System (HDFS). HDFS is written completely in Java and is depicted in such a way that in can store Big data more reliably, and can stream those at high processing time to the user applications. Hadoop has been widely used in recent days by popular organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, experiments on Data-Intensive computations are going on to parallelize the processing of data. None of them could actually achieve a desirable performance. Hadoop, with its Map-Reduce parallel data processing capability can achieve these goals efficiently. This chapter initially provides an overview of the HDFS in details. The next portion of the paper evaluates Hadoops performance with various factors in different environments. The chapter shows how files less than the block size affect Hadoops R/W performance and how the time of execution of a job depends on block size and number of reducers. Chapter concludes with providing the different real challenges of Hadoop in recent days and scope for future work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache Hadoop. http://hadoop.apache.org/
Beaver, D., Kumar, S., Li, H. C., Sobel, J., & Vajgel, P. (2010). Finding a needle in haystack: Facebooks photo storage. In OSDI, ACM (pp. 1–8).
Bhandarkar, M. (2010). MapReduce programming with apache Hadoop. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (Vol. 1, No. 1, pp. 19–23).
Carns, P. H., Ligon III, W. B., Ross, R. B., & Thakur, R. (2000). PVFS: A parallel file system for Linux clusters. In Proceedings of 4th Annual Linux Showcase and Conference (pp. 317–327).
Daxin, X., & Fei, L. (2011). Research on optimization techniques for Hadoop cluster performance. Computer Knowledge and Technology, 8(7), 5484–5486.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In Proceedings Sixth Symposium Operating System Design and Implementation (OSDI 04) (pp. 137–150).
Dev D., & Patgiri, R. (in press). HAR+: Archive and metadata distribution! Why not both? In ICCCI 2015.
Dev D., & Patgiri, R. (in press). Performance evaluation of HDFS in big data management. In ICHPCA-2014.
Dev, D., & Baishnab, K. L. A. (2014). Review and research towards mobile cloud computing. In 2nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (Mobile- Cloud) (pp. 252–256).
Ghemawat, S., Gobio, H. & Leung, S.-T. (2003). The google file system. In Proceedings 19th ACM Symposium Operating Systems Principles (SOSP03) (pp. 29–43).
Grobauer, B., Walloschek, T., & Stocker, E. Understanding cloud computing vulnerabilities. In IEEE International Conference on Security & Privacy (vol. 9, pp. 50–57).
Guilan, X., & Shengxian, L. (2010). Research on applications based on Hadoop MapReduce model. Microcomputer & Its Applications (8), 4–7.
Hadoop Distributed File System Rebalancing Blocks. (2012). http://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancing.
HDFS Federation. (2012). http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federation.html.
Lustre File System. http://www.lustre.org.
McKusick, K., & Quinlan, S. G. F. S. (2010). Evolution on Fast-Forward. Communication of the ACM, 53(3), 42–49.
Nicolae, B., Moise, D., Antoniu, G., Boug, L., & Dorier, M. (2010). BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications. In Proceedongs 24th IEEE Interational Parallel and Distributed Processing Symposium (IPDPS 2010).
Shafer, J. A. (2010). Storage architecture for data-intensive computing. Ph.D. thesis, Rice University. Advisor-Rixner, Scott.
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Vol. 1, No. 10, pp. 3–7).
Tantisiriroj, W., Patil, S., & Gibson, G. (2008, October). Data-intensive file systems for Internet services: A rose by any other name. Technical Report CMUPDL- 08–114, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA.
Tantisiriroj, W., Patil, S., Gibson, G., Son, S. W., Lang, S. J., & Ross, R. B. On the duality of data-intensive file system design: Reconciling HDFS and PVFS. In SC11.
Ubuntu. http://releases.ubuntu.com/14.04/.
Weil, S. A., Pollack, K. T., Brandt, S. A., & Miller, E. L. (2004). Dynamic metadata management for petabyte-scale file systems. In ACM/IEEE SC (pp. 4–12).
White, T. (2009). Hadoop, guide, The Definitive, & Inc, O’ Reilly Media. (1005). Gravenstein Highway North, Sebastopol. CA, 95472.
Yan, J., Yang, X., Gu, R., Yuan, C., & Huang, Y. (2012). Performance optimization for short MapReduce job execution in Hadoop. In: 2012 Second International Conference on Cloud and Green Computing (CGC) (Vol. 688, No. 694, pp. 1–3).
Acknowledgments
The research is supported by Data Science & Analytic Lab of NIT Silchar. The authors would also like to thank the anonymous reviewers for their valuable and constructive comments on improving the chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Dev, D., Patgiri, R. (2016). A Deep Dive into the Hadoop World to Explore Its Various Performances. In: Mishra, B., Dehuri, S., Kim, E., Wang, GN. (eds) Techniques and Environments for Big Data Analysis. Studies in Big Data, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-27520-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-27520-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27518-5
Online ISBN: 978-3-319-27520-8
eBook Packages: EngineeringEngineering (R0)