Abstract
As a representative large-scale data management technology, Apache Hadoop is an open-source framework for processing a variety of data such as SNS, medical, weather, and IoT data. Hadoop largely consists of HDFS, MapReduce, and YARN. Among them, we focus on improving the HDFS metadata management scheme responsible for storing and managing big data. We note that the current HDFS incurs many problems in system utilization due to its file-based metadata management. To solve these problems, we propose a novel metadata management scheme based on RDBMS for improving the functional aspects of HDFS. Through analysis of the latest HDFS, we first present five problems caused by its metadata management and derive three requirements of robustness, availability, and scalability for resolving these problems. We then design an overall architecture of the advanced HDFS, A-HDFS, which satisfies these requirements. In particular, we define functional modules according to HDFS operations and also present the detailed design strategy for adding or modifying the individual components in the corresponding modules. Finally, through implementation of the proposed A-HDFS, we validate its correctness by experimental evaluation and also show that A-HDFS satisfies all the requirements. The proposed A-HDFS significantly enhances the HDFS metadata management scheme and, as a result, ensures that the entire system improves its stability, availability, and scalability. Thus, we can exploit the improved distributed file system based on A-HDFS for various fields and, in addition, we can expect more applications to be actively developed.
Similar content being viewed by others
Notes
In a federation mode of multiple NameNodes [37], each NameNode is still responsible for a large number of DataNodes. Thus, all data accesses in a set of DataNodes are also done through a NameNode even in such a federation mode.
References
Apache Hadoop. http://hadoop.apache.org. Accessed 26 Mar 2017
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, Article No. 5
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies, Lake Tahoe, pp 1–10
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, pp 137–149
Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles, Lake George, pp 29–43
Kohi J, Neuman C (1993) The Kerberos Network Authentication Service (V5), RFC1510
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Won HS, Nguyen MC (2015) Multitenant Hadoop across geographically distributed data centers. Strata + Hadoop World, Singapore, Oral Presentation
Elmasri R, Navathe SB (2015) Fundamentals of database systems, 6th edn. Pearson
Won HS, Nguyen MC, Gil MS, Moon YS (2015) Advanced resource management with access control for multitenant Hadoop. J Commun Netw 17(6):592–601
IBM Open Platform with Apache Hadoop. http://www-03.ibm.com/software/products/en/ibm-open-platform-with-apache-hadoop. Accessed 26 Mar 2017
Apache Hadoop on Amazon EMR. https://aws.amazon.com/elasticmapreduce/details/hadoop. Accessed 26 Mar 2017
Cloudera Enterprise with Apache Hadoop. http://www.cloudera.com/products/apache-hadoop.html. Accessed 26 Mar 2017
Hortonworks Data Platform with Apache Hadoop. http://hortonworks.com/hdp. Accessed 26 Mar 2017
Apache Hadoop for the MapR Converged Data Platform. https://www.mapr.com/products/mapr-distribution-including-apache-hadoop. Accessed 26 Mar 2017
Cassandra. http://cassandra.apache.org. Accessed 26 Mar 2017
Ceph. http://ceph.com. Accessed 26 Mar 2017
Lustre. http://lustre.org. Accessed 26 Mar 2017
OneFS. http://www.emc.com/en-us/storage/isilon/onefs-operating-system.htm. Accessed 26 Mar 2017
Tracey D, Sreenan C (2013) A holistic architecture for the internet of things, sensing services and big data. In: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Delft, pp 546–553
Anderson JW, Kennedy KE, Ngo LB, Luckow A, Apon AW (2014) Synthetic data generation for the internet of things. In: Proceedings of 2014 IEEE International Conference on Big Data, Washington, DC, pp 171–176
Hromic H, Phuoc DL, Serrano M, Antonic A, Zarko IP, Hayes C, Decker S (2015) Real time analysis of sensor data for the internet of things by means of clustering and event processing. In: Proceedings of 2015 IEEE International Conference on Communications, London, pp 685–691
Rathore MM, Ahmad A, Paul A (2015) The internet of things based medical emergency management using Hadoop ecosystem. In: Proceedings of IEEE Sensors, Busan, pp 1–4
White T (2015) Hadoop: The Definitive Guide, 4th edn. OReilly Media
Liu X, Han J, Zhong Y, Han C, He X (2009) Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS. In: Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, pp 1–8
Zhang J, Wu G, Hu X, Wu X (2012) A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of the 13th ACM/IEEE International Conference on Grid Computing, Beijing, pp 12–21
Lu X, Islam NS, Wasi-ur-Rahman M, Jose J, Subramoni H, Wang H, Panda DK (2013) High performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, pp 641–650
He H, Du Z, Zhang W, Chen A (2016) Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72(10):3696–3707
Tanenbaum AS (1992) Modern Operating Systems. Prentice-Hall, Upper Saddle River
Rabkin A, Katz RH (2013) How Hadoop Clusters Break. IEEE Softw 30(4):88–94
Cohen JC, Acharya S (2014) Towards a trusted HDFS storage platform: mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J Inf Secur Appl 19(3):224–244
Borthakur D, Gray J, Sarma JS, Muthukkaruppan K, Spiegelberg N, Kuang H, Ranganathan K, Molkov D, Menon A, Rash S, Schmidt R (2011) Apache Hadoop goes realtime at Facebook. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Athens, pp 1071–1080
Hua X, Wu H, Li Z, Ren S (2014) Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J Parallel Distrib Comput 74(8):2770–2779
Neuman BC, Tso T (1994) Kerberos: an authentication service for computer network. IEEE Commun Mag 32(19):33–38
Hairong Kuang synthetic load generator for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-3992. Accessed 26 Mar 2017
Mukund Madhugiri NNBench for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-2000. Accessed 26 Mar 2017
HDFS Federation. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/Federation.html. Accessed 26 Mar 2017
HDFS Architecture. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. Accessed 26 Mar 2017
Organizations Powered by Apache Hadoop. https://wiki.apache.org/hadoop/PoweredBy. Accessed 26 Mar 2017
Held G (2010) A practical guide to content delivery network, 2nd edn. CRC Press, Boca Raton
Acknowledgements
This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by Korean Government (MSIP) (No. 2016R1A2B4015929). This work was also partly supported by ICT R&D program of MSIP/IITP [B0101-16-0233, Smart Networking Core Technology Development] and [R7117-16-0214, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams].
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Won, H., Nguyen, M.C., Gil, MS. et al. Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J Supercomput 73, 2657–2681 (2017). https://doi.org/10.1007/s11227-016-1949-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1949-7