Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Won, Heesun; Nguyen, Minh Chau; Gil, Myeong-Seon; Moon, Yang-Sae; Whang, Kyu-Young

doi:10.1007/s11227-016-1949-7

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Published: 27 March 2017

Volume 73, pages 2657–2681, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Heesun Won^1,2,
Minh Chau Nguyen²,
Myeong-Seon Gil³,
Yang-Sae Moon³ &
…
Kyu-Young Whang¹

455 Accesses
12 Citations
Explore all metrics

Abstract

As a representative large-scale data management technology, Apache Hadoop is an open-source framework for processing a variety of data such as SNS, medical, weather, and IoT data. Hadoop largely consists of HDFS, MapReduce, and YARN. Among them, we focus on improving the HDFS metadata management scheme responsible for storing and managing big data. We note that the current HDFS incurs many problems in system utilization due to its file-based metadata management. To solve these problems, we propose a novel metadata management scheme based on RDBMS for improving the functional aspects of HDFS. Through analysis of the latest HDFS, we first present five problems caused by its metadata management and derive three requirements of robustness, availability, and scalability for resolving these problems. We then design an overall architecture of the advanced HDFS, A-HDFS, which satisfies these requirements. In particular, we define functional modules according to HDFS operations and also present the detailed design strategy for adding or modifying the individual components in the corresponding modules. Finally, through implementation of the proposed A-HDFS, we validate its correctness by experimental evaluation and also show that A-HDFS satisfies all the requirements. The proposed A-HDFS significantly enhances the HDFS metadata management scheme and, as a result, ensures that the entire system improves its stability, availability, and scalability. Thus, we can exploit the improved distributed file system based on A-HDFS for various fields and, in addition, we can expect more applications to be actively developed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Notes

In a federation mode of multiple NameNodes [37], each NameNode is still responsible for a large number of DataNodes. Thus, all data accesses in a set of DataNodes are also done through a NameNode even in such a federation mode.

References

Apache Hadoop. http://hadoop.apache.org. Accessed 26 Mar 2017
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, Article No. 5
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies, Lake Tahoe, pp 1–10
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, pp 137–149
Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles, Lake George, pp 29–43
Kohi J, Neuman C (1993) The Kerberos Network Authentication Service (V5), RFC1510
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Article Google Scholar
Won HS, Nguyen MC (2015) Multitenant Hadoop across geographically distributed data centers. Strata + Hadoop World, Singapore, Oral Presentation
Elmasri R, Navathe SB (2015) Fundamentals of database systems, 6th edn. Pearson
Won HS, Nguyen MC, Gil MS, Moon YS (2015) Advanced resource management with access control for multitenant Hadoop. J Commun Netw 17(6):592–601
Article Google Scholar
IBM Open Platform with Apache Hadoop. http://www-03.ibm.com/software/products/en/ibm-open-platform-with-apache-hadoop. Accessed 26 Mar 2017
Apache Hadoop on Amazon EMR. https://aws.amazon.com/elasticmapreduce/details/hadoop. Accessed 26 Mar 2017
Cloudera Enterprise with Apache Hadoop. http://www.cloudera.com/products/apache-hadoop.html. Accessed 26 Mar 2017
Hortonworks Data Platform with Apache Hadoop. http://hortonworks.com/hdp. Accessed 26 Mar 2017
Apache Hadoop for the MapR Converged Data Platform. https://www.mapr.com/products/mapr-distribution-including-apache-hadoop. Accessed 26 Mar 2017
Cassandra. http://cassandra.apache.org. Accessed 26 Mar 2017
Ceph. http://ceph.com. Accessed 26 Mar 2017
Lustre. http://lustre.org. Accessed 26 Mar 2017
OneFS. http://www.emc.com/en-us/storage/isilon/onefs-operating-system.htm. Accessed 26 Mar 2017
Tracey D, Sreenan C (2013) A holistic architecture for the internet of things, sensing services and big data. In: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Delft, pp 546–553
Anderson JW, Kennedy KE, Ngo LB, Luckow A, Apon AW (2014) Synthetic data generation for the internet of things. In: Proceedings of 2014 IEEE International Conference on Big Data, Washington, DC, pp 171–176
Hromic H, Phuoc DL, Serrano M, Antonic A, Zarko IP, Hayes C, Decker S (2015) Real time analysis of sensor data for the internet of things by means of clustering and event processing. In: Proceedings of 2015 IEEE International Conference on Communications, London, pp 685–691
Rathore MM, Ahmad A, Paul A (2015) The internet of things based medical emergency management using Hadoop ecosystem. In: Proceedings of IEEE Sensors, Busan, pp 1–4
White T (2015) Hadoop: The Definitive Guide, 4th edn. OReilly Media
Liu X, Han J, Zhong Y, Han C, He X (2009) Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS. In: Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, pp 1–8
Zhang J, Wu G, Hu X, Wu X (2012) A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of the 13th ACM/IEEE International Conference on Grid Computing, Beijing, pp 12–21
Lu X, Islam NS, Wasi-ur-Rahman M, Jose J, Subramoni H, Wang H, Panda DK (2013) High performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, pp 641–650
He H, Du Z, Zhang W, Chen A (2016) Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72(10):3696–3707
Tanenbaum AS (1992) Modern Operating Systems. Prentice-Hall, Upper Saddle River
MATH Google Scholar
Rabkin A, Katz RH (2013) How Hadoop Clusters Break. IEEE Softw 30(4):88–94
Article Google Scholar
Cohen JC, Acharya S (2014) Towards a trusted HDFS storage platform: mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J Inf Secur Appl 19(3):224–244
Google Scholar
Borthakur D, Gray J, Sarma JS, Muthukkaruppan K, Spiegelberg N, Kuang H, Ranganathan K, Molkov D, Menon A, Rash S, Schmidt R (2011) Apache Hadoop goes realtime at Facebook. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Athens, pp 1071–1080
Hua X, Wu H, Li Z, Ren S (2014) Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J Parallel Distrib Comput 74(8):2770–2779
Article Google Scholar
Neuman BC, Tso T (1994) Kerberos: an authentication service for computer network. IEEE Commun Mag 32(19):33–38
Article Google Scholar
Hairong Kuang synthetic load generator for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-3992. Accessed 26 Mar 2017
Mukund Madhugiri NNBench for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-2000. Accessed 26 Mar 2017
HDFS Federation. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/Federation.html. Accessed 26 Mar 2017
HDFS Architecture. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. Accessed 26 Mar 2017
Organizations Powered by Apache Hadoop. https://wiki.apache.org/hadoop/PoweredBy. Accessed 26 Mar 2017
Held G (2010) A practical guide to content delivery network, 2nd edn. CRC Press, Boca Raton
Book Google Scholar

Download references

Acknowledgements

This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by Korean Government (MSIP) (No. 2016R1A2B4015929). This work was also partly supported by ICT R&D program of MSIP/IITP [B0101-16-0233, Smart Networking Core Technology Development] and [R7117-16-0214, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams].

Author information

Authors and Affiliations

School of Computing, KAIST, Daejeon, Korea
Heesun Won & Kyu-Young Whang
BigData Intelligence Research Department, ETRI, Daejeon, Korea
Heesun Won & Minh Chau Nguyen
Department of Computer Science, Kangwon National University, Chuncheon, Korea
Myeong-Seon Gil & Yang-Sae Moon

Authors

Heesun Won
View author publications
You can also search for this author in PubMed Google Scholar
Minh Chau Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Myeong-Seon Gil
View author publications
You can also search for this author in PubMed Google Scholar
Yang-Sae Moon
View author publications
You can also search for this author in PubMed Google Scholar
Kyu-Young Whang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyu-Young Whang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Won, H., Nguyen, M.C., Gil, MS. et al. Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J Supercomput 73, 2657–2681 (2017). https://doi.org/10.1007/s11227-016-1949-7

Download citation

Published: 27 March 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11227-016-1949-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Big data preprocessing: methods and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data privacy: a technological perspective and review

Big data preprocessing: methods and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation