Skip to main content
Log in

Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As a representative large-scale data management technology, Apache Hadoop is an open-source framework for processing a variety of data such as SNS, medical, weather, and IoT data. Hadoop largely consists of HDFS, MapReduce, and YARN. Among them, we focus on improving the HDFS metadata management scheme responsible for storing and managing big data. We note that the current HDFS incurs many problems in system utilization due to its file-based metadata management. To solve these problems, we propose a novel metadata management scheme based on RDBMS for improving the functional aspects of HDFS. Through analysis of the latest HDFS, we first present five problems caused by its metadata management and derive three requirements of robustness, availability, and scalability for resolving these problems. We then design an overall architecture of the advanced HDFS, A-HDFS, which satisfies these requirements. In particular, we define functional modules according to HDFS operations and also present the detailed design strategy for adding or modifying the individual components in the corresponding modules. Finally, through implementation of the proposed A-HDFS, we validate its correctness by experimental evaluation and also show that A-HDFS satisfies all the requirements. The proposed A-HDFS significantly enhances the HDFS metadata management scheme and, as a result, ensures that the entire system improves its stability, availability, and scalability. Thus, we can exploit the improved distributed file system based on A-HDFS for various fields and, in addition, we can expect more applications to be actively developed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. In a federation mode of multiple NameNodes [37], each NameNode is still responsible for a large number of DataNodes. Thus, all data accesses in a set of DataNodes are also done through a NameNode even in such a federation mode.

References

  1. Apache Hadoop. http://hadoop.apache.org. Accessed 26 Mar 2017

  2. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, Article No. 5

  3. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies, Lake Tahoe, pp 1–10

  4. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, pp 137–149

  5. Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles, Lake George, pp 29–43

  6. Kohi J, Neuman C (1993) The Kerberos Network Authentication Service (V5), RFC1510

  7. Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77

    Article  Google Scholar 

  8. Won HS, Nguyen MC (2015) Multitenant Hadoop across geographically distributed data centers. Strata + Hadoop World, Singapore, Oral Presentation

  9. Elmasri R, Navathe SB (2015) Fundamentals of database systems, 6th edn. Pearson

  10. Won HS, Nguyen MC, Gil MS, Moon YS (2015) Advanced resource management with access control for multitenant Hadoop. J Commun Netw 17(6):592–601

    Article  Google Scholar 

  11. IBM Open Platform with Apache Hadoop. http://www-03.ibm.com/software/products/en/ibm-open-platform-with-apache-hadoop. Accessed 26 Mar 2017

  12. Apache Hadoop on Amazon EMR. https://aws.amazon.com/elasticmapreduce/details/hadoop. Accessed 26 Mar 2017

  13. Cloudera Enterprise with Apache Hadoop. http://www.cloudera.com/products/apache-hadoop.html. Accessed 26 Mar 2017

  14. Hortonworks Data Platform with Apache Hadoop. http://hortonworks.com/hdp. Accessed 26 Mar 2017

  15. Apache Hadoop for the MapR Converged Data Platform. https://www.mapr.com/products/mapr-distribution-including-apache-hadoop. Accessed 26 Mar 2017

  16. Cassandra. http://cassandra.apache.org. Accessed 26 Mar 2017

  17. Ceph. http://ceph.com. Accessed 26 Mar 2017

  18. Lustre. http://lustre.org. Accessed 26 Mar 2017

  19. OneFS. http://www.emc.com/en-us/storage/isilon/onefs-operating-system.htm. Accessed 26 Mar 2017

  20. Tracey D, Sreenan C (2013) A holistic architecture for the internet of things, sensing services and big data. In: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Delft, pp 546–553

  21. Anderson JW, Kennedy KE, Ngo LB, Luckow A, Apon AW (2014) Synthetic data generation for the internet of things. In: Proceedings of 2014 IEEE International Conference on Big Data, Washington, DC, pp 171–176

  22. Hromic H, Phuoc DL, Serrano M, Antonic A, Zarko IP, Hayes C, Decker S (2015) Real time analysis of sensor data for the internet of things by means of clustering and event processing. In: Proceedings of 2015 IEEE International Conference on Communications, London, pp 685–691

  23. Rathore MM, Ahmad A, Paul A (2015) The internet of things based medical emergency management using Hadoop ecosystem. In: Proceedings of IEEE Sensors, Busan, pp 1–4

  24. White T (2015) Hadoop: The Definitive Guide, 4th edn. OReilly Media

  25. Liu X, Han J, Zhong Y, Han C, He X (2009) Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS. In: Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, pp 1–8

  26. Zhang J, Wu G, Hu X, Wu X (2012) A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of the 13th ACM/IEEE International Conference on Grid Computing, Beijing, pp 12–21

  27. Lu X, Islam NS, Wasi-ur-Rahman M, Jose J, Subramoni H, Wang H, Panda DK (2013) High performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, pp 641–650

  28. He H, Du Z, Zhang W, Chen A (2016) Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72(10):3696–3707

  29. Tanenbaum AS (1992) Modern Operating Systems. Prentice-Hall, Upper Saddle River

    MATH  Google Scholar 

  30. Rabkin A, Katz RH (2013) How Hadoop Clusters Break. IEEE Softw 30(4):88–94

    Article  Google Scholar 

  31. Cohen JC, Acharya S (2014) Towards a trusted HDFS storage platform: mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J Inf Secur Appl 19(3):224–244

    Google Scholar 

  32. Borthakur D, Gray J, Sarma JS, Muthukkaruppan K, Spiegelberg N, Kuang H, Ranganathan K, Molkov D, Menon A, Rash S, Schmidt R (2011) Apache Hadoop goes realtime at Facebook. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Athens, pp 1071–1080

  33. Hua X, Wu H, Li Z, Ren S (2014) Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J Parallel Distrib Comput 74(8):2770–2779

    Article  Google Scholar 

  34. Neuman BC, Tso T (1994) Kerberos: an authentication service for computer network. IEEE Commun Mag 32(19):33–38

    Article  Google Scholar 

  35. Hairong Kuang synthetic load generator for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-3992. Accessed 26 Mar 2017

  36. Mukund Madhugiri NNBench for NameNode testing. https://issues.apache.org/jira/browse/HADOOP-2000. Accessed 26 Mar 2017

  37. HDFS Federation. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/Federation.html. Accessed 26 Mar 2017

  38. HDFS Architecture. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. Accessed 26 Mar 2017

  39. Organizations Powered by Apache Hadoop. https://wiki.apache.org/hadoop/PoweredBy. Accessed 26 Mar 2017

  40. Held G (2010) A practical guide to content delivery network, 2nd edn. CRC Press, Boca Raton

    Book  Google Scholar 

Download references

Acknowledgements

This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by Korean Government (MSIP) (No. 2016R1A2B4015929). This work was also partly supported by ICT R&D program of MSIP/IITP [B0101-16-0233, Smart Networking Core Technology Development] and [R7117-16-0214, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyu-Young Whang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Won, H., Nguyen, M.C., Gil, MS. et al. Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J Supercomput 73, 2657–2681 (2017). https://doi.org/10.1007/s11227-016-1949-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1949-7

Keywords

Navigation