Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal
- 409 Downloads
In this Exa byte scale era, data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with big data. Due to this growth of huge amount of metadata, however, the efficiency of Hadoop is questioned numerous times by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata needs to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which are the metadata servers of Hadoop. In this paper, we present a circular metadata management mechanism named dynamic circular metadata splitting (DCMS). DCMS preserves metadata locality using consistent hashing and locality-preserving hashing, keeps replicated metadata for excellent reliability, and dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop. Keeping the directory tree of all files, failure of which causes the single point of failure (SPOF). DCMS removes Hadoop’s SPOF and provides an efficient and scalable metadata management. The new framework is named ‘Dr. Hadoop’ after the name of the authors.
KeywordsHadoop NameNode Metadata Locality-preserving hashing Consistent hashing
The authors would like to thank Data Science and Analytic Lab of NIT Silchar for the research.
- Apache Software Foundation, 2012. Hot Standby for NameNode. Available from http://issues.apache.org/jira/browse/HDFS-976.
- Beaver, D., Kumar, S., Li, H.C., et al., 2010. Finding a needle in haystack: Facebookars photo storage. OSDI, p.47–60.Google Scholar
- Biplob, D., Sengupta, S., Li, J., 2010. FlashStore: high throughput persistent key-value store. Proc. VLDB Endowment, p.1414–1425.Google Scholar
- Bisciglia, C., 2009. Hadoop HA Configuration. Available from http://www.cloudera.com/blog/2009/07/22/hadoop-haconfiguration/.Google Scholar
- Braam, R. Z. PJ, 2007. Lustre: a Scalable, High Performance File System. Cluster File Systems, Inc.Google Scholar
- Brandt, S.A., Miller, E.L, Long, D.D.E., et al., 2003. Efficient metadata management in large distributed storage systems. IEEE Symp. on Mass Storage Systems, p.290–298.Google Scholar
- Cao, Y., Chen, C., Guo, F., et al., 2011. Es2: a cloud data storage system for supporting both OLTP and OLAP. Proc. IEEE ICDE, p.291–302.Google Scholar
- Dev, D., Patgiri, R., 2014. Performance evaluation of HDFS in big data management. Int. Conf. on High Performance Computing and Applications, p.1–7.Google Scholar
- Dev, D., Patgiri, R., 2015. HAR+: archive and metadata distribution! Why not both? ICCCI, in press.Google Scholar
- Fred, H., McNab, R., 1998. SimJava: a discrete event simulation library for Java. Simul. Ser., 30:51–56.Google Scholar
- Ghemawat, S., Gobioff, H., Leung, S.T., 2003. The Google file system. Proc. 19th ACM Symp. on Operating Systems Principles, p.29–43.Google Scholar
- Haddad, I.F., 2000. Pvfs: a parallel virtual file system for Linux clusters. Linux J., p.5.Google Scholar
- Wiki, 2012. NameNode Failover, on Wiki Apache Hadoop. Available from http://wiki.apache.org/hadoop/NameNodeFailover.
- HDFS, 2010. Hadoop AvatarNode High Availability. Available from http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html.
- Karger, D., Lehman, E., Leighton, F., et al., 1997. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. Proc. 29th Annual ACM Symp. on Theory of Computing, p.654–663.Google Scholar
- Kavalanekar, S., Worthington, B.L., Zhang, Q., et al., 2008. Characterization of storage workload traces from production Windows Servers. Proc. IEEE IISWC, p.119–128.Google Scholar
- Lewin, D., 1998. Consistent hashing and random trees: algorithms for caching in distributed networks. Master Thesis, Department of EECS, MIT.Google Scholar
- Lim, H., Fan, B., Andersen, D.G., et al., 2011. SILT: a memory-efficient, high-performance key-value store. Proc. 23rd ACM Symp. on Operating Systems Principles.Google Scholar
- Nagle, D., Serenyi, D., Matthews, A., 2004. The Panasas activescale storage cluster-delivering scalable high bandwidth storage. Proc. ACM/IEEE SC, p.1–10.Google Scholar
- Ousterhout, J.K., Costa, H.D., Harrison, D., et al., 1985. A trace-driven analysis of the Unix 4.2 BSD file system. SOSP, p.15–24.Google Scholar
- Rodeh, O., Teperman, A., 2003. ZFS—a scalable distributed file system using object disks. IEEE Symp. on Mass Storage Systems, p.207–218.Google Scholar
- Shvachko, K., Kuang, H.R., Radia, S., et al., 2010. The Hadoop Distributed File System. IEEE 26th Symp. on Mass Storage Systems and Technologies, p.1–10.Google Scholar
- Torodanhan, 2009. Best Practice: DB2 High Availability Disaster Recovery. Available from http://www.ibm.com/developerworks/wikis/display/data/Best+Practice+-+DB2+High+Availability+Disaster+Recovery.
- U.S. Department of Commerce/NIST, 1995. FIPS 180-1. Secure Hash Standard. National Technical Information Service, Springfield, VA.Google Scholar
- Weil, S.A., Pollack, K.T., Brandt, S.A., et al., 2004. Dynamic metadata management for petabyte-scale file systems. SC, p.4.Google Scholar
- Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. CEPH: a scalable, high-performance distributed file system. OSDI, p.307–320.Google Scholar
- White, T., 2009. Hadoop: the Definitive Guide. O’Reilly Media, Inc.Google Scholar
- White, B.S., Walker, M., Humphrey, M., et al., 2001. Legionfs: a secure and scalable file system supporting cross-domain highperformance applications. Proc. ACM/IEEE Conf. on Supercomputing, p.59.Google Scholar