Abstract
Hadoop Distributed File System (HDFS) has been popularly utilized by many Big Data processing frameworks as their underlying storage engine, such as Hadoop MapReduce, HBase, Hive, and Spark. This makes the performance of HDFS a primary concern in the Big Data community. Recent studies have shown that HDFS cannot completely exploit the performance benefits of RDMA-enabled high performance interconnects like InfiniBand. To solve these performance issues, RDMA-enabled HDFS designs have been proposed in the literature that show better performance with RDMA-enabled networks. But these designs are tightly integrated with the specific versions of the Apache Hadoop distribution, and cannot be used with other Hadoop distributions easily. In this paper, we propose an efficient RDMA-based plugin for HDFS, which can be easily integrated with various Hadoop distributions and versions like Apache Hadoop 2.5 and 2.6, Hortonworks HDP, and Cloudera CDH. Performance evaluations show that our plugin ensures the expected performance of up to 3.7x improvement in TestDFSIO write, associated with the hybrid RDMA-enhanced design, to all these distributions. We also demonstrate that our RDMA-based plugin can achieve up to 4.6x improvement over Mellanox R4H (RDMA for HDFS) plugin.
This research is supported in part by National Science Foundation grants #CNS-1419123 and #IIS-1447804.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. NSDI 2012, San Jose, CA (2012)
Apache HBase. http://hbase.apache.org/
Cloudera Hadoop Distribution: http://cloudera.com/
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI. Boston, MA (2004)
Foundation, A.S.: Centralized Cache Management in HDFS. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
Data, High-Performance Big (HiBD). http://hibd.cse.ohio-state.edu
Hortonworks: We do Hadoop. Enabling the Data-First Enterprise. http://hortonworks.com/
Islam, N.S., Lu, X., Rahman, M.W., Rajachandrasekar, R., Panda, D.K.: In-memory I/O and replication for HDFS with memcached: Early experiences. In: 2014 IEEE International Conference on Big Data (IEEE BigData). Washington DC (2014)
Islam, N.S., Lu, X., Rahman, M.W., Shankar, D., Panda, D.K.: Triple-H: A hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. China (2015)
Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: The Proceedings of The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC). Canada (2014)
Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over infiniBand. In: The Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Salt Lake City (2012)
Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: Can parallel replication benefit hadoop distributed file system for high performance interconnects? In: The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects (HOTI). San Jose, CA (2013)
Mellanox. http://www.mellanox.com
Anwar, R.K., Butt, A.A.: hatS: A heterogeneity-aware tiered storage for hadoop. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2014)
R.K., Iqbal, S., Butt, A.: VENU: Orchestrating SSDs in hadoop storage. In: 2014 IEEE International Conference on Big Data (IEEE BigData) (2014)
Rahman, M.W., Islam, N.S., Lu, X., Jose, J., Subramoni, H., Wang, H., Panda, D.K.: High-performance RDMA-based design of hadoop mapreduce over infiniBand. In: HPDIC, in conjunction with IPDPS. Boston, MA (2013)
Rahman, M.W., Lu, X., Islam, N.S., Panda, D.K.: HOMR: A hybrid approach to exploit maximum overlapping in mapreduce over high performance interconnects. In: ICS. Munich, Germany (2014)
RDMA for HDFS (R4H). https://github.com/Mellanox/R4H
Shafer, J., Rixner, S., Cox, A.: The hadoop distributed filesystem: Balancing portability and performance. In: 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pp. 122–133, March 2010
Shvachko, K.: HDFS Scalability: The Limits to Growth (2010)
The Apache Software Foundation: The Apache Hive. http://hive.apache.org/
Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop acceleration through network levitated merge. In: SC (2011)
Welsh, M., Culler, D., Brewer, E.: SEDA: An architecture for well-conditioned, scalable internet services. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP). Banff, Alberta, Canada (2001)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, Boston, MA (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bhat, A., Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., (DK) Panda, D.K. (2016). A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS. In: Zhan, J., Han, R., Zicari, R. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2015. Lecture Notes in Computer Science(), vol 9495. Springer, Cham. https://doi.org/10.1007/978-3-319-29006-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-29006-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29005-8
Online ISBN: 978-3-319-29006-5
eBook Packages: Computer ScienceComputer Science (R0)