Abstract
Hadoop has been widely used in various clusters to build scalable and high performance distributed file systems. However, Hadoop distributed file system (HDFS) is designed for large file management. In case of small files applications, those metadata requests will flood the network and consume most of the memory in Namenode thus sharply hinders its performance. Therefore, many web applications do not benefit from clusters with centered metanode, like Hadoop. In this paper, we compare our Fat-Btree based data access method, which excludes center node in clusters, with Hadoop. We show their different performance in different file I/O applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2004 (2004)
Boral, H., Alexander, W., Clay, L., Copeland, G., Danforth, S., Franklin, M., Hart, B., Smith, M., Valduriez, P.: Prototyping Bubba, a highly parallel database system. IEEE TKDE 2(1), 4–24 (1990)
Dewitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.I., Rasmussen, R.: The Gamma database machine project. IEEE TKDE 2(1), 44–62 (1990)
Yokota, H., Kanemasa, Y., Miyazaki, J.: Fat-Btree: An update conscious parallel directory structure. In: ICDE 1999, March 1999, pp. 448–457. IEEE Computer Society, Los Alamitos (1999)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB 2009: Proceedings of the 2009 VLDB Endowment (2009)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference, June 2009. ACM, New York (2009)
Hadoop, http://hadoop.apache.org/
Augusto, C., Baquero, S.: Performance test of Hadoop and iRODS Distributed Storage Systems. In: Seminario De Invesigation III, Mayo 18 De (2009)
Delmerico, J., Byrnes, N., Bruno, A., Jones, M., Gallo, S., Chaudhary, V.: Comparing the Performance of Clusters, Hadoop and Active Disks on Microarray Correlation Computations. In: The 16th IEEE International Conference on High Performance Computing, HiPC 2009, Cochin, India (2009)
Wu, S., Kemme, B.: Postgres-R(SI): Combining Replica Control with Concurrency Control Based on Snapshot Isolation. In: Proceedings of the 21st Int’l Conf. on Data Engineering, ICDE 2005, Washington, DC, April 05 - 08, pp. 422–433 (2005)
Pacitti, E., Ozsu, M.T., Coulon, C.: Preventive multi-master replication in a cluster of autonomous databases. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 318–327. Springer, Heidelberg (2003)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., Ludwig, T.: Small-File Access in Parallel File Systems. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (April 2009)
Liu, X., Han, J., et al.: Implementing WebGIS on Hadoop: A Case Study of Improving Small File IO Performance on HDFS. In: IEEE Cluster, New Orleans, LA (September 1, 2009)
Leo, S., Santoni, F., Zanetti, G.: Biodoop: Bioinformatics on Hadoop. In: International Conference on Parallel Processing Workshops, ICPPW 2009, September 22-25, pp. 415–422 (2009)
Seo, S., Jang, I., et al.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: Cluster Computing and Workshops (2009)
Namiki, Y., Kanbe, K., Kobayashi, D., Yokota, H.: An approach of using a parallel B-tree structure, Fat-Btree, in PostgreSQL for distributed retrieval. DBSJ Letters 6(2), 61–64 (2007)
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: Easy and efficient parallel processing of massive data sets. In: Proc. of International Conference on Very Large Data Bases (2008)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. of SIGMOD (2008)
Facebook. Hive. Web page, http://issues.apache.org/jira/browse/HADOOP-3601
Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: CloudDB 2009: Proceeding of the First International Workshop on Cloud Data Management (2009)
Thusoo, A., et al.: Hive: A warehousing solution over a Map-Reduce framework. In: Proceedings of the Conference on Very Large Databases, pp. 1626–1629 (2009)
Amazon Web Service LLC. Amazon Elastic MapReduce (2009), http://aws.amazon.com/elasticmapreduce/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luo, M., Yokota, H. (2010). Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-14246-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)