Skip to main content

Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

Abstract

Hadoop has been widely used in various clusters to build scalable and high performance distributed file systems. However, Hadoop distributed file system (HDFS) is designed for large file management. In case of small files applications, those metadata requests will flood the network and consume most of the memory in Namenode thus sharply hinders its performance. Therefore, many web applications do not benefit from clusters with centered metanode, like Hadoop. In this paper, we compare our Fat-Btree based data access method, which excludes center node in clusters, with Hadoop. We show their different performance in different file I/O applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://www.cloudera.com/blog/2009/02/02/the-small-files-problem

  2. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2004 (2004)

    Google Scholar 

  3. Boral, H., Alexander, W., Clay, L., Copeland, G., Danforth, S., Franklin, M., Hart, B., Smith, M., Valduriez, P.: Prototyping Bubba, a highly parallel database system. IEEE TKDE 2(1), 4–24 (1990)

    Google Scholar 

  4. Dewitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.I., Rasmussen, R.: The Gamma database machine project. IEEE TKDE 2(1), 44–62 (1990)

    Google Scholar 

  5. Yokota, H., Kanemasa, Y., Miyazaki, J.: Fat-Btree: An update conscious parallel directory structure. In: ICDE 1999, March 1999, pp. 448–457. IEEE Computer Society, Los Alamitos (1999)

    Google Scholar 

  6. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB 2009: Proceedings of the 2009 VLDB Endowment (2009)

    Google Scholar 

  7. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference, June 2009. ACM, New York (2009)

    Google Scholar 

  8. Hadoop, http://hadoop.apache.org/

  9. Augusto, C., Baquero, S.: Performance test of Hadoop and iRODS Distributed Storage Systems. In: Seminario De Invesigation III, Mayo 18 De (2009)

    Google Scholar 

  10. Delmerico, J., Byrnes, N., Bruno, A., Jones, M., Gallo, S., Chaudhary, V.: Comparing the Performance of Clusters, Hadoop and Active Disks on Microarray Correlation Computations. In: The 16th IEEE International Conference on High Performance Computing, HiPC 2009, Cochin, India (2009)

    Google Scholar 

  11. http://dev.mysql.com/doc/refman/5.1/en/overview.html

  12. Wu, S., Kemme, B.: Postgres-R(SI): Combining Replica Control with Concurrency Control Based on Snapshot Isolation. In: Proceedings of the 21st Int’l Conf. on Data Engineering, ICDE 2005, Washington, DC, April 05 - 08, pp. 422–433 (2005)

    Google Scholar 

  13. Pacitti, E., Ozsu, M.T., Coulon, C.: Preventive multi-master replication in a cluster of autonomous databases. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 318–327. Springer, Heidelberg (2003)

    Google Scholar 

  14. http://slony.info/documentation/failover.html

  15. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  16. Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., Ludwig, T.: Small-File Access in Parallel File Systems. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (April 2009)

    Google Scholar 

  17. Liu, X., Han, J., et al.: Implementing WebGIS on Hadoop: A Case Study of Improving Small File IO Performance on HDFS. In: IEEE Cluster, New Orleans, LA (September 1, 2009)

    Google Scholar 

  18. Leo, S., Santoni, F., Zanetti, G.: Biodoop: Bioinformatics on Hadoop. In: International Conference on Parallel Processing Workshops, ICPPW 2009, September 22-25, pp. 415–422 (2009)

    Google Scholar 

  19. Seo, S., Jang, I., et al.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: Cluster Computing and Workshops (2009)

    Google Scholar 

  20. Namiki, Y., Kanbe, K., Kobayashi, D., Yokota, H.: An approach of using a parallel B-tree structure, Fat-Btree, in PostgreSQL for distributed retrieval. DBSJ Letters 6(2), 61–64 (2007)

    Google Scholar 

  21. Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: Easy and efficient parallel processing of massive data sets. In: Proc. of International Conference on Very Large Data Bases (2008)

    Google Scholar 

  22. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. of SIGMOD (2008)

    Google Scholar 

  23. Facebook. Hive. Web page, http://issues.apache.org/jira/browse/HADOOP-3601

  24. Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: CloudDB 2009: Proceeding of the First International Workshop on Cloud Data Management (2009)

    Google Scholar 

  25. Thusoo, A., et al.: Hive: A warehousing solution over a Map-Reduce framework. In: Proceedings of the Conference on Very Large Databases, pp. 1626–1629 (2009)

    Google Scholar 

  26. Amazon Web Service LLC. Amazon Elastic MapReduce (2009), http://aws.amazon.com/elasticmapreduce/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luo, M., Yokota, H. (2010). Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14246-8_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14245-1

  • Online ISBN: 978-3-642-14246-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics