Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

  • Min Luo
  • Haruo Yokota
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6184)

Abstract

Hadoop has been widely used in various clusters to build scalable and high performance distributed file systems. However, Hadoop distributed file system (HDFS) is designed for large file management. In case of small files applications, those metadata requests will flood the network and consume most of the memory in Namenode thus sharply hinders its performance. Therefore, many web applications do not benefit from clusters with centered metanode, like Hadoop. In this paper, we compare our Fat-Btree based data access method, which excludes center node in clusters, with Hadoop. We show their different performance in different file I/O applications.

Keywords

Parallel Database Fat-Btree Hadoop File I/O 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2004 (2004)Google Scholar
  3. 3.
    Boral, H., Alexander, W., Clay, L., Copeland, G., Danforth, S., Franklin, M., Hart, B., Smith, M., Valduriez, P.: Prototyping Bubba, a highly parallel database system. IEEE TKDE 2(1), 4–24 (1990)Google Scholar
  4. 4.
    Dewitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.I., Rasmussen, R.: The Gamma database machine project. IEEE TKDE 2(1), 44–62 (1990)Google Scholar
  5. 5.
    Yokota, H., Kanemasa, Y., Miyazaki, J.: Fat-Btree: An update conscious parallel directory structure. In: ICDE 1999, March 1999, pp. 448–457. IEEE Computer Society, Los Alamitos (1999)Google Scholar
  6. 6.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB 2009: Proceedings of the 2009 VLDB Endowment (2009)Google Scholar
  7. 7.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference, June 2009. ACM, New York (2009)Google Scholar
  8. 8.
  9. 9.
    Augusto, C., Baquero, S.: Performance test of Hadoop and iRODS Distributed Storage Systems. In: Seminario De Invesigation III, Mayo 18 De (2009)Google Scholar
  10. 10.
    Delmerico, J., Byrnes, N., Bruno, A., Jones, M., Gallo, S., Chaudhary, V.: Comparing the Performance of Clusters, Hadoop and Active Disks on Microarray Correlation Computations. In: The 16th IEEE International Conference on High Performance Computing, HiPC 2009, Cochin, India (2009)Google Scholar
  11. 11.
  12. 12.
    Wu, S., Kemme, B.: Postgres-R(SI): Combining Replica Control with Concurrency Control Based on Snapshot Isolation. In: Proceedings of the 21st Int’l Conf. on Data Engineering, ICDE 2005, Washington, DC, April 05 - 08, pp. 422–433 (2005)Google Scholar
  13. 13.
    Pacitti, E., Ozsu, M.T., Coulon, C.: Preventive multi-master replication in a cluster of autonomous databases. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 318–327. Springer, Heidelberg (2003)Google Scholar
  14. 14.
  15. 15.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRefGoogle Scholar
  16. 16.
    Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., Ludwig, T.: Small-File Access in Parallel File Systems. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (April 2009)Google Scholar
  17. 17.
    Liu, X., Han, J., et al.: Implementing WebGIS on Hadoop: A Case Study of Improving Small File IO Performance on HDFS. In: IEEE Cluster, New Orleans, LA (September 1, 2009)Google Scholar
  18. 18.
    Leo, S., Santoni, F., Zanetti, G.: Biodoop: Bioinformatics on Hadoop. In: International Conference on Parallel Processing Workshops, ICPPW 2009, September 22-25, pp. 415–422 (2009)Google Scholar
  19. 19.
    Seo, S., Jang, I., et al.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: Cluster Computing and Workshops (2009)Google Scholar
  20. 20.
    Namiki, Y., Kanbe, K., Kobayashi, D., Yokota, H.: An approach of using a parallel B-tree structure, Fat-Btree, in PostgreSQL for distributed retrieval. DBSJ Letters 6(2), 61–64 (2007)Google Scholar
  21. 21.
    Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: Easy and efficient parallel processing of massive data sets. In: Proc. of International Conference on Very Large Data Bases (2008)Google Scholar
  22. 22.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. of SIGMOD (2008)Google Scholar
  23. 23.
  24. 24.
    Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: CloudDB 2009: Proceeding of the First International Workshop on Cloud Data Management (2009)Google Scholar
  25. 25.
    Thusoo, A., et al.: Hive: A warehousing solution over a Map-Reduce framework. In: Proceedings of the Conference on Very Large Databases, pp. 1626–1629 (2009)Google Scholar
  26. 26.
    Amazon Web Service LLC. Amazon Elastic MapReduce (2009), http://aws.amazon.com/elasticmapreduce/

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Min Luo
    • 1
  • Haruo Yokota
    • 2
  1. 1.Department of Computer ScienceTokyo Institute of TechnologyTokyoJapan
  2. 2.Global Scientific Information and Computing CenterTokyo Instititute of, TechnologyTokyoJapan

Personalised recommendations