Skip to main content

Handling Small Size Files in Hadoop: Challenges, Opportunities, and Review

  • Conference paper
  • First Online:
Soft Computing in Data Analytics

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 758))

Abstract

Recent technological advancements in the field of computing have been the cause of voluminous generation of data which cannot be handled effectively by traditionally available tools, processes, and systems. To effectively handle this big data, new techniques and frameworks have emerged in recent times. Hadoop is a prominent framework for managing huge amount of data. It provides efficient means for the storage, retrieval, processing, and analytics of big data. Although Hadoop works very well with large files, its performance tends to degrade when it is required to process hundreds or thousands of small size files. This paper puts forward the challenges and opportunities that may arise while handling large number of small size files. It also presents a comprehensive review of the various techniques available for efficiently handling small size files in Hadoop on the basis of certain performance parameters like access time, read/write complexity, scalability, and processing speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. IEEE (2010)

    Google Scholar 

  2. Borthakur, D.: HDFS Architecture Guide, Apache Foundation (2008). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf

  3. Gupta, L.: HDFS—Hadoop Distributed File System Architecture Tutorial (2015). http://howtodoinjava.com/big-data/hadoop/hdfs-hadoop-distributed-file-system-architecture-tutorial/

  4. Ahad, M.A., Biswas, R.: Comparing and analyzing the characteristics of hadoop, cassandra and quantcast file systems for handling big data. Indian J. Sci. Technol. [S.l.] (2017). ISSN 0974-5645. http://www.indjst.org/index.php/indjst/article/view/105400, https://doi.org/10.17485/ijst/2017/v10i8/105400

  5. Dong, B., Zheng, Q., Tian, F., Chao, K.-M., Ma, R., Anane, R.: An optimized approach for storing and accessing small files on cloud storage. J. Netw. Comput. Appli. 35, 1847–1862 (2012)

    Google Scholar 

  6. Bao, Z., Xu, S., Zhang, W., Chen, J., Liu, J.: A strategy for small files processing in HDFS. In: Che W. et al. (eds) Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol. 623. Springer, Singapore (2016)

    Google Scholar 

  7. Yan, C., Li, T., Huang, Y., Gan, Y.: Hmfs: efficient support of small files processing over HDFS. In: Sun X. et al. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol. 8631. Springer, Cham (2014)

    Google Scholar 

  8. Vorapongkitipun, C., Nupairoj, N.: Improving performance of small-file accessing in hadoop. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 200–205. IEEE (2014)

    Google Scholar 

  9. Liu C.: An improved HDFS for small file. In: 2016 18th International Conference on Advanced Communication Technology (ICACT) Advanced Communication Technology (ICACT), pp. 478–481, IEEE

    Google Scholar 

  10. White, T.: The small files problem (2009). http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

  11. Ashton, K.: That “Internet of Things” thing. RFiD J. (2009)

    Google Scholar 

  12. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Future Gener. Comput. Syst. 29, 1645–1660 (2013)

    Google Scholar 

  13. Kim, T.-H., Ramos, C., Mohammed, S.: Smart city and IoT, future. Gener. Comput. Syst. 76, 159–162 (2017)

    Google Scholar 

  14. Revathi, S., Kauser, S., Sushmitha, S., Vinodini, G.: A survey on different file handling mechanisms in HDFS. Int. Res. J. Eng. Technol. (IRJET), 04(06), 1994–1996 (2017)

    Google Scholar 

  15. Qu W., Cheng S., Wang H.: Efficient file accessing techniques on hadoop distributed file systems. In: Che W. et al. (eds.) Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol. 623. Springer, Singapore (2016)

    Google Scholar 

  16. Mir, M.A., Ahmed, J.: An Optimal Solution for small file problem in Hadoop. Int. J. Adv. Res. Comput. Sci. 8(5) (2017)

    Google Scholar 

  17. Prasad, G., Nagesh, M.S., Swathi Prabhu, H.R.: An efficient approach to optimize the performance of massive small files in hadoop MapReduce framework. Int. J. Comput. Sci. Eng. IJCSE 5(6), 112–120 (2017)

    Google Scholar 

  18. Bhandari, S., Chougale, S.: An approach to solve a small file problem in hadoop by using dynamic merging and indexing scheme. Int. J. Recent Innov. Trends Comput. Commun. 4(11), 227–230 (2016)

    Google Scholar 

  19. Vaidya, I., Nath, R.: An improved de-duplication technique for small files in hadoop. Int. Res. J. Eng. Technol. (IRJET) 04(07), 2040–2045 (2017)

    Google Scholar 

  20. Hadoop Archives: Hadoop archives guide (2011). http://hadoop.apache.org/common/docs/current/hadoop_archives.htmlS

  21. Solving Small size problem in Hadoop. http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

  22. Solving Small size problem in Hadoop. https://pastiaro.wordpress.com/2013/06/05/solving-the-small-files-problem-in-apache-hadoop-appending-and-merging-in-hdfs/

  23. SequenceFile: Sequence file wiki (2011). http://wiki.apache.org/hadoop/SequenceFileS

  24. MapFile: Mapfile api (2011). http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.htmlS

  25. White, S., Hadoop, T.: The Definitive Guide. Yahoo Press (2010)

    Google Scholar 

  26. Hadoop File Crush Utility. https://github.com/edwardcapriolo/filecrush

  27. Hadoop getmerge command. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

  28. Combined File Input Format. https://gist.github.com/airawat/6647007

  29. Biswas, R.: Atrain distributed system (ADS): an infinitely scalable architecture for processing big data of any 4Vs. In: Computational Intelligence for Big Data Analysis, Volume 19 of the series Adaptation, Learning, and Optimization, pp. 3–54. Springer International Publishing, Switzerland, (2015). online ISBN: 978-3-319-16598-1

    Google Scholar 

  30. Bendea, S., Shedgeb, R.: Dealing with small files problem in hadoop distributed file system. In: 7th International Conference on Communication, Computing and Virtualization 2016, Procedia Computer Science, vol. 79, pp. 1001–1012, Elsevier (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohd Abdul Ahad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ahad, M.A., Biswas, R. (2019). Handling Small Size Files in Hadoop: Challenges, Opportunities, and Review. In: Nayak, J., Abraham, A., Krishna, B., Chandra Sekhar, G., Das, A. (eds) Soft Computing in Data Analytics . Advances in Intelligent Systems and Computing, vol 758. Springer, Singapore. https://doi.org/10.1007/978-981-13-0514-6_62

Download citation

Publish with us

Policies and ethics