Abstract
Recent technological advancements in the field of computing have been the cause of voluminous generation of data which cannot be handled effectively by traditionally available tools, processes, and systems. To effectively handle this big data, new techniques and frameworks have emerged in recent times. Hadoop is a prominent framework for managing huge amount of data. It provides efficient means for the storage, retrieval, processing, and analytics of big data. Although Hadoop works very well with large files, its performance tends to degrade when it is required to process hundreds or thousands of small size files. This paper puts forward the challenges and opportunities that may arise while handling large number of small size files. It also presents a comprehensive review of the various techniques available for efficiently handling small size files in Hadoop on the basis of certain performance parameters like access time, read/write complexity, scalability, and processing speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. IEEE (2010)
Borthakur, D.: HDFS Architecture Guide, Apache Foundation (2008). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
Gupta, L.: HDFS—Hadoop Distributed File System Architecture Tutorial (2015). http://howtodoinjava.com/big-data/hadoop/hdfs-hadoop-distributed-file-system-architecture-tutorial/
Ahad, M.A., Biswas, R.: Comparing and analyzing the characteristics of hadoop, cassandra and quantcast file systems for handling big data. Indian J. Sci. Technol. [S.l.] (2017). ISSN 0974-5645. http://www.indjst.org/index.php/indjst/article/view/105400, https://doi.org/10.17485/ijst/2017/v10i8/105400
Dong, B., Zheng, Q., Tian, F., Chao, K.-M., Ma, R., Anane, R.: An optimized approach for storing and accessing small files on cloud storage. J. Netw. Comput. Appli. 35, 1847–1862 (2012)
Bao, Z., Xu, S., Zhang, W., Chen, J., Liu, J.: A strategy for small files processing in HDFS. In: Che W. et al. (eds) Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol. 623. Springer, Singapore (2016)
Yan, C., Li, T., Huang, Y., Gan, Y.: Hmfs: efficient support of small files processing over HDFS. In: Sun X. et al. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol. 8631. Springer, Cham (2014)
Vorapongkitipun, C., Nupairoj, N.: Improving performance of small-file accessing in hadoop. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 200–205. IEEE (2014)
Liu C.: An improved HDFS for small file. In: 2016 18th International Conference on Advanced Communication Technology (ICACT) Advanced Communication Technology (ICACT), pp. 478–481, IEEE
White, T.: The small files problem (2009). http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
Ashton, K.: That “Internet of Things” thing. RFiD J. (2009)
Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Future Gener. Comput. Syst. 29, 1645–1660 (2013)
Kim, T.-H., Ramos, C., Mohammed, S.: Smart city and IoT, future. Gener. Comput. Syst. 76, 159–162 (2017)
Revathi, S., Kauser, S., Sushmitha, S., Vinodini, G.: A survey on different file handling mechanisms in HDFS. Int. Res. J. Eng. Technol. (IRJET), 04(06), 1994–1996 (2017)
Qu W., Cheng S., Wang H.: Efficient file accessing techniques on hadoop distributed file systems. In: Che W. et al. (eds.) Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol. 623. Springer, Singapore (2016)
Mir, M.A., Ahmed, J.: An Optimal Solution for small file problem in Hadoop. Int. J. Adv. Res. Comput. Sci. 8(5) (2017)
Prasad, G., Nagesh, M.S., Swathi Prabhu, H.R.: An efficient approach to optimize the performance of massive small files in hadoop MapReduce framework. Int. J. Comput. Sci. Eng. IJCSE 5(6), 112–120 (2017)
Bhandari, S., Chougale, S.: An approach to solve a small file problem in hadoop by using dynamic merging and indexing scheme. Int. J. Recent Innov. Trends Comput. Commun. 4(11), 227–230 (2016)
Vaidya, I., Nath, R.: An improved de-duplication technique for small files in hadoop. Int. Res. J. Eng. Technol. (IRJET) 04(07), 2040–2045 (2017)
Hadoop Archives: Hadoop archives guide (2011). http://hadoop.apache.org/common/docs/current/hadoop_archives.htmlS
Solving Small size problem in Hadoop. http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Solving Small size problem in Hadoop. https://pastiaro.wordpress.com/2013/06/05/solving-the-small-files-problem-in-apache-hadoop-appending-and-merging-in-hdfs/
SequenceFile: Sequence file wiki (2011). http://wiki.apache.org/hadoop/SequenceFileS
MapFile: Mapfile api (2011). http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.htmlS
White, S., Hadoop, T.: The Definitive Guide. Yahoo Press (2010)
Hadoop File Crush Utility. https://github.com/edwardcapriolo/filecrush
Hadoop getmerge command. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Combined File Input Format. https://gist.github.com/airawat/6647007
Biswas, R.: Atrain distributed system (ADS): an infinitely scalable architecture for processing big data of any 4Vs. In: Computational Intelligence for Big Data Analysis, Volume 19 of the series Adaptation, Learning, and Optimization, pp. 3–54. Springer International Publishing, Switzerland, (2015). online ISBN: 978-3-319-16598-1
Bendea, S., Shedgeb, R.: Dealing with small files problem in hadoop distributed file system. In: 7th International Conference on Communication, Computing and Virtualization 2016, Procedia Computer Science, vol. 79, pp. 1001–1012, Elsevier (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ahad, M.A., Biswas, R. (2019). Handling Small Size Files in Hadoop: Challenges, Opportunities, and Review. In: Nayak, J., Abraham, A., Krishna, B., Chandra Sekhar, G., Das, A. (eds) Soft Computing in Data Analytics . Advances in Intelligent Systems and Computing, vol 758. Springer, Singapore. https://doi.org/10.1007/978-981-13-0514-6_62
Download citation
DOI: https://doi.org/10.1007/978-981-13-0514-6_62
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0513-9
Online ISBN: 978-981-13-0514-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)