Skip to main content
Log in

Exploring large-scale small file storage for search engines

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Large-scale small file storage for original pages degrades performance of search engines. In this paper, we first analyze the disadvantages of the existing EXT3 file system in accessing small files. Then, the rate and speed of compression algorithms are verified to choose a proper storage compression algorithm. Meanwhile, we design an original page oriented file organization structure and a read–write query tree to store the large-scale small files which need no modification. The accessing response time and disk space waste are remarkably decreased when search engines use these techniques to store original-page small files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. RFC1952.GZIP File format specification version 4.3. http://www.ietf.org/rfc/rfc1951.txt

  2. RFC1950.ZLIB Compressed Data Format Specification version 3.3. http://www.ietf.org/rfc/rfc1950.txt

  3. Welch TA (1984) A technique for high-performance data compression. Computer 17(6):8–19

    Article  Google Scholar 

  4. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

  5. Tweedie SC (1998) Journaling the Linux ext2fs filesystem. In: Proceedings of the 4th Annual LinuxExpo, Durham, NC

  6. Namesys web site. http://www.namesys.com/

  7. JFS for linux project website. http://jfs.sourceforge.net/

  8. The SGI XFS project website. http://oss.sgi.com/projects/xfs/

  9. Rosenblum M, Ousterhout JK (1992) The design and implementation of a log-structured file system. ACM Trans Comput Syst (TOCS) 10(1):26–52

    Article  Google Scholar 

  10. Zhang WZ, Chen HX, He H, Chen G (2014) A two-tier distributed full-text indexing system. Appl Math 8(1):321–326

  11. Zhang WZ, He H, Ye J (2013) A two-level cache for distributed information retrieval in search engines. Sci World J (2013)

  12. Zhang WZ, He H, Zhang Q (2012) Original-page small file oriented EXT3 file storage system. ASTL 5 (Software Technology)

Download references

Acknowledgments

This work is supported by the National Basic Research Program of China (973 Program) under Grant No. 2011CB302605, the National Science Foundation of China (NSFC) under Grant No. 61173145, and also the Doctoral Program of Higher Education of China under Grant No. 20132302110037.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weizhe Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Lu, G., He, H. et al. Exploring large-scale small file storage for search engines. J Supercomput 72, 2911–2923 (2016). https://doi.org/10.1007/s11227-015-1394-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1394-z

Keywords

Navigation