The Journal of Supercomputing

, Volume 72, Issue 8, pp 3006–3032 | Cite as

MBFS: a parallel metadata search method based on Bloomfilters using MapReduce for large-scale file systems

  • Zhisheng Huo
  • Limin Xiao
  • Qiaoling Zhong
  • Shupan Li
  • Ang Li
  • Li Ruan
  • Shouxin Wang
  • Lihong Fu


The metadata search is an important way to access and manage file systems. Many solutions have been proposed to tackle performance issue of metadata search. However, the existing solutions build a separate metadata index at the internal or external file system through the related data structure or database use semantics and event-notification method to construct the index structure, utilize the sampling-based method to conduct direct metadata search on the namespace, face problems of the high I/O overhead for maintaining consistency between metadata indexes and metadata, have enormous space overhead for metadata indexes storing and low accuracy of results and so on. To address these problems, this paper presents MBFS, a fast, accurate and lightweight metadata search method based on multi-dimensional Bloomfilters. We create a multi-dimensional Bloomfilter structure on the basis of the directory entry that can prune sub-trees to narrow the search scope of namespace. MBFS is capable of producing fast and accurate answers for a class of complex search over a file system after consuming a small number of disk accesses. MBFS residing in the file system does not need additional I/O overhead to maintain consistency. MBFS consists of Bloomfilters which are composed of bits, so it is a lightweight metadata search method that consumes marginal space overhead. Moreover, MBFS employs MapReduce for speeding up search under the environment of multiple metadata servers. Extensive experiments are conducted to prove the effectiveness of MBFS. The experimental results show that MBFS can achieve an excellent performance not only on the search latency, but also on the accuracy of results with low space and time overhead.


Large-scale file systems Parallel metadata search  Fast and accurate Lightweight Multi-dimensional Bloomfilters 



This version has benefited greatly from the many detailed comments and suggestions from the anonymous reviewers. The authors gratefully acknowledge these comments and suggestions. The work described in this paper was supported by the National Natural Science Foundation of China under Grant No. 61370059 and 61232009, the Beijing Natural Science Foundation under Grant No. 4152030, the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2014ZX-05, the Open Research Fund of The Academy of Satellite Application under Grant NO. 2014_CXJJDSJ_04, the Fundamental Research Funds for the Central Universities under Grant NO. YWF-14-JSJXY-14 and YWF-15-GJSYS-085, the Open Project Program of National Engineering Research Center for Science & Technology Resources Sharing Service (Beihang University).


  1. 1.
    Agrawal N, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2009) Generating realistic impressions for file-system benchmarking. ACM Trans Storage 5(4):16CrossRefGoogle Scholar
  2. 2.
    Agrawal N, Bolosky WJ, Douceur JR, Lorch JR (2007) A five-year study of file-system metadata. ACM Trans Storage 3(3):9CrossRefGoogle Scholar
  3. 3.
    APPLE (2009) Spotlight server: stop searching, start finding.
  4. 4.
    Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43CrossRefGoogle Scholar
  5. 5.
    Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Technical Report 15Google Scholar
  6. 6.
    Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117CrossRefGoogle Scholar
  7. 7.
    Broder A, Mitzenmacher M (2004) Network applications of bloom filters: a survey. Internet Math 1(4):485–509MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cohen S, Matias Y (2003) Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 241–252. ACMGoogle Scholar
  9. 9.
    Dai D, Ross RB, Carns P, Kimpe D, Chen Y (2014) Using property graphs for rich metadata management in hpc systems. In: 2014 9th parallel data storage workshop (PDSW), pp 7–12. IEEEGoogle Scholar
  10. 10.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  11. 11.
    Engines I (2008) Power over information.
  12. 12.
    Fan L, Cao P, Almeida J, Broder AZ (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans Netw 8(3):281–293CrossRefGoogle Scholar
  13. 13.
    Fast (2008) A microsoft subsidiary. fast-enterprise search.
  14. 14.
    Ficara D, Giordano S, Procissi G, Vitucci F (2008) Multilayer compressed counting bloom filters. In: INFOCOM 2008. The 27th conference on computer communications. IEEEGoogle Scholar
  15. 15.
    Gifford DK, Jouvelot P, Sheldon MA et al. (1991) Semantic file systems. In: ACM SIGOPS operating systems review, vol 25. ACM, pp 16–25Google Scholar
  16. 16.
    Google I (2007) Google desktop: information when you want it, right on your desktop.
  17. 17.
    Google I (2008) Google enterprise.
  18. 18.
    Groups ES (2007) ESG research report: storage resource management on the launch pad. Technical Report etsg-1809930. Technical Report, Enterprise Strategy GroupsGoogle Scholar
  19. 19.
    Hua Y, Jiang H, Feng D (2014) Fast: near real-time searchable data analytics for the cloud. In: SC14: international conference for high performance computing, networking, storage and analysis, pp 754–765. IEEEGoogle Scholar
  20. 20.
    Hua Y, Jiang H, Zhu Y, Feng D (2010) Rapport: semantic-sensitive namespace management in large-scale file systems. CSE Technical reports. University of Nebraska, LincolnGoogle Scholar
  21. 21.
    Hua Y, Jiang H, Zhu Y, Feng D, Tian L (2009) Smartstore: a new metadata organization paradigm with semantic-awareness for next-generation file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis, pp 1–12. IEEEGoogle Scholar
  22. 22.
    Hua Y, Jiang H, Zhu Y, Feng D, Xu L (2014) SANE: semantic-aware namespace in ultra-large-scale file systems. IEEE Trans Parallel Distrib Syst 25(5):1328–1338CrossRefGoogle Scholar
  23. 23.
    Hua Y, Zhu Y, Jiang H, Feng D, Tian L (2011) Supporting scalable and adaptive metadata management in ultralarge-scale file systems. IEEE Trans Parallel Distrib Syst 22(4):580–593CrossRefGoogle Scholar
  24. 24.
    Huang HH, Zhang N, Wang W, Das G, Szalay A (2012) Just-in-time analytics on large file systems. IEEE Trans Comput 61(11):1651–1664MathSciNetCrossRefGoogle Scholar
  25. 25.
    Huston L, Sukthankar R, Wickremesinghe R, Satyanarayanan M, Ganger GR, Riedel E, Ailamaki A (2004) Diamond: a storage architecture for early discard in interactive search. FAST 4:73–86Google Scholar
  26. 26.
    Imran M, Hlavacs H (2013) Searching in cloud object storage by using a metadata model. In: 2013 Ninth international conference on semantics, knowledge and grids (SKG), pp 121–128. IEEEGoogle Scholar
  27. 27.
    Inc GG (2008) Compare search appliance tools. http://www.goebelgroup
  28. 28.
    Katcher J (1997) Postmark: a new file system benchmark. Technical Report TR3022, Network Appliance, 1997.
  29. 29.
    KAZEON: Kazeon: search the enterprise.
  30. 30.
    Leung A, Adams I, Miller EL (2009) Magellan: a searchable metadata architecture for large-scale file systems. University of California, Santa Cruz, Technical Report UCSC-SSRC-09-07Google Scholar
  31. 31.
    Leung AW (2009) Organizing, indexing, and searching large-scale file systems. PhD thesis, University of California, Santa CruzGoogle Scholar
  32. 32.
    Leung AW, Pasupathy S, Goodson GR, Miller EL (2008) Measurement and analysis of large-scale network file system workloads. In: USENIX annual technical conference, vol 1, pp 213–226Google Scholar
  33. 33.
    Leung AW, Shao M, Bisson T, Pasupathy S, Miller EL (2009) Spyglass: fast, scalable metadata search for large-scale storage systems. FAST 9:153–166Google Scholar
  34. 34.
    Liu J, Feng D, Hua Y, Peng B, Nie Z (2014) Using provenance to efficiently improve metadata searching performance in storage systems. Future Gener Comput SystGoogle Scholar
  35. 35.
    Madden APWBA, Long MMDD. Examining scientific data for scalable index designsGoogle Scholar
  36. 36.
    Malkani P, Ellard D, Ledlie J, Seltzer M (2003) Passive NFS tracing of email and research workloads. Proceedings of the 2nd USENIX conference on file and storage technologies. pp 203–216Google Scholar
  37. 37.
    Mathur A, Cao M, Bhattacharya S, Dilger A, Tomas A, Vivier L (2007) The new ext4 filesystem: current status and future plans. In: Proceedings of the Linux symposium, vol 2. Citeseer, pp 21–33Google Scholar
  38. 38.
    MetaTracker (2008) Metatracker for linux.
  39. 39.
  40. 40.
    Nunez J (2008) High end computing file system and IO R&D gaps roadmap. In: HEC FSIO R&D ConferenceGoogle Scholar
  41. 41.
    Ohara Y (2013) Hctrie: a structure for indexing hundreds of dimensions for use in file systems search. In: 2013 IEEE 29th symposium on mass storage systems and technologies (MSST), pp 1–5. IEEEGoogle Scholar
  42. 42.
    Owens L, Brown M, Poore K, Nicolson N (2008) The forrester wave: enterprise search, q2 2008. For information and knowledge management professionalsGoogle Scholar
  43. 43.
    Pagh A, Pagh R, Rao SS (2005) An optimal bloom filter replacement. In: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 823–829Google Scholar
  44. 44.
    Pathan AI, Sinhal A (2013) Encode decode linux based partitions to hide and explore file system. Int J Comput Appl 75(12)Google Scholar
  45. 45.
    Ross RB, Thakur R et al (2000) PVFS: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference, pp 391–430Google Scholar
  46. 46.
    Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the 2003 Linux symposium, vol 2003Google Scholar
  47. 47.
    SNIA (2010) Nfs traces.
  48. 48.
    Soules CA, Ganger GR (2005) Connections: using context to enhance file search. In: ACM SIGOPS operating systems review, vol 39. ACM, pp 119–132Google Scholar
  49. 49.
    Soules CA, Keeton K, Morrey III CB (2009) Scan-lite: enterprise-wide analysis on the cheap. In: Proceedings of the 4th ACM European conference on computer systems. ACM, pp 117–130Google Scholar
  50. 50.
    van Heuven van Staereling R, Appuswamy R, van Moolenbroek DC, Tanenbaum AS (2011) Efficient, modular metadata management with loris. In: 2011 6th IEEE international conference on networking, architecture and storage (NAS). IEEE, pp 278–287Google Scholar
  51. 51.
    Szalay A (2008) New challenges in petascale scientific databases. In: Scientific and statistical database management. Springer, Berlin, p 1Google Scholar
  52. 52.
    Takata M, Sutoh A (2012) Event-notification-based inactive file search for large-scale file systems. In: APMRC, 2012 digest. IEEE, pp 1–7Google Scholar
  53. 53.
    Ward L (2009) PDSI SciDAC: released trace data.
  54. 54.
    Weil SA (2007) Ceph: reliable, scalable, and high-performance distributed storage. PhD thesis, University of California, Santa CruzGoogle Scholar
  55. 55.
    Xiao B, Hua Y (2010) Using parallel bloom filters for multiattribute representation on network services. IEEE Trans Parallel Distrib Syst 21(1):20–32CrossRefGoogle Scholar
  56. 56.
    Xu L, Huang Z, Jiang H, Tian L, Swanson D (2014) VSFS: a searchable distributed file system. In: Parallel data storage workshop (PDSW), 2014 9th. IEEE, pp 25–30Google Scholar
  57. 57.
    Xu L, Jiang H, Liu X, Tian L, Hua Y, Hu J (2011) Propeller: a scalable metadata organization for a versatile searchable file system. CSE Technical reports. University of Nebraska, LincolnGoogle Scholar
  58. 58.
    Yu Y, Zhu Y, Ng W, Samsudin J (2014) An efficient multidimension metadata index and search system for cloud data. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom), pp 499–504. IEEEGoogle Scholar
  59. 59.
    Zhang Q, Feng D, Wang F, Wu S (2014) Mlock: building delegable metadata service for the parallel file systems. Sci China Inf Sci 58(3):1–14CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Zhisheng Huo
    • 1
    • 2
  • Limin Xiao
    • 1
    • 2
  • Qiaoling Zhong
    • 1
    • 2
  • Shupan Li
    • 1
    • 2
  • Ang Li
    • 1
    • 2
  • Li Ruan
    • 1
    • 2
  • Shouxin Wang
    • 3
  • Lihong Fu
    • 3
  1. 1.State Key Laboratory of Software Development EnvironmentBeihang UniversityBeijingChina
  2. 2.School of Computer Science and EngineeringBeihang UniversityBeijingChina
  3. 3.Space Star Technology Co., LtdBeijingChina

Personalised recommendations