Advertisement

Semantic-Aware Metadata Organization for Exact-Matching Queries

  • Yu HuaEmail author
  • Xue Liu
Chapter

Abstract

Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This section proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files’ metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces which shows that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems (©{2012}IEEE. Reprinted, with permission, from Ref. [1].).

References

  1. 1.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 2, 337–344 (2012)CrossRefGoogle Scholar
  2. 2.
    J. Nunez, High end computing file system and I/O R&D gaps roadmap, in High Performance Computer Science Week, ASCR Computer Science Research (2008)Google Scholar
  3. 3.
    J.R. Douceur, J. Howell, Distributed directory service in the farsite file system, in Proceedings of the OSDI (2006), pp. 321–334Google Scholar
  4. 4.
    S.A. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of the OSDI (2006)Google Scholar
  5. 5.
    D. Agrawal, S. Das, A.E. Abbadi, Big data and cloud computing: new wine or just new bottles? in VLDB tutorial (2010)CrossRefGoogle Scholar
  6. 6.
    M. Stonebraker, U. Cetintemel, One size fits all: an idea whose time has come and gone, in Proceedings of the ICDE (2005)Google Scholar
  7. 7.
    A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of the FAST (2009)Google Scholar
  8. 8.
    D. Roselli, J. Lorch, T. Anderson, A comparison of file system workloads, in Proceedings of the USENIX Conference (2000), pp. 41–54Google Scholar
  9. 9.
    A. Traeger, E. Zadok, N. Joukov, C. Wright, A nine year study of file system and storage benchmarking. ACM Trans. Storage 2, 1–56 (2008)CrossRefGoogle Scholar
  10. 10.
    A. Szalay, New challenges in petascale scientific databases, in Keynote Talk in Scientific and Statistical Database Management Conference (SSDBM) (2008)Google Scholar
  11. 11.
    M. Seltzer, N. Murphy, Hierarchical file systems are dead, in Proceedings of the HotOS (2009)Google Scholar
  12. 12.
    F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber, Bigtable: a distributed storage system for structured data, in Proceedings of the OSDI (2006)Google Scholar
  13. 13.
    D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W. OToole, Semantic file systems, in Proceedings of the SOSP (1991)Google Scholar
  14. 14.
    P. Gu, J. Wang, Y. Zhu, H. Jiang, P. Shang, A novel weighted-graph-based grouping algorithm for metadata prefetching. IEEE Trans. Comput. 1, 1–15 (2010)MathSciNetCrossRefGoogle Scholar
  15. 15.
    S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)CrossRefGoogle Scholar
  16. 16.
    C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)MathSciNetCrossRefGoogle Scholar
  17. 17.
    S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  18. 18.
    T. Hofmann, Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 22(1), 89–115 (2004)MathSciNetCrossRefGoogle Scholar
  19. 19.
    T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 50–57Google Scholar
  20. 20.
    S. Doraimani, A. Iamnitchi, File grouping for scientific data management: lessons from experimenting with real traces, in Proceedings of the HPDC (2008)Google Scholar
  21. 21.
    A. Leung, S. Pasupathy, G. Goodson, E. Miller, Measurement and analysis of large-scale network file system workloads, in Proceedings of the USENIX Conference (2008)Google Scholar
  22. 22.
    P. Xia, D. Feng, H. Jiang, L. Tian, F. Wang, FARMER: a Novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance, in Proceedings of the HPDC (2008)Google Scholar
  23. 23.
    E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of the FAST (2002)Google Scholar
  24. 24.
    S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)Google Scholar
  25. 25.
    D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of the FAST (2003)Google Scholar
  26. 26.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with metadata semantic-awareness for next-generation file systems, Technical Report (University of Nebraska- Lincoln, TR-UNL-CSE-2008-0012, November, 2008)Google Scholar
  27. 27.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness. FAST Work-in-Progress Report and Poster Session (February, 2009)Google Scholar
  28. 28.
    B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the FAST (2008)Google Scholar
  29. 29.
    M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of the FAST (2009)Google Scholar
  30. 30.
    X. Liu, A. Aboulnaga, K. Salem, X. Li, CLIC: client-informed caching for storage servers, in Proceedings of the FAST (2009)Google Scholar
  31. 31.
    M. Li, E. Varki, S. Bhatia, A. Merchant, TaP: table-based prefetching for storage caches, in Proceedings of the FAST (2008)Google Scholar
  32. 32.
    A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of the SIGMOD (1984)Google Scholar
  33. 33.
    B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
  34. 34.
    Y. Hua, Y. Zhu, H. Jiang, D. Feng, L. Tian, Scalable and adaptive metadata management in ultra large-scale file systems, in Proceedings of the ICDCS (2008)Google Scholar
  35. 35.
    J. Hartigan, M. Wong, Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)CrossRefGoogle Scholar
  36. 36.
    G. Salton, A. Wong, C. Yang, A vector space model for information retrieval. J. Am. Soc. Inf. Retr. 3, 613–620 (1975)zbMATHGoogle Scholar
  37. 37.
    M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, and information retrieval. SIAM Rev. 41, 335–362 (1999)MathSciNetCrossRefGoogle Scholar
  38. 38.
    G. Golub, C. Van Loan, Matrix Computations (Johns Hopkins University Press, USA, 1996)zbMATHGoogle Scholar
  39. 39.
    G. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1997)zbMATHGoogle Scholar
  40. 40.
    A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  41. 41.
    P. Moreno, P. Ho, N. Vasconcelos, A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications, in Advances in Neural Information Processing Systems (2004)Google Scholar
  42. 42.
    Z. Rached, F. Alajaji, L. Campbell, The Kullback-Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of ACM/IEEE Supercomputing Conference (SC) (2009)Google Scholar
  44. 44.
    P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in STOC (1998), pp. 604–613Google Scholar
  45. 45.
    V. Gaede, O. Guenther, Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)CrossRefGoogle Scholar
  46. 46.
    C.A.N. Soules, G.R. Goodson, J.D. Strunk, G.R. Ganger, Metadata efficiency in versioning file systems, in Proceedings of the FAST (2003)Google Scholar
  47. 47.
    L. Fan, P. Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide area web cache sharing protocol, IEEE/ACM Trans. Netw. 8(3) (2000)Google Scholar
  48. 48.
    Y. Zhu, H. Jiang, J. Wang, F. Xian, HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst. 19(4), 1–14 (2008)CrossRefGoogle Scholar
  49. 49.
    D. Comer, The Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)MathSciNetCrossRefGoogle Scholar
  50. 50.
    C. du Mouza, W. Litwin, P. Rigaux, SD-Rtree: A scalable distributed Rtree, Proceedings of the IEEE ICDE (2007), pp. 296–305Google Scholar
  51. 51.
    A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography (CRC Press, Baco Raton, 1997)zbMATHGoogle Scholar
  52. 52.
    D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W.O. Jr, Semantic file systems, in Proceedings of the SOSP (1991)Google Scholar
  53. 53.
  54. 54.
    C. Soules, G. Ganger, Connections: using context to enhance file search, in Proceedings of the SOSP (2005)Google Scholar
  55. 55.
    J. Kleinberg, Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)MathSciNetCrossRefGoogle Scholar
  56. 56.
  57. 57.
    S. Patil, G. Gibson, GIGA+: scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110 (2008)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Huazhong University of Science and TechnologyWuhanChina
  2. 2.McGill UniversityMontrealCanada

Personalised recommendations