Searchable Storage in Cloud Computing pp 67-97 | Cite as
Semantic-Aware Metadata Organization for Exact-Matching Queries
Abstract
Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale, Exabyte-level file systems with billions of files. This section proposes a novel decentralized semantic-aware metadata organization, called SmartStore, which exploits semantics of files’ metadata to judiciously aggregate correlated files into semantic-aware groups by using information retrieval tools. The key idea of SmartStore is to limit the search scope of a complex metadata query to a single or a minimal number of semantically correlated groups and avoid or alleviate brute-force search in the entire system. The decentralized design of SmartStore can improve system scalability and reduce query latency for complex queries (including range and top-k queries). Moreover, it is also conducive to constructing semantic-aware caching, and conventional filename-based point query. We have implemented a prototype of SmartStore and extensive experiments based on real-world traces which shows that SmartStore significantly improves system scalability and reduces query latency over database approaches. To the best of our knowledge, this is the first study on the implementation of complex queries in large-scale file systems (©{2012}IEEE. Reprinted, with permission, from Ref. [1].).
References
- 1.Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems. IEEE Trans. Parallel Distrib. Syst. (TPDS) 2, 337–344 (2012)CrossRefGoogle Scholar
- 2.J. Nunez, High end computing file system and I/O R&D gaps roadmap, in High Performance Computer Science Week, ASCR Computer Science Research (2008)Google Scholar
- 3.J.R. Douceur, J. Howell, Distributed directory service in the farsite file system, in Proceedings of the OSDI (2006), pp. 321–334Google Scholar
- 4.S.A. Weil, S.A. Brandt, E.L. Miller, D.D.E. Long, C. Maltzahn, Ceph: a scalable, high-performance distributed file system, in Proceedings of the OSDI (2006)Google Scholar
- 5.D. Agrawal, S. Das, A.E. Abbadi, Big data and cloud computing: new wine or just new bottles? in VLDB tutorial (2010)CrossRefGoogle Scholar
- 6.M. Stonebraker, U. Cetintemel, One size fits all: an idea whose time has come and gone, in Proceedings of the ICDE (2005)Google Scholar
- 7.A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, E.L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, in Proceedings of the FAST (2009)Google Scholar
- 8.D. Roselli, J. Lorch, T. Anderson, A comparison of file system workloads, in Proceedings of the USENIX Conference (2000), pp. 41–54Google Scholar
- 9.A. Traeger, E. Zadok, N. Joukov, C. Wright, A nine year study of file system and storage benchmarking. ACM Trans. Storage 2, 1–56 (2008)CrossRefGoogle Scholar
- 10.A. Szalay, New challenges in petascale scientific databases, in Keynote Talk in Scientific and Statistical Database Management Conference (SSDBM) (2008)Google Scholar
- 11.M. Seltzer, N. Murphy, Hierarchical file systems are dead, in Proceedings of the HotOS (2009)Google Scholar
- 12.F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber, Bigtable: a distributed storage system for structured data, in Proceedings of the OSDI (2006)Google Scholar
- 13.D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W. OToole, Semantic file systems, in Proceedings of the SOSP (1991)Google Scholar
- 14.P. Gu, J. Wang, Y. Zhu, H. Jiang, P. Shang, A novel weighted-graph-based grouping algorithm for metadata prefetching. IEEE Trans. Comput. 1, 1–15 (2010)MathSciNetCrossRefGoogle Scholar
- 15.S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)CrossRefGoogle Scholar
- 16.C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)MathSciNetCrossRefGoogle Scholar
- 17.S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
- 18.T. Hofmann, Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 22(1), 89–115 (2004)MathSciNetCrossRefGoogle Scholar
- 19.T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 50–57Google Scholar
- 20.S. Doraimani, A. Iamnitchi, File grouping for scientific data management: lessons from experimenting with real traces, in Proceedings of the HPDC (2008)Google Scholar
- 21.A. Leung, S. Pasupathy, G. Goodson, E. Miller, Measurement and analysis of large-scale network file system workloads, in Proceedings of the USENIX Conference (2008)Google Scholar
- 22.P. Xia, D. Feng, H. Jiang, L. Tian, F. Wang, FARMER: a Novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance, in Proceedings of the HPDC (2008)Google Scholar
- 23.E. Riedel, M. Kallahalla, R. Swaminathan, A framework for evaluating storage system security, in Proceedings of the FAST (2002)Google Scholar
- 24.S. Kavalanekar, B. Worthington, Q. Zhang, V. Sharda, Characterization of storage workload traces from production windows servers, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2008)Google Scholar
- 25.D. Ellard, J. Ledlie, P. Malkani, M. Seltzer, Passive NFS tracing of email and research workloads, in Proceedings of the FAST (2003)Google Scholar
- 26.Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with metadata semantic-awareness for next-generation file systems, Technical Report (University of Nebraska- Lincoln, TR-UNL-CSE-2008-0012, November, 2008)Google Scholar
- 27.Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness. FAST Work-in-Progress Report and Poster Session (February, 2009)Google Scholar
- 28.B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the FAST (2008)Google Scholar
- 29.M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, P. Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, in Proceedings of the FAST (2009)Google Scholar
- 30.X. Liu, A. Aboulnaga, K. Salem, X. Li, CLIC: client-informed caching for storage servers, in Proceedings of the FAST (2009)Google Scholar
- 31.M. Li, E. Varki, S. Bhatia, A. Merchant, TaP: table-based prefetching for storage caches, in Proceedings of the FAST (2008)Google Scholar
- 32.A. Guttman, R-trees: a dynamic index structure for spatial searching, in Proceedings of the SIGMOD (1984)Google Scholar
- 33.B. Bloom, Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
- 34.Y. Hua, Y. Zhu, H. Jiang, D. Feng, L. Tian, Scalable and adaptive metadata management in ultra large-scale file systems, in Proceedings of the ICDCS (2008)Google Scholar
- 35.J. Hartigan, M. Wong, Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)CrossRefGoogle Scholar
- 36.G. Salton, A. Wong, C. Yang, A vector space model for information retrieval. J. Am. Soc. Inf. Retr. 3, 613–620 (1975)zbMATHGoogle Scholar
- 37.M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, and information retrieval. SIAM Rev. 41, 335–362 (1999)MathSciNetCrossRefGoogle Scholar
- 38.G. Golub, C. Van Loan, Matrix Computations (Johns Hopkins University Press, USA, 1996)zbMATHGoogle Scholar
- 39.G. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1997)zbMATHGoogle Scholar
- 40.A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
- 41.P. Moreno, P. Ho, N. Vasconcelos, A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications, in Advances in Neural Information Processing Systems (2004)Google Scholar
- 42.Z. Rached, F. Alajaji, L. Campbell, The Kullback-Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 50(5), 917–921 (2004)MathSciNetCrossRefGoogle Scholar
- 43.Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems, in Proceedings of ACM/IEEE Supercomputing Conference (SC) (2009)Google Scholar
- 44.P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in STOC (1998), pp. 604–613Google Scholar
- 45.V. Gaede, O. Guenther, Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)CrossRefGoogle Scholar
- 46.C.A.N. Soules, G.R. Goodson, J.D. Strunk, G.R. Ganger, Metadata efficiency in versioning file systems, in Proceedings of the FAST (2003)Google Scholar
- 47.L. Fan, P. Cao, J. Almeida, A.Z. Broder, Summary cache: a scalable wide area web cache sharing protocol, IEEE/ACM Trans. Netw. 8(3) (2000)Google Scholar
- 48.Y. Zhu, H. Jiang, J. Wang, F. Xian, HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst. 19(4), 1–14 (2008)CrossRefGoogle Scholar
- 49.D. Comer, The Ubiquitous B-tree. ACM Comput. Surv. 11(2), 121–137 (1979)MathSciNetCrossRefGoogle Scholar
- 50.C. du Mouza, W. Litwin, P. Rigaux, SD-Rtree: A scalable distributed Rtree, Proceedings of the IEEE ICDE (2007), pp. 296–305Google Scholar
- 51.A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography (CRC Press, Baco Raton, 1997)zbMATHGoogle Scholar
- 52.D.K. Gifford, P. Jouvelot, M.A. Sheldon, J.W.O. Jr, Semantic file systems, in Proceedings of the SOSP (1991)Google Scholar
- 53.Google Desktop, http://www.desktop.google.com/
- 54.C. Soules, G. Ganger, Connections: using context to enhance file search, in Proceedings of the SOSP (2005)Google Scholar
- 55.J. Kleinberg, Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)MathSciNetCrossRefGoogle Scholar
- 56.Google, http://www.google.com/
- 57.S. Patil, G. Gibson, GIGA+: scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110 (2008)Google Scholar