Adventures in NoSQL for Metadata Management

  • Jay LofsteadEmail author
  • Ashleigh Ryan
  • Margaret Lawson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11887)


This paper describes an attempt to use a NoSQL database engine to manage custom metadata using a rich query interface as motivating and descriptive examples of what kind of functionality is desired. While the difficulties are numerous, a number of important considerations for how and when to use this alternative technology were revealed as well as some initial performance numbers showing the performance impact of those choices.


Metadata NoSQL Cassandra Spark 


  1. 1.
    Apache: Apache Accumulo (2018). Accessed 18 Dec 2018
  2. 2.
    Baron, J., Kotecha, S.: Storage options in the AWS cloud. Amazon Web Services, Washington DC, Technical report (2013)Google Scholar
  3. 3.
    Edward Hartnett, E., Rew, R.K.: Experience with an enhanced NetCDF data model and interface for scientific data access. In: 24th Conference on IIPS (2008) Google Scholar
  4. 4.
    Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 36–47. ACM (2011)Google Scholar
  5. 5.
    Gamblin, T., et al.: The spack package manager: bringing order to HPC software chaos. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 40. ACM (2015)Google Scholar
  6. 6.
    Greenberg, H., Bent, J., Grider, G.: MDHIM: a parallel key/value framework for HPC. In: HotStorage (2015)Google Scholar
  7. 7.
    Khetrapal, A., Ganesh, V.: Hbase and hypertable for large scale distributed storage systems. Dept. of Computer Science, Purdue University, pp. 22–28 (2006)Google Scholar
  8. 8.
    Lakshman, A., Malik, P.: Cassandra: structured storage system on a P2P network. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, p. 5. ACM (2009)Google Scholar
  9. 9.
    Lamb, A., et al.: The vertica analytic database: C-store 7 years later. Proc. VLDB Endow. 5(12), 1790–1801 (2012)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Lawson, M.: EMPRESS Metadata Management System (2018). Accessed 18 Dec 2018
  11. 11.
    Lawson, M., Lofstead, J.: Using a robust metadata management system to accelerate scientific discovery at extreme scales. In: Proceedings of the 3rd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems. ACM (2018)Google Scholar
  12. 12.
    Lawson, M., et al.: Empress: extensible metadata provider for extreme-scale scientific simulations. In: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, pp. 19–24. ACM (2017)Google Scholar
  13. 13.
    Li, J., et al.: Parallel NetCDF: a high-performance scientific I/O interface. In: 2003 ACM/IEEE Conference on Supercomputing, p. 39, November 2003.
  14. 14.
    Lofstead, J., et al.: Six degrees of scientific data: reading patterns for extreme scale science IO. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 49–60. ACM (2011).
  15. 15.
    Lofstead, J., Zheng, F., Klasky, S., Schwan, K.: Adaptable, metadata rich IO methods for portable high performance IO. In: Proceedings of IPDPS 2009, Rome, Italy, 25–29 May 2009Google Scholar
  16. 16.
    Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, pp. 15–24. ACM (2008)Google Scholar
  17. 17.
    Rew, R., Hartnett, E., Caron, J., et al.: NetCDF-4: software implementing an enhanced data model for the geosciences. In: 22nd International Conference on Interactive Information Processing Systems for Meteorology, Oceanograph, and Hydrology (2006)Google Scholar
  18. 18.
    Sahin, S., Cao, W., Zhang, Q., Liu, L.: JVM configuration management and its performance impact for big data applications. In: 2016 IEEE International Congress on Big Data (BigData Congress), pp. 410–417. IEEE (2016)Google Scholar
  19. 19.
    Sevilla, M.A., et al.: Tintenfisch: file system namespace schemas and generators. In: The 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 2018) (2018)Google Scholar
  20. 20.
    Stax, D.: DataStax Cassandra Connector (2018). Accessed 18 Dec 2018
  21. 21.
    Tang, H., Byna, S., Dong, B., Liu, J., Koziol, Q.: SoMeta: scalable object-centric metadata management for high performance computing. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 359–369. IEEE (2017)Google Scholar
  22. 22.
    Tschetter, E.: Introducing Druid (2012). Accessed 18 Dec 2018
  23. 23.
    Ulmer, C.D., et al.: Faodail: enabling in situ analytics for next-generation systems. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) (2017)Google Scholar
  24. 24.
    Indiana University: IndexedHbase (2019). Accessed 14 June 2019
  25. 25.
    Vora, M.N.: Hadoop-hbase for large-scale data. In: 2011 International Conference on Computer Science and Network Technology (ICCSNT), vol. 1, pp. 601–605. IEEE (2011)Google Scholar
  26. 26.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Sandia National LaboratoriesAlbuquerqueUSA
  2. 2.Georgia Institute of TechnologyAtlantaUSA
  3. 3.University of IllinoisUrbana-ChampaignUSA

Personalised recommendations