Storage backends of parallel compute clusters are still based mostly on magnetic disks, while newer and faster storage technologies such as flash-based SSDs or non-volatile random access memory (NVRAM) are deployed within compute nodes. Including these new storage technologies into scientific workflows is unfortunately today a mostly manual task, and most scientists therefore do not take advantage of the faster storage media. One approach to systematically include nodelocal SSDs or NVRAMs into scientific workflows is to deploy ad hoc file systems over a set of compute nodes, which serve as temporary storage systems for single applications or longer-running campaigns. This paper presents results from the Dagstuhl Seminar 17202 “Challenges and Opportunities of User-Level File Systems for HPC” and discusses application scenarios as well as design strategies for ad hoc file systems using node-local storage media. The discussion includes open research questions, such as how to couple ad hoc file systems with the batch scheduling environment and how to schedule stage-in and stage-out processes of data between the storage backend and the ad hoc file systems. Also presented are strategies to build ad hoc file systems by using reusable components for networking and how to improve storage device compatibility. Various interfaces and semantics are presented, for example those used by the three ad hoc file systems BeeOND, GekkoFS, and BurstFS. Their presentation covers a range from file systems running in production to cutting-edge research focusing on reaching the performance limits of the underlying devices.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Ruemmler C, Wilkes J. An introduction to disk drive modeling. IEEE Computer, 1994, 27(3): 17-28.
Qian Y, Li X, Ihara S, Zeng L, Kaiser J, Süß T, Brinkmann A. A configurable rule based classful token bucket filter network request scheduler for the Lustre file system. In Proc. the 2017 International Conference for High Performance Computing, November 2017, Article No. 6.
Rajachandrasekar R, Moody A, Mohror K, Panda D K. A 1 PB/s file system to checkpoint three million MPI tasks. In Proc. the 22nd Int. Symp. High-Performance Parallel and Distributed Computing, June 2013, pp.143-154.
Schroeder B, Lagisetty R, Merchant A. Flash reliability in production: The expected and the unexpected. In Proc. the 14th USENIX Conference on File and Storage Technologies, February 2016, pp.67-80.
Meza J, Wu Q, Kumar S, Mutlu O. A large-scale study of flash memory failures in the field. In Proc. the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, June 2015, pp.177-190.
Narayanan I, Wang D, Jeon M, Sharma B, Caulfield L, Sivasubramaniam A, Cutler B, Liu J, Khessib B M, Vaid K. SSD failures in datacenters: What? When? and Why? In Proc. the 9th ACM International on Systems and Storage Conference, June 2016, Article No. 7.
Welch B, Noer G. Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions. In Proc. the 29th IEEE Symposium on Mass Storage Systems and Technologies, May 2013, Article No. 29.
Liu N, Cope J, Carns P H, Carothers C D, Ross R B, Grider G, Crume A, Maltzahn C. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
Qian Y, Li X, Ihara S, Dilger A, Thomaz C, Wang S, Cheng W, Li C, Zeng L, Wang F, Feng D, Süß T, Brinkmann A. LPCC: Hierarchical persistent client caching for Lustre. In Proc. the Int. Conf. High Performance Computing, Networking, Storage and Analysis, November 2019.
Vef MA, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS — A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE Int. Conf. Cluster Computing, September 2018, pp.319-324.
Wang T, Mohror K, Moody A, Sato K, Yu W. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, November 2016, pp.807-818.
Brinkmann A, Mohror K, Yu W. Challenges and opportunities of user-level file systems for HPC (Dagstuhl Seminar 17202). Dagstuhl Reports, 2017, 7(5): 97-139.
Kleppmann M. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st edition). O’Reilly Media, 2017.
Antcheva I, Ballintijn M, Bellenot B et al. ROOT — A C++ framework for petabyte data storage, statistical analysis and visualization. Computer Physics Communications, 2011, 182(6): 1384-1385.
Edgar R C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 2010, 26(19): 2460-2461.
Köster J, Rahmann S. Snakemake—A scalable bioinformatics workflow engine. Bioinformatics, 2018, 34(20): 3600.
Zhang Z, Barbary K, Nothaft F A, Sparks E R, Zahn O, Franklin M J, Patterson D A, Perlmutter S. Scientific computing meets big data technology: An astronomy use case. In Proc. the 2015 IEEE International Conference on Big Data, October 2015, pp.918-927.
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symp. Mass Storage Systems and Technologies, May 2010, Article No. 9.
Zaharia M, Xin R S, Wendell P, Das T et al. Apache spark: A unified engine for big data processing. Commun. ACM, 2016, 59(11): 56-65.
Banker K. MongoDB in Action (2nd edition). Manning Publications, 2016.
Carpenter J, Hewitt E. Cassandra: The Definitive Guide (2nd edition). O’Reilly Media, 2016.
Jacob J C, Katz D S, Berriman G B, Good J, Laity A C, Deelman E, Kesselman C, Singh G, Su M, Prince T A, Williams R. Montage: A grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Computational Science and Engineering, 2009, 4(2): 73-87.
O’Driscoll A, Daugelaite J, Sleator R D. ’Big data’, Hadoop and cloud computing in genomics. Journal of Biomedical Informatics, 2013, 46(5): 774-781.
Conejero J, Corella S, Badia R M, Labarta J. Task-based programming in COMPSs to converge from HPC to big data. International Journal of High Performance Computing Applications, 2018, 32(1): 45-60.
Fox G C, Qiu J, Jha S et al. Big data, simulations and HPC convergence. In Lecture Notes in Computer Science 10044, Rabl T, Nambiar R, Baru C, Bhandarkar M, Poess M, Pyne S (eds.), Springer-Verlag, 2015, pp.3-17.
Wasi-ur-Rahman M, Lu X, Islam N S, Rajachandrasekar R, Panda D K. High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.291-300.
Ferreira K, Riesen R, Oldfield R, Stearley J, Laros J, Pedretti K, Brightwell R, Kordenbrock T. Increasing fault resiliency in a message passing environment. Technical Report, Sandia National Laboratories, 2009. https://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/rMPI tech.pdf, August 2019.
Philp I R. Software failures and the road to a petaflop machine. In Proc. the 1st Workshop on High Performance Computing Reliability Issues, February 2005.
Petrini F. Scaling to thousands of processors with Buffered Coscheduling. In Proc. the 2002 Scaling to New Height Workshop, May 2002.
Congiu G, Narasimhamurthy S, Süß T, Brinkmann A. Improving collective I/O performance using non-volatile memory devices. In Proc. the 2016 IEEE International Conference on Cluster Computing, September 2016, pp.120-129.
Moody A, Bronevetsky G, Mohror K, de Supinski B R. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 22.
Islam T Z, Mohror K, Bagchi S, Moody A, de Supinski B R, Eigenmann R. McrEngine: A scalable checkpointing system using data-aware aggregation and compression. In Proc. the 2012 Conf. High Performance Computing Networking, Storage and Analysis, Nov. 2012, Article No. 17.
Kaiser J, Gad R, Süß T, Padua F, Nagel L, Brinkmann A. Deduplication potential of HPC applications’ checkpoints. In Proc. the 2016 IEEE International Conference on Cluster Computing, September 2016, pp.413-422.
Zhu Y, Chowdhury F, Fu H, Moody A, Mohror K, Sato K, Yu W. Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In Proc. the 26th IEEE Int. Symp. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Sept. 2018, pp.145-156.
Rasch P, Xie S, Ma P L et al. An overview of the atmospheric component of the Energy Exascale Earth System Model. Journal of Advances in Modeling Earth Systems, 2019, 11(8): 2377-2411.
Ross R, Ward L, Carns P, Grider G, Klasky S, Koziol Q, Lockwood G K, Mohror K, Settlemyer B, Wolf M. Storage systems and I/O: Organizing, storing, and accessing data for scientific discovery. Technical Report, US Department of Energy, 2018. https://www.osti.gov/biblio/1491994, May 2019.
Kurth T, Treichler S, Romero J et al. Exascale deep learning for climate analytics. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 51.
Liu J, Wu J, Panda D K. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming, 2004, 32(3): 167-198.
Chen D, Eisley N, Heidelberger P, Senger R M, Sugawara Y, Kumar S, Salapura V, Satterfield D L, Steinmacher-Burow B D, Parker J J. The IBM Blue Gene/Q interconnection network and message unit. In Proc. the 2011 Conference on High Performance Computing Networking, Storage and Analysis, November 2011, Article No. 26.
Faanes G, Bataineh A, Roweth D, Court T, Froese E, Alverson R, Johnson T, Kopnick J, Higgins M, Reinhard J. Cray cascade: A scalable HPC system based on a Dragonfly network. In Proc. the 2012 Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 103.
Gropp W D, Lusk E L, Skjellum A. Using MPI — Portable Parallel Programming with the Message-Passing Interface (3rd edition). MIT Press, 2014.
Latham R, Ross R B, Thakur R. Can MPI be used for persistent parallel services? In Proc. the 13th European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, September 2006, pp.275-284.
Grun P, Hefty S, Sur S, Goodell D, Russell R D, Pritchard H, Squyres J M. A brief introduction to the OpenFabrics interfaces — A new network API for maximizing high performance application efficiency. In Proc. the 23rd IEEE Annual Symposium on High-Performance Interconnects, August 2015, pp.34-39.
Shamis P, Venkata M G, Lopez M G et al. UCX: An open source framework for HPC network APIs and beyond. In Proc. the 23rd IEEE Annual Symposium on High-Performance Interconnects, August 2015, pp.40-43.
Barrett B W, Brightwell R, Hemmert S et al. The portals 4.0 network programming interface. Technical Report, Sandia National Laboratories, 2012. http://www.cs.sandia.gov/Portals/portals40.pdf, May 2019.
Soumagne J, Kimpe D, Zounmevo J A, Chaarawi M, Koziol Q, Afsahi A, Ross R B. Mercury: Enabling remote procedure call for high-performance computing. In Proc. the 2013 IEEE International Conference on Cluster Computing, September 2013, Article No. 50.
Oldfield R, Widener P M, Maccabe A B, Ward L, Kordenbrock T. Efficient data-movement for lightweight I/O. In Proc. the 2006 IEEE International Conference on Cluster Computing, September 2006, Article No. 60.
Wheeler K B, Murphy R C, Thain D. Qthreads: An API for programming with millions of lightweight threads. In Proc. the 22nd IEEE International Symposium on Parallel and Distributed Processing, April 2008.
Nakashima J, Taura K. MassiveThreads: A thread library for high productivity languages. In Concurrent Objects and Beyond — Papers Dedicated to Akinori Yonezawa on the Occasion of His 65th Birthday, Agha G, Igarashi A, Kobayashi N, Masuhara H, Matsuoka S, Shibayama E, Taura K (eds.), Springer, 2014, pp.222-238.
Seo S, Amer A, Balaji P et al. Argobots: A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst., 2018, 29(3): 512-526.
Dorier M, Carns P H, Harms K et al. Methodology for the rapid development of scalable HPC data services. In Proc. the 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, November 2018, pp.76-87.
Carns P H, Jenkins J, Cranor C D, Atchley S, Seo S, Snyder S, Ross R B. Enabling NVM for data-intensive scientific services. In Proc. the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, November 2016, Article No. 4.
Vef M A, Tarasov V, Hildebrand D, Brinkmann A. Challenges and solutions for tracing storage systems: A case study with spectrum scale. ACM Trans. Storage, 2018, 14(2): Article No. 18.
Lofstead J F, Klasky S, Schwan K, Podhorszki N, Jin C. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proc. the 6th International Workshop on Challenges of Large Applications in Distributed Environments, June 2008, pp.15-24.
Moore M, Bonnie D, Ligon B, Marshall M, Ligon W, Mills N, Quarles E, Sampson S, Yang S, Wilson B. OrangeFS: Advancing PVFS. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011.
Volos H, Nalli S, Panneerselvam S, Varadarajan V, Saxena P, Swift M M. Aerie: Flexible file-system interfaces to storage-class memory. In Proc. the 9th Eurosys Conference, April 2014, Article No. 14.
Zheng Q, Cranor C D, Guo D, Ganger G R, Amvrosiadis G, Gibson G A, Settlemyer B W, Grider G, Guo F. Scaling embedded in-situ indexing with deltaFS. In Proc. the 2018 Int. Conf. High Performance Computing, Networking, Storage, and Analysis, Nov. 2018, Article No. 3.
Kelly S M, Brightwell R. Software architecture of the light weight kernel, Catamount. In Proc. the 2005 Cray User Group Annual Technical Conference, May 2005, pp.16-19.
Rajgarhia A, Gehani A. Performance and extension of user space file systems. In Proc. the 2010 ACM Symposium on Applied Computing, March 2010, pp.206-213.
Vangoor B K R, Tarasov V, Zadok E. To FUSE or not to FUSE: Performance of user-space file systems. In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.59-72.
Henson V, van de Ven A, Gud A, Brown Z. Chunkfs: Using divide-and-conquer to improve file system reliability and repair. In Proc. the 2nd Workshop on Hot Topics in System Dependability, November 2006, Article No. 8.
Hoskins M E. SSHFS: Super easy file access over SSH. Linux Journal, 2006, 2006(146): Article No. 4.
Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P H, Ross R B, Raicu I. FusionFS: Toward supporting data intensive scientific applications on extreme-scale highperformance computing systems. In Proc. the 2014 IEEE Int. Conf. Big Data, October 2014, pp.61-70.
Davies A, Orsaria A. Scale out with GlusterFS. Linux Journal, 2013, 2013(235): Article No. 1.
Lensing P H, Cortes T, Brinkmann A. Direct lookup and hash-based metadata placement for local file systems. In Proc. the 6th Annual International Systems and Storage Conference, June 2013, Article No. 5.
Lensing P H, Cortes T, Hughes J, Brinkmann A. File system scalability with highly decentralized metadata on independent storage devices. In Proc. the 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, pp.366-375.
Carns P H, Ligon III W B, Ross R B, Thakur R. PVFS: A parallel file system for Linux clusters. In Proc. the 4th Annual Linux Showcase & Conference, October 2000, Article No. 4.
Dong S, Callaghan M, Galanis L, Borthakur D, Savor T, Strum M. Optimizing space amplification in RocksDB. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 30.
Oral S, Dillow D A, Fuller D, Hill J, Leverman D, Vazhkudai S S, Wang F, Kim Y, Rogers J, Simmons J, Miller R. OLCF’s 1 TB/s, next-generation Lustre file system. In Proc. the 2013 Cray User Group Conference, April 2013.
Greenberg H, Bent J, Grider G. MDHIM: A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
Karger D R, Lehman E, Leighton F T, Panigrahy R, Levine M S, Lewin D. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. the 29th Annual ACM Symposium on the Theory of Computing, May 1997, pp.654-663.
Bent J, Gibson G A, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS: A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 26.
Carns P H, Harms K, Allcock W E, Bacon C, Lang S, Latham R, Ross R B. Understanding and improving computational science storage access through continuous characterization. ACM Trans. Storage, 2011, 7(3): Article No. 8.
Yildiz O, Dorier M, Ibrahim S, Ross R B, Antoniu G. On the root causes of cross application I/O interference in HPC storage systems. In Proc. the 2016 IEEE Int. Parallel and Distributed Processing Symposium, May 2016, pp.750-759.
Lofstead J F, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M. Managing variability in the IO performance of petascale storage systems. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 35.
Xie B, Chase J S, Dillow D, Drokin O, Klasky S, Oral S, Podhorszki N. Characterizing output bottlenecks in a supercomputer. In Proc. the 2012 Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 8.
Paul D, Landsteiner B. Datawarp administration tutorial. https://cug.org/proceedings/cug2018proceedings/includes/les/tut105s2- le1.pdf, May 2019.
Gonsiorowski E. Using sierra burst buffers. https://computing.llnl.gov/tutorials/SierraGettingStarted, May 2019.
Kougkas A, Devarajan H, Sun X, Lofstead J F. Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers. In Proc. the 2018 IEEE Int. Conf. Cluster Computing, September 2018, pp.290-301.
Dong B, Byna S, Wu K, Prabhat, Johansen H, Johnson J N, Keen N. Data elevator: Low-contention data movement in hierarchical storage system. In Proc. the 23rd IEEE Int. Conf. High Performance Computing, December 2016, pp.152-161.
Miranda A, Jackson A, Tocci T, Panourgias I, Nou R. NORNS: Extending Slurm to support data-driven workflows through asynchronous data staging. In Proc. the 2019 IEEE International Conference on Cluster Computing, September 2019.
Subedi P, Davis P E, Duan S, Klasky S, Kolla H, Parashar M. Stacker: An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the 2018 Int. Conf. for High Performance Computing, Networking, Storage, and Analysis, Nov. 2018, Article No. 73.
Wang T, Oral S, Pritchard M,Wang B, YuW. TRIO: Burst buffer based I/O orchestration. In Proc. the 2015 IEEE Int. Conf. Cluster Computing, Sept. 2015, pp.194-203.
Thapaliya S, Bangalore P, Lofstead J F, Mohror K, Moody A. Managing I/O interference in a shared burst buffer system. In Proc. the 45th International Conference on Parallel Processing, August 2016, pp.416-425.
Soysal M, Berghoff M, Klusácek D, Streit A. On the quality of wall time estimates for resource allocation prediction. In Proc. the 48th International Conference on Parallel Processing, August 2019, Article No. 23.
Folk M, Heber G, Koziol Q, Pourmal E, Robinson D. An overview of the HDF5 technology suite and its applications. In Proc. the 2011 EDBT/ICDT Workshop on Array Databases, March 2011, pp.36-47.
Li J, Liao W, Choudhary A N, Ross R B, Thakur R, Gropp W, Latham R, Siegel A R, Gallagher B, Zingale M. Parallel netCDF: A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conf. High Performance Networking and Computing, Nov. 2003, Article No. 39.
Rew R, Davis G. NetCDF: An interface for scientific data access. IEEE Computer Graphics and Applications, 1990, 10(4): 76-82.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Brinkmann, A., Mohror, K., Yu, W. et al. Ad Hoc File Systems for High-Performance Computing. J. Comput. Sci. Technol. 35, 4–26 (2020). https://doi.org/10.1007/s11390-020-9801-1
- parallel architectures
- distributed file system
- high-performance computing
- burst buffer
- POSIX (portable operating system interface)