Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Hey T, Tansley S, Tolle K M. The Fourth Paradigm: Data-Intensive Scientific Discovery (1st edition). Microsoft Research, 2009.
Ross R, Thakur R, Choudhary A. Achievements and challenges for I/O in computational science. Journal of Physics: Conference Series, 2005, 16(1): 501-509.
Nieuwejaar N, Kotz D, Purakayastha A, Ellis C S, Best M L. File-access characteristics of parallel scientific workloads. IEEE Trans. Parallel Distrib. Syst., 1996, 7(10): 1075-1089.
Wang F, Xin Q, Hong B, Brandt S A, Miller E, Long D, McLarty T. File system workload analysis for large scientific computing applications. In Proc. the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004, pp.139-152.
Crandall P, Aydt R A, Chien A A, Reed D A. Input/output characteristics of scalable parallel applications. In Proc. the 1995 Supercomputing, December 1995, Article No. 59.
Dorier M, Antoniu G, Ross R B, Kimpe D, Ibrahim S. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In Proc. the 28th IEEE International Parallel and Distributed Processing Symposium, May 2014, pp.155-164.
Thapaliya S, Bangalore P, Lofstead J F, Mohror K, Moody A. Managing I/O interference in a shared burst buffer system. In Proc. the 45th International Conference on Parallel Processing, August 2016, pp.416-425.
Lofstead J F, Klasky S, Schwan K, Podhorszki N, Jin C. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proc. the 6th International Workshop on Challenges of Large Applications in Distributed Environments, June 2008, pp.15-24.
Folk M, Cheng A, Yates K. HDF5: A file format and I/O library for high performance computing applications. In Proc. the 1999 Supercomputing (CD-ROM), November 1999, pp.5-33.
Liu N, Cope J, Carns P H, Carothers C D, Ross R B, Grider G, Crume A, Maltzahn C. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
Wang T, Mohror K,Moody A, Sato K, YuW. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, November 2016, pp.807-818.
Bent J, Gibson G A, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS: A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 26.
Vilayannur M, Nath P, Sivasubramaniam A. Providing tunable consistency for a parallel file store. In Proc. the 2005 Conference on File and Storage Technologies, December 2005, Article No. 3.
Lensing P H, Cortes T, Hughes J, Brinkmann A. File system scalability with highly decentralized metadata on independent storage devices. In Proc. the 16th the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, pp.366-375.
Soumagne J, Kimpe D, Zounmevo J A, Chaarawi M, Koziol Q, Afsahi A, Ross R B. Mercury: Enabling remote procedure call for high-performance computing. In Proc. the 2013 IEEE International Conference on Cluster Computing, September 2013, Article No. 50.
Seo S, Amer A, Balaji P, Bordage C et al. Argobots: A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst., 2018, 29(3): 512-526.
Carns P H, Jenkins J, Cranor C D, Atchley S, Seo S, Snyder S, Ross R B. Enabling NVM for data-intensive scientific services. In Proc. the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, November 2016, Article No. 4.
Jasak H, Jemcov A, Tukovic Z et al. OpenFOAM: A C++ library for complex physics simulations. In Proc. the International Workshop on Coupled Methods in Numerical Dynamics, September 2007, Article No. 3.
Vef M, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS — A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.319-324.
Schmuck F B, Haskin R L. GPFS: A shared-disk file system for large computing clusters. In Proc. the 2002 Conference on File and Storage Technologies, January 2002, pp.231-244.
Braam P J, Schwan P. Lustre: The intergalactic file system. In Proc. the 2002 Ottawa Linux Symposium, June 2002, pp.50-54.
Qian Y, Li X, Ihara S, Zeng L, Kaiser J, S¨uß T, Brinkmann A. A configurable rule based classful token bucket filter network request scheduler for the Lustre file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 6.
Herold F, Breuner S. An introduction to BeeGFS. https://www.beegfs.io/docs/whitepapers/Introduction to BeeGFS b y ThinkParQ.pdf, August 2019.
Ross R B, Latham R. PVFS — PVFS: A parallel file system. In Proc. the 2006 ACM/IEEE Conference on High Performance Networking and Computing, November 2006, Article No. 34.
Oral S, Shah G. Spectrum scale enhancements for CORAL. http://files.gpfsug.org/presentations/2016/SC16/11 Sarp Oral Gautam Shah Spectrum Scale Enhancements for CO RAL v2.pdf, August 2019.
Kougkas A, Devarajan H, Sun X. Hermes: A heterogeneousaware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on High-Performance Parallel and Distributed Computing, June 2018, pp.219-230.
Latham R, Ross R B, Thakur R. The impact of file systems on MPI-IO scalability. In Proc. the 11th European PVM/MPI Users’ Group Meeting, September 2004, pp.87-96.
Choudhary A, Liao W K, Gao K, Nisar A, Ross R, Thakur R, Latham R. Scalable I/O and analytics. Journal of Physics: Conference Series, 2009, 180(1): Article No. 012048.
Moore M, Bonnie D, Ligon B, Marshall M, Ligon W, Mills N, Quarles E, Sampson S, Yang S,Wilson B. OrangeFS: Advancing PVFS. https://www.usenix.org/legacy/event/fast11/posters files/Moore.pdf, August 2019.
Ritchie D, Thompson K. The UNIX time-sharing system (reprint). Commun. ACM, 1983, 26(1): 84-89.
Vef M A, Tarasov V, Hildebrand D, Brinkmann A. Challenges and solutions for tracing storage systems: A case study with spectrum scale. ACM Trans. Storage, 2018, 14(2): Article No. 18.
Patil S, Gibson G A. Scale and concurrency of GIGA+: File system directories with millions of files. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.177-190.
Ren K, Zheng Q, Patil S, Gibson G A. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proc. the 2014 International Conference for High Performance Computing, November 2014, pp.237-248.
Carns P, Yao Y, Harms K, Latham R, Ross R, Antypas K. Production I/O characterization on the Cray XE6. In Proc. the Cray User Group Meeting, May 2013, Article No. 121.
Xing J, Xiong J, Sun N, Ma J. Adaptive and scalable metadata management to support a trillion files. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 31.
FringsW,Wolf F, Petkov V. Scalable massively parallel I/O to task-local files. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 22.
Yang S, Ligon III W B, Quarles E C. Scalable distributed directory implementation on orange file system. In Proc. the 7th IEEE International Workshop on Storage Network Architecture and Parallel I/Os, May 2011.
Patil S, Ren K, Gibson G. A case for scaling HPC metadata performance through de-specialization. In Proc. the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, November 2012, pp.30-35.
Carns P H, Ligon III W B, Ross R B, Thakur R. PVFS: A parallel file system for Linux clusters. In Proc. the 4th Annual Linux Showcase & Conference, October 2000, Article No. 4.
Dong S, Callaghan M, Galanis L, Borthakur D, Savor T, Strum M. Optimizing space amplification in RocksDB. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 30.
Oral S, Dillow D A, Fuller D et al. OLCF’s 1 Tb/s, nextgeneration Lustre file system. In Proc. the 2013 Cray User Group Conference, May 2013, Article No. 151.
Lofstead J F, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M. Managing variability in the IO performance of petascale storage systems. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 35.
Xie B, Chase J S, Dillow D, Drokin O, Klasky S, Oral S, Podhorszki N. Characterizing output bottlenecks in a supercomputer. In Proc. the 2012 International Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 8.
Kougkas A, Devarajan H, Sun X, Lofstead J F. Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.290-301.
Hashimoto Y, Aida K. Evaluation of performance degradation in HPC applications with VM consolidation. In Proc. the 3rd International Conference on Networking and Computing, December 2012, pp.273-277.
Lofstead J F, Ross R. Insights for exascale IO APIs from building a petascale IO API. In Proc. the 2013 International Conference for High Performance Computing, November 2013, Article No. 87.
Reed D A, Dongarra J J. Exascale computing and big data. Commun. ACM, 2015, 58(7): 56-68.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Vef, MA., Moti, N., Süß, T. et al. GekkoFS — A Temporary Burst Buffer File System for HPC Applications. J. Comput. Sci. Technol. 35, 72–91 (2020). https://doi.org/10.1007/s11390-020-9797-6
- distributed file system
- high-performance computing (HPC)
- burst buffer
- POSIX (portable operating system interface)