The Journal of Supercomputing

, Volume 61, Issue 3, pp 966–996 | Cite as

Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling

  • Kwangho ChaEmail author
  • Seungryoul Maeng


As the number of nodes in high performance computing (HPC) systems increases, collective I/O becomes an important issue and I/O aggregators are the key factors in improving the performance of collective I/O. When an HPC system uses non-exclusive scheduling, a different number of CPU cores per node can be assigned for MPI jobs; thus, I/O aggregators experience a disparity in their workloads and communication costs. Because the communication behaviors are influenced by the sequence of the I/O aggregators and by the number of CPU cores in neighbor nodes, changing the order of the nodes affects the communication costs in collective I/O. There are few studies, however, that seek to incorporate steps to adequately determine the node sequence. In this study, it was found that an inappropriate order of nodes results in an increase in the collective I/O communication costs. In order to address this problem, we propose the use of specific heuristic methods to regulate the node sequence. We also develop a prediction function in order to estimate the MPI-IO performance when using the proposed heuristic functions. The performance measurements indicated that the proposed scheme achieved its goal of preventing the performance degradation of the collective I/O process. For instance, in a multi-core cluster system with the Lustre file system, the read bandwidth of MPI-Tile-IO was improved by 7.61% to 17.21% and the write bandwidth of the benchmark was also increased by 17.05% to 26.49%.


Collective I/O Cluster system MPI-IO Parallel I/O 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    TOP 500 Supercomputer Sites (2010) Accessed 17 August 2010
  2. 2.
    Shan H, Shalf J (2007) Using IOR to analyze the I/O performance for HPC platforms. In: Cray users group meeting (CUG), Seattle, Washington Google Scholar
  3. 3.
    Zhang Z, Espinosa A, Iskra K, Raicu I, Foster I, Wilde M (2008) Design and evaluation of a collective IO model for loosely coupled petascale programming. In: Proc of the ACM/IEEE SC08 workshop on many-task computing on grids and supercomputers, pp 1–10. ISBN:978-1-4244-2872-4 CrossRefGoogle Scholar
  4. 4.
    Liao W-K, Choudhary A (2008) Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols. In: Proc of the 2008 ACM/IEEE conference on supercomputing, Article no 3. ISBN:978-1-4244-2834-2 Google Scholar
  5. 5.
    Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I/O in ROMIO. In: Proc of the 7th symposium on the frontiers of massively parallel computation. IEEE Computer Society Press, Los Alamitos, pp 182–189. ISBN:0-7695-0087-0 CrossRefGoogle Scholar
  6. 6.
    Prost J-P, Treumann R, Hedges R, Jia B, Koniges A (2001) MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In: Proc of the 2001 ACM/IEEE conference on supercomputing. ISBN:1-58113-293-X Google Scholar
  7. 7.
    Nitzberg B, Lo V (1997) Collective buffering: improving parallel I/O performance. In: Proc of the IEEE international symposium on high performance distributed computing, pp 148–157. ISBN:0-8186-8117-9 CrossRefGoogle Scholar
  8. 8.
    Ma X, Winslett M, Lee J, Yu S (2003) Improving MPI-IO output performance with active buffering plus threads. In: Proc of the international parallel and distributed processing symposium. ISBN:0-7695-1926-1 Google Scholar
  9. 9.
    Liao W-K, Coloma K, Choudhary A, Ward L (2007) Cooperative client-side file caching for MPI applications. Int J High Perform Comput Appl 21(2):144–154. ISSN:1094-3420 CrossRefGoogle Scholar
  10. 10.
    Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Tideman S (2005) Collective caching: application-aware client-side file caching. In: Proc of the 14th IEEE international symposium on high performance distributed computing, pp 81–90. ISBN:0-7803-9037-7 CrossRefGoogle Scholar
  11. 11.
    Liao W-K, Ching A, Coloma K, Nisar A, Choudhary A, Chen J, Sankaran R, Klasky S (2007) Using MPI file caching to improve parallel write performance for large-scale scientific applications. In: Proc of the 2007 ACM/IEEE conference on supercomputing, Article no 8. ISBN:978-1-59593-764-3 Google Scholar
  12. 12.
    Liao W-K, Ching A, Coloma K, Choudhary A, Ward L (2007) An implementation and evaluation of client-side file caching for MPI-IO. In: Proc of the IEEE international parallel and distributed processing symposium. ISBN:1-4244-0910-1 Google Scholar
  13. 13.
    Liao W-K, Ching A, Coloma K, Choudhary A, Kandemir M (2007) Improving MPI independent write performance using a two-stage write-behind buffering method. In: Proc of the IEEE IPDPS workshop on NSF next generation software program. ISBN:1-4244-0910-1 Google Scholar
  14. 14.
    Liao W-K, Coloma K, Choudhary A, Ward L, Russell E, Pundit N (2006) Scalable design and implementations for MPI parallel overlapping I/O. IEEE Trans Parallel Distrib Syst 17(11):1264–1276. ISSN:1045-9219 CrossRefGoogle Scholar
  15. 15.
    Filgueira R, Carretero J, Singh DE, Calderón A, Núñez A (2010) Dynamic-CoMPI: dynamic optimization techniques for MPI parallel applications. J Supercomput. doi: 10.1007/s11227-010-0440-0 Google Scholar
  16. 16.
    Filgueira R, Singh DE, Pichel JC, Isaila F, Carretero J (2008) Data locality aware strategy for two-phase collective I/O. In: High performance computing for computational science—VECPAR 2008. LNCS, vol 5336. Springer, Berlin, pp 137–149. ISBN:978-3-540-92858-4 CrossRefGoogle Scholar
  17. 17.
    Thakur R, Choudhary A (1996) An extended two-phase method for accessing sections of out-of-core arrays. Sci Program 5(4):301–317. ISSN:1058-9244 Google Scholar
  18. 18.
    Kotz D (1997) Disk-directed I/O for MIMD multiprocessors. ACM Trans Comput Syst 15(1):41–74. ISSN:0734-2071 MathSciNetCrossRefGoogle Scholar
  19. 19.
    Yu W, Vetter J (2008) ParColl: partitioned collective I/O on the cray XT. In: Proc of the 37th international conference on parallel processing, pp 562–569. ISBN:978-0-7695-3374-2 CrossRefGoogle Scholar
  20. 20.
    Kandemir M (2001) Compiler-directed collective-I/O. IEEE Trans Parallel Distrib Syst 12(12):1318–1331. ISSN:1045-9219 CrossRefGoogle Scholar
  21. 21.
    Patrick CM, Son SW, Kandemir M (2008) Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO. Oper Syst Rev 42(6):43–49. ISSN:0163-5980 CrossRefGoogle Scholar
  22. 22.
    Dickens PM, Logan J (2010) A high performance implementation of MPI-IO for a Lustre file system environment. Concurr Comput 22(11):1433–1449. ISSN:1532-0626 Google Scholar
  23. 23.
    Dickens PM, Logan J (2009) Y-Lib: a user level library to increase the performance of MPI-IO in a Lustre file system environment. In: Proc of the 18th ACM international symposium on high performance distributed computing, pp 31–38. ISBN:978-1-60558-587-1 CrossRefGoogle Scholar
  24. 24.
    Nagle D, Serenyi D, Mattews A (2004) The panasas ActiveScale storage cluster—delivering scalable high bandwidth storage. In: Proc of the ACM/IEEE conference on supercomputing, pp 53–53. ISBN:0-7695-2153-3 Google Scholar
  25. 25.
    Sun Grid Engine Home (2011) Accessed 19 June 2011
  26. 26.
    Portable Batch System (2011) Accessed 19 June 2011
  27. 27.
    Workload Management with LoadLeveler (2011) Accessed 19 June 2011
  28. 28.
    Vienne J, Martinasso M, Vincent J-M, Méhaut J-F (2008) Predictive models for bandwidth sharing in high performance clusters. In: Proc of the IEEE international conference on cluster computing, pp 286–291. ISBN:978-1-4244-2639-3 Google Scholar
  29. 29.
    Parallel I/O Benchmarking Consortium (2010) Accessed 17 August 2010
  30. 30.
    Sebepou Z, Magoutis K, Marazakis M, Bilas A (2008) A comparative experimental study of parallel file systems for large-scale data processing. In: Proc of the first USENIX workshop on large-scale computing, Article no 5. ISBN:978-1-931971-59-1 Google Scholar
  31. 31.
    Borrill J, Oliker L, Shalf J, Shan H, Uselton A (2009) HPC global file system performance analysis using a scientific-application derived benchmark. Parallel Comput 35(6):358–373. ISSN:0167-8191 CrossRefGoogle Scholar
  32. 32.
    Bhatele A, Wesolowski L, Bohm E, Solomonik E, Kale LV (2010) Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar. Int J High Perform Comput Appl 24(4):411–427. ISSN:1094-3420 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Computer Science DepartmentKorea Advanced Institute of Science and TechnologyDaejeonKorea
  2. 2.Supercomputing CenterKorea Institute of Science and Technology InformationDaejeonKorea

Personalised recommendations