Abstract
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
Similar content being viewed by others
References
Agarwal, S., Garg, R., Vishnoi, N.K.: The impact of noise on the scaling of collectives: A theoretical approach. In: Proceedings of the 12th International Conference on High Performance Computing, Goa, India. Springer Lecture Notes in Computer Science, vol. 3769, pp. 280–289 (2005)
Almási, G.: Private communication (2006)
Beckman, P., Iskra, K., Yoshii, K., Coghlan, S.: Operating system issues for petascale systems. ACM SIGOPS Oper. Syst. Rev. 40(2), 29–33 (2006)
Brightwell, R., Riesen, R., Underwood, K., Hudson, T.B., Bridges, P., Maccabe, A.B.: A performance comparison of Linux and a lightweight kernel. In: Proceedings of the 5th IEEE International Conference on Cluster Computing, Kowloon, Hong Kong, China (2003)
Burger, D.C., Hyder, R.S., Miller, B.P., Wood, D.A.: Paging tradeoffs in distributed-shared-memory multiprocessors. J. Supercomput. 10(1), 87–104 (1996)
Dietrich, S.-T., Walker, D.: The evolution of real-time Linux, November 2005, http://www.linuxdevices.com/files/rtlws-2005/SvenThorstenDietrich.pdf
Garg, R., De, P.: Impact of noise on scaling of collectives: An empirical evaluation. In: Proceedings of the 13th International Conference on High Performance Computing, Bangalore, India. Springer Lecture Notes in Computer Science, vol. 4297, pp. 460–471 (2006)
Hudson, T., Brightwell, R.: Network performance impact of a lightweight Linux for Cray XT3 compute nodes. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Tampa, FL (2006)
Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Phoenix, AZ (2003)
Jones, T.R., Brenner, L.B., Fier, J.M.: Impacts of operating systems on the scalability of parallel applications. Technical Report UCRL-MI-202629, Lawrence Livermore National Laboratory (2003)
Kelly, S.M., Brightwell, R.: Software architecture of the light weight kernel, Catamount. In: Proceedings of the 47th Cray User Group Conference, Albuquerque, NM (2005)
Kramer, W., Ryan, C.: Performance variability of highly parallel architectures. In: Proceedings of the International Conference on Computational Science, Melbourne, Australia and St. Petersburg, Russia. Springer Lecture Notes in Computer Science, vol. 2659 (2003)
Moreira, J.E., et al.: Blue Gene/L programming and operating environment. IBM J. Res. Dev. 49(2/3), 367–376 (2005)
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the ACM/IEEE Conference on Supercomputing. Phoenix, AZ (2003)
Sottile, M., Minnich, R.: Analysis of microbenchmarks for performance tuning of clusters. In: Proceedings of the 6th IEEE International Conference on Cluster Computing, San Diego, CA, pp. 371–377 (2004)
Terry, P., Shan, A., Huttunen, P.: Improving application performance on HPC systems with process synchronization. Linux J. 127, 68–73 (2004)
TOP500 supercomputer sites, http://www.top500.org/
Tsafrir, D., Etsion, Y., Feitelson, D.G., Kirkpatrick, S.: System noise, OS clock ticks, and fine-grained parallel applications. In: Proceedings of the 19th International Conference on Supercomputing, Cambridge, MA, pp. 303–312 (2005)
van der Pas, R.: Memory hierarchy in cache-based systems. Technical Report 817-0742-10, Sun Microsystems, Nov. 2002
Wagner, A., Buntinas, D., Panda, D.K., Brightwell, R.: Application-bypass reduction for large-scale clusters. In: Proceedings of the 5th IEEE International Conference on Cluster Computing, Kowloon, Hong Kong, China, pp. 404–411 (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Beckman, P., Iskra, K., Yoshii, K. et al. Benchmarking the effects of operating system interference on extreme-scale parallel machines. Cluster Comput 11, 3–16 (2008). https://doi.org/10.1007/s10586-007-0047-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-007-0047-2