The Impact of Noise on the Scaling of Collectives: A Theoretical Approach
The performance of parallel applications running on large clusters is known to degrade due to the interference of kernel and daemon activities on individual nodes, often referred to as noise. In this paper, we focus on an important class of parallel applications, which repeatedly perform computation, followed by a collective operation such as a barrier. We model this theoretically and demonstrate, in a rigorous way, the effect of noise on the scalability of such applications. We study three natural and important classes of noise distributions: The exponential distribution, the heavy-tailed distribution, and the Bernoulli distribution. We show that the systems scale well in the presence of exponential noise, but the performance goes down drastically in the presence of heavy-tailed or Bernoulli noise.
Unable to display preview. Download preview PDF.
- 1.Gioiosa, R., Petrini, F., Davis, K., Lebaillif-Delamare, F.: Analysis of System Overhead on Parallel Computers. In: The 4th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT 2004), Rome, Italy (December 2004)Google Scholar
- 2.Jones, T.R., Brenner, L.B., Fier, J.M.: Impacts of Operating Systems on the Scalibility of Parallel Applications. Tech. Rep. UCRL-MI-202629, Lawrence Livermore National Laboratory (Mar 2003)Google Scholar
- 3.Petrini, F., Kerbyson, D.J., Pakin, S.: The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In: ACM/IEEE Conference on Supercomputing (SC 2003), Phoenix, Arizona, USA (November 2003)Google Scholar
- 4.Kramer, W.T.C., Ryan, C.: Performance Variability of Highly Parallel Architectures. In: International Conference on Computational Science (ICCS 2003), Melbourne, Australia (June 2003)Google Scholar
- 5.Frachtenberg, E., Petrini, F., Fernandez, J., Pakin, S., Coll, S.: STORM: Lightning-Fast Resource Management. In: ACM/IEEE Conference on Supercomputing (SC 2002), Baltimore, Maryland, USA (November 2002)Google Scholar
- 6.Hori, A., Tezuka, H., Ishikawa, Y.: Highly Efficient Gang Scheduling Implementation. In: ACM/IEEE Conference on Supercomputing (SC 1998), Orlando, FL, USA (November 1998)Google Scholar
- 7.Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System. In: ACM/IEEE Conference on Supercomputing (SC 2003), Phoenix, Arizona, USA (November 2003)Google Scholar
- 8.Frachtenberg, E., Feitelson, D., Petrini, F., Fernández, J.: Flexible Coscheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources. In: International Parallel and Distributed Processing Symposium 2003 (IPDPS 2003), Nice, France (April 2003)Google Scholar
- 9.Agarwal, S., Choi, G.S., Das, C.R., Yoo, A.B., Nagar, S.: Co-ordinated Coscheduling in Time-Sharing Clusters through a Generic Framework. In: IEEE International Conference on Cluster Computing (CLUSTER 2003), Hong Kong (December 2003)Google Scholar
- 10.Agarwal, S., Garg, R., Vishnoi, N.: The Impact of Noise on the Scaling of Collectives: A Theoretical Approach., Tech. Rep. RI-05003, IBM Research Report (February 2005)Google Scholar