Performance Advantages of Partitioned Global Address Space Languages
For nearly a decade, the Message Passing Interface (MPI) has been the dominant programming model for high performance parallel computing, in large part because it is universally available and scales to thousands of processors. In this talk I will describe some of the alternatives to MPI based on a Partitioned Global Address Space model of programming, such as UPC and Titanium. I will show that these models offer significant advantages in performance as well as programmer productivity, because they allow the programmer to build global data structures and perform one-sided communication in the form of remote reads and writes, while still giving programmers control over data layout. In particular, I will show that these languages make more effective use of cluster networks with RDMA support, allowing them to outperform two-sided communication on both microbenchmarks and bandwidth-limited computational problems, such as global FFTs. The key optimization is overlap of communication with computation and pipelining communication. Surprisingly, sending smaller messages more frequently can be faster than a few large messages if overlap with computation is possible. This creates an interesting open problem for global scheduling of communication, since the simple strategy of maximum aggregation is not always best. I will also show some of the productivity advantages of these languages through application case studies, including complete Titanium implementations of two different application frameworks: an immersed boundary method package and an elliptic solver using adaptive mesh refinement.