Exploiting OpenMP to Provide Scalable SMP BLAS and LAPACK Routines
- First Online:
The present Fujitsu PRIMEPOWER 2000 system can have up to 128 processors in an SMP node. It is therefore desirable to provide users of this system with high performance parallel BLAS and LAPACK routines that scale to as many processors as possible. It is also desirable that users can obtain some level of parallel performance merely by relinking their codes with SMP Math Libraries. This talk outlines the major design decisions taken in providing OpenMP versions of BLAS and LAPACK routines to users, it discusses some of the algorithmic issues that have been addressed and it discusses some of short comings of OpenMP for this task.
A good deal has been learned about exploiting OpenMP in this on-going activity and the talk will attempt to identify what worked and what did not work. For instance, while OpenMP does not support recursion, some of the basic ideas behind linear algebra with recursive algorithms can be exploited to overlap sequential operations with parallel ones. As another example, the overheads of dynamic scheduling tended to outweigh the better load balancing that such a schedule provides so that static cyclic loop scheduling was more effective.