The Journal of Supercomputing

, Volume 33, Issue 3, pp 197–226 | Cite as

Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

  • Maria Athanasaki
  • Aristidis Sotiropoulos
  • Georgios Tsoukalas
  • Nectarios Koziris
  • Panayiotis Tsanakas
Article

Abstract

This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath, where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.

Keywords

supernodes loop tiling tile grouping pipelined schedules hyperplanes 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), Williamsburg, VA, pp. 39–50, April 1991.Google Scholar
  2. 2.
    T. Andronikos, N. Koziris, G. Papakonstantinou, and P. Tsanakas. Optimal scheduling for UET/UET-UCT generalized N-dimensional grid task graphs. Journal of Parallel and Distributed Computing 57(2):140–165, 1999.CrossRefGoogle Scholar
  3. 3.
    H. R. Arabnia and S. M. Bhandarkar. Parallel stereocorrelation on a reconfigurable multi-ring network. Journal of Supercomputing (Kluwer Academin Publishers), Special Issue on Parallel and Distributed Processing 10(3):243–270, 1996.Google Scholar
  4. 4.
    S. Araki, A. Bilas, C. Dubnicki, J. Edler, K. Konishi, and J. Philbin. User-space communication: A quantitative study. In Proceedings of the 1998 Supercomputing Conference on High Performance Networking and Computing (SC98) Orlando, Florida, Nov. 1998.Google Scholar
  5. 5.
    M. Athanasaki, E. Koukis, and N. Koziris. Scheduling of tiled nested loops onto a cluster with a fixed number of SMP nodes. In Proceedings of the 12-th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP04) IEEE Computer Society Press, A Coruna, Spain, pp. 424–433, 2004.Google Scholar
  6. 6.
    M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, and N. Koziris. A pipelined execution of tiled nested loops on SMPs with computation and communication overlapping. In Proceedings of the Workshop on Compile/Runtime Techniques for Parallel Computing, in conjunction with 2002 Int’l Conference on Parallel Processing (ICPP-2002) Vancouver, Canada, pp. 559–567, 2002.Google Scholar
  7. 7.
    M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, and N. Koziris. Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing (SC2002) IEEE Computer Society Press, Baltimore, Maryland, Nov. 2002.Google Scholar
  8. 8.
    S. M. Bhandarkar and H. R. Arabnia. Parallel computer vision on a reconfigurable multiprocessor network. IEEE Trans. on Parallel and Distributed Systems 8(3): 292–310, 1997.CrossRefGoogle Scholar
  9. 9.
    A. Bilas, C. Liao, and J. P. Singh. Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems. In Proceedings of the 26th Int’l Symposium on Computer Architecture ISCA-26 Atlanta, GA, pp. 282–293, 1999.Google Scholar
  10. 10.
    M. Blumrich. Network Interface for Protected, User-Level Communication PhD thesis, Princeton University, April 1996.Google Scholar
  11. 11.
    N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. Su. Myrinet. A Gigabit-per-second local area network. IEEE Micro 15(1):29–36, 1995.CrossRefGoogle Scholar
  12. 12.
    P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-ultimate tiling? INTEGRATION, The VLSI Jounal 17:33–51, 1994.Google Scholar
  13. 13.
    F. O. Carroll, H. Tezuka, A. Hori, and Y. Ishikawa. The design and implementation of zero copy MPI using commodity hardware with a high performance network. In Proceedings of the Int’l Conference on Supercomputing Melbourne, Australia, pp. 243–249, 1998.Google Scholar
  14. 14.
    F. T. Chong, R. Barua, F. Dahlgren, J. Kubiatowicz, and A. Agarwal. The sensitivity of communication mechanisms to bandwidth and latency. In Proceedings of the HPCA-4 High Performance Communication Architectures, pp. 37–46, 1998.Google Scholar
  15. 15.
    Compaq, Intel, and Microsoft. Virtual Interface Architecture Specification Dec. 1997.Google Scholar
  16. 16.
    F. Desprez, J. Dongarra, and Y. Robert. Determining the idle time of a tiling: New results. Journal of Information Science and Engineering 14:167–190, 1997.Google Scholar
  17. 17.
    I. Drossitis, G. Goumas, N. Koziris, G. Papakonstantinou, and P. Tsanakas. Evaluation of loop grouping methods based on orthogonal projection spaces. In Proceedings of the Int’l Conference on Parallel Processing Toronto, Canada, pp. 469–476, Aug. 2000.Google Scholar
  18. 18.
    T. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating System Principles Copper Mountain, Colorado, pp. 40–53, Dec. 1995.Google Scholar
  19. 19.
    K. Ghouas, K. Omang, and H. Bugge. VIA over SCI—Consequences of a zero copy implementation and comparison with VIA over myrinet. In Proceedings of the Workshop on Communication Architecture for Clusters (CAC’ 2001) in Conjunction with Int’l Parallel and Distributed Processing Symposium (IPDPS ‘01) San Francisco, April 2001.Google Scholar
  20. 20.
    F. Giacomini, T. Amundsen, A. Bogaerts, R. Hauser, B. Johnsen, H. Kohmann, R. Nordstrom, and P. Werner. Low Level SCI software functional specification-Software Infrastructure for SCI. ESPRIT Project 23174.Google Scholar
  21. 21.
    G. Goumas, M. Athanasaki, and N. Koziris. An efficient code generation technique for tiled iteration spaces. IEEE Trans. on Parallel and Distributed Systems 14(10):1021–1034, 2003.CrossRefGoogle Scholar
  22. 22.
    G. Goumas, A. Sotiropoulos, and N. Koziris. Minimizing completion time for loop tiling with computation and communication overlapping. In Proceedings of IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS’01) San Francisco, April 2001.Google Scholar
  23. 23.
    H. Hellwagner. The SCI standard and applications of SCI. In H. Hellwagner and A. Reinefield, eds., Scalable Coherent Interface (SCI): Architecture and Software for High-Performance Computer Clusters Springer-Verlag, pp. 3–34, Sept. 1999.Google Scholar
  24. 24.
    E. Hodzic and W. Shang. On supernode transformation with minimized total running time. IEEE Trans. on Parallel and Distributed Systems 9(5):417–428, 1998.CrossRefGoogle Scholar
  25. 25.
    E. Hodzic and W. Shang. On time optimal supernode shape. IEEE Trans. on Parallel and Distributed Systems 13(12):1220–1233, 2002.CrossRefGoogle Scholar
  26. 26.
    K. Hogstedt, L. Carter, and J. Ferrante. Determining the idle time of a tiling. In Principles of Programming Languages (POPL) pp. 160–173, Jan. 1997.Google Scholar
  27. 27.
    K. Hogstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In ACM Symposium on Parallel Algorithms and Architectures pp. 201–211, 1999.Google Scholar
  28. 28.
    K. Hogstedt, L. Carter, and J. Ferrante. On the parallel execution time of tiled loops. IEEE Trans. on Parallel and Distributed Systems 14(3):307–321, 2003.CrossRefGoogle Scholar
  29. 29.
    F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th Ann. ACM SIGACT-SIGPLAN Symp. Principles of Programming Languages San Diego, California, pp. 319–329, 1988.Google Scholar
  30. 30.
    M. Kandemir, J. Ramanujam, and A. Choudary. Improving cache locality by a combination of loop and data transformations. IEEE Trans. on Parallel and Distributed Systems 48(2):159–167, 1999.Google Scholar
  31. 31.
    V. Karamcheti and A. Chien. Software overhead in messaging layers: where does the time go? In Proceedings of the 6th Int’l Conference on Architectural Support for Programming Languages and Operating Systems pp. 51–60, Oct. 1994.Google Scholar
  32. 32.
    C.-T. King, W.-H. Chou, and L. Ni. Pipelined data-parallel algorithms: Part II design. IEEE Trans. on Parallel and Distributed Systems 2(4):430–439, 1991.CrossRefGoogle Scholar
  33. 33.
    M. Lam, E. Rothberg, and M. Wolf. The cache performance and optimizations of blocked algorithms. In Second Int’l Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) Santa Clara, California, pp. 63–74, April 1991.Google Scholar
  34. 34.
    N. Manjikian and T. S. Abdelrahman. Exploiting wavefront parallelism on large-scale shared-memory multiprocessors. IEEE Trans. on Parallel and Distributed Systems 12(3):259–271, 2001.CrossRefGoogle Scholar
  35. 35.
    R. Martin, A. Vahdat, D. Culler, and T. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of Int’l Symposium on Computer Architecture Denver, CO, June 1997.Google Scholar
  36. 36.
    N. Park, B. Hong, and V. Prasanna. Tiling, block data layout and memory hierarchy performance. IEEE Trans. on Parallel and Distributed Systems 14(7):640–654, 2003.CrossRefGoogle Scholar
  37. 37.
    D. Patterson and J.Hennessy. Computer Organization & Design. The Hardware/Software Interface Morgan Kaufmann Publishers, San Francisco, CA, pp. 364–367, 1994.Google Scholar
  38. 38.
    J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for multicomputers. Journal of Parallel and Distributed Computing 16:108–120, 1992.CrossRefGoogle Scholar
  39. 39.
    L. Renganarayana and S. Rajopadhye. A geometric programming framework for optimal multi-level tiling. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing (SC2004), Pittsburgh, PA USA, Nov. 2004.Google Scholar
  40. 40.
    J.-P. Sheu and T.-S. Chen. Partitioning and mapping nested loops for linear array multicomputers. Journal of Supercomputing 9:183–202, 1995.CrossRefGoogle Scholar
  41. 41.
    J.-P. Sheu and T.-H. Tai. Partitioning and mapping nested loops on multiprocessor systems. IEEE Trans. on Parallel and Distributed Systems 2(4):430–439, 1991.CrossRefGoogle Scholar
  42. 42.
    P. Shivam, P. Wyckoff, and D. Panda. EMP: Zero-copy OS-bypass NIC-driven gigabit ethernet message passing. In Proceedings of the ACM Supercomputing 2001 (SC2001) Denver, CO, USA, Nov. 2001.Google Scholar
  43. 43.
    A. Sotiropoulos, G. Tsoukalas, and N. Koziris. Enhancing the performance of tiled loop execution onto clusters using memory mapped network interfaces and pipelined schedules. In Proceedings of the 2002 Workshop on Communication Architecture for Clusters (CAC’02), Int’l Parallel and Distributed Processing Symposium (IPDPS’02) Fort Lauderdale, Florida, April 2002.Google Scholar
  44. 44.
    H. Tezuka, F. Carroll, A. Hori, and Y. Ishikawa. Pin-down cache: A virtual memory management technique for zero-copy communication. In Proceedings of 12th Int’l Parallel Processing Symposium Orlando, FL, pp. 308–314, March 1998.Google Scholar
  45. 45.
    P. Tsanakas, N. Koziris, and G. Papakonstantinou. Chain grouping: A method for partitioning loops onto mesh-connected processor arrays. IEEE Trans. on Parallel and Distributed Systems 11(9):941–955, 2000.CrossRefGoogle Scholar
  46. 46.
    R. Wang, A. Krishnamurthy, R. Martin, T. Anderson, and D. Culler. Modeling communication pipeline latency. In Proceedings of SIGMETRICS ‘98/PERFORMANCE ‘98 Conference on the Measurement and Modeling of Computer Systems June 1998.Google Scholar
  47. 47.
    J. Xue. Communication-minimal tiling of uniform dependence loops. Journal of Parallel and Distributed Computing 42(1):42–59, 1997.CrossRefGoogle Scholar
  48. 48.
    J. Xue. On tiling as a loop transformation. Parallel Processing Letters 7(4):409–424, 1997.CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Maria Athanasaki
    • 1
  • Aristidis Sotiropoulos
    • 1
  • Georgios Tsoukalas
    • 1
  • Nectarios Koziris
    • 1
  • Panayiotis Tsanakas
    • 1
  1. 1.School of Electrical and Computer Engineering, Computing Systems LaboratoryNational Technical University of AthensZografouGreece

Personalised recommendations