Nixdorf 1992: Parallel Architectures and Their Efficient Use pp 68-92 | Cite as
Massively parallel computing: Data distribution and communication
Abstract
We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of communication primitives.
Keywords
Local Memory Hamiltonian Path Address Space Gray Code Index SpacePreview
Unable to display preview. Download preview PDF.
References
- [1]B. Alspach, J.-C. Bermond, and D. Sotteau. Decomposition into cycles i: Hamilton decompositions. In G. Hahn et. al., editor, Cycles and Graphs, pages 9–18. Kluwer Academic Publishers, 1990.Google Scholar
- [2]Christopher R. Anderson. An implementation of the fast multipole method without multipoles. SIAM J. Sci. Stat. Comp., 13(4):923–947, July 1992.Google Scholar
- [3]D. P. Bertsekas, C. Ozveren, G.D. Stamoulis, P. Tseng, and J.N. Tsitsiklis. Optimal communication algorithms for hypercubes. Journal of Parallel and Distributed Computing, 11:263–275, 1991.Google Scholar
- [4]M. Bromley, Steve Heller, Tim McNerny, and Guy Steele. Fortran at ten Gigaflops: The Connection Machine convolution compiler. In Proceedings of ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation. ACM Press, 1991.Google Scholar
- [5]Jean-Philippe Brunet and S. Lennart Johnsson. All-to-all broadcast with applications on the Connection Machine. International Journal of Supercomputer Applications, 6(3):241–256, 1992.Google Scholar
- [6]J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particle simulations. SIAM J. of Scientific and Statistical Computations, 9(4):669–686, July 1988.Google Scholar
- [7]M.Y. Chan. Embedding of grids into optimal hypercubes. SIAM J. Computing, 20(5):834–864, 1991.Google Scholar
- [8]G. Dahlquist, Å. Björck, and N. Anderson. Numerical Methods. Series in Automatic Computation. Prentice Hall, Inc., Englewood Cliffs, NJ, 1974.Google Scholar
- [9]William J. Dally. A VLSI Architecture for Concurrent Data Structures. PhD thesis, California Institute of Technology, 1986.Google Scholar
- [10]William J. Dally. The J-Machine: A fine-grain concurrent computer. In Proc. IFIP Congress, pages 1147–1153. North-Holland, August 1989.Google Scholar
- [11]Jack. J. Dongarra and Stanley C. Eisenstat. Squeezing the most out of an algorithm in Cray Fortran. ACM Trans. Math. Softw., 10(3):219–230, 1984.Google Scholar
- [12]M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23:298–305, 1973.Google Scholar
- [13]M. Fiedler. Eigenvectors of acyclic matrices. Czechoslovak Mathematical Journal, 25:607–618, 1975.Google Scholar
- [14]M. Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Mathematical Journal, 25:619–633, 1975.Google Scholar
- [15]Charles M. Flaig and Charles L Seitz. Inter-computer message routing system with each computer having separate routing automata for each dimension of the netwrok, 1988. U.S. Patent 5,105,424.Google Scholar
- [16]High Performance Fortran Forum. High performance fortran language specification, version 0.4. Technical report, Department of Computer Science, Rice University, November 1992.Google Scholar
- [17]Geoffrey C. Fox and Wojtek Furmanski. Optimal communication algorithms on the hypercube. Technical Report CCCP-314, California Institute of Technology, July 1986.Google Scholar
- [18]Geoffrey C. Fox, Mark A. Johnsson, Gregory A. Lyzenga, Steve W. Otto, John K. Salmon, and Wojtek Furmanski. Solving Problems on Concurrent Processors. Prentice-Hall, 1988.Google Scholar
- [19]William George, Ralph G. Brickner, and S. Lennart Johnsson. Polyshift communications software for the Connection Machine systems CM-2 and CM-200. Technical report, Thinking Machines Corp., March 1992.Google Scholar
- [20]Gene Golub and Charles vanLoan. Matrix Computations. The Johns Hopkins University Press, 1985.Google Scholar
- [21]Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325–348, 1987.Google Scholar
- [22]I. Havel and J. Móravek. B-valuations of graphs. Czech. Math. J., 22:338–351, 1972.Google Scholar
- [23]Ching-Tien Ho and S. Lennart Johnsson. Spanning balanced trees in Boolean cubes. SIAM Journal on Sci. Stat. Comp, 10(4):607–630, July 1989.Google Scholar
- [24]Ching-Tien Ho and S. Lennart Johnsson. Embedding meshes in Boolean cubes by graph decomposition. J. of Parallel and Distributed Computing, 8(4):325–339, April 1990.Google Scholar
- [25]Zdenek Johan. Data Parallel Finite Element Techniques for Large-Scale Computational Fluid Dynamics. PhD thesis, Department of Mechanical Engineering, Stanford University, 1992.Google Scholar
- [26]Zdenek Johan and Thomas J. R. Hughes. An efficient implementation of the spectral partitioning algorithm on the connection machine systems. In International Conference on Computer Science and Control. INRIA, 1992.Google Scholar
- [27]S. Lennart Johnsson. Dense matrix operations on a torus and a Boolean cube. In The National Computer Conference, July 1985.Google Scholar
- [28]S. Lennart Johnsson. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel Distributed Computing, 4(2):133–172, April 1987.Google Scholar
- [29]S. Lennart Johnsson. Minimizing the communication time for matrix multiplication on multiprocessors. Technical Report TR-23-91, Harvard University, Division of Applied Sciences, September 1991. To appear in Parallel Computing.Google Scholar
- [30]S. Lennart Johnsson. Performance modeling of distributed memory architectures. J. Parallel and Distributed Computing, 12(4):300–312, August 1991.Google Scholar
- [31]S. Lennart Johnsson. Data ordering in multisection FFT. Technical report, Thinking Machines Corp., 1992. In preparation.Google Scholar
- [32]S. Lennart Johnsson. Compilation Techniques for Novel Architectures, chapter Language and Compiler Issues in Scalable High Performance Libraries. Springer Verlag, 1993. Harvard University Technical Report TR-18-92.Google Scholar
- [33]S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Computers, 38(9):1249–1268, September 1989.Google Scholar
- [34]S. Lennart Johnsson and Ching-Tien Ho. Generalized shuffle permutations on Boolean cubes. J. Parallel and Distributed Computing, 16(1):1–14, 1992.Google Scholar
- [35]S. Lennart Johnsson and Ching-Tien Ho. Optimal communication channel utilization for matrix transposition and related permutations on Boolean cubes. Discrete Applied Mathematics, 1992.Google Scholar
- [36]S. Lennart Johnsson and Ching-Tien Ho. Boolean cube emulation of butterfly networks encoded by Gray code. Journal of Parallel and Distributed Computing, 1993. Department of Computer Science, Yale University, Technical Report, YALEU/DCS/RR-764, February, 1990.Google Scholar
- [37]S. Lennart Johnsson, Ching-Tien Ho, Michel Jacquemin, and Alan Ruttenberg. Computing fast Fourier transforms on Boolean cubes and related networks. In Advanced Algorithms and Architectures for Signal Processing II, volume 826, pages 223–231. Society of Photo-Optical Instrumentation Engineers, 1987.Google Scholar
- [38]S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381–397, October 1992.Google Scholar
- [39]S. Lennart Johnsson and Robert L. Krawitz. Cooley-Tukey FFT on the Connection Machine. Parallel Computing, 18(11):1201–1221, 1992.Google Scholar
- [40]Monica S. Lam, Edward E. Rothenberg, and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. In The Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74. ACM Press, 1991.Google Scholar
- [41]Guangye Li and Thomas F. Coleman, A parallel triangular solver for a distributed memory multiprocessor. SIAM J. Sci. Statist. Comput., 9(3):485–502, 1988.Google Scholar
- [42]Guangye Li and Thomas F. Coleman. A new method for solving triangular systems on a distributed memory message-passing multiprocessor. SIAM J. Sci. Statist. Comput., 10(2):382–396, 1989.Google Scholar
- [43]Woody Lichtenstein and S. Lennart Johnsson. Block cyclic dense linear algebra. SIAM Journal of Scientific Computing, 14(5), 1993. Thinking Machines Corp., Technical Report, TMC-215, December 1991.Google Scholar
- [44]Christoffer Lutz, Steve Rabin, Charles L. Seitz, and Donald Speck. Design of the mosaic element. In Proceedings, Conf. on Advanced research in VLSI, pages 1–10. Artech House, 1984.Google Scholar
- [45]Kapil K. Mathur and S. Lennart Johnsson. Multiplication of matrices of arbitrary shape on a Data Parallel Computer. Technical Report 216, Thinking Machines Corp., December 1991.Google Scholar
- [46]Kapil K. Mathur and S. Lennart Johnsson. All-to-all communication. Technical Report 243, Thinking Machines Corp., December 1992.Google Scholar
- [47]Kapil K. Mathur and S. Lennart Johnsson. Communication primitives for unstructured finite element simulations on data parallel architectures. Computing Systems in Engineering, 3(1–4):63–72, December 1992.Google Scholar
- [48]Alex Pothen, Horst D. Simon, and Kang-Pu Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430–452, 1990.Google Scholar
- [49]Abhiram Ranade. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on the Foundations of Computer Science, pages 185–194. IEEE Computer Society, October 1987.Google Scholar
- [50]Abhiram Ranade and S. Lennart Johnsson. The communication efficiency of meshes, Boolean cubes, and cube connected cycles for wafer scale integration. In 1987 International Conf. on Parallel Processing, pages 479–482. IEEE Computer Society, 1987.Google Scholar
- [51]Abhiram G. Ranade, Sandeep N. Bhatt, and S. Lennart Johnsson. The Fluent abstract machine. In Advanced Research in VLSI, Proceedings of the fifth MIT VLSI Conference, pages 71–93. MIT Press, 1988.Google Scholar
- [52]E.M. Reingold, J. Nievergelt, and N. Deo. Combinatorial Algorithms. Prentice-Hall, Englewood Cliffs. NJ, 1977.Google Scholar
- [53]Arnold L. Rosenberg. Preserving proximity in arrays. SIAM J. Computing, 4:443–460, 1975.Google Scholar
- [54]Horst D. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2:135–148, 1991.Google Scholar
- [55]Quentin F. Stout and Bruce Wagar. Intensive hypercube communication I: prearranged communication in link-bound machines. Technical Report CRL-TR-9-87, Computing Research Lab., Univ. of Michigan, Ann Arbor, MI, 1987.Google Scholar
- [56]Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1987.Google Scholar
- [57]Paul N. Swarztrauber. Symmetric FFTs. Mathematics of Computation, 47(175):323–346, July 1986.Google Scholar
- [58]Paul N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5:197–210, 1987.Google Scholar
- [59]Clive Temperton. On the FACR(1) algorithm for the discrete Poisson equatron. J. of Computational Physics, 34:314–329, 1980.Google Scholar
- [60]Thinking Machines Corp. CMSSL for Fortran, 1990.Google Scholar
- [61]Thinking Machines Corp. CM-200 Technical Summary, 1991.Google Scholar
- [62]Thinking Machines Corp. CM-5 Technical Summary, 1991.Google Scholar
- [63]Thinking Machines Corp. CM Fortran optimization notes: slicewise model, version 1.0, 1991.Google Scholar
- [64]Charles Tong and Paul N. Swarztrauber. Ordered Fast Fourier transforms on a masively parallel hypercube multiprocessor. Journal of Parallel and Distributed Computing, 12(1):50–59, May 1991.Google Scholar
- [65]Leslie Valiant. A scheme for fast parallel communication. SIAM Journal on Computing, 11:350–361, 1982.Google Scholar
- [66]Leslie Valiant and G.J. Brebner. Universal schemes for parallel communication. In Proc. of the 13th ACM Symposium on the Theory of Computation, pages 263–277. ACM, 1981.Google Scholar