Ensemble Architectures and their Algorithms: An Overview

  • S. Lennart Johnsson
Conference paper
Part of the The IMA Volumes in Mathematics and Its Applications book series (IMA, volume 13)


During recent years the number of commercially available parallel computer architectures have increased dramatically. The number of processors in these systems vary from a few up to 64k processors for the Connection Machine. In this paper we discuss some of the technology issues that are the underlying driving force, and focus on a particular class of parallel computer architectures often called Ensemble Architectures. They are interesting candidates for future high performance computing systems. The ensemble configurations discussed here are linear arrays, 2-dimensional arrays, binary trees, shuffle-exchange networks, Boolean cubes and cube connected cycles. We discuss a few algorithms for arbitrary data permutations, and some particular data permutation and distribution algorithms used in standard matrix computations. Special attention is given to data routing. Distributed routing algorithms in which elements with distinct origin and distinct destinations do not traverse the same communications link make possible a maximum degree of pipelined communications. The linear algebra computations discussed are: matrix transposition, matrix multiplication, dense and general banded systems solvers, linear recurrence solvers, tridiagonal system solvers, fast Poisson solvers, and very briefly, iterative methods.


Binary Tree Gaussian Elimination Computation Graph Systolic Array Gray Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Loyce Adams. Iterative Algorithms for Large Sparse Linear Systems on Parallel Computers. Technical Report 166027, NASA Langley Research Center, 1982.Google Scholar
  2. [2]
    Hassan M. Ahmed, Jean-Marc Delosme, and Martin Morf. Highly concurrent computing structures for matrix arithmetic and signal processing. Computer, 15:65–82, January 1982.CrossRefGoogle Scholar
  3. [3]
    R. Aleliunas. Randomized parallel computation. In ACM Symposium on Principles of Distributed Computing, pages 60–72, ACM, 1982.Google Scholar
  4. [4]
    A.V. Aho A.V., John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.MATHGoogle Scholar
  5. [5]
    Kenneth E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307–314, IEEE, 1968.Google Scholar
  6. [6]
    Gerald M. Baudet. Asynchronous iterative methods for multiprocessors. J. ACM, 25(2):226–244, 1978.CrossRefGoogle Scholar
  7. [7]
    Sandeep N. Bhatt, F.R.K. Chung, F. Tom Leighton, and Arnold L. Rosenberg. Optimal Embeddings of Binary Trees in the Boolean Hypercube. Technical Report YALEU/CSD/RR-, Yale University, Dept. of Computer Science, December 1985.Google Scholar
  8. [8]
    Sandeep N. Bhatt and Ilse I.F. Ipsen. How to Embed Trees in Hypercubes. Technical Report YALEU/CSD/RR-443, Yale University, Dept. of Computer Science, December 1985.Google Scholar
  9. [9]
    Sandeep N. Bhatt and Charles E. Leiserson. Minimizing the Longest Edge in a VLSI Layout. Technical Report MIT VLSI Memo 82–86, MIT, 1982.Google Scholar
  10. [10]
    Smith B.J. Architecture and applications of the hep multiprocessor computer system. In Real-Time Signal Processing IV, Proc of SPIE, pages 241–248, 1981.Google Scholar
  11. [11]
    Richard P. Brent and H.T. Kung. On the area of binary tree layouts. Information Processing Letters, 11(1):44–46, 1980.MathSciNetCrossRefGoogle Scholar
  12. [12]
    Sally A. Browning. The Tree Machine: A Highly Concurrent Computing Environment. Technical Report 1980:TR:3760, Computer Science, California Institute of Technology, January 1980.Google Scholar
  13. [13]
    P. Budnik and David J. Kuck. The organisation and use of parallel memories. IEEE Trans. Computer, C-20:1566–1569, December 1971.CrossRefGoogle Scholar
  14. [14]
    Billy L. Buzbee. A fast poisson solver amenable to parallel computation. IEEE Trans. Computers, C-22:793–796, 1973.CrossRefGoogle Scholar
  15. [15]
    Billy L. Buzbee, Gene H. Golub, and C W. Nielson. On direct methods for solving poisson’s equations. SIAM J. Numer. Anal., 7(4):627–656, December 1970.MathSciNetMATHCrossRefGoogle Scholar
  16. [16]
    L.E. Cannon. A Cellular Computer to Implement the Kaiman Filter Algorithm. PhD thesis, Montana State University, 1969.Google Scholar
  17. [17]
    Peter R. Capello and Kenneth Steiglitz. Unifying VLSI Array Design with Linear Transformations of Space-Time. Technical Report TRCS83–03, UC Santa Barbara, Dept of Computer Science, May 1982.Google Scholar
  18. [18]
    D. Chazan and Willard L. Miranker. Chaotic relaxation. Linear Algebra and its Applications, 2:199–222, 1969.MathSciNetMATHCrossRefGoogle Scholar
  19. [19]
    Marina C. Chen. Synthesizing systolic designs. In 2nd InternationalSymposium on VLSI Technology, Systems, And Applications, IEEE Computer Society, 1985.Google Scholar
  20. [20]
    W. Crowther, J. Goodhue, E. Starr, R. Thomas, W. Milliken, and T. Blackadar. Performance measurements on a 128-node butterfly parallel processor. In Proceedings of the 1985 International Conference on Parallel Processing, pages 531–540, IEEE Computer Society, 1985.Google Scholar
  21. [21]
    E. Dekel, D. Nassimi, and Sartaj Sahni. Parallel matrix and graph algorithms. SIAM J. Computing, 10:657–673, 1981.MathSciNetMATHCrossRefGoogle Scholar
  22. [22]
    Jean-Marc Delosme and Ilse C.P. Ipsen. An illustration of a methodology for the construction of efficient systolic architecture in vlsi. In 2nd IntemationalSymposium on VLSI Technology, Systems, And Applications, IEEE Computer Society, 1985.Google Scholar
  23. [23]
    R.A. DeMillod, Stanley C. Eisenstat, and Richard J. Lipton. Preserving average proximity in arrays. Communicationsof the ACM, 21:228–231, March 1978.CrossRefGoogle Scholar
  24. [24]
    Jack Dongarra and S. Lennart Johnsson. Solving banded systems on a parallel processor. Parallel Computing, 1987. Presented at International Conference on Vector and Parallel Computing, 1986.Google Scholar
  25. [25]
    Jack J. Dongarra and Ahmed H. Sameh. On Some Parallel Bandded System Solvers. Technical Report ANL/MCS-TM-27, Argonne National Laboratories, 1984.Google Scholar
  26. [26]
    J.O. Eklundh. A fast computer method for matrix transposing. IEEE Trans. Computers, C-21(7):801–803, 1972.MathSciNetCrossRefGoogle Scholar
  27. [27]
    Michael J. Fischer. Efficiency of Equivalence Algorithms, pages 153–167. Plenum Press, 1972.Google Scholar
  28. [28]
    Dennis Gannon and John Van Rosendale. On the impact of communication complexity in the design of parallel numerical algorithms. IEEE Trans. Computers, C-33(12):1180–1194, December 1984.CrossRefGoogle Scholar
  29. [29]
    W. Morven Gentleman. Some complexity results for matrix computations on parallel processors. J. ACM, 25(1):112–115, January 1978.MathSciNetMATHCrossRefGoogle Scholar
  30. [30]
    W. Morven Gentleman and H.T. Kung. Matrix triangularization by systolic arrays. In Real-Time Signal Processing IV, Proc. of SPIE, pages 19–26, SPIE, 1981.Google Scholar
  31. [31]
    Allan Gottlieb, R. Grishman, Clyde P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir. The nyu ultracomputer — designing an mimd shared memory parallel computer. IEEE Trans. Computers, C-32(2):175–189, 1983.CrossRefGoogle Scholar
  32. [32]
    J.W. Greene and A. El Gammal. Area and delay penalties in restructurable wafer-scale arrays. In Third Caltech Conference on VLSI, pages 165–184, Computer Sciences Press, 1983.CrossRefGoogle Scholar
  33. [33]
    Donald E. Heller and Use C.P. Ipsen. Systolic networks for orthogonal equivalence transformations and their applications. In P. Penfield Jr, editor, Proceedings, Advanced Research in VLSI, pages 113–122, Artech House, 1982.Google Scholar
  34. [34]
    John L. Hennessey, N. Jouppi, Forrest Baskett, and J. Gill. Mips: a vlsi processor architecture. In VLSI Systems and Computations, pages 337–346, Computer Sciences Press, 1981.CrossRefGoogle Scholar
  35. [35]
    John L. Hennessey, N. Jouppi, S. Przybylski, and C. Rowen. Design of a high performance vlsi processor. In Proc. of the Third Caltech Conference on VLSI, pages 33–54, Computer Sciences Press, 1983.Google Scholar
  36. [36]
    M.R. Hestenes and E. Stiefel. Methods of conjugate gradient for solution of linear systems. J. Res. Nat. Bur. Standards, 49:409–436, 1952.MathSciNetMATHGoogle Scholar
  37. [37]
    W. Daniel Hillis. The Connection Machine. MIT Press, 1985.Google Scholar
  38. [38]
    W. Daniel Hillis. The Connection Machine. Technical Report Memo 646, MIT Artificial Intelligence Laboratory, 1981.Google Scholar
  39. [39]
    W. Daniel Hillis and Guy L. Steel. Data parallel algorithms. Communications of the CACM, 29:1170–1183, December 1986.Google Scholar
  40. [40]
    Daniel S. Hirschberg. Fast parallel sorting algorithms. Communications of the ACM, 21(8):657–661, 1978.CrossRefGoogle Scholar
  41. [41]
    Ching-Tien Ho and S. Lennart Johnsson. Matrix Transposition on Boolean n-cube Configured Ensemble Architectures. Technical Report YALEU/CSD/RR-494, Yale University, Dept. of Computer Science, September 1986.Google Scholar
  42. [42]
    Ching-Tien Ho and S. Lennart Johnsson. On the embedding of arbitrary meshes in boolean cubes with expansion two dilation two. In Int. Conf.on Parallel Processing, IEEE Computer Society, 1987.Google Scholar
  43. [43]
    Ching-Tien Ho and S. Lennart Johnsson. On the Embedding of Meshes in Boolean Cubes. Technical Report YALEU/CSD/RR-, Yale University, Dept. of Computer Science, In preparation 1986.Google Scholar
  44. [44]
    Ching-Tien Ho and S. Lennart Johnsson. Spanning Graphs for Optimum Broadcasting and Personalized Communication in Hypercubes. Technical Report YALEU/CSD/RR-500, Yale University, Dept. of Computer Science, November 1986.Google Scholar
  45. [45]
    Roger W. Hockney. A fast direct solution of pois son’s equation using fourier analysis. J. ACM, 12:95–113, 1965.CrossRefGoogle Scholar
  46. [46]
    Roger W. Hockney. The potential calculation and some applications. Methods Comput. Phys., 9:135–211, 1970.Google Scholar
  47. [47]
    Roger W. Hockney and C.R. Jesshope. Parallel Computers. Adam Hilger, 1981.MATHGoogle Scholar
  48. [48]
    Briggs F.A. Hwang K., editor. Computer Architecture and Parallel Processing. McGraw-Hill, 1984.MATHGoogle Scholar
  49. [49]
    S. Lennart Johnsson. Combining parallel and sequential sorting on a boolean n-cube. In International Conference on Parallel Processing, pages 444–448, IEEE Computer Society, 1984. Presented at the 1984 Conf. on Vector and Parallel Processors in Computational Science II.Google Scholar
  50. [50]
    S. Lennart Johnsson. Communication efficient basic linear algebra computations on hypercube architectures. Journal of Parallel and Distributed Computing, 4(2):133–172, April 1987. Report YALEU/CSD/RR-361, January 1985, Dept. of Computer Science, Yale University.MathSciNetCrossRefGoogle Scholar
  51. [51]
    S. Lennart Johnsson. A computational array for the qr-method. In Jr. P. Penfield, editor, Proc, Conf. on Advanced Research in VLSI, pages 123–129, Artech House, January 1982.Google Scholar
  52. [52]
    S. Lennart Johnsson. Computational Arrays for Band Matrix Equations. Technical Report 4287:TR:81, Computer Science, California Institute of Technology, May 1981.Google Scholar
  53. [53]
    S. Lennart Johnsson. Data Permutations and Basic Linear Algebra Computations on Ensemble Architectures. Technical Report YALEU/CSD/RR-367, Yale University, Dept. of Computer Science, February 1985.Google Scholar
  54. [54]
    S. Lennart Johnsson. Dense matrix operations on a torus and a boolean cube. In The National Computer Conference, July 1985.Google Scholar
  55. [55]
    S. Lennart Johnsson. Fast Banded Systems Solvers for Ensemble Architectures. Technical Report YALEU/CSD/RR-379, Department of Computer Science, Yale University, March 1985.Google Scholar
  56. [56]
    S. Lennart Johnsson. Fast pde solvers on fine and medium grain architectures. In Int. Assoc. for Mathematics and Computers in Simulation, page, IMACS, 1987.Google Scholar
  57. [57]
    S. Lennart Johnsson. Highly concurrent algorithms for solving linear systems of equations. In Elliptic Problem Solving II, Academic Press, 1983.Google Scholar
  58. [58]
    S. Lennart Johnsson. Odd-Even Cyclic Reduction on Ensemble Architectures and the Solution Tridiagonal Systems of Equations. Technical Report YALE/CSD/RR-339, Department of Computer Science, Yale University, October 1984.Google Scholar
  59. [59]
    S. Lennart Johnsson. Pipelined linear equation solvers and vlsi. In Microelectronics ‘82, pages 42–46, Institution of Electrical Engineers, Australia, May 1982.Google Scholar
  60. [60]
    S. Lennart Johnsson. Solving narrow banded systems on ensemble architectures. ACM TOMS, 11(3):271–288, November 1985. Also available as Report YALEU/CSD/RR-418, November 1984.Google Scholar
  61. [61]
    S. Lennart Johnsson. Solving Narrow Banded Systems on Ensemble Architectures. Technical Report YALEU/CSD/RR-343, Dept. of Computer Science, Yale University, November 1984.Google Scholar
  62. [62]
    S. Lennart Johnsson. Solving tridiagonal systems on ensemble architectures. SIAM J. Sci. Stat. Comp., 8(3):354–392, May 1987. Report YALEU/CSD/RR-436, November 1985.MathSciNetGoogle Scholar
  63. [63]
    S. Lennart Johnsson. Vlsi algorithms for doolittle’s, crout’s and cholesky’s methods. In International Conference on Circuits and Computers 1982, ICCC82, pages 372–377, IEEE, Computer Society, September 1982.Google Scholar
  64. [64]
    S. Lennart Johnsson and Danny Cohen. An algebraic description of array implementations of fft algorithms. In 20th Allerton Conference on Communication, Control, and Computing, Electrical Engineering, University of Illinois, Urbana/Champaign, 1982.Google Scholar
  65. [65]
    S. Lennart Johnsson and Danny Cohen. Mathematical Approach to Computational Networks for the Discrete Fourier Transform. Technical Report, Department of Computer Science, Yale University, 1984.Google Scholar
  66. [66]
    S. Lennart Johnsson and Ching-Tien Ho. Matrix multiplication on boolean cubes using generic communication primitives. In Parallel Processing and Medium Scale Multiprocessors, SIAM, 1987. YALEU/CSD/RR-530.Google Scholar
  67. [67]
    S. Lennart Johnsson, Ching-Tien Ho, and Faisal Saied. Multiple tridiagonal systems, the Alternating Direction Method, and Boolean cube configured multiprocessors. Technical Report, Yale University, In preparation 1987.Google Scholar
  68. [68]
    S. Lennart Johnsson, Uri Weiser, Danny Cohen, and Al Davis. Towards a formal treatment of vlsi arrays. In Proceedings of the Second Caltech Conference on VLSI, pages 375 – 398, Caltech Computer Science Department, January 1981.Google Scholar
  69. [69]
    Hwang K., editor. Supercomputers: Design and Applications. IEEE Computer Society, 1984.Google Scholar
  70. [70]
    C. Kamath and Ahmed H. Sameh. The Preconditioned Conjugate Gradient Method on a Multiprocessor. Technical Report ANL/MCS-TM-28, Argonne National Laboratories, Mathematics and Computer Science Division, 1984.Google Scholar
  71. [71]
    M.G.H. Katevenis. Reduced Instruction Set Computer Architectures for VLSI. The MIT Press, 1985.Google Scholar
  72. [72]
    D. Kershaw. Solution of Single Tridiagonal Linear Systems and the Vectorization of the ICCG Algorithm on the CRAY-1, pages 85–92. Academic Press, 1982.Google Scholar
  73. [73]
    S.C. Knauer, J.H. O’Neill, and A. Huang. Self-routing Switching Network, pages 424–448. Addison-Wesley, 1985.Google Scholar
  74. [74]
    Donald E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley, 1973.Google Scholar
  75. [75]
    P.M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Computers, C-22(8):786–792, 1973.MathSciNetCrossRefGoogle Scholar
  76. [76]
    David J. Kuck. The Structure of Computers and Computations. John Wiley, 1978.Google Scholar
  77. [77]
    David J. Kuck. A survey of parallel machine organization and programming. ACM Computing Surveys, 9(1):29–59, 1977.MathSciNetMATHCrossRefGoogle Scholar
  78. [78]
    David J. Kuck, Duncan H. Lawrie, R. Cytron, Ahmed Sameh, and Daniel D. Gajski. The Architecture and the Programming of the Cedar System. Technical Report, Laboratory for Advanced Supercomputers, Dept. of Computer Science, University of Illinois, August 1983.Google Scholar
  79. [79]
    M. Kumar and Daniel S. Hirschberg. An efficient implementation of batcher’s bitonic odd-even merge algorithm and its application in parallel sorting schemes. IEEE Trans. Computers, C-32(3):254–264, 1983.CrossRefGoogle Scholar
  80. [80]
    H.T. Kung and Charles E. Leiserson. Algorithms for VLSI Processor Arrays, pages 271–292. Addison-Wesley, 1980.Google Scholar
  81. [81]
    Snyder L. Introduction to the configurable highly parallel computer. Computer, 15(1):47–56, 1982.CrossRefGoogle Scholar
  82. [82]
    Duncan H. Lawrie. Access and alignment of data in an array processor. IEEE Trans, on Computers, C-24(12):99–109, 1975.Google Scholar
  83. [83]
    Duncan H. Lawrie and Ahmed H. Sameh. The computational and communication complexity of a parallel banded system solver. ACM TOMS, 10(2):185–195, June 1984.MathSciNetMATHCrossRefGoogle Scholar
  84. [84]
    Duncan H. Lawrie and C.R. Vora. The prime memory system for array access. IEEE Trans. Computer, C-31:1435–442, May 1982.CrossRefGoogle Scholar
  85. [85]
    F. Tom Leighton. Complexity Issues in VLSI: Optimal Layouts for the Shuffle-Exchange Graph and Other Networks. MIT Press, 1983.Google Scholar
  86. [86]
    F. Tom Leighton and Charles E. Leiserson. Wafer-scale integration of systolic arrays. IEEE Trans. Comp., C-34(5):448–461, May 1985.CrossRefGoogle Scholar
  87. [87]
    Charles E. Leiserson. Area-Efficient VLSI Computation. MIT Press, 1982.Google Scholar
  88. [88]
    Charles E. Leiserson. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans. Computers, C-34:892–901, October 1985.Google Scholar
  89. [89]
    Li.-J. Li and Benjamin W. Wah. The design of optimal systolic arrays. IEEE Trans. Computers, C-34:66–77, 1985.CrossRefGoogle Scholar
  90. [90]
    Peggy Li and Lennart Johnsson. The tree machine: an evaluation of program loading strategies. In 1983 International Conference on Parallel Processing, pages 202 – 205, IEEE Computer Society, August 1983.Google Scholar
  91. [91]
    Bjorn Lisper. Description and Synthesis of Systolic Arrays. Technical Report TRITA-NA-8318, The Royal Institute of Technology, Dept. of Numerical Analysis and Computing Sciences, 1983.Google Scholar
  92. [92]
    Boris Lubachevsky and Debasis Mitra. A Chaotic, Asynchronous Algorithm for Computing the Fixed Point of a Nonnegative Matrix of Unit Spectral Radius. Technical Report, AT&T Bell Laboratories, 1984.Google Scholar
  93. [93]
    Christoffer Lutz, Steve Rabin, Charles L. Seitz, and Donald Speck. Design of the mosaic element. In Proceedings, Conf. on Advanced research in VLSI, pages 1–10, Artech House, 1984.Google Scholar
  94. [94]
    Carver A. Mead and Lynn Conway. Introduction to VLSI Systems. Addis on-Wesley, 1980.Google Scholar
  95. [95]
    Willard L. Miranker. Hierarchical relaxation. Computing, 23:267–285, 1979.MathSciNetMATHCrossRefGoogle Scholar
  96. [96]
    Willard L. Miranker and Andrew Winkler. Spacetime representations of computational structures. Computing, 32(2):93–114, 1984.MathSciNetMATHCrossRefGoogle Scholar
  97. [97]
    Donald I. Moldovan. On the design of algorithms for vlsi systolic arrays. Proc. IEEE, 71(1):113–120, 1983.Google Scholar
  98. [98]
    D. Nassimi and Sartaj Sahni. Bitonic sort on a mesh-connected parallel computer. IEEE Trans. Computers, C-27(1):2 – 7, 1979.MathSciNetCrossRefGoogle Scholar
  99. [99]
    Susan T. O’Donnel, P. Geiger, and Martin H. Schultz. Solving the Poisson Equation on the FPS-164. Technical Report YALEU/DCS/RR-293, Research Center for Scientific Computing, Dept. of Computer Science, Yale University, November 1983.Google Scholar
  100. [100]
    M.S. Paterson, W.L. Ruzzo, and Larry Snyder. Bounds on minimax edge length for complete binary trees. In Proc. of the 13th Annual Symposium on the Theory of Computing, pages 293–299, ACM, 1981.Google Scholar
  101. [101]
    Gregory F. Pfister, W.C. Brantley, D.A. George, S.L. Harvey, W.J. Kleinfelder, K.P. McAuliffe, E.A. Melton, V.A. Norton, and J. Weiss. The ibm research parallel processor prototype (rp3); introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing, pages 764–771, IEEE Computer Society, 1985.Google Scholar
  102. [102]
    Franco P. Preparata and J.E. Vuillemin. The cube connected cycles: a versatile network for parallel computation. In Proc. Twentieth Annual IEEE Symposium on Foundations of Computer Science, pages 140–147, 1979.Google Scholar
  103. [103]
    P. Quinton. Automatic synthesis of systolic arrays from uniform recurrent equations. In Proc. 11th Annual Symposium on Computer Architecture, pages 208–214, IEEE Computer Society, 1984.Google Scholar
  104. [104]
    L.R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing. Prentice-Hall, 1975.Google Scholar
  105. [105]
    Abhiram Ranade. Interconnection networks and parallel memory organization for array processing. In 1985 International Conference on Parallel Processing, IEEE Computer Society, 1985.Google Scholar
  106. [106]
    Abhiram Ranade and S. Lennart Johnsson. The communication efficiency of meshes, boolean cubes, and cube connected cycles for wafer scale integration. In Int. Conf. on Parallel Processing, page, IEEE Computer Society, 1987.Google Scholar
  107. [107]
    E M. Reingold, J Nievergelt, and N Deo. Combinatorial Algorithms. Prentice Hall, 1977.Google Scholar
  108. [108]
    E. Reiter and Gary Rodrigue. An incomplete cholesky factorization by a matrix partitioning algorithm. In Elliptic Problem Solvers II, pages 161–174, Academic Press, 1983.Google Scholar
  109. [109]
    Arnold L. Rosenberg and Larry Snyder. Bounds on the costs of data encodings. Mathematical Systems Theory, 12:9–39, 1978.MathSciNetMATHCrossRefGoogle Scholar
  110. [110]
    John Van Rosendale. Minimizing inner product dependencies in conjugate gradient iteration. In Proc. of the 1983 International Conference on Parallel Processing, pages 44–46, IEEE Computer Society, 1983.Google Scholar
  111. [111]
    Yousef Saad. Practical use of Polynomial Preconditionings for the Conjugate Gradient Method. Technical Report YALEU/DCS/RR-282, Dept. of Computer Science, 1983.Google Scholar
  112. [112]
    Yousef Saad and Ahmed H. Sameh. Iterative methods for the solution of elliptic differential equations on multiprocessors. In Proc. of the CONPAR 81 Conference, pages 395–411, Springer Verlag, 1981.CrossRefGoogle Scholar
  113. [113]
    Faisal Saied, Ching-Tien Ho, S. Lennart Johnsson, and Martin H. Schultz. Solving schroedinger’s equation on the intel ipsc by the alternating direction method. In Hypercube Multiprocessors 1987, page, SIAM, September 1986. Tech. report YALEU/CSD/RR-502.Google Scholar
  114. [114]
    Ahmed H. Sameh. A fast poisson solver for multiprocessors. In Elliptic Problem Solvers II, pages 175–186, Academic Press, 1984.Google Scholar
  115. [115]
    Ahmed H. Sameh. Numerical parallel algorithms — a survey. In High Speed Computer and Algorithm Organization, pages 207–228, Academic Press, 1977.Google Scholar
  116. [116]
    Ahmed H. Sameh and David J. Kuck. On stable parallel linear system solvers. J. ACM, 25(1):81–91, January 1978.MathSciNetMATHCrossRefGoogle Scholar
  117. [117]
    J.T. Schwartz. Ultracomputers. ACM Trans. on Programming Languages and Systems, 2:484–521, 1980.MATHCrossRefGoogle Scholar
  118. [118]
    Charles L. Seitz. Ensemble architectures for vlsi — a survey and taxonomy. In P. Penfield Jr., editor, 1982 Conf on Advanced Research in VLSI, pages 130 – 135, Artech House, January 1982.Google Scholar
  119. [119]
    Charles L. Seitz. Experiments with vlsi ensemble machines. J. VLSI Comput. Syst., 1(3), 1984.Google Scholar
  120. [120]
    M.C. Sejnowski, E.T. Upchurch, R.N. Kapur, D.P.S. Charlu, and G.J. Lipovski. An overview of the texas reconfigurable array computer. In Proceedings, National Computer Conference, pages 631–641, IEEE, 1980.Google Scholar
  121. [121]
    M. Sekanina. On an ordering of the set of vertices of a connected graph. Publ. of the Faculty of the Sciences of the Univ. of Brno, 412():137–142, 1960.Google Scholar
  122. [122]
    J. Stevens. A Fast Fourier Transform Subroutine for the Illiac IV. Technical Report, Center for Advanced Computation, Univ. of Illinois, 1971.Google Scholar
  123. [123]
    Harold S. Stone. Parallel processing with the perfect shuffle. IEEE Trans. Computers, C-20:153–161, 1971.CrossRefGoogle Scholar
  124. [124]
    Paul N. Swartztrauber. The methods of cyclic reduction, fourier analysis, and the facr algorithm for the discrete solution of poisson’s equation on a rectangle. SIAM Review, 19:490–501, 1977.MathSciNetMATHCrossRefGoogle Scholar
  125. [125]
    Clive Temperton. Direct methods for the solution of the discrete poisson equation: some comparisons. J. of Computational Physics, 31:1–20, 1979.MathSciNetMATHCrossRefGoogle Scholar
  126. [126]
    Clive Temperton. On the facr(l) algorithm for the discrete poisson equation. J. of Computational Physics, 34:314–329, 1980.MathSciNetMATHCrossRefGoogle Scholar
  127. [127]
    C.D. Thompson. Fourier Transforms in VLSI. Technical Report UCB/ERL/M80/51, Electronic Research Laboratory, UC Berkeley, 1980.Google Scholar
  128. [128]
    C.D. Thompson and H.T. Kung. Sorting on a mesh-connected parallel computer. CACM, 20(4):263–271, 1977.MathSciNetMATHGoogle Scholar
  129. [129]
    Jeffrey D. Ullman. Computational Aspects of VLSI. Computer Sciences Press, 1984.MATHGoogle Scholar
  130. [130]
    E. Upfal. Efficient schemes for parallel computation. In ACM Symposium on Principles of Distributed Computing, pages 55–59, ACM, 1982.Google Scholar
  131. [131]
    Leslie Valiant and G.J. Brebner. Universal schemes for parallel communication. In Proc. of the 13th ACM Symposium on the Theory of Computation, pages 263–277, ACM, 1981.Google Scholar
  132. [132]
    A.Y. Wu. Embedding of tree networks in hypercubes. Journal of Parallel and Distributed Computing, 2(3):238–249, 1985.CrossRefGoogle Scholar
  133. [133]
    W.A. Wulf and C.G. Bell. C.mmp — a multi-mini-processor. In AFIPS 72 FJCC, pages 765–777, 1972.Google Scholar

Copyright information

© Springer-Verlag New York Inc. 1988

Authors and Affiliations

  • S. Lennart Johnsson
    • 1
  1. 1.Departments of Computer Science, and Electrical EngineeringYale UniversityUSA

Personalised recommendations