Tolerating Communication Latency through Dynamic Thread Invocation in a Multithreaded Architecture
Communication latency is a key parameter which affects the performance of distributed-memory multiprocessors. Instruction-level multithreading attempts to tolerate latency by overlapping communication with computation. This chapter explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hardware supports for dynamic function spawning and instruction-level multithreading. The supports include a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed-sized communication packets based on one-sided communication. Two problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Parameters that characterize the performance of multithreading are investigated, including the number of threads, the number of thread switches, the run length, and the number of remote reads. Experimental results indicate that the best communication performance occurs when the number of threads is two to four. A large number of threads of over eight is found inefficient and has adversely affected the overall performance. FFT yielded over 95% overlapping due to a large amount of computation and communication parallelism across threads. Even at the absence of thread computation parallelism, multithreading helps overlap over 35% of the communication time for bitonic sorting.
KeywordsFast Fourier Transform Switching Cost Communication Time Direct Memory Access Remote Memory
Unable to display preview. Download preview PDF.
- 1.Accelerated Strategic Computing Initiative (ASCI), Lawrence Livermore, Los Alamos, and Sandia National Laboratories, http://www.llnl.gov/asci/.
- 2.A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B-H. Lim, K. Mackenzie, and D. Yeung, The MIT Alewife Machine: Architecture and Performances, in Proc. the International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995, pp.2–13.Google Scholar
- 3.T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D.M. Dias, and M. Snir, SP-2 System Architecture, IBM Systems Journal Vol. 34, No. 2, 1995.Google Scholar
- 4.R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, The Tera computer system, In Proc. of ACM International Conference on Supercomputing, Amsterdam, Netherlands, June 1990, ACM, pp.1–6.Google Scholar
- 5.K. Batcher, Sorting Networks and Their Applications, in Proc. the AFIPS Spring Joint Computer Conference 32, Reston, VA, 1968, pp.307–314.Google Scholar
- 7.D. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken, LogP: Towards a Realistic Model of Parallel Computation, in Proc. of the Fourth ACM Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993.Google Scholar
- 8.G. Gao, L. Bic and J-L. Gaudiot (Eds.) Advanced Topic in Dataflow Computing and Multithreading, IEEE Computer society press, 1995.Google Scholar
- 9.R. Iannucci, G. Gao, R. Halstead, and B. Smith (Eds.), Multithreaded Computer Architecture, Kluwer Publishers, Norwell, MA 1994.Google Scholar
- 10.Y. Kodama, Y. Koumura, M. Sato, H. Sakane, S. Sakai, and Y. Yamaguchi, EMC-Y: Parallel Processing Element Optimizing Communication and Computation, in Proc. of ACM International Conference on Supercomputing, Tokyo, Japan, July 1993, pp.167–174.Google Scholar
- 11.Y. Kodama, H. Sakane, M. Sato, H. Yamana, S. Sakai, and Y. Yamaguchi, The EM-X Parallel Computer: Architecture and Basic Performance, in Proc. of ACM International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995, pp.14–23.Google Scholar
- 12.H. Matsuoka, K. Okamoto, H. Hirono, M. Sato, T. Yokota, S. Sakai, Pipeline design and enhancement for fast network message handling in RWC-1 multiprocessor, in Proc. of the Workshop on Multithreaded Execution, Architecture and Compilation, Las Vegas, Nevada, February 1998.Google Scholar
- 13.R. Nikhil, G. Papadopolous, and Arvind, *T: A Multithreaded Massively Parallel Architecture, in Proc. of ACM International Symposium on Computer Architecture, Gold Coast, Australia, May 1992, pp.156–167.Google Scholar
- 14.G. Papadopolous, An Implementation of General Purpose Dataflow Multiprocessor, MIT Press, Cambridge, MA, 1991.Google Scholar
- 15.R. Saavedra-Barrera, D. Culler, and T. von Eicken, Analysis of Multithreaded Architectures for Parallel Computing, in Proc. of ACM Symposium on Parallel Algorithms and Architectures, pp. 169–178, July 1990.Google Scholar
- 16.S. Sakai, Y. Yamaguchi, K. Hiraki, and T. Yuba, An Architecture of a Data-flow Single Chip Processor, in Proc. of ACM International Symposium on Computer Architecture, Jerusalem, Israel, May 1989, pp.46–53.Google Scholar
- 17.M. Sato, Y. Kodama, S. Sakai, Y. Yamaguchi, and Y. Koumura, Thread-based Programming for the EM-4 Hybrid Data-flow Machine, in Proc. of ACM International Symposium on Computer Architecture, Gold Coast, Australia, May 1992, pp.146–155.Google Scholar
- 18.S. Scott, Synchronization and Communication in the T3E Multiprocessor, in Proc. of ACM Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, October 1996.Google Scholar
- 19.B. J. Smith, A Pipelined, Shared Resource MIMD Computer, in Proc. of International Conference on Parallel Processing, 1978, pp.6–8.Google Scholar
- 20.A. Sohn, J. Ku, Y. Kodama, M. Sato, H. Sakane, H. Yamana, S. Sakai, and Y. Yamaguchi, Identifying the Capability of Overlapping Computation with Communication, in Proc. of ACM/IEEE Conference on Parallel Architectures and Compilation Techniques, Boston, MA, October 1996, pp. 133–138.Google Scholar
- 21.A. Sohn, M. Sato, N. Yoo, and J-L Gaudiot, Data and Workload Distribution in a Multithreaded Architecture, Journal of Parallel and Distributed Computing, December 1996.Google Scholar