A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture

  • Yasutaka Wada
  • Akihiro Hayashi
  • Takeshi Masuura
  • Jun Shirako
  • Hirofumi Nakano
  • Hiroaki Shikano
  • Keiji Kimura
  • Hironori Kasahara
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6760)

Abstract

Heterogeneous multicore architectures, integrating several kinds of accelerator cores in addition to general purpose processor cores, have been attracting much attention to realize high performance with low power consumption. To attain effective high performance, high application software productivity, and low power consumption on heterogeneous multicores, cooperation between an architecture and a parallelizing compiler is important. This paper proposes a compiler cooperative heterogeneous multicore architecture and parallelizing compilation scheme for it. Performance of the proposed scheme is evaluated on the heterogeneous multicore integrating Hitachi and Renesas’ SH4A processor cores and Hitachi’s FE-GA accelerator cores, using an MP3 encoder. The heterogeneous multicore gives us 14.34 times speedup with two SH4As and two FE-GAs, and 26.05 times speedup with four SH4As and four FE-GAs against sequential execution with a single SH4A. The cooperation between the heterogeneous multicore architecture and the parallelizing compiler enables to achieve high performance in a short development period.

References

  1. 1.
    Hammond, L., Hubbert, B.A., Siu, M., Prabhu, M.K., Chen, M., Olukotun, K.: The stanford hydra CMP. IEEE Micro 20, 71–84 (2000)CrossRefGoogle Scholar
  2. 2.
    ARM Limited: ARM11 MPCore Processor Technical Reference Manual (2005)Google Scholar
  3. 3.
    Friedrich, J., McCredie, B., James, N., Huott, B., Curran, B., Fluhr, E., Mittal, G., Chan, E., Chan, Y., Plass, D., Chu, S., Le, H., Clark, L., Ripley, J., Taylor, S., Dilullo, J., Lanzerotti, M.: Design of the Power6 microprocessor. In: Digest of Technical Papers of the 2007 IEEE International Solid-State Circuits Conference, pp. 96–97 (February 2007)Google Scholar
  4. 4.
    Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: The raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro 22, 25–35 (2002)CrossRefGoogle Scholar
  5. 5.
    Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 422–433 (June 2003)Google Scholar
  6. 6.
    Shiota, T., Kawasaki, K., Kawabe, Y., Shibamoto, W., Sato, A., Hashimoto, T., Hayakawa, F., Tago, S., Okano, H., Nakamura, Y., Miyake, H., Suga, A., Takahashi, H.: A 51.2GOPS 1.0GB/s-DMA single-chip multi-processor integrating quadruple 8-Way VLIW processors. In: Digest of Technical Papers of the 2005 IEEE International Solid-State Circuits Conference, pp. 194–593 (February 2005)Google Scholar
  7. 7.
    Sohi, G.S., Breach, S.E., Vijaykumar, T.N.: Multiscalar processors. In: Proceedings of 22nd Annual International Symposium on Computer Architecture, pp. 414–425 (June 1995)Google Scholar
  8. 8.
    Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-Tile 1.28TFLOPS network-on-chip in 65nm CMOS. In: Digest of Technical Papers of the 2007 IEEE International Solid-State Circuits Conference, pp. 98–589 (February 2007)Google Scholar
  9. 9.
    Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27(3) (2008)Google Scholar
  10. 10.
    Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K.: The design and implementation of a first-generation CELL processor. In: Digest of Technical Papers of the 2005 IEEE International Solid-State Circuits Conference, pp. 184–592 (February 2005)Google Scholar
  11. 11.
    Khailany, B., Williams, T., Lin, J., Long, E., Rygh, M., Tovey, D., Dally, W.J.: A programmable 512 GOPS stream processor for signal, image, and video processing. In: Digest of Technical Papers of the 2007 IEEE International Solid-State Circuits Conference, pp. 272–602 (February 2007)Google Scholar
  12. 12.
    Torii, S., Suzuki, S., Tomonaga, H., Tokue, T., Sakai, J., Suzuki, N., Murakami, K., Hiraga, T., Shigemoto, K., Tatebe, Y., Ohbuchi, E., Kayama, N., Edahiro, M., Kusano, T., Nishi, N.: A 600MIPS 120mW 70μA leakage triple-CPU mobile application processor chip. In: Digest of Technical Papers of the 2005 IEEE International Solid-State Circuits Conference, pp. 136–589 (February 2005)Google Scholar
  13. 13.
    Ito, M., Todaka, T., Tsunoda, T., Tanaka, H., Kodama, T., Shikano, H., Onouchi, M., Uchiyama, K., Odaka, T., Kamei, T., Nagahama, E., Kusaoke, M., Nitta, Y., Wada, Y., Kimura, K., Kasahara, H.: Heterogeneous multiprocessor on a chip which enables 54x AAC-LC stereo encoding. In: Proceedings of the 2007 IEEE Symposium on VLSI Circuits, pp. 18–19 (June 2007)Google Scholar
  14. 14.
    Kumar, R., Tullsen, D.M., Ranganathan, P., Jouppi, N.P., Farkas, K.I.: Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 64–75 (June 2004)Google Scholar
  15. 15.
    Shikano, H., Suzuki, Y., Wada, Y., Shirako, J., Kimura, K., Kasahara, H.: Performance evaluation of heterogeneous chip multi-processor with MP3 audio encoder. In: Proceedings of the IEEE Symposium on Low-Power and High Speed Chips, pp. 349–363 (April 2006)Google Scholar
  16. 16.
    Noda, H., Tanizaki, T., Gyohten, T., Dosaka, K., Nakajima, M., Mizumoto, K., Yoshida, K., Iwao, T., Nishijima, T., Okuno, Y., Arimoto, K.: The circuits and robust design methodology of the massively parallel processor based on the matrix architecture. In: Digest of Technical Papers of the 2006 Symposium on VLSI Circuits, pp. 210–211 (2006)Google Scholar
  17. 17.
    NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2008)Google Scholar
  18. 18.
    Xie, T., Qin, X.: Stochastic scheduling with availability constraints in heterogeneous clusters. In: Proceedings of the 2006 IEEE International Conference on Cluster Computing, pp. 1–10 (September 2006)Google Scholar
  19. 19.
    Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems 4, 175–187 (1993)CrossRefGoogle Scholar
  20. 20.
    Chan, W.Y., Li, C.K.: Scheduling tasks in DAG to heterogeneous processor system. In: Proceedings of the 6th Euromicro Workshop on Parallel and Distributed Processing, pp. 27–31 (January 1998)Google Scholar
  21. 21.
    Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 260–274 (2002)CrossRefGoogle Scholar
  22. 22.
    Kasahara, H., Honda, H., Narita, S.: Parallel processing of near fine grain tasks using static scheduling on OSCAR (Optimally SCheduled Advanced multiprocessoR). In: Proceedings of Supercomputing ’90, pp. 856–864 (November 1990)Google Scholar
  23. 23.
    Kimura, K., Kodaka, T., Obata, M., Kasahara, H.: Multigrain parallel processing on OSCAR CMP. In: Proceedings of the 2003 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (January 2003)Google Scholar
  24. 24.
    Ishizaka, K., Miyamoto, T., Shirako, J., Obata, M., Kimura, K., Kasahara, H.: Performance of OSCAR multigrain parallelizing compiler on SMP servers. In: Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing (September 2004)Google Scholar
  25. 25.
    Kimura, K., Wada, Y., Nakano, H., Kodaka, T., Shirako, J., Ishizaka, K., Kasahara, H.: Multigrain parallel processing on compiler cooperative chip multiprocessor. In: Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures, pp. 11–20 (February 2005)Google Scholar
  26. 26.
    Kasahara, H., Ogata, W., Kimura, K., Matsui, G., Matsuzaki, H., Okamoto, M., Yoshida, A., Honda, H.: OSCAR multi-grain architecture and its evaluation. In: Proceedings of the 1997 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, pp. 106–115 (October 1997)Google Scholar
  27. 27.
    Kasahara, H., Honda, H., Mogi, A., Ogura, A., Fujiwara, K., Narita, S.: A multi-grain parallelizing compilation scheme for OSCAR (Optimally scheduled advanced multiprocessor). In: Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pp. 283–297 (August 1991)Google Scholar
  28. 28.
    Obata, M., Shirako, J., Kaminaga, H., Ishizaka, K., Kasahara, H.: Hierarchical parallelism control for multigrain parallel processing. In: Pugh, B., Tseng, C.-W. (eds.) LCPC 2002. LNCS, vol. 2481, pp. 31–44. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  29. 29.
    Shirako, J., Nagasawa, K., Ishizaka, K., Obata, M., Kasahara, H.: Selective inline expansion for improvement of multi grain parallelism. In: The IASTED International Conference on Parallel and Distributed Computing and Networks, pp. 128–134 (February 2004)Google Scholar
  30. 30.
    Yoshida, Y., Kamei, T., Hayase, K., Shibahara, S., Nishii, O., Hattori, T., Hasegawa, A., Takada, M., Irie, N., Uchiyama, K., Odaka, T., Takada, K., Kimura, K., Kasahara, H.: A 4320MIPS four-processor core SMP/AMP with individually managed clock frequency for low power consumption. In: Digest of Technical Papers of the 2007 IEEE International Solid-State Circuits Conference, pp. 100–590 (February 2007)Google Scholar
  31. 31.
    Kodama, T., Tsunoda, T., Takada, M., Tanaka, H., Akita, Y., Sato, M., Ito, M.: Flexible engine: A dynamic reconfigurable accelerator with high performance and low power consumption. In: Proceedings of the IEEE Symposium on Low-Power and High Speed Chips, pp. 393–408 (April 2006)Google Scholar
  32. 32.
    UZURA3: MPEG1/LayerIII encoder in FORTRAN90, http://members.at.infoseek.co.jp/kitaurawa/index_e.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Yasutaka Wada
    • 1
  • Akihiro Hayashi
    • 1
  • Takeshi Masuura
    • 1
  • Jun Shirako
    • 1
  • Hirofumi Nakano
    • 1
  • Hiroaki Shikano
    • 1
  • Keiji Kimura
    • 1
  • Hironori Kasahara
    • 1
  1. 1.Department of Computer Science and EngineeringWaseda UniversityShinjuku-kuJapan

Personalised recommendations