Skip to main content
Log in

Challenges and possible approaches: towards the petaflops computers

  • Review Article
  • Published:
Frontiers of Computer Science in China Aims and scope Submit manuscript

Abstract

In parallel with the R&D efforts in USA and Europe, China’s National High-tech R&D program has setup its goal in developing petaflops computers. Researchers and engineers world-wide are looking for appropriate methods and technologies to achieve the petaflops computer system. Based on discussion on important design issues in developing the petaflops computer, this paper raises the major technological challenges including the memory wall, low power system design, interconnects, and programming support, etc. Current efforts in addressing some of these challenges and in pursuing possible solutions for developing the petaflops systems are presented. Several existing systems are briefly introduced as examples, including Roadrunner, Cray XT5 jaguar, Dawning 5000A/6000, and Lenovo DeepComp 7000. Architectures proposed by Chinese researchers for implementing the petaflops computer are also introduced. Advantages of the architecture as well as the difficulties in its implementation are discussed. Finally, future research direction in development of high productivity computing systems is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bell G. Bell’s law for the birth and death of computer classes. Commun. ACM, 2008, 51(1): 86–94

    Article  Google Scholar 

  2. Kumar S, Hughes C J, Nguyen A. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. ISCA’07, 2007, 162–173

  3. Leverich J, Arakida H, Solomatnikov A, et al. Comparing memory systems for chip multiprocessors. ISCA’07, 2007, 358–368

  4. Liu C, Anand S, Kandemir M. Organizing the last line of defense before hitting the memory wall for CMPs. In: Proceedings of 10th International Symposium on High Performance Computer Architecture, 2004, 176–185

  5. Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004, 111–122

  6. Huh J, et al. A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th annual international conference on Supercomputing. Cambridge, Massachusetts: ACM, 2005, 31–40

    Chapter  Google Scholar 

  7. Chishti Z, Powell MD, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings. 32nd International Symposium on Computer Architecture. 2005, 357–368

  8. Kogge P M, et al. Combined DRAM and logic chip for massively parallel systems. In: Proceedings of the 16th Conference on Advanced Research in VLSI. 1995, 4–16

  9. Sterling T, Bergman L. A design analysis of a hybrid technology multithreaded architecture for petaflops scale computation3. In: Proceedings of the 13th international conference on Supercomputing, Rhodes, Greece: ACM, 1999, 286–293

    Chapter  Google Scholar 

  10. Draper J, et al. The architecture of the DIVA processing-in-memory chip. In: Proceedings of the 16th international conference on Supercomputing, USA: ACM, 2002, 14–25

    Chapter  Google Scholar 

  11. Sterling T L, Zima H P. Gilgamesh: a multithreaded processor-inmemory architecture for petaflops computing. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland: IEEE Computer Society Press, 2002, 1–23

    Google Scholar 

  12. Sterling, T, Brodowicz M. The MIND scalable PIM architecture. In: Advanced Research Workshop on High Performance Computing Technology and Applications. Cetraro, Italy, 2005

  13. Khailany B, Dally W J, Rixner S, et al. Imagine: Media Processing with Streams. IEEE Micro, 2001, 21(2): 35–46

    Article  Google Scholar 

  14. ClearSpeed, “CSX600 datasheet,” http://www.clearspeed.com/downloads/CSX600Processor.pdf, 2005.

  15. Dally W J, et al. Merrimac: supercomputing with streams. In: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, 2003

  16. Wang L, Yang X J, Yan X B, et al. A 64-bit stream processor architecture for scientific applications. In: Proceedings of the 34th Annual international Symposium on Computer Architecture. San Diego CA, USA, 2007: 210–219

  17. Pham D, et al. The design and implementation of a first-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference. San Francisco, CA, 2005, 184–185

  18. Kapasi U J, et al. The imagine stream processor. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors. 2002, 282–288

  19. Erez M. Merrimac: high-performance and highly-efficient scientific computing with streams, 2007, Stanford University. 211

  20. Stitt G, Vahid F, Nematbakhsh S. Energy savings and speedups from partitioning critical software loops to hardware in embedded systems. ACM trans. on Embedded Comput. Syst., 2004, 3, (1): 218–232

    Article  Google Scholar 

  21. Todman T J, Constantinides G A, Wilton S J E, et al. Reconfigurable computing: architectures and design Methods. IEE Proc.-Comput. Digit. Tech., 2005, 152(2): 193–207

    Article  Google Scholar 

  22. Sankaralingam K, et al., Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. SIGARCH Comput. Archit. News. 2003. 31 (2): 422–433

    Article  Google Scholar 

  23. Taylor M B, et al. The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro, 2002, 22 (2): 25–35

    Article  Google Scholar 

  24. Paleologo G A, Benini L, Bogliolo A, et al. Policy optimization for dynamic power management. In: Proceedings of the Design Automation Conference (DAC’98). San Francisco: IEEE/ACM, 1998, 182–187

    Google Scholar 

  25. Nowka K, Carpenter G, MacDonald E, et al. A 32-bit powerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling. IEEE Journal of Solid-State Circuits, 2002, 37(11): 1600–1608

    Article  Google Scholar 

  26. Heo S, Barr K, Asanovic K, Reducing power density through activity migration. In: Proceedings of International Symposium on Low Power Electronics and Design (ISLPED), 2003, 217–222

  27. Powell M, Gomaa M, Vijaykumar T N. Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In: Proceedings of 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI). 2004, 260–270

  28. Fritts J E, Chamberlain R D. Breaking the memory bottleneck with an optical data path. In: Proceedings of 35th Annual Simulation Symposium. 2002

  29. Kirman N, Kirman M, Dokania R K, et al. Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 492–507

  30. Shacham A, Small B A, Liboiron-Ladouceur O, et al. A fully implemented 12x12 data vortex optical packet switching interconnection network. Journal of Lightwave Technology, 2005, 23: 3066–3075

    Article  Google Scholar 

  31. Patrick D, et al. Lighnting network and systems architecture. Journal of Lightwave Technology, 1996, 14: 1371–1387

    Article  Google Scholar 

  32. Chunming M, et. al., Dynamic reconfiguration of optically interconnected networks with time-division multiplexing. Journal of Parallel and Distributed Computing, 1994, 22(2): 268–278

    Article  Google Scholar 

  33. Praveen K, Roger C, Mark F. Dynamic reconfiguration of an optical interconnect. In: Proceedings of 36th Annual Simulation Symposium, 2003, 89–97

  34. Kodi A, Louri A. Performance adaptive power-aware reconfigurable optical interconnects for high-performance computing (HPC) systems. In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. Reno, Nevada: ACM, 2007

    Google Scholar 

  35. Drost R, et al. Proximity Communications. IEEE CICC, 2003, 469–472

  36. Bader D, Kanade V, Madduri K. SWARM: A Parallel Programming Framework for Multi- Core Processors. First Workshop on Multithreaded Architectures and Applications (MTAAP) at IPDPS 2007. Long Beach, CA, USA, March 2007

    Google Scholar 

  37. Perez J, Bellens P, Badia R, Labarta J. CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development, 2007, 51(5): 593–604

    Article  Google Scholar 

  38. Reed D, Lu C, Mendes C. Big systems and big reliability challenges. In: Proceedings of Int’l Conf. on Parallel Computing (ParCo), 2003, 19(4): 189–197

    Google Scholar 

  39. Schroeder B, Gibson G. A large scale study of failures in highperformance-computing systems. In: Proceedings of Int’l Conf. on Dependable Systems and Networks (DSN), 2006

  40. Schroeder B, Gibson G A. Understanding failures in petascale computers. In: Proceedings of SciDAC 2007 J. Phys: Conf. Ser., 2007, 78(012022)

  41. Daly J T. A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of ICCS 2003, LNCS 2660, 3–12

  42. Babaoglu O, Joy W. Converting a swap-based system to do paging in an architecture lacking page reference bits. In: Proceedings of 8th Symp. on Operating Systems Principles (SOSP), 1981, 15(5): 78–96

    Google Scholar 

  43. Sancho J, Petrini F, Johnson G, et al. On the feasibility of incremental checkpointing for scientific computing, In: Proceedings of 18th International Parallel and Distributed Processing Symp. 2004, 26–30

  44. Lu C D. Scalable diskless checkpointing for large parallel systems. PhD dissertation, Univ. of Illinois at Urbana- Champaign, 2005

  45. Zheng G, Shi L, Kale L. FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for charm++ and MPI. In: Proceedings of IEEE Int’l Conf. on Cluster Computing (Cluster), 2004, 93–103

  46. Oliner A J, Rudolph L, and Sahoo R K. Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS). Cairns, Australia, June 2006, 14–23

    Chapter  Google Scholar 

  47. Yudan, L., et al. An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing, 2008

  48. Xue Z, et al. A survey on failure prediction of large-scale server clusters. In: Proceedings of the eighth ACIS international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing (SNPD 2007), 2007, 2: 733–738

    Google Scholar 

  49. Sahoo R, Oliner A, Rish I, et al. Critical event prediction for proactive management in large-scale computer clusters, In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 426–435

  50. Wang C, Mueller F, Engelmann C, et al. Proactive process-level live migration in HPC environments. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 2007, 23–32

    Google Scholar 

  51. Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38

  52. Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007, 25–38

  53. Snavely A, Tullsen D. Symbiotic jobscheduling for a simultaneous multithreading processor. In: Proceedings of ASPLOS, 2000, 234–244

  54. Jiang Y, Shen X. Exploration of the influence of program inputs on cmp co-scheduling. In: Proceedings of European Conference on Parallel Computing (Euro-Par), 2008, 5168: 263–273

    Google Scholar 

  55. Jiang, Y, et al. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the 17th international conference on Parallel architectures and compilation techniques. Toronto, Ontario, Canada: ACM, 2008, 220–229

    Chapter  Google Scholar 

  56. Blagojevic F, et al. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Comput., 2007, 33 (10–11): 700–719

    Google Scholar 

  57. Becchi M, Crowley P. Dynamic thread assignment on heterogeneous multiprocessor architectures. In: Proceedings of the 3rd conference on Computing frontiers. Ischia, Italy: ACM, 2006, 29–40

    Chapter  Google Scholar 

  58. Watanabe R, et al. Task scheduling under performance constraints for reducing the energy consumption of the GALS multi-processor SoC. In: Proceedings of IEEE/ACM DATE, 2007

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Depei Qian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, D., Zhu, D. Challenges and possible approaches: towards the petaflops computers. Front. Comput. Sci. China 3, 273–289 (2009). https://doi.org/10.1007/s11704-009-0022-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-009-0022-6

Keywords

Navigation