A Systematic Design Space Exploration Approach to Customising Multi-Processor Architectures: Exemplified Using Graphics Processors

  • Ben Cope
  • Peter Y. K. Cheung
  • Wayne Luk
  • Lee Howes
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6760)


A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture’s critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost.


Design Space Processing Element Cache Size Design Space Exploration Graphic Processor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Vassiliadis, S., et al.: Tera-device computing and beyond: Thematic group 7 (2006), Roadmap
  2. 2.
    De Bosschere, K., Luk, W., Martorell, X., Navarro, N., O’Boyle, M., Pnevmatikatos, D., Ramírez, A., Sainrat, P., Seznec, A., Stenström, P., Temam, O.: High-performance embedded architecture and compilation roadmap. In: Stenström, P. (ed.) Transactions on High-Performance Embedded Architectures and Compilers I. LNCS, vol. 4050, pp. 5–29. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Cope, B., Cheung, P.Y.K., Luk, W.: Systematic design space exploration for customisable multi-processor architectures. In: SAMOS, pp. 57–64 (July 2008)Google Scholar
  4. 4.
    Keinhuis, B., et al.: An approach for quantitative analysis of application-specific dataflow architectures. In: ASAP, pp. 338–350 (July 1997)Google Scholar
  5. 5.
    Lieverse, P., et al.: A methodology for architecture exploration of heterogeneous signal processing systems. Journal of VLSI Signal Processing 29(3), 197–207 (2001)CrossRefzbMATHGoogle Scholar
  6. 6.
    Moya, V., Golzalez, C., Roca, J., Fernandez, A.: Shader performance analysis on a modern GPU architecture. In: IEEE/ACM Symposium on Microarchitecture, pp. 355–364 (2005)Google Scholar
  7. 7.
    Sheaffer, J.W., Skadron, K., Luebke, D.P.: Fine-grained graphics architectural simulation with qsilver. In: Computer Graphics and Interactive Techniques (2005)Google Scholar
  8. 8.
    Nvidia: nvidia shaderperf 1.8 performance analysis tool,
  9. 9.
    Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A memory model for scientific algorithms on graphics processors. In: ACM/IEEE Super Computing, pp. 89–98 (2006)Google Scholar
  10. 10.
    Kahn, G.: The semantics of a simple language for parallel programming. In: IFIP Congress (1974)Google Scholar
  11. 11.
    Rissa, T., Donlin, A., Luk, W.: Evaluation of systemc modelling of reconfigurable embedded systems. In: DATE, pp. 253–258 (March 2005)Google Scholar
  12. 12.
    Donlin, A., Braun, A., Rose, A.: SystemC for the design and modeling of programmable systems. In: Becker, J., Platzner, M., Vernalde, S. (eds.) FPL 2004. LNCS, vol. 3203, pp. 811–820. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Todman, T.J., Constantinides, G.A., Wilton, S.J., Mencer, O., Luk, W., Cheung, P.Y.: Reconfigurable computing: Architectures and design methods. IEE Computers and Digital Techniques 152(2), 193–207 (2005)CrossRefGoogle Scholar
  14. 14.
    Cope, B., Cheung, P.Y.K., Luk, W.: Using reconfigurable logic to optimise gpu memory accesses. In: DATE, pp. 44–49 (2008)Google Scholar
  15. 15.
    Moll, L., Heirich, A., Shand, M.: Sepia: Scalable 3d compositing using pci pamette. In: FCCM, pp. 146–155 (April 1999)Google Scholar
  16. 16.
    Manzke, M., Brennan, R., O’Conor, K., Dingliana, J., O’Sullivan, C.: A scalable and reconfigurable shared-memory graphics architecture. In: Computer Graphics and Interactive Techniques (August 2006)Google Scholar
  17. 17.
    Xue, X., Cheryauka, A., Tubbs, D.: Acceleration of fluoro-ct reconstruction for a mobile c-arm on gpu and fpga hardware: A simulation study. In: SPIE Medical Imaging 2006, vol. 6142(1), pp. 1494–1501 (2006)Google Scholar
  18. 18.
    Kelmelis, E., Humphrey, J., Durbano, J., Ortiz, F.: High-performance computing with desktop workstations. WSEAS Transactions on Mathematics 6(1), 54–59 (2007)Google Scholar
  19. 19.
    Schleupen, K., Lekuch, S., Mannion, R., Guo, Z., Najjar, W., Vahid, F.: Dynamic partial fpga reconfiguration in a prototype microprocessor system. In: FPL, pp. 533–536 (August 2007)Google Scholar
  20. 20.
    Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor. In: Proceedings of the IEEE ISSCC, pp. 82–83 (February 2008)Google Scholar
  21. 21.
    Dale, K., et al.: A scalable and reconfigurable shared-memory graphics architecture. In: Bertels, K., Cardoso, J.M.P., Vassiliadis, S. (eds.) ARC 2006. LNCS, vol. 3985, pp. 99–108. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  22. 22.
    Yalamanchili, S.: From adaptive to self-tuned systems. In: Symposium on The Future of Computing in memory of Stamatis Vassiliadis (2007)Google Scholar
  23. 23.
    MathStar: Field programmable object arrays: Architecture (2008),
  24. 24.
    Chen, T.F., Hsu, C.M., Wu, S.R.: Flexible heterogeneous multicore architectures for versatile media processing via customized long instruction words. IEEE Transactions on Circuits and Systems for Video Technology 15(5), 659–672 (2005)CrossRefGoogle Scholar
  25. 25.
    Verbauwhede, I., Schaumont, P.: The happy marriage of architecture and application in next-generation reconfigurable systems. In: Computing Frontiers, pp. 363–376 (April 2004)Google Scholar
  26. 26.
    Nollet, V., Verkest, D., Corporaal, H.: A quick safari through the mpsoc run-time management jungle. In: Workshop on Embedded Systems for Real-Time Multimedia, pp. 41–46 (October 2007)Google Scholar
  27. 27.
    Shin, D.: Automatic generation of transaction level models for rapid design space exploration. In: Proceedings of Hardware/Software Codesign and System Synthesis, pp. 64–69 (October 2006)Google Scholar
  28. 28.
    Cope, B., Cheung, P.Y.K., Luk, W.: Bridging the gap between FPGAs and multi-processor architectures: A video processing perspective. In: Application-specific Systems, Architectures and Processors, pp. 308–313 (2007)Google Scholar
  29. 29.
    Priem, C., Solanki, G., Kirk, D.: Texture cache for a computer graphics accelerator. United States Patent No. US 7, 136, 068 B1 (1998)Google Scholar
  30. 30.
    Jin, Q., Thomas, D., Luk, W., Cope, B.: Exploring reconfigurable architectures for financial computation. In: Woods, R., Compton, K., Bouganis, C., Diniz, P.C. (eds.) ARC 2008. LNCS, vol. 4943, pp. 245–255. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  31. 31.
    Ahn, J.H., Erez, M., Dally, W.J.: The design space of data-parallel memory systems. In: ACM/IEEE Super Computing, pp. 80–92 (November 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ben Cope
    • 1
  • Peter Y. K. Cheung
    • 1
  • Wayne Luk
    • 2
  • Lee Howes
    • 2
  1. 1.Department of Electrical & Electronic EngineeringImperial College LondonUK
  2. 2.Department of ComputingImperial College LondonUK

Personalised recommendations