Advertisement

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

  • Joseph Gebis
  • Leonid Oliker
  • John Shalf
  • Samuel Williams
  • Katherine Yelick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5455)

Abstract

The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changes to the core design and could thus be easily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant performance benefits over scalar techniques for a variety of memory access patterns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication — achieving 2x–13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promising approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.

Keywords

Cache Line Memory Subsystem Memory Access Pattern Scratchpad Memory Corner Turn 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bailey, D.: Little’s law and high performance computing. In RNR Technical Report (1997)Google Scholar
  2. 2.
    Blelloch, G.E., Heroux, M., Zagha, M.: Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Technical Report CMU-CS-93-173 (August 1993)Google Scholar
  3. 3.
    Bohrer, P., Peterson, J., Ozahy, M., Rajamony, R., Gheith, A., Rockhold, R., Lefurgy, C., Shafi, H., Nakra, T., Simpson, R., Speight, E., Sudeep, K., Hensbergen, E.V., Zhang, L.: Mambo: a full system simulator for the PowerPC architecture. ACM SIGMETRICS Performance Evaluation Review 31(4), 8–12 (2004)CrossRefGoogle Scholar
  4. 4.
    Creating science-driven computer architecture:a new path to scientific leadership, http://www.nersc.gov/news/reports/blueplanet.php
  5. 5.
    Espasa, R., Valero, M., Smith, J.E.: Vector architectures: past, present and future. In: Proceedings of the 12th international Conference on Supercomputing (1998)Google Scholar
  6. 6.
    Gebis, J.: Low-complexity Vector Microprocessor Extensions. PhD thesis, University of California, Berkeley, CA, USA (May 2008)Google Scholar
  7. 7.
    Grun, P., Nicolau, A., Dutt, N.: Memory Architecture Exploration for Programmable Embedded Systems. Kluwer Academic Publishers, Norwell (2002)Google Scholar
  8. 8.
    Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: Proceedings of 3rd Conference on Computing Frontiers, New York, NY, USA, pp. 1–8 (2006)Google Scholar
  9. 9.
    Guo, Y., Chheda, S., Koren, I., Krishna, C.M., Moritz, C.A.: Energy characterization of hardware-based data prefetching. In: ICCD 2004: Proceedings of the IEEE International Conference on Computer Design, Washington, DC, USA, pp. 518–523. IEEE Computer Society, Los Alamitos (2004)Google Scholar
  10. 10.
    HPEC Challenge Benchmark Suite, http://www.ll.mit.edu/HPECchallenge
  11. 11.
    McVoy, L.W., Staelin, C.: lmbench: Portable tools for performance analysis. In: USENIX Annual Technical Conference, pp. 279–294 (1996)Google Scholar
  12. 12.
    Natarajan, K., Hanson, H., Keckler, S.W., Moore, C.R., Burger, D.: Microprocessor pipeline energy analysis. pp. 282–287 (August 2003)Google Scholar
  13. 13.
    Patterson, D.A.: Latency lags bandwith. Commun. ACM 47(10), 71–75 (2004)CrossRefGoogle Scholar
  14. 14.
    Temam, O., Jalby, W.: Characterizing sparse algorithms on caches. In: Proc. Supercomputing (1992)Google Scholar
  15. 15.
    Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. In: Proc. SciDAC 2005, Journal of Physics (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Joseph Gebis
    • 1
    • 2
  • Leonid Oliker
    • 1
    • 2
  • John Shalf
    • 1
  • Samuel Williams
    • 1
    • 2
  • Katherine Yelick
    • 1
    • 2
  1. 1.Lawrence Berkeley National LaboratoryCRD/NERSCBerkeleyUSA
  2. 2.CS DivisionUniversity of California at BerkeleyBerkeleyUSA

Personalised recommendations