L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy

  • Philip Machanick
  • Zunaid Patel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2823)


The RAMpage hierarchy moves main memory up a level to replace the lowest-level cache by an equivalent-sized SRAM main memory, with a TLB caching page translations for that main memory. This paper illustrates how more aggressive components higher in the hierarchy increase the fraction of total execution time spent waiting for DRAM. For an instruction issue rate of 1 GHz, the simulated standard hierarchy waited for DRAM 10% of the time, increasing to 40% at an instruction issue rate of 8 GHz. For a larger L1 cache, the fraction of time waiting for DRAM was even higher. RAMpage with context switches on misses was able to hide almost all DRAM latency. A larger TLB was shown to increase the viable range of RAMpage SRAM page sizes.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alexander, T., Kedem, G.: Distributed prefetch-buffer/cache design for high-performance memory systems. In: Proc. 2nd IEEE Symp. on High- Performance Computer Architecture, San Jose, CA, February 1996, pp. 254–263 (1996)Google Scholar
  2. 2.
    AMD. HyperTransport technology: Simplifying system design [online] (October 2002),
  3. 3.
    Borkenhagen, J.M., Eickemeyer, R.J., Kalla, R.N., Kunkel, S.R.: A multithreaded PowerPC processor for commercial servers. IBM J. Research and Development 44(6), 885–898 (2000)CrossRefGoogle Scholar
  4. 4.
    Chen, T., Baer, J.: Reducing memory latency via non-blocking and prefetching caches. In: Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-5), September 1992, pp. 51–61 (1992)Google Scholar
  5. 5.
    Chen, T.-F.: An effective programmable prefetch engine for on-chip caches. In: Proc. 28th Int. Symp. on Microarchitecture (MICRO-28), Ann Arbor, MI, November 29 – December 1, pp. 237–242 (1995)Google Scholar
  6. 6.
    Cheriton, D.R., Goosen, H.A., Holbrook, H., Machanick, P.: Restructuring a parallel simulation to improve cache behavior in a shared-memory multiprocessor: The value of distributed synchronization. In: Proc. 7th Workshop on Parallel and Distributed Simulation, San Diego, May 1993, pp. 159–162 (1993)Google Scholar
  7. 7.
    Cheriton, D.R., Slavenburg, G., Boyle, P.: Software-controlled caches in the VMP multiprocessor. In: Proc. 13th Int. Symp. on Computer Architecture (ISCA 1986), Tokyo, June 1986, pp. 366–374 (1986)Google Scholar
  8. 8.
    Crisp, R.: Direct Rambus technology: The new main memory standard. IEEE Micro 17(6), 18–28 (1997)CrossRefGoogle Scholar
  9. 9.
    Davis, B., Mudge, T., Jacob, B., Cuppu, V.: DDR2 and low latency variants. In: Solving the Memory Wall Problem Workshop, Vancouver, Canada (June 2000); in conjunction with 26th Annual lnt. Symp. on Computer ArchitectureGoogle Scholar
  10. 10.
    Hallnor, E.G., Reinhardt, S.K.: A fully associative software-managed cache design. In: Proc. 27th Annual Int. Symp. on Computer Architecture, Vancouver, BC, pp. 107–116 (2000)Google Scholar
  11. 11.
    Handy, J.: The Cache Memory Book, 2nd edn. Academic Press, San Diego (1998)Google Scholar
  12. 12.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 2nd edn. Morgan Kaufmann, San Francisco (1996)zbMATHGoogle Scholar
  13. 13.
    Huck, J., Hays, J.: Architectural support for translation table management in large address space machines. In: Proc. 20th Int. Symp. on Computer Architecture (ISCA 1993), San Diego, CA, May 1993, pp. 39–50 (1993)Google Scholar
  14. 14.
    Jacob, B., Mudge, T.: Software-managed address translation. In: Proc. Third Int. Symp. on High-Performance Computer Architecture, San Antonio, TX, February 1997, pp. 156–167 (1997)Google Scholar
  15. 15.
    Jacob, B.L., Mudge, T.N.: A look at several memory management units, TLB-refill mechanisms, and page table organizations. In: Proc. 8th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose, CA, pp. 295–306 (1998)Google Scholar
  16. 16.
    Johnson, E.E.: Graffiti on the memory wall. Computer Architecture News 23(4), 7–8 (1995)CrossRefGoogle Scholar
  17. 17.
    Jouppi, N.P.: Cache write policies and performance. In: Proc. 20th annual Int. Symp. on Computer Architecture, San Diego, CA, pp. 191–201 (1993)Google Scholar
  18. 18.
    Lee, J.-S., Hong, W.-K., Kim, S.-D.: Design and evaluation of a selective compressed memory system. In: Proc. IEEE Int. Conf. on Computer Design, Austin, TX, October 10-13, pp. 184–191 (1999)Google Scholar
  19. 19.
    Lo, J.L., Emer, J.S., Levy, H.M., Stamm, R.L., Tullsen, D.M.: Converting threadlevel parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans. on Computer Systems 15(3), 322–354 (1997)CrossRefGoogle Scholar
  20. 20.
    Machanick, P.: The case for SRAM main memory. Computer Architecture News 24(5), 23–30 (1996)CrossRefGoogle Scholar
  21. 21.
    Machanick, P.: Scalability of the RAMpage memory hierarchy. South African Computer Journal (25), 68–73 (2000)Google Scholar
  22. 22.
    Machanick, P., Salverda, P., Pompe, L.: Hardware-software trade-offs in a Direct Rambus implementation of the RAMpage memory hierarchy. In: Proc. 8th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose, CA, October 1998, pp. 105–114 (1998)Google Scholar
  23. 23.
    Machanick, P.: An Object-Oriented Library for Shared-Memory Parallel Simulations. PhD Thesis, Dept. of Computer Science, University of Cape Town (1996)Google Scholar
  24. 24.
    Machanick, P.: What if DRAM is a slow peripheral? Computer Architecture News 30(6), 16–19 (2002)Google Scholar
  25. 25.
    Mowry, T.C., Lam, M.S., Gupta, A.: Design and evaluation of a compiler algorithm for prefetching. In: Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, September 1992, pp. 62–73 (1992)Google Scholar
  26. 26.
    Rosenblum, M., Herrod, S.A., Witchel, E., Gupta, A.: Complete computer system simulation: The SimOS approach. IEEE Parallel and Distributed Technology 3(4), 34–43 (1995)CrossRefGoogle Scholar
  27. 27.
    Saulsbury, A., Pong, F., Nowatzyk, A.: Missing the memory wall: the case for processor/memory integration. In: Proc. 23rd annual Int. Symp. on Computer architecture, Philadelphia, PA, pp. 90–101 (1996)Google Scholar
  28. 28.
    Sprangle, E., Carmean, D.: Increasing processor performance by implementing deeper pipelines. In: Proc. 29th Annual Int. Symp. on Computer architecture, Anchorage, Alaska, pp. 25–34 (2002)Google Scholar
  29. 29.
    Tendler, J.M., Dodson Jr., J.S., Fields, J.S., Le, H., Sinharoy, B.: POWER4 system microarchitecture. IBM J. Research and Development 46(1), 5–25 (2002)CrossRefGoogle Scholar
  30. 30.
    Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. Computer Architecture News 23(1), 20–24 (1995)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Philip Machanick
    • 1
  • Zunaid Patel
    • 2
  1. 1.School of ITEEUniversity of QueenslandBrisbaneAustralia
  2. 2.School of Computer ScienceUniversity of the WitwatersrandJohannesburgSouth Africa

Personalised recommendations