Evaluation of Large L3 Caches Using TPC-H Trace Samples
In this chapter we evaluate the miss rates of four L3 cache architectures for small-scale multiprocessors. Eight processors are partitioned into 1, 2, 4, or 8 clusters with 8, 4, 2, or 1 processors, respectively. Each cluster has a large L3 cache, and the aggregate amount of L3 cache in each of the four architectures varies between 64 MB and 1 GB. The target of our evaluations is decision support systems. We use bus trace samples obtained during the execution of a 100 GB TPC-H on an 8-way multiprocessor. These 12 time samples were taken at one hour intervals during the first day of execution of TPC-H. Each sample contains 64 M bus references.
We first show the distribution of bus references across samples and across processors in the same sample. The major problem with time samples is the cold start misses at the beginning of each sample. We show the cache warm-up rate for all cluster architectures and cache sizes. Unfortunately, systems with aggregate L3 cache sizes above 128 MB are never completely warm with our samples of size 64 million. Thus we evaluate cache architectures under three conditions: cold cache at the beginning of each sample, warm sets only, and stitched trace. We classify misses to understand their cause. Observations are similar across all three simulation types. We also show that the 12 time samples exhibit similar behavior.
One of the major observations using the twelve 64 M reference trace samples is the large number of interprocessor and IO coherence misses.
KeywordsAddress Space Cache Size Shared Cache Trace Sample Large Cache
Unable to display preview. Download preview PDF.
- 1.Barroso L, Gharachorloo K, Bugnion E (1998) Memory System Characterization of Commercial Workloads. In: Proceedings of the 25th ACM International Symposium on Computer Architecture.Google Scholar
- 2.Barroso L, Gharachorloo K, Nowatzyk A, Verghese B (2000) Impact of Chip-Level Integration on Performance of OLTP Workloads. In: Proceedings of the 6th International Symposium on High-Performance Computer Architecture.Google Scholar
- 3.Barroso L, Gharachorloo K, McNamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B (2000) Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In: Proceedings of the 27th International Symposium on Computer Architecture.Google Scholar
- 4.Chame J, Dubois M (1993) Cache Inclusion and Processor Sampling in Multiprocessor Simulations. In: Proceedings of ACM Sigmetrics, pp. 36–47.Google Scholar
- 7.Magnusson P et al. (1998) SimICS/sun4m: A Virtual Workstation. In: Proceedings of the 1998 USENIX Annual Technical Conference, pp. 119–130.Google Scholar
- 8.Nanda A, Mak K, Sugavanam K, Sahoo R, Soundararajan B, Smith T (2000) MemorIES: A Programmable, Real-Time Hardware Emulation Tool for Multiprocessor Server Design. In: Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- 9.Puzak T (1985) Analysis of Cache Replacement-Algorithms. Ph.D. Dissertation, University of Massachusetts, Amherst, MA.Google Scholar
- 11. Transaction Processing Performance Council (1999) TPC Benchmark H Standard Specification. Transaction Processing Performance Council. http://tpc.org.
- 13.Wood D, Hill M, Kessler R (1990) A Model for Estimating Trace-Sample Miss Ratios. In: Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems.Google Scholar