Skip to main content
Log in

BADCO: Behavioral Application-Dependent Superscalar Core Models

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a detailed uncore model. Behavioral core models are built from detailed simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10 % in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Lee et al. used SimpleScalar sim-outorder [1] for their experiments.

  2. If an L1 miss Y is data-dependent on a delayed L1 hit which is waiting for a cache line requested by a previous L1 miss X, then Y is considered data-dependent on X [2].

  3. The only exception is L1 store miss, because store requests are processed at commit on x86 architectures.

  4. The fetch stall models the pipeline flush on real architectures.

  5. Feedback mechanisms in the prefetcher throttle prefetch requests when needed. The occupancy of the MSHR and the utilization of the bus, for example, can be monitored by the prefetch controller to decide about whether or not to issue prefetch requests.

  6. We attach a DL1 miss request to the first \(\mu \)op (load or store) accessing that cache line. We attach a DL1 prefetch to the \(\mu \)op triggering the prefetch. We attach a write-back request to the same \(\mu \)op to which the request causing the write-back is attached.

  7. The Zesto model implements next-line prefetching for the instructions, but does not pipeline the instruction misses. Node fetching mimics this behavior.

  8. For instance, the Zesto simulator allocates an MSHR entry for each delayed hit (i.e., hits on a pending miss). Our BADCO machine does the same in our experiments. This is why we simulate an unlimited MSHR for generating trace TL, so as to capture all potential delayed hits in the trace.

  9. Each request to the uncore is attached to a single \(\mu \)op.

  10. D(X) is null or 0 when there is not request \(\mu \)op whose CT is less than the IT of X. This just happen at the beginning of the trace.

  11. The uncore simulator was extracted from Zesto.

References

  1. Austin, T., Larson, E., Ernst, D.: SimpleScalar: an infrastructure for computer system modeling. IEEE Comput. 35(2), 59–67 (2002). http://www.simplescalar.com/

    Google Scholar 

  2. Chen, X.E., Aamodt, T.M.: Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. In: Proceedings of the 41st International Symposium on Microarchitecture (2008)

  3. Cho, S., Demetriades, S., Evans, S., Jin, L., Lee, H., Lee, K., Moeng, M.: TPTS : a novel framework for very fast manycore processor architecture simulation. In: Proceedings of the 37th International Conference on Parallel Processing (2008)

  4. Durbhakula, M., Pai, V.S., Adve, S.: Improving the accuracy vs. speed tradeoff for simulating shared-memory multiprocessors with ILP processors. In: Proceedings of the 5th International Symposium on High-Performance Computer Architecture (1999)

  5. Eyerman, S., Eeckhout, L., Karkhanis, T., Smith, J.E.: A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27(2) (2009)

  6. Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2011)

  7. Fields, B.A., Bodik, R., Hill, M.D., Newburn, C.J.: Using interaction costs for microarchitectural bottleneck analysis. In: Proceedings of the 36th International Symposium on Microarchitecture (2003)

  8. Fields, B., Rubin, S., Bodik, R.: Focusing processor policies via critical-path prediction. In: Proceedings of the 28th International Symposium on Computer Architecture (2001)

  9. Genbrugge, D., Eyerman, S., Eeckhout, L.: Interval simulation : raising the level of abstraction in architectural simulation. In: Proceedings of the 16th International Symposium on High-Performance Computer Architecture (2010)

  10. Goldschmidt, S.R., Hennessy, J.L.: The accuracy of trace-driven simulations of multiprocessors. In: Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (1993)

  11. \({\ddot{\rm I}}\)pek, E., McKee, S., Caruana, R., de Supinski, B., Schulz, M.: Efficiently Exploring Architectural Design Spaces Via Predictive Modeling, vol. 40. ACM (2006)

  12. Joseph, P., Vaswani, K., Thazhuthaveetil, M.: Construction and use of linear regression models for processor performance analysis. In: High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pp. 99–108. IEEE (2006)

  13. Kanaujia, S., Papazian, I.E., Chamberlain, J., Baxter, J.: FastMP : a multi-core simulation methodology. In: Workshop on Modeling, Benchmarking and Simulation (2006)

  14. Karkhanis, T.S., Smith, J.E.: A first-order superscalar processor model. In: Proceedings of the 31st International Symposium on Computer Architecture (2004)

  15. Lee, K., Cho, S.: In-N-Out : reproducing out-of-order superscalar processor behavior from reduced in-order traces. In: Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2011)

  16. Lee, K., Evans, S., Cho, S.: Accurately approximating superscalar processor performance from traces. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2009)

  17. Li, Y., Lee, B., Brooks, D., Hu, Z., Skadron, K.: CMP design space exploration subject to physical constraints. In: Proceedings of the 12th International Symposium on High Performance Computer Architecture (2006)

  18. Loh, G., Subramaniam, S., Xie, Y.: Zesto : a cycle-level simulator for highly detailed microarchitecture exploration. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2009)

  19. Loh, G.: A time-stamping algorithm for efficient performance estimation of superscalar processors. In: Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (2001)

  20. Moses, J., Illikkal, R., Iyer, R., Huggahalli, R., Newell, D.: ASPEN : towards effective simulation of threads & engines in evolving platforms. In: Proceedings of the 12th IEEE / ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2004)

  21. Mutlu, O., Kim, H., Armstrong, D., Patt, Y.: Understanding the effects of wrong-path memory references on processor performance. In: Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture, pp. 56–64. ACM (2004)

  22. Noonburg, D.B., Shen, J.P.: Theoretical modeling of superscalar processor performance. In: Proceedings of the 27th International Symposium on Microarchitecture (1994)

  23. Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., Valero, M.: Trace-driven simulation of multithreaded applications. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (2011)

  24. Ryckbosch, F., Polfliet, S., Eeckhout, L.: Fast, accurate, and validated full-system software simulation on x86 hardware. IEEE Micro 30(6), 46–56 (2010)

    Article  Google Scholar 

  25. Sendag, R., Yilmazer, A., Yi, J., Uht, A.: Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pp. 10-pp. IEEE (2006)

  26. Sorin, D.J., Pai, V.S., Adve, S.V., Vernon, M.K., Wood, D.A.: Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings of the 25th International Symposium on Computer Architecture (1998)

  27. Zhao, L., Iyer, R., Moses, J., Illikkal, R., Makineni, S., Newell, D.: Exploring large-scale CMP architectures using ManySim. IEEE Micro 27(4), 21–33 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricardo A. Velásquez.

Additional information

This work was partially supported by the European Research Council Advanced Grant DAL No 267175 and a PhD fellowship funded by Region Bretagne.

Appendix

Appendix

Let us assume that the execution of a program by a superscalar processor can be modeled as a graph, where nodes represents certain events and edges represent dependences between events [7, 8]. Each edge is annotated with a latency. Let us assume that requests to the uncore are a subset of the graph edges, and that all the requests have the same latency \(X\). We enumerate all the possible paths (i.e., dependence chains) in the graph and denote \(N_k\) the number of requests on path \(k\). The length of path \(k\) is

$$\begin{aligned} T_k (X) = L_k + N_k X \end{aligned}$$

and the total execution time is the length of the longest path

$$\begin{aligned} T (X) = \max _k \, T_k(X) = T_{p(X)} (X) \end{aligned}$$

where \(p(X)\) is the longest path. \(N_{p(X)}\) is the slope of \(T(X)\) at \(X\). Let us consider \(X < Y\). We have

$$\begin{aligned} T_{p(Y)}(X) \le T(X) \\ T_{p(X)}(Y) \le T(Y) \end{aligned}$$

This implies \((N_{p(Y)} - N_{p(X)}) X \le (N_{p(Y)} - N_{p(X)}) Y\), which is possible only if \(N_{p(Y)} \ge N_{p(X)}\). The slope of \(T(X)\) increases with \(X\), hence \(T(X)\) is convex.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Velásquez, R.A., Michaud, P. & Seznec, A. BADCO: Behavioral Application-Dependent Superscalar Core Models. Int J Parallel Prog 43, 130–157 (2015). https://doi.org/10.1007/s10766-013-0278-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0278-1

Keywords

Navigation