Skip to main content
Log in

Efficient Sequential Consistency Using Conditional Fences

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adve S.V., Boehm H.-J.: Memory models: a case for rethinking parallel languages and hardware. Commun. ACM 53(8), 90–101 (2010)

    Article  Google Scholar 

  2. Adve S.V., Gharachorloo K.: Shared memory consistency models: a tutorial. IEEE Comput. 29, 66–76 (1995)

    Article  Google Scholar 

  3. Adve, S.V., Hill, M.D.: Weak ordering—a new definition. In: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA ’90, pp. 2–14. ACM, New York (1990)

  4. Ahn, W., Qi, S., Nicolaides, M., Torrellas, J., Lee, J.-W., Fang, X., Midkiff, S., Wong, D.: BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. In: Proceedings of MICRO-42, pp. 133–144. ACM, New York (2009)

  5. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from Berkeley. Technical report, EECS Department, University of California, Berkeley (Dec 2006)

  6. Blundell, C., Martin, M.M., Wenisch, T.F.: Invisifence: performance-transparent memory ordering in conventional multiprocessors. In: Proceedings of ISCA-36, pp. 233–244. ACM, New York (2009)

  7. Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: BulkSC: Bulk enforcement of sequential consistency. In: Proceedings of ISCA-34, pp. 278–289 (2007)

  8. Chafi, H., Casper, J., Carlstrom, B.D., McDonald, A., Minh, C.C., Baek, W., Kozyrakis, C., Olukotun K.: A scalable, non-blocking approach to transactional memory. In: HPCA-13, pp. 97–108. IEEE Computer Society, Washington, DC (2007)

  9. Chen, W.-Y., Krishnamurthy, A., Yelick, K.A.: Polynomial-time algorithms for enforcing sequential consistency in SPMD programs with arrays. In: LCPC, pp. 340–356. Springer, Berlin (2003)

  10. Dijkstra, E.W.: Cooperating Sequential Processes. The Origin of Concurrent Programming: From Semaphores to Remote Procedure Calls, pp. 65–138, (2002)

  11. Duan, Y., Feng, X., Wang, L., Zhang, C., Yew, P.-C.: Detecting and eliminating potential violations of sequential consistency for concurrent C/C +  + programs. In: CGO ’09, pp. 25–34. IEEE Computer Society, Washington, DC (2009)

  12. Fang, X., Lee, J., Midkiff, S.P.: Automatic fence insertion for shared memory multiprocessing. In: ICS ’03: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 285–294. ACM, New York (2003)

  13. Gharachorloo, K., Gupta, A., Hennessy, J.: Two techniques to enhance the performance of memory consistency models. In: Proceedings of the 1991 International Conference on Parallel Processing, pp. 355–364 (1991)

  14. Gniady, C., Falsafi, B.: Speculative sequential consistency with little custom storage. In: PACT ’02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pp. 179–188. IEEE Computer Society, Washington, DC (2002)

  15. Gniady, C., Falsafi, B., Vijaykumar, T.N.: Is SC +ILP = RC? In: Proceedings of ISCA-26, pp. 162–171. IEEE Computer Society, Washington, DC (1999)

  16. Hammond L., Wong V., Chen M., Carlstrom B.D., Davis J.D., Hertzberg B., Prabhu M.K., Wijaya H., Kozyrakis C., Olukotun K.: Transactional memory coherence and consistency. SIGARCH Comput. Archit. News 32(2), 102 (2004)

    Article  Google Scholar 

  17. Hill M.D., Marty M.R.: Amdahl’s law in the multicore era. Computer 41, 33–38 (2008)

    Article  Google Scholar 

  18. Kamil, A., Su, J., Yelick, K.: Making sequential consistency practical in Titanium. In: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 15. IEEE Computer Society, Washington, DC (2005)

  19. Krishnamurthy, A., Yelick, K.: Optimizing parallel programs with explicit synchronization. In: Proceedings of the ACM SIGPLAN ’95 Conference on Programming Language Design and Implementation, pp. 196–204 (1995)

  20. Krishnamurthy A., Yelick K.: Analyses and optimizations for shared address space programs. J. Parallel Distrib. Comput 38, 130–144 (1996)

    Article  MATH  Google Scholar 

  21. Lamport L.: How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Trans. Comput. 28(9), 690–691 (1979)

    Article  MATH  Google Scholar 

  22. Lee J., Padua D.A.: Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput. 50(8), 824–833 (2001)

    Article  Google Scholar 

  23. Lee, K., Fang, X., Midkiff, S.P.: Practical escape analyses: how good are they? In: VEE ’07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pp. 180–190. ACM, New York (2007)

  24. Liao, G., Guo, D., Bhuyan, L., King, S.R.: Software techniques to improve virtualized I/O performance on multi-core systems. In: Proceedings of the 4th ANCS, pp. 161–170. ACM, New York (2008)

  25. Liao, G., Zhu, X., Bhuyan, L.: A new server I/O architecture for high speed networks. In: Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture, HPCA ’11 (2011)

  26. Lucia, B., Ceze, L., Strauss, K., Qadeer, S., Boehm, H.-J.: Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. In: Proceedings of ISCA ’37, pp. 210–221. ACM, New York (2010)

  27. Marino, D., Singh, A., Millstein, T., Musuvathi, M., Narayanasamy, S.: DRFx: a simple and efficient memory model for concurrent programming languages. In: Proceedings of PLDI ’10, pp. 351–362. ACM, New York (2010)

  28. Midkiff, S.P.: Dependence analysis in parallel loops with i ± k subscripts. In: LCPC, pp. 331–345 (1995)

  29. Midkiff, S.P., Padua, D.A.: Issues in the optimization of parallel programs. In: Proceedings of the 1990 International Conference on Parallel Processing, Vol. 2: Software, pp. 105–113. Urbana-Champaign, IL (1990)

  30. Ranganathan, P., Pai, V.S., Adve, S.V.: Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. In: Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 199–210. ACM, New York (1997)

  31. Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Sarangi, S., Sack, P., Strauss, K., Montesinos, P.: SESC simulator, (January 2005). http://sesc.sourceforge.net

  32. Shasha D., Snir M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988)

    Article  Google Scholar 

  33. Singh, A., Marino, D., Narayanasamy, S., Millstein, T., Musuvathi, M.: Efficient processor support for DRFx, a memory model with exceptions. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’11, pp. 53–66. ACM, New York (2011)

  34. Sura, Z., Fang, X., Wong, C.-L., Midkiff, S.P., Lee, J., Padua, D.: Compiler techniques for high performance sequentially consistent Java programs. In: PPoPP ’05, pp. 2–13. ACM, New York (2005)

  35. Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. (2005). http://www.gotw.ca/publications/concurrency-ddj.htm, March 2005

  36. Tian, C., Feng, M., Nagarajan, V., Gupta, R.: Copy or discard execution model for speculative parallelization on multicores. In: Proceedings of Micro-41, pp. 330–341. IEEE Computer Society, Washington, DC (2008)

  37. von Praun, C., Cain, H.W., Choi, J.-D., Ryu, K.D.: Conditional memory ordering. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, pp. 41–52. IEEE Computer Society, Washington, DC (2006)

  38. Wenisch, T.F., Ailamaki, A., Falsafi, B., Moshovos, A.: Mechanisms for store-wait-free multiprocessors. In: Proceedings of ISCA-34, pp. 266–277. ACM, New York (2007)

  39. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of ISCA-22, pp. 24–36. ACM, New York (1995)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changhui Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, C., Nagarajan, V. & Gupta, R. Efficient Sequential Consistency Using Conditional Fences. Int J Parallel Prog 40, 84–117 (2012). https://doi.org/10.1007/s10766-011-0176-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-011-0176-3

Keywords

Navigation