Efficient Sequential Consistency Using Conditional Fences

Lin, Changhui; Nagarajan, Vijay; Gupta, Rajiv

doi:10.1007/s10766-011-0176-3

Efficient Sequential Consistency Using Conditional Fences

Published: 28 June 2011

Volume 40, pages 84–117, (2012)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Changhui Lin¹,
Vijay Nagarajan² &
Rajiv Gupta¹

181 Accesses
1 Citation
Explore all metrics

Abstract

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adve S.V., Boehm H.-J.: Memory models: a case for rethinking parallel languages and hardware. Commun. ACM 53(8), 90–101 (2010)
Article Google Scholar
Adve S.V., Gharachorloo K.: Shared memory consistency models: a tutorial. IEEE Comput. 29, 66–76 (1995)
Article Google Scholar
Adve, S.V., Hill, M.D.: Weak ordering—a new definition. In: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA ’90, pp. 2–14. ACM, New York (1990)
Ahn, W., Qi, S., Nicolaides, M., Torrellas, J., Lee, J.-W., Fang, X., Midkiff, S., Wong, D.: BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. In: Proceedings of MICRO-42, pp. 133–144. ACM, New York (2009)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from Berkeley. Technical report, EECS Department, University of California, Berkeley (Dec 2006)
Blundell, C., Martin, M.M., Wenisch, T.F.: Invisifence: performance-transparent memory ordering in conventional multiprocessors. In: Proceedings of ISCA-36, pp. 233–244. ACM, New York (2009)
Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: BulkSC: Bulk enforcement of sequential consistency. In: Proceedings of ISCA-34, pp. 278–289 (2007)
Chafi, H., Casper, J., Carlstrom, B.D., McDonald, A., Minh, C.C., Baek, W., Kozyrakis, C., Olukotun K.: A scalable, non-blocking approach to transactional memory. In: HPCA-13, pp. 97–108. IEEE Computer Society, Washington, DC (2007)
Chen, W.-Y., Krishnamurthy, A., Yelick, K.A.: Polynomial-time algorithms for enforcing sequential consistency in SPMD programs with arrays. In: LCPC, pp. 340–356. Springer, Berlin (2003)
Dijkstra, E.W.: Cooperating Sequential Processes. The Origin of Concurrent Programming: From Semaphores to Remote Procedure Calls, pp. 65–138, (2002)
Duan, Y., Feng, X., Wang, L., Zhang, C., Yew, P.-C.: Detecting and eliminating potential violations of sequential consistency for concurrent C/C + + programs. In: CGO ’09, pp. 25–34. IEEE Computer Society, Washington, DC (2009)
Fang, X., Lee, J., Midkiff, S.P.: Automatic fence insertion for shared memory multiprocessing. In: ICS ’03: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 285–294. ACM, New York (2003)
Gharachorloo, K., Gupta, A., Hennessy, J.: Two techniques to enhance the performance of memory consistency models. In: Proceedings of the 1991 International Conference on Parallel Processing, pp. 355–364 (1991)
Gniady, C., Falsafi, B.: Speculative sequential consistency with little custom storage. In: PACT ’02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pp. 179–188. IEEE Computer Society, Washington, DC (2002)
Gniady, C., Falsafi, B., Vijaykumar, T.N.: Is SC +ILP = RC? In: Proceedings of ISCA-26, pp. 162–171. IEEE Computer Society, Washington, DC (1999)
Hammond L., Wong V., Chen M., Carlstrom B.D., Davis J.D., Hertzberg B., Prabhu M.K., Wijaya H., Kozyrakis C., Olukotun K.: Transactional memory coherence and consistency. SIGARCH Comput. Archit. News 32(2), 102 (2004)
Article Google Scholar
Hill M.D., Marty M.R.: Amdahl’s law in the multicore era. Computer 41, 33–38 (2008)
Article Google Scholar
Kamil, A., Su, J., Yelick, K.: Making sequential consistency practical in Titanium. In: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 15. IEEE Computer Society, Washington, DC (2005)
Krishnamurthy, A., Yelick, K.: Optimizing parallel programs with explicit synchronization. In: Proceedings of the ACM SIGPLAN ’95 Conference on Programming Language Design and Implementation, pp. 196–204 (1995)
Krishnamurthy A., Yelick K.: Analyses and optimizations for shared address space programs. J. Parallel Distrib. Comput 38, 130–144 (1996)
Article MATH Google Scholar
Lamport L.: How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Trans. Comput. 28(9), 690–691 (1979)
Article MATH Google Scholar
Lee J., Padua D.A.: Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput. 50(8), 824–833 (2001)
Article Google Scholar
Lee, K., Fang, X., Midkiff, S.P.: Practical escape analyses: how good are they? In: VEE ’07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pp. 180–190. ACM, New York (2007)
Liao, G., Guo, D., Bhuyan, L., King, S.R.: Software techniques to improve virtualized I/O performance on multi-core systems. In: Proceedings of the 4th ANCS, pp. 161–170. ACM, New York (2008)
Liao, G., Zhu, X., Bhuyan, L.: A new server I/O architecture for high speed networks. In: Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture, HPCA ’11 (2011)
Lucia, B., Ceze, L., Strauss, K., Qadeer, S., Boehm, H.-J.: Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. In: Proceedings of ISCA ’37, pp. 210–221. ACM, New York (2010)
Marino, D., Singh, A., Millstein, T., Musuvathi, M., Narayanasamy, S.: DRFx: a simple and efficient memory model for concurrent programming languages. In: Proceedings of PLDI ’10, pp. 351–362. ACM, New York (2010)
Midkiff, S.P.: Dependence analysis in parallel loops with i ± k subscripts. In: LCPC, pp. 331–345 (1995)
Midkiff, S.P., Padua, D.A.: Issues in the optimization of parallel programs. In: Proceedings of the 1990 International Conference on Parallel Processing, Vol. 2: Software, pp. 105–113. Urbana-Champaign, IL (1990)
Ranganathan, P., Pai, V.S., Adve, S.V.: Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. In: Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 199–210. ACM, New York (1997)
Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Sarangi, S., Sack, P., Strauss, K., Montesinos, P.: SESC simulator, (January 2005). http://sesc.sourceforge.net
Shasha D., Snir M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988)
Article Google Scholar
Singh, A., Marino, D., Narayanasamy, S., Millstein, T., Musuvathi, M.: Efficient processor support for DRFx, a memory model with exceptions. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’11, pp. 53–66. ACM, New York (2011)
Sura, Z., Fang, X., Wong, C.-L., Midkiff, S.P., Lee, J., Padua, D.: Compiler techniques for high performance sequentially consistent Java programs. In: PPoPP ’05, pp. 2–13. ACM, New York (2005)
Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. (2005). http://www.gotw.ca/publications/concurrency-ddj.htm, March 2005
Tian, C., Feng, M., Nagarajan, V., Gupta, R.: Copy or discard execution model for speculative parallelization on multicores. In: Proceedings of Micro-41, pp. 330–341. IEEE Computer Society, Washington, DC (2008)
von Praun, C., Cain, H.W., Choi, J.-D., Ryu, K.D.: Conditional memory ordering. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, pp. 41–52. IEEE Computer Society, Washington, DC (2006)
Wenisch, T.F., Ailamaki, A., Falsafi, B., Moshovos, A.: Mechanisms for store-wait-free multiprocessors. In: Proceedings of ISCA-34, pp. 266–277. ACM, New York (2007)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of ISCA-22, pp. 24–36. ACM, New York (1995)

Download references

Author information

Authors and Affiliations

CSE Department, University of California at Riverside, Riverside, CA, 92521, USA
Changhui Lin & Rajiv Gupta
School of Informatics, University of Edinburgh, Edinburgh, UK
Vijay Nagarajan

Authors

Changhui Lin
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar
Rajiv Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changhui Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, C., Nagarajan, V. & Gupta, R. Efficient Sequential Consistency Using Conditional Fences. Int J Parallel Prog 40, 84–117 (2012). https://doi.org/10.1007/s10766-011-0176-3

Download citation

Received: 31 January 2011
Accepted: 10 June 2011
Published: 28 June 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10766-011-0176-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Sequential Consistency Using Conditional Fences

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Sequential Consistency Using Conditional Fences

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation