Advertisement

International Journal of Parallel Programming

, Volume 40, Issue 1, pp 84–117 | Cite as

Efficient Sequential Consistency Using Conditional Fences

  • Changhui Lin
  • Vijay Nagarajan
  • Rajiv Gupta
Article
  • 157 Downloads

Abstract

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors.

Keywords

Memory consistency Sequential consistency Interprocessor delay Associates Conditional fences Active table 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adve S.V., Boehm H.-J.: Memory models: a case for rethinking parallel languages and hardware. Commun. ACM 53(8), 90–101 (2010)CrossRefGoogle Scholar
  2. 2.
    Adve S.V., Gharachorloo K.: Shared memory consistency models: a tutorial. IEEE Comput. 29, 66–76 (1995)CrossRefGoogle Scholar
  3. 3.
    Adve, S.V., Hill, M.D.: Weak ordering—a new definition. In: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA ’90, pp. 2–14. ACM, New York (1990)Google Scholar
  4. 4.
    Ahn, W., Qi, S., Nicolaides, M., Torrellas, J., Lee, J.-W., Fang, X., Midkiff, S., Wong, D.: BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. In: Proceedings of MICRO-42, pp. 133–144. ACM, New York (2009)Google Scholar
  5. 5.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from Berkeley. Technical report, EECS Department, University of California, Berkeley (Dec 2006)Google Scholar
  6. 6.
    Blundell, C., Martin, M.M., Wenisch, T.F.: Invisifence: performance-transparent memory ordering in conventional multiprocessors. In: Proceedings of ISCA-36, pp. 233–244. ACM, New York (2009)Google Scholar
  7. 7.
    Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: BulkSC: Bulk enforcement of sequential consistency. In: Proceedings of ISCA-34, pp. 278–289 (2007)Google Scholar
  8. 8.
    Chafi, H., Casper, J., Carlstrom, B.D., McDonald, A., Minh, C.C., Baek, W., Kozyrakis, C., Olukotun K.: A scalable, non-blocking approach to transactional memory. In: HPCA-13, pp. 97–108. IEEE Computer Society, Washington, DC (2007)Google Scholar
  9. 9.
    Chen, W.-Y., Krishnamurthy, A., Yelick, K.A.: Polynomial-time algorithms for enforcing sequential consistency in SPMD programs with arrays. In: LCPC, pp. 340–356. Springer, Berlin (2003)Google Scholar
  10. 10.
    Dijkstra, E.W.: Cooperating Sequential Processes. The Origin of Concurrent Programming: From Semaphores to Remote Procedure Calls, pp. 65–138, (2002)Google Scholar
  11. 11.
    Duan, Y., Feng, X., Wang, L., Zhang, C., Yew, P.-C.: Detecting and eliminating potential violations of sequential consistency for concurrent C/C +  + programs. In: CGO ’09, pp. 25–34. IEEE Computer Society, Washington, DC (2009)Google Scholar
  12. 12.
    Fang, X., Lee, J., Midkiff, S.P.: Automatic fence insertion for shared memory multiprocessing. In: ICS ’03: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 285–294. ACM, New York (2003)Google Scholar
  13. 13.
    Gharachorloo, K., Gupta, A., Hennessy, J.: Two techniques to enhance the performance of memory consistency models. In: Proceedings of the 1991 International Conference on Parallel Processing, pp. 355–364 (1991)Google Scholar
  14. 14.
    Gniady, C., Falsafi, B.: Speculative sequential consistency with little custom storage. In: PACT ’02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pp. 179–188. IEEE Computer Society, Washington, DC (2002)Google Scholar
  15. 15.
    Gniady, C., Falsafi, B., Vijaykumar, T.N.: Is SC +ILP = RC? In: Proceedings of ISCA-26, pp. 162–171. IEEE Computer Society, Washington, DC (1999)Google Scholar
  16. 16.
    Hammond L., Wong V., Chen M., Carlstrom B.D., Davis J.D., Hertzberg B., Prabhu M.K., Wijaya H., Kozyrakis C., Olukotun K.: Transactional memory coherence and consistency. SIGARCH Comput. Archit. News 32(2), 102 (2004)CrossRefGoogle Scholar
  17. 17.
    Hill M.D., Marty M.R.: Amdahl’s law in the multicore era. Computer 41, 33–38 (2008)CrossRefGoogle Scholar
  18. 18.
    Kamil, A., Su, J., Yelick, K.: Making sequential consistency practical in Titanium. In: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 15. IEEE Computer Society, Washington, DC (2005)Google Scholar
  19. 19.
    Krishnamurthy, A., Yelick, K.: Optimizing parallel programs with explicit synchronization. In: Proceedings of the ACM SIGPLAN ’95 Conference on Programming Language Design and Implementation, pp. 196–204 (1995)Google Scholar
  20. 20.
    Krishnamurthy A., Yelick K.: Analyses and optimizations for shared address space programs. J. Parallel Distrib. Comput 38, 130–144 (1996)CrossRefzbMATHGoogle Scholar
  21. 21.
    Lamport L.: How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Trans. Comput. 28(9), 690–691 (1979)CrossRefzbMATHGoogle Scholar
  22. 22.
    Lee J., Padua D.A.: Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput. 50(8), 824–833 (2001)CrossRefGoogle Scholar
  23. 23.
    Lee, K., Fang, X., Midkiff, S.P.: Practical escape analyses: how good are they? In: VEE ’07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pp. 180–190. ACM, New York (2007)Google Scholar
  24. 24.
    Liao, G., Guo, D., Bhuyan, L., King, S.R.: Software techniques to improve virtualized I/O performance on multi-core systems. In: Proceedings of the 4th ANCS, pp. 161–170. ACM, New York (2008)Google Scholar
  25. 25.
    Liao, G., Zhu, X., Bhuyan, L.: A new server I/O architecture for high speed networks. In: Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture, HPCA ’11 (2011)Google Scholar
  26. 26.
    Lucia, B., Ceze, L., Strauss, K., Qadeer, S., Boehm, H.-J.: Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. In: Proceedings of ISCA ’37, pp. 210–221. ACM, New York (2010)Google Scholar
  27. 27.
    Marino, D., Singh, A., Millstein, T., Musuvathi, M., Narayanasamy, S.: DRFx: a simple and efficient memory model for concurrent programming languages. In: Proceedings of PLDI ’10, pp. 351–362. ACM, New York (2010)Google Scholar
  28. 28.
    Midkiff, S.P.: Dependence analysis in parallel loops with i ± k subscripts. In: LCPC, pp. 331–345 (1995)Google Scholar
  29. 29.
    Midkiff, S.P., Padua, D.A.: Issues in the optimization of parallel programs. In: Proceedings of the 1990 International Conference on Parallel Processing, Vol. 2: Software, pp. 105–113. Urbana-Champaign, IL (1990)Google Scholar
  30. 30.
    Ranganathan, P., Pai, V.S., Adve, S.V.: Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. In: Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 199–210. ACM, New York (1997)Google Scholar
  31. 31.
    Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Sarangi, S., Sack, P., Strauss, K., Montesinos, P.: SESC simulator, (January 2005). http://sesc.sourceforge.net
  32. 32.
    Shasha D., Snir M.: Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst. 10(2), 282–312 (1988)CrossRefGoogle Scholar
  33. 33.
    Singh, A., Marino, D., Narayanasamy, S., Millstein, T., Musuvathi, M.: Efficient processor support for DRFx, a memory model with exceptions. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’11, pp. 53–66. ACM, New York (2011)Google Scholar
  34. 34.
    Sura, Z., Fang, X., Wong, C.-L., Midkiff, S.P., Lee, J., Padua, D.: Compiler techniques for high performance sequentially consistent Java programs. In: PPoPP ’05, pp. 2–13. ACM, New York (2005)Google Scholar
  35. 35.
    Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. (2005). http://www.gotw.ca/publications/concurrency-ddj.htm, March 2005
  36. 36.
    Tian, C., Feng, M., Nagarajan, V., Gupta, R.: Copy or discard execution model for speculative parallelization on multicores. In: Proceedings of Micro-41, pp. 330–341. IEEE Computer Society, Washington, DC (2008)Google Scholar
  37. 37.
    von Praun, C., Cain, H.W., Choi, J.-D., Ryu, K.D.: Conditional memory ordering. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, pp. 41–52. IEEE Computer Society, Washington, DC (2006)Google Scholar
  38. 38.
    Wenisch, T.F., Ailamaki, A., Falsafi, B., Moshovos, A.: Mechanisms for store-wait-free multiprocessors. In: Proceedings of ISCA-34, pp. 266–277. ACM, New York (2007)Google Scholar
  39. 39.
    Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of ISCA-22, pp. 24–36. ACM, New York (1995)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.CSE DepartmentUniversity of California at RiversideRiversideUSA
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations