Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors

Zhang, Deli; Lynch, Brendan; Dechev, Damian

doi:10.1007/s10766-014-0317-6

Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors

Published: 15 August 2014

Volume 43, pages 721–751, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Deli Zhang¹,
Brendan Lynch¹ &
Damian Dechev¹

241 Accesses
1 Citation
Explore all metrics

Abstract

We present a scalable lock algorithm and an adaptive scheme for shared-memory multiprocessors addressing the resource allocation problem, which is also known as the \(h\)-out-of-\(k\) mutual exclusion problem. In this problem, threads compete for \(k\) shared resources where a thread may request an arbitrary number \(1\le h\le k\) of resources at the same time. The challenge is for each thread to acquire exclusive access to desired resources while preventing deadlock or starvation. Many existing approaches solve this problem in a distributed system, but the explicit message passing paradigm they adopt is not optimal for shared-memory. Other applicable methods, like two-phase locking and resource hierarchy, suffer from performance degradation under heavy contention, while lacking a desirable fairness guarantee. This work describes the first multi-resource lock algorithm that guarantees the strongest first-in, first-out fairness. Our methodology is based on a non-blocking queue where competing threads spin on previous conflicting resource requests. In our experimental evaluation we compared the overhead and scalability of our lock to the best available alternative approaches using a micro-benchmark. As contention increases, our multi-resource lock obtains an average of eight times speed-up over the alternatives including GNU C++’s lock method, Boost’s lock function, and Intel TBB’s queue mutex. To further improve the performance on low levels of contention, we introduce an adaptive scheme that is composed of two different lock algorithms and alternates the use the locks depending on the level of contention. Our experimental results show that the composite adaptive scheme achieves the best overall performance comparing with using either lock alone when system contention is not known a priori.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Fast and Scalable, Lock-Free k-FIFO Queues

Concurrency groups: a new way to look at real-time multiprocessor lock nesting

Article 05 February 2021

Catherine E. Nemitz, Tanya Amert, … James H. Anderson

Notes

A bitset is a data structure that contains an array of bits.
Also known as compare_exchange
Note that ABA is not an acronym. It refers situations where a thread reads value A at some address and later attempts CAS operation expecting value A. However, between the read and the CAS another thread has changed the value from A to B and back to A, thus the CAS operation succeeds when it should not.
It would be less considering memory reserved for the kernel.

References

Anderson, J.H., Kim, Y.J., Herman, T.: Shared-memory mutual exclusion: major research trends since 1986. Distrib. Comput. 16(2), 75–110 (2003)
Article Google Scholar
Anderson, Thomas E.: The performance of spin lock alternatives for shared-money multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1(1), 6–16 (1990)
Article Google Scholar
Awerbuch, B., Saks, M.: A dining philosophers algorithm with polynomial response time. In: Proceedings, 31st Annual IEEE Symposium on Foundations of Computer Science, 1990, pp. 65–74. (1990)
Bar-Ilan, J., Peleg, D.: Distributed resource allocation algorithms. In: Segall, A., Zaks, S. (eds.) Distributed Algorithms. Lecture Notes in Computer Science, vol. 647, pp. 277–291. Springer Berlin Heidelberg (1992). doi:10.1007/3-540-56188-9_19
Bernstein, P., Goodman, N.: Timestamp based algorithms for concurrency control in distributed database systems. In: Proceedings 6th International Conference on Very Large Data Bases, (1980)
Boehm, H.-J., Adve, S. V.: Foundations of the c++ concurrency memory model. In: ACM SIGPLAN Notices, vol. 43, pp. 68–78. ACM, (2008)
Borkar, S.: Thousand core chips: a technology perspective. In: Proceedings of the 44th annual Design Automation Conference, pp. 746–749. ACM, (2007)
Craig, T.: Building fifo and priorityqueuing spin locks from atomic swap. Technical report, Citeseer, (1994)
Damron, P., Fedorova, A., Lev, Y., Luchangco, V., Moir, M., Nussbaum, D.: Hybrid transactional memory. In: ACM Sigplan Notices, vol. 41, pp. 336–346. ACM, (2006)
Datta, A.K., Devismes, S., Horn, F.: Self-stabilizing k-out-of-h exclusion in tree networks. Int. J. Found. Comput. Sci. 22(03), 657–677 (2011)
Article MathSciNet MATH Google Scholar
Dechev, D., Pirkelbauer, P., Stroustrup, B.: Lock-free dynamically resizable arrays. In: Principles of Distributed Systems, pp. 142–156. Springer, (2006)
Dice, D., Marathe, V.J., Shavit, N.: Flat-combining numa locks. In: Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, pp. 65–74. ACM, (2011)
Dijkstra, E.W.: Hierarchical ordering of sequential processes. Acta inform. 1(2), 115–138 (1971)
Article MathSciNet Google Scholar
Eswaran, K.P., Gray, J.N., Lorie, R.A., Traiger, I.L.: The notions of consistency and predicate locks in a database system. Commun. ACM 19(11), 624–633 (1976)
Article MathSciNet MATH Google Scholar
Michael, J., Fischer, Nancy A., Lynch, James E., Burns, Allan Borodin: Distributed fifo allocation of identical resources using small shared space. ACM Trans. Program. Lang. Syst. 11(1), 90–114 (1989)
Article Google Scholar
Fischer, M.J., Lynch, N.A., Burns, J.E., Borodin, A.: Resource allocation with immunity to limited process failure. In: 20th Annual IEEE Symposium on Foundations of Computer Science, 1979, pp. 234–254. (1979)
Fraser, Keir, Harris, Tim: Concurrent programming without locks. ACM Trans. Comput. Syst. 25(2), 5 (2007)
Article Google Scholar
Harris, T.L., Fraser, K., Pratt, I.A.: A practical multi-word compare-and-swap operation. In: Malkhi, D. (ed.) Distributed Computing. Lecture Notes in Computer Science, vol. 2508, pp. 265–279. Springer Berlin Heidelberg (2002). doi:10.1007/3-540-36108-1_18
Herlihy, M.: A methodology for implementing highly concurrent data objects. ACM Transa. Program. Lang. Syst. 15(5), 745–770 (1993)
Article Google Scholar
Herlihy, Maurice: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991)
Article Google Scholar
Herlihy, Maurice, Moss, J.Eliot B.: Transactional memory: architectural support for lock-free data structures. SIGARCH Comput. Archit. News 21(2), 289–300 (1993)
Article Google Scholar
Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming, Revised Reprint. Elsevier (2012)
Johnson, R., Pandis, I., Hardavellas, N., Ailamaki, A., Falsafi, B.: Shore-mt: a scalable storage manager for the multicore era. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 24–35. ACM, (2009)
Karlsson, B.: Beyond the C++ Standard Library: An Introduction to Boost. Pearson Education, Upper Saddle River (2005)
Google Scholar
Kogan, A., Petrank, E.: A methodology for creating fast wait-free data structures. In: ACM SIGPLAN Notices, vol. 47, pp. 141–150. ACM, (2012)
Lomont, C.: Introduction to intel advanced vector extensions. Technical report. Intel White Paper, (2011)
Lynch, N.A.: Fast allocation of nearby resources in a distributed system. In: Proceedings of the twelfth annual ACM symposium on Theory of computing, pp. 70–81. ACM, (1980)
Marathe, V.J., Moir, M.: Toward high performance nonblocking software transactional memory. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 227–236. ACM, (2008)
Matveev, A., Shavit, N.: Reduced hardware transactions: a new approach to hybrid transactional memory. In: Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures, pp. 11–22. ACM, (2013)
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Article Google Scholar
Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, pp. 267–275. ACM, (1996)
Raynal, M.: A distributed solution to the k-out of-m resources allocation problem. In: Dehne, F., Fiala, F., Koczkodaj, W.W. (eds.) Advances in Computing and Information—ICCI’91. Lecture Notes in Computer Science, vol. 497, pp. 599–609. Springer Berlin Heidelberg (1991). doi:10.1007/3-540-54029-6_209
Raynal, M., Beeson, D.: Algorithms for Mutual Exclusion. MIT Press, Cambridge (1986)
MATH Google Scholar
Reddy, V.A., Mittal, P., Gupta, I.: Fair k mutual exclusion algorithm for peer to peer systems. In: The 28th International Conference on Distributed Computing Systems, ICDCS’08, IEEE, pp. 655–662. (2008)
Rudolph, L., Segall, Z.: Dynamic decentralized cache schemes for mimd parallel processors. In: Proceedings of the 11th annual international symposium on Computer architecture, ISCA ’84, pp. 340–347. ACM, (1984)
Scott, M.L., Scherer, W.N.: Scalable queue-based spin locks with timeout. In: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, PPoPP ’01, pp. 44–52. ACM, (2001)
Shavit, Nir, Touitou, Dan: Software transactional memory. Distrib. Comput. 10(2), 99–116 (1997)
Article Google Scholar
Willhalm, T., Popovici, N.: Putting intel threading building blocks to work. In: Proceedings of the 1st international workshop on Multicore software engineering, pp. 3–4. ACM, (2008)
Yoo, R.M., Hughes, C.J., Lai, K., Rajwar, R.: Performance evaluation of intel transactional synchronization extensions for high-performance computing. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 19. ACM, (2013)

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under CCF Award No.1218100. The authors would also like to thank Dimitry Vyukov for providing insightful implementation tips on the non-blocking queue.

Author information

Authors and Affiliations

University of Central Florida, Orlando, FL, USA
Deli Zhang, Brendan Lynch & Damian Dechev

Authors

Deli Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Brendan Lynch
View author publications
You can also search for this author in PubMed Google Scholar
Damian Dechev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deli Zhang.

Appendix: Hardware Transactional Memory with MRLock as Fallback

1.1 Outline

Intel’s Transactional Synchronization Extensions (TSX) offer support for a best-effort transactional memory. With Intel TSX, the processor executes transactions optimistically in hardware without explicit serialization. In situations of low contention, the majority of the transactional regions of code will be able to successfully and efficiently commit to memory using HTM. As contention increases, transactional aborts will also increase due to detected data conflicts. Performance degradation is evident when transactions continuously abort. It is common practice to place a limitation on the total number of allowed aborts. Transactions that reach this limit should be re-routed towards a software execution path. Intel 64 and IA-32 Architectures Optimization Reference Manual recommends a wrapper for lock elision using TSX where transactions that do not successfully elide the lock must acquire a global lock to commit to memory. While the use of a global lock offers a solution, it is not scalable due to limited concurrency that is associated with global serialization of critical sections.

In order to apply MRLock as the software fallback path, we define a resource to be a shared memory location that is part of a transaction’s read- and/or write-set, and each transactional resource is dynamically mapped to a bit in a bitset. We use an unordered mapping scheme where each element of the unordered map stores a memory address as the key and the assigned bit position as the mapped value. Transactions are initially executed using TSX. Once a transaction has reached the maximum allowed attempts to commit using HTM, it will be executed by our multi-resource lock manager. It is essential that transactions executing in hardware and software safely interact with each other in order to protect data integrity. We guarantee correct behavior by efficiently traversing the MRLock queue prior to every transaction attempted using HTM. During the queue traversal, we check for any conflicts between the enqueued requests and the bitset request of the pending transaction. If no conflicts exist, the transaction may safely proceed with the execution in hardware. Otherwise, the transaction is explicitly aborted and will wait for all conflicting requests to commit to memory before a retry.

1.2 Evaluation

Performance experiments are performed on a Fourth Generation Intel Core™ processor with Intel TSX support. We evaluate our approach on a micro-benchmark consisting of read/write operations on an array. A similar benchmark was used by Matveev and Shavit [29] where transactions read and write to random locations of a shared array. The transactional regions within the micro-benchmark are short, delivering a practical representation of the intended usage of Intel TSX. Sufficient resources are provided by Intel TSX to commit a common transactional region to memory. However, a transaction that exceeds the capacity limitations will experience frequent aborts. Each thread randomly selects h unique array positions to increment out of k total array positions. The resource contention is denoted by the fraction h/k. The array incrementation is performed within a loop, where the loop iteration count is set to 10,000. We vary the number of total resources between 4 \(\le \) k \(\le \) 64 and the number of resource requests between 4 \(\le \) h \(\le \) k. We compare our approach against Rochester Software Transactional Memory (RSTM) [28] and a Hybrid HTM-STM [9].

Figure 7, shows performance scaling when increasing the resource contention at two, four, and eight threads. The y-axis represents the execution time at a logarithmic scale. The x-axis represents various resource contention ratios, where the labeled tick mark indicates the total available resources. The data plotted in the region to the left of the tick mark demonstrates the performance results for resource requests varied between 4 \(\le \) h \(\le \) k, where the total available resources, k, is indicated by the right-most tick mark. We test the proposed methodology using both std::unordered_map and Intel TBB concurrent_hashmap for the dynamic hashing scheme. The Hybrid HTM-STM follows a similar trend pattern to RSTM, but consistently under-performs the pure software transactional memory counterpart. This result is expected because the Hybrid HTM-STM must check the ownership record table prior to every read or write performed in a hardware transaction. The hardware transactions performed in the hybrid approach will explicitly abort themselves if a conflict is detected with an ongoing software transaction, yielding a bias towards transactions committing in software rather than hardware. RSTM and the Hybrid HTM-STM both show an increase in execution times following a logarithmic trend when increasing the resource contention ratio.

As the number of threads are increased, the contention for reads and writes on the array will also increase. Figure 7a shows that we have slower execution times than both RSTM and Hybrid HTM-STM at two threads until the resource contention ratio reaches approximately 1/8. Once the resource contention ratio increases beyond 1/8, we maintain a faster execution time than both RSTM and Hybrid HTM-STM. Figure 7b shows the results when scaling the thread count up to four threads. Our approach yields slower execution times to RSTM and Hybrid HTM-STM at the 4/k configuration, but maintains faster execution times for all other configurations. At eight threads, Fig. 7c demonstrates that we outperform RSTM and Hybrid HTM-STM at all configurations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, D., Lynch, B. & Dechev, D. Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors. Int J Parallel Prog 43, 721–751 (2015). https://doi.org/10.1007/s10766-014-0317-6

Download citation

Received: 03 February 2014
Accepted: 30 July 2014
Published: 15 August 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10766-014-0317-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Fast and Scalable, Lock-Free k-FIFO Queues

Concurrency groups: a new way to look at real-time multiprocessor lock nesting

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Hardware Transactional Memory with MRLock as Fallback

1.1 Outline

1.2 Evaluation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Fast and Scalable, Lock-Free k-FIFO Queues

Concurrency groups: a new way to look at real-time multiprocessor lock nesting

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Hardware Transactional Memory with MRLock as Fallback

Appendix: Hardware Transactional Memory with MRLock as Fallback

1.1 Outline

1.2 Evaluation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation