Beyond the Bottleneck: Enhancing High-Concurrency Systems with Lock Tuning

Ji, Juntao; Gu, Yinyou; Fu, Yubao; Lin, Qingshan

doi:10.1007/978-3-031-71177-0_20

Juntao Ji¹¹,
Yinyou Gu¹¹,
Yubao Fu^11,12 &
…
Qingshan Lin^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14934))

Included in the following conference series:

International Symposium on Formal Methods

Abstract

High-concurrency systems often suffer from performance bottlenecks [1]. This is often caused by waiting and context switching caused by fierce competition between threads for locks. As a cloud computing company, we place great emphasis on maximizing performance. In this regard, we transform the lightweight spin-lock and propose a concise parameter fine-tuning strategy, which can break through the system performance bottleneck with minimal risk conditions. This strategy has been validated in Apache RocketMQ [2], a high-throughput message queue system. Through this strategy, we achieved performance improvements of up to 37.58% on X86 CPU and up to 32.82% on ARM CPU. Furthermore, we confirmed the method’s consistent effectiveness across different code versions and IO flush strategies, demonstrating its broad applicability in real-world settings. This work offers not only a practical tool for addressing performance issues in high-concurrency systems but also highlights the practical value of formal techniques in solving engineering problems.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

In recent decades, computer processors have undergone significant transformations. Initially, performance enhancements were driven by increasing the clock frequency of single-core processors and transistor density. However, this trend was rendered unsustainable due to energy consumption and heat dissipation. As a solution, the industry shifted towards multicore processors, aligned with Moore’s Law [4], which predicts continued transistor density doubling. Today, multicore processors are equipped in personal computers, smartphones, gaming consoles, and even servers and supercomputers. This allows parallel processing of more tasks.

Multicore processors, however, do not improve software performance linearly. On the contrary, unlocking the parallel potential of these cores presents a significant emerging challenge [1]. Multicore environments require software to effectively distribute tasks across multiple cores, involving complex synchronization and coordination issues. Without proper task allocation and data access management, errors such as data races and deadlocks can occur during execution. These errors are not present in sequentially executed programs [6]. Moreover, due to the interdependencies among cores in multicore processors, parallel solutions to single problems must be carefully designed to avoid performance bottlenecks and resource waste.

Concurrent programming is essential because it could unlock multicore processors’ full capabilities, enabling complex tasks and large volumes of data [9]. Concurrent programming also challenges system performance and stability. Over years of development, a highly concurrent system may evolve into an exceedingly complex structure. This is often accompanied by complex data flows and performance constraints among system components. While locking mechanisms can ensure concurrent operations’ safety, if not used properly, they may become performance bottlenecks themselves. Over time, these issues-if not carefully managed-can lead to performance degradation, affecting response time and reliability.

Locks serve as a fundamental mechanism for synchronizing multiple execution threads for safe access to shared resources. Although locks are widely used in multicore embedded systems to ensure mutual exclusion, their usage also presents challenges. As concurrent systems become increasingly complex, proper lock management strategies are crucial for maintaining system performance [7]. Researchers have been studying and developing various lock implementations to optimize performance, fairness, and predictability for different scenarios. However, in practice, the choice of the appropriate lock type for a specific scenario is often based on developers’ assumptions, which may not match actual requirements. Incorrect lock types can degrade performance and result in unfair resource access.

Research has found that no single lock implementation performs best in all scenarios [13], and some well-performing locks may exhibit severe performance issues in other contexts. These findings underscore the necessity of lock performance analysis, such as the ability to monitor lock contention intensity in real-time and adjust lock strategies accordingly [13]. However, real-time monitoring of lock contention in a process can incur a performance overhead, and such adjustments may be reactive rather than proactive.

Therefore, we do not seek to find a "one-size-fits-all" lock-the cost is simply too prohibitive. Our goal is to discover a straightforward approach that allows us to fine-tune locks in a simple manner, enabling them to unlock the performance potential of specific applications. With this in mind, the lock overhead must be kept to a minimum, and it’s imperative that we autonomously manage the CPU utilization strategy of the lock to overcome current performance bottlenecks. To this end, we propose a method for optimizing spin-lock parameters. This method, based on an M/G/1 queue model [5] analysis of lock overhead, aims to identify the optimal spin backoff parameter k in high-pressure scenarios to balance the costs associated with lock acquisition. By following this method, applications in high-concurrency scenarios can perform better.

With this theory, we adjust the spinning behavior of locks within the system to flexibly decide on the spin threshold between different scenarios. This allows us to find the best performance balance while maintaining system stability. This optimization technique has been validated in the open-source Apache RocketMQ as well as in commercial RocketMQ instances offered by Alibaba Cloud. This has resulted in a significant improvement in message sending performance by 37.58% on normal X86 CPU.

Furthermore, to validate the universality of this optimization method, we deployed RocketMQ on servers equipped with Alibaba Cloud’s newly developed ARM architecture CPU. We applied the optimization strategy described in this study. The optimized system performance achieved an additional 32.82% improvement, proving our method’s effectiveness is not limited by a specific hardware architecture. We also tested the performance of this strategy when deploying older code versions on different CPUs or when employing various data persistence strategies. Ultimately, all scenarios resulted in performance improvements. These performances confirm that our method can improve the overall performance of high-concurrency systems dealing with chaotic data flows and inter-component performance constraints.

2 Preliminaries

2.1 Apache RocketMQ

Apache RocketMQ [2] is a cloud-native distributed messaging and streaming platform designed for real-time data processing that spans collaboration scenarios across cloud, edge, and devices. Initially created by Alibaba Group, it was donated to the Apache Software Foundation in 2016 and has since become a prominent top-level project within the foundation. RocketMQ excels in handling diverse message queue models, including publish/subscribe, point-to-point messaging, delayed message delivery, and message sequencing. These capabilities fulfill the stringent requirements of applications that demand high scalability, reliability, and throughput.

RocketMQ’s adoption across various industries such as finance, e-commerce, Internet of Things (IoT), and big data analysis is a testament to its adaptability and capability to meet complex messaging challenges. The architecture of RocketMQ is centered around clusters of Brokers and NameServers. The Brokers manage message storage, lifecycle, and distribution, whereas NameServers play an essential role in service discovery and message routing with their lightweight design.

Additionally, RocketMQ supports transactional messaging and provides client libraries for popular programming languages such as Java, C++, and Go, making it easier for developers to build and scale high-performance distributed applications.

Known for its impressive performance, RocketMQ can process several hundred thousand messages per second without compromising stability or reliability. The system’s design inherently favors distributed deployment, which allows for effortless scalability to meet growing business needs. The platform also includes an extensive set of monitoring metrics and tools, simplifying management and operational tasks.

Our team, the original developers of Apache RocketMQ, has dedicated considerable effort to advance its performance, focusing on maximizing throughput per machine. This commitment to performance enhancement is a driving force behind the research presented in this paper, situating our work within the broader context of innovation in distributed messaging systems.

2.2 Spin-Lock

For decades, spin-locks have been a key study subject in concurrent programming. They are lauded for their role in regulating access to shared resources in multicore systems. Recognized for their minimal overhead in low-contention scenarios, spin-locks shine where locking time is expected to be short, allowing threads to effectively ’spin’ until a resource becomes available [3].

The simplest and perhaps most rudimentary spin-lock form is the test-and-set (TAS) lock. This implementation employs a brute-force approach, spinning aggressively using atomic operations to gain exclusive access to a resource. Although this simplicity is compelling, it’s also its Achilles’ heel: atomic operations can aggressively drain CPU execution cycles and negatively impact shared resources like the system’s bus and memory, thus hampering the performance of the spinning thread and reducing overall system throughput, more so when multiple threads contend for the same lock [3].

To address the performance challenges associated with TAS, researchers proposed the test-test-and-set (TTAS) lock, an iteration aiming to reduce the heavy use of atomic operations [10], TTAS lock attempts to ameliorate the spin-lock behavior by integrating a preliminary, non-atomic check that the lock is potentially free before invoking atomic operations. This strategy reduces unnecessary bus traffic and cache coherence invalidations when the lock is known to be held. Nonetheless, TTAS still struggles with performance under high contention, which is exacerbated in cache-coherent environments where threads vying for a lock can cause cacheline invalidation due to coherence traffic-leading to additional delays and memory access overhead [13].

To tackle the shortcomings of both TAS and TTAS in high-contention scenarios, we propose a spin-backoff strategy. Rather than spinning indiscriminately, we employ a strategy where contending threads back off after a certain number of collisions. This mitigates collision risk and, as a result, enhances system performance. This strategy has proven to enhance complex systems’ performance under high concurrency.

3 Modeling Spin-Lock Overheads

3.1 Fundamental Assumptions

To effectively model the problem, we consider an exponential distribution model of lock contention probability and an M/G/1 queue model [5]. The M/G/1 queueing model is a type of queueing theory model where arrivals follow a Poisson process (denoted by M), service times have a general distribution (denoted by G), and there is a single server (denoted by 1). This model is used to analyze and describe characteristics such as customer waiting times and queue lengths in single-server systems. This necessitates the redefinition of lock behavior and system load. We consider several key variables in our model:

Arrival rate ($\lambda $): The average rate of lock requests per unit time.
Service rate ($\mu $): The average rate at which locks are released and successfully acquired by another thread per unit time.
System utilization ratio ($\rho = \lambda / \mu $): The ratio of the arrival rate to the service rate.

In the normal M/G/1 queue model, the arrival process is a Poisson process, and the service time distribution is a general distribution. When simulating the waiting time for a spin-lock, assuming that the lock holding time follows an exponential distribution means that each thread attempting to acquire the lock has an equal chance of success at any moment, unaffected by the previous waiting time. Therefore, in our case, we assume the lock holding time follows an exponential distribution, simplifying the model to an M/M/1 queue model because the service time is also exponentially distributed.

One of the critical properties of the M/M/1 queue model is the average queue length ($L_q$), which is given by the formula:

$$\begin{aligned} L_q = \frac{\rho ^2}{1 - \rho } \end{aligned}$$

This formula shows how the average queue length can be directly determined by system utilization ($\rho $), and is independent of service time variance due to the nature of the exponential distribution. Then the average waiting time in the queue $W_q$ is given by the formula:

$$\begin{aligned} W_q = \frac{\rho }{\lambda (1 - \rho )} \end{aligned}$$

3.2 Modeling Process

To tailor our spin-lock model, we introduce additional parameters:

Spin time ($T_s$): The average time for each spin attempt, which equates to a single CAS (Compare-And-Swap) operation, which is typically on the order of nanoseconds.
Context switch time ($T_c$): The average time for each context switch, encompassing operations such as saving the current process state, loading another process’s state, and potential cache invalidation, with the time overhead generally in the order of microseconds.
Number of spins ($k$): The number of attempts a thread makes to acquire the lock before yielding.

Assuming a thread acquires the lock on the $i$th attempt ($1 \le i \le k$), the expected spin time is $i \cdot T_s$. With each spin attempt being independent, and the probability of acquiring the lock on any attempt being $1-\rho $, the probability of success on the $i$th attempt is $\rho ^{i-1} \cdot (1 - \rho )$.

We consider the total expected time $E(T_{\text {total}})$ as the weighted sum of the spin attempts, where the weights are the probabilities of success for each attempt. The expected spin time $E(T_{\text {spin}})$ is expressed as:

$$\begin{aligned} E(T_{\text {spin}}) = \sum _{i=1}^{k} (i \cdot T_s) \cdot \rho ^{i-1} \cdot (1-\rho ) \end{aligned}$$

Defining $H(x) = \sum _{i=1}^{k} i \cdot x^{i-1}$, we can simplify the $E(T_{\text {spin}})$ to:

$$\begin{aligned} E(T_{\text {spin}}) = T_s \cdot (1-\rho ) \cdot H(\rho ) \end{aligned}$$

To facilitate the calculation of $E(T_{\text {spin}})$, we employ a summation technique involving geometric series and its derivatives. Let $G(x) = \sum _{i=0}^{k-1} x^i$ represent our geometric series, with the sum $G(x) = \frac{1 - x^k}{1 - x}$. We note that $H(x)$ is the derivative of $G(x)$, resulting in:

$$\begin{aligned} H(x) = G'(x) = \frac{(1-x^k) - kx^{k-1}(1-x)}{(1-x)^2} \end{aligned}$$

By substituting $x$ with $\rho $, we obtain $H(\rho )$, which is then used to compute $E(T_{\text {spin}})$:

$$\begin{aligned} E(T_{\text {spin}}) = T_s \cdot (1-\rho ) \cdot H(\rho ) = T_s \cdot \frac{(1-\rho ^k) - k\rho ^{k-1}(1-\rho )}{1-\rho } \end{aligned}$$

If a thread fails to obtain the lock after $k$ spins, it performs a context switch and subsequently waits in the queue. The expected time for this event is:

$$\begin{aligned} E(T_{\text {yield-total}}) = \rho ^k \cdot (k \cdot T_s + T_c + W_q) \end{aligned}$$

The total expected waiting time $E(T_{\text {total}})$ is thus the sum of the expected spin time and the expected yield time:

$$\begin{aligned} E(T_{\text {total}}) = E(T_{\text {spin}}) + E(T_{\text {yield-total}}) \end{aligned}$$

$$\begin{aligned} E(T_{\text {total}}) = T_s \cdot \frac{(1-\rho ^k) - k\rho ^{k-1}(1-\rho )}{1-\rho } + \rho ^k \cdot \left( k \cdot T_s + T_c + \frac{\rho }{\lambda (1 - \rho )} \right) \end{aligned}$$

3.3 Validation

To validate our model, we examine the behavior of the expected total time as system utilization ($\rho $) approaches different limits. For a system under minimal load, where $\rho $ approaches 0, the expected spin time simplifies to:

$$\begin{aligned} \lim _{\rho \rightarrow 0} E(T_{\text {total}}) &= \lim _{\rho \rightarrow 0} T_s \cdot \frac{(1-\rho ^k) - k\rho ^{k-1}(1-\rho )}{1-\rho } \nonumber \\ &\quad + \lim _{\rho \rightarrow 0}\rho ^k \cdot \left( k \cdot T_s + T_c + \frac{\rho }{\lambda (1 - \rho )} \right) \nonumber \\ &= T_s + 0 \\ &=T_s \end{aligned}$$

This implies that the spin cost is equivalent to $T_s$, meaning that lock acquisition occurs immediately after a single compare-and-swap (CAS) operation.

Conversely, as the system load approaches its maximum capacity and $\rho $ approaches 1, the expected spin success time becomes indeterminate and requires the application of l’Hôpital’s rule:

$$\begin{aligned} \lim _{\rho \rightarrow 1}E(T_{\text {spin}}) = \lim _{\rho \rightarrow 1}T_s \cdot \frac{(1-\rho ^k) - k\rho ^{k-1}(1-\rho )}{1-\rho } \end{aligned}$$

$$\begin{aligned} \lim _{\rho \rightarrow 1}E(T_{\text {spin}}) = \lim _{\rho \rightarrow 1}\frac{-k-k(k-1)+k^2}{-1} = 0 \end{aligned}$$

This result indicates that the expected spin time to acquire the lock is zero. Thus, the predominant component of the total expected time at high system utilization is:

$$\begin{aligned} \lim _{\rho \rightarrow 1}E(T_{\text {total}}) = \lim _{\rho \rightarrow 1}\rho ^k \cdot \left( k \cdot T_s + T_c + \frac{\rho }{\lambda (1 - \rho )} \right) \end{aligned}$$

Hence, in scenarios where $\rho $ is near 1, the time cost comprises k spins, a context switch, and the waiting time in the queue.

In summary, when $\rho $ is low, increasing $k$ rapidly reduces the probability of a thread spinning k times without acquiring the lock, minimizing context switch overhead. Conversely, when $\rho $ is high, the main contribution to waiting time will be $E(T_{\text {yield-total}})$, which necessitates careful control of $k$. A low $k$ leads to more frequent context switching, while a high $k$ results in extended spin times, reducing lock throughput. This highlights the need for an optimal spinning strategy that balances $k \cdot T_s$ against $T_c + \frac{\rho }{\lambda (1 - \rho )}$ when $\rho $ further increases, leading to longer queueing times.

4 Spin-Lock Fine-Tuning

Following the mathematical foundation laid down in the preceding section, our focus shifts toward the practical application of the model we have established. Our primary goal is to strategically determine the optimal value of k, the number of spin attempts. This minimizes the total expected waiting time for a thread contending for a lock within a system operating under varying loads, represented by $\rho $.

System load, $\rho $, fundamentally affects spin-lock performance. At peak system loads, we expect lock contention to be at its highest. This translates into a higher probability that a thread will have to wait before acquiring the lock. This scenario is particularly pertinent for optimizing our spin-lock strategy, as lock contention costs increase.

4.1 Strategy Overview

1.
Peak Load Simulation: The first step involves pushing the system to its maximum load to simulate an environment where lock contention is at its highest. By doing so, we ensure that $\rho $ reflects a state of maximum contention, and that any optimizations we perform are directed at the most stressful operating conditions.
2.
Dynamic Tuning of k : In this high contention scenario, we begin with a minimum k value of 1 and incrementally adjust it while monitoring the impact on system performance metrics. Considering that $\rho $ remains relatively constant under peak load, our task consists of determining the optimal value of k. This balances the trade-off between spinning costs and context switching costs and potential delays.
3.
Performance Optimization and Monitoring: We aim to find the optimal k value that boosts the lock’s performance. When increasing k further reduces performance gains, it’s no longer beneficial to keep spinning. This ideal k ensures our spin-lock operates at peak efficiency under the existing system load, which means we have identified the best balance between $k \cdot T_s$ and $T_c + \frac{\rho }{\lambda (1 - \rho )}$.
4.
Mutex Adoption Strategy: If performance degrades as k increases from its initial value, it suggests that the lock contention is too intense for spinning to be effective. In such scenarios, a Mutex lock, which involves less spinning, may be more appropriate. This decision is informed by the understanding that, under extreme contention, excessive spinning’s overhead outweighs its benefits.

The strategy we propose provides a systematic approach to identifying an optimal k value under peak system load. This value aims to balance the probability of lock acquisition against the costs associated with prolonged spinning and context switching. Importantly, while the optimal k is derived under high contention conditions, it offers a benchmark that performs efficiently across a range of loads. When the system load diminishes, the optimal k remains effective at quickly acquiring locks while minimizing context switching overhead.

In essence, this strategy does not limit its applicability to a single $\rho $ but instead offers a versatile solution that accommodates the entire spectrum of system loads. By integrating our theoretical insights with practical, empirical observations, we ensure that our spin-lock strategy is both robust and adaptive. This is capable of maintaining lock performance in the face of fluctuating system demands.

5 Experiment

5.1 Variables

To demonstrate the universality of our spin-backoff strategy, we designed multiple scenarios for testing. The results indicate that our strategy significantly improves Apache RocketMQ performance in various contexts. The scenario variables are as follows:

Different CPU Architectures: Throughout its development journey, Apache RocketMQ, initially designed and built for the x86 architecture, has successfully been ported and now supports running on ARM architecture CPUs. This adapts to hardware architecture diversification. Running on various CPU architectures challenges RocketMQ’s code development. Therefore, we tested our proposed optimization strategy on more than one type of CPU to prove its effectiveness across multiple architectures. In addition to the traditional x86 CPU, we tested Alibaba Cloud’s self-developed ARM CPUs-a novel architectural choice, indicating that Apache RocketMQ was not previously optimized for this architecture.

Different Code Versions: Over approximately a decade of iterations, Apache RocketMQ has undergone significant changes-adding many enhanced features and increasing the complexity of message publishing and receiving processes. Thus, we validated not only with the latest code but also with a stable version from two years ago. This was to prove the enduring effectiveness of our strategy throughout code evolution.

Different Flush Policies: Apache RocketMQ is a message queue with built-in storage, so bottlenecks may arise from more than CPU performance. We also set different flush policies to simulate various data persistence approaches. An aggressive persistence approach (ASYNC) leads to asynchronous flushing, offering higher throughput at the risk of data loss in a crash. A conservative persistence approach (SYNC) uses synchronous flushing, which doesn’t report success to producers until messages are successfully written to disk. This ensures data integrity at the expense of lower throughput.

5.2 Experimental Procedure

Based on the aforementioned variables, we arranged combinations for a total of $2^3 = 8$ scenarios of maximum original throughput. We applied our optimization strategy to each. Ultimately, our strategy identified an optimal spin parameter k in every high-pressure scenario, enhancing maximum original performance.

With the three scenarios serving as variables, we kept all other related parameters constant:

A 16 vCPU, 32 GiB Alibaba Cloud model with CentOS 7.9 64-bit, ESSD cloud PL1 disk (1024 GiB, 50000 IOPS), an internal network bandwidth of 10 Gbit/s, and a network packet transmission rate of 3 million PPS were selected.

For the experiments in this chapter, we employed multiple stress test machines to send messages at full capacity to Apache RocketMQ servers. To mitigate disk pressure caused by high message volume, we set the message payload to just 2 bytes. This makes the total message size approximately 100 bytes. During this process, we implemented a backpressure strategy that kept the Broker’s processing load close to but not exceeding its limits. This is consistent with our theoretical design where $\rho $ approaches 1.

Under these conditions, we tested the raw limit of performance (Send QPS) and applied the spin parameter optimization strategy mentioned in this paper. We obtained the optimal spin parameter k along with performance post-optimization. Table 1 records these metrics and calculates the final percentage of performance optimization.

Table 1. Performance Improvement Overview

Full size table

Table 1 exemplifies that, under all variable scenarios, our most effective spin parameter search strategy is effective-by optimizing a single spin-lock during message sending, we can achieve a performance boost ranging from 5.45% to 37.58%.

The strategy shines especially on ARM CPUs. Taking the aggressive flush policy under the ARM architecture as an example, after two years of code iteration, Apache RocketMQ has improved message sending throughput by 24,000 QPS, approximately 16.8%. If the proposed strategy had been implemented in the code version two years ago, it would have directly improved performance by 32.82%. This is twice the improvement achieved after two years of iterations.

Obtaining such a significant performance boost in traditional high-concurrency systems is typically difficult, as it often implies code refactoring and phased rollout risks. Nevertheless, optimizing the spin-lock backoff strategy brings risk-free performance gains while maintaining thread safety under high concurrency. The improvement is akin to infusing water into a cup filled with sand-better utilizing CPU resources otherwise wasted by spinning behavior.

In addition, we also examined the CPU usage across various values of k. This usage was measured based on the percentage CPU utilization ("CPU%") reported by the pidstat command. Each experimental set was conducted for a duration of six minutes and a total of eight sets of k values were tested. The experiments tested k ranging from $10^0$ to $10^6$, as well as the scenario where k was set to infinity (indicating continuous spinning). We take the most remarkable set of results as an example, specifically the latest code version deployed on a machine with an X86 CPU, utilizing the SYNC flush mode. The results are presented in the Fig. 1.

The experimental results reveal that at $k=10^3$, not only did our sending speed reach its peak (155019.20), but the CPU usage was also at its lowest. This suggests that our spin-backoff strategy successfully conserved CPU resources, thereby enhancing CPU utilization. At this point, the CPU supported higher performance levels with lower utilization rates, indicating that the performance bottleneck had shifted-perhaps to the disk, for example.

In Table 1, we observed a 10.4% improvement in the performance of RocketMQ on an ARM CPU with the same k ($10^3$) and configuration parameters (latest code, SYNC Flush mode). Furthermore, as illustrated in Fig. 1, CPU usage experienced a substantial decrease when $k = 10^3$, falling from an average of over 1000% to around 750%. This decrease in resource consumption indicates that alleviating other system bottlenecks could lead to even more significant gains in performance.

6 Conclusion & Future Work

In this paper, we focus on a common and challenging issue in high-concurrency systems: performance bottlenecks. This paper introduces an innovative approach utilizing a back-off spin-lock strategy to tackle performance bottlenecks. A cost analysis of spin-locks was conducted to establish a quantitative relationship linking the expected overhead of lock contention with the number of spin attempts (k) and system load ($\rho $). On the foundation of theoretical modeling, we explored a viable solution: tuning parameters at peak system loads, where $\rho $ is close to 1, to find the optimal balance between spin wait and context switch times. This approach has significantly improved system performance. Moreover, at lower $\rho $ values, the determined maximal spin attempts (k) naturally become ineffective, while also reducing context switching costs, thus ensuring operational efficiency under light load conditions.

The strategy for searching for the backoff spin-lock parameter, as proposed in this article, has been empirically validated on Apache RocketMQ as well as in commercial RocketMQ instances offered by Alibaba Cloud. Our tests confirmed the strategy’s effectiveness across different CPU architectures, including X86 and ARM, demonstrating its broad applicability. Moreover, we have evaluated the strategy’s stability by examining its performance across various versions of Apache RocketMQ, observing consistent and reliable behavior over time. The study also explored how the strategy performs in conjunction with different data flush approaches-both asynchronous and synchronous-ensuring that the backoff spin-lock optimization effectively enhances system performance under diverse system behaviors. In our experiments, the application of this spin-lock backoff parameter search strategy led to performance improvements ranging from 5.45% to 37.58%, showcasing its potential for enhancing system efficiency in multiple contexts.

The performance enhancements reported in this paper are invigorating within the industry - minute optimizations are hard-won in complex high-concurrency systems, let alone improvements as substantial as 30% at their peak. This finely tuned optimization method avoids extensive code overhauls or phased rollouts. However, it brings safe and stable improvements to high-concurrency systems. It is like strategically pouring water into a cup of densely packed sand. This utilizes CPU resources otherwise wasted on spinning and context switching.

Looking ahead, we aim to continue our exploration in both laboratory settings and industrial arenas to discover methods to quantify system load ($\rho $) accurately. This insight will provide us with additional controllable parameters, helping us refine our spin-lock strategy further. Consequently, it will provide more accurate and efficient performance optimization solutions for various high-concurrency systems. Through these endeavors, we aspire to contribute to deeper technological advancements and breakthrough industrial applications within high-concurrency systems.

References

Han, S., Dang, Y., Ge, S., Zhang, D.: Performance debugging in the large via mining millions of stack traces. In: Proceedings of the International Conference on Software Engineering, pp. 176–186 (2012)
Google Scholar
Apache RocketMQ. Apache Software Foundation. Available online: http://rocketmq.apache.org (Accessed on 1 Apr 2024)
Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming, Revised Reprint. 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2012). ISBN 978-0-12-397337-5
Google Scholar
Moore, G.E.: Progress in digital integrated electronics. In: Proceedings of the IEEE Electron Devices Meeting, vol 21, pp. 11–13, San Francisco, CA (1975)
Google Scholar
Nelson, R.: The M/G/1 queue. In: Probability, Stochastic Processes, and Queueing Theory. Springer, New York, NY (1995). https://doi.org/10.1007/978-1-4757-2426-4_7
Breshears, C.: The Art of Concurrency: A ThreadMonkey’s Guide to Writing Parallel Applications. O’Reilly Media, Sebastopol, CA, USA (2009)
Google Scholar
Anderson, T.E.: The performance of spin lock alternatives for shared-money multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1(1), 6–16 (1990). https://doi.org/10.1109/71.80120
Article Google Scholar
Wikipedia contributors. M/G/1 Queue. https://en.wikipedia.org/wiki/M/G/1_queue. April 2024. [Accessed 10 Apr 2024]
Akhter, S., Roberts, J.: Multi-Core Programming: Increasing Performance through Software Multi-Threading, vol. 33 (2006). Hillsboro, OR, USA: Intel Press
Google Scholar
Anderson, T., Dahlin, M.: Operating systems: Principles and Practice, 2nd ed. Recursive Books (2014). ISBN 978-0-9856735-2-9
Google Scholar
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991). https://doi.org/10.1145/103727.103729
Dice, D., Shavit, N.: What really makes transactions faster? (2006)
Google Scholar
Li, L., Wagner, P., Mayer, A., Wild, T., Herkersdorf, A.: A non-intrusive, operating system independent spinlock profiler for embedded multicore systems. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), 322–325 (2017). https://doi.org/10.23919/DATE.2017.7927009

Download references

Author information

Authors and Affiliations

Alibaba Cloud Computing Co. Ltd., Beijing, China
Juntao Ji, Yinyou Gu, Yubao Fu & Qingshan Lin
Apache RocketMQ Community, New York, USA
Yubao Fu & Qingshan Lin

Authors

Juntao Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yinyou Gu
View author publications
You can also search for this author in PubMed Google Scholar
Yubao Fu
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingshan Lin .

Editor information

Editors and Affiliations

Karlsruhe Institute of Technology, Karlsruhe, Germany
Andre Platzer
Iowa State University, Ames, IA, USA
Kristin Yvonne Rozier
Politecnico di Milano, Milan, Italy
Matteo Pradella
Politecnico di Milano, Milan, Italy
Matteo Rossi

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, J., Gu, Y., Fu, Y., Lin, Q. (2025). Beyond the Bottleneck: Enhancing High-Concurrency Systems with Lock Tuning. In: Platzer, A., Rozier, K.Y., Pradella, M., Rossi, M. (eds) Formal Methods. FM 2024. Lecture Notes in Computer Science, vol 14934. Springer, Cham. https://doi.org/10.1007/978-3-031-71177-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-71177-0_20
Published: 13 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71176-3
Online ISBN: 978-3-031-71177-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Beyond the Bottleneck: Enhancing High-Concurrency Systems with Lock Tuning