Scheduling computations with provably low synchronization overheads

Rito, Guilherme; Paulino, Hervé

doi:10.1007/s10951-021-00706-6

Scheduling computations with provably low synchronization overheads

Published: 21 October 2021

Volume 25, pages 107–124, (2022)
Cite this article

Journal of Scheduling Aims and scope Submit manuscript

294 Accesses
2 Citations
Explore all metrics

Abstract

We present a Work Stealing scheduling algorithm that provably avoids most synchronization overheads by keeping processors’ deques entirely private by default and only exposing work when requested by thieves. This is the first paper that obtains bounds on the synchronization overheads that are (essentially) independent of the total amount of work, thus corresponding to a great improvement, in both algorithm design and theory, over state-of-the-art Work Stealing algorithms. Consider any computation with work $T_{1}$ and critical-path length $T_{\infty }$ executed by P processors using our scheduler. Our analysis shows that the expected execution time is $O\left( \frac{T_{1}}{P} + T_{\infty }\right) $, and the expected synchronization overheads incurred during the execution are at most $O\left( \left( C_{\mathrm{CAS}} + C_{\mathrm{MFence}}\right) PT_{\infty }\right) $, where $C_{\mathrm{CAS}}$ and $C_{\mathrm{MFence}}$, respectively, denote the maximum cost of executing a Compare-And-Swap instruction and a Memory Fence instruction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Quantum Computer Operating System

Qurzon: A Prototype for a Divide and Conquer-Based Quantum Compiler for Distributed Quantum Systems

Article 10 June 2022

Hybrid Register Allocation with Spill Cost and Pattern Guided Optimization

Notes

In particular, a processor $p_i$ must execute an MFence instruction after writing the variable a[i] (in the $\mathsf {update\_status}$ method) to guarantee that idle processors learn, in constant time, that $p_i$ has work to be stolen (see Sewell et al., 2010), as is required to achieve the expected runtime bounds presented in their paper.

References

Acar, U. A., Blelloch, G. E., & Blumofe, R. D. (2002). The data locality of work stealing. Theory of Computing Systems, 35(3), 321–347. https://doi.org/10.1007/s00224-002-1057-3
Acar, U. A., Charguéraud, A., & Rainey, M. (2013). Scheduling parallel programs by work stealing with private deques. In ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’13, February 23–27, 2013 (pp. 219–228). https://doi.org/10.1145/2442516.2442538
Agrawal, K., He, Y., Hsu, W., & Leiserson, C. E. (2007). Adaptive scheduling with parallelism feedback. In 21th International parallel and distributed processing symposium (IPDPS 2007), proceedings, 26–30 March 2007 (pp. 1–7). https://doi.org/10.1109/IPDPS.2007.370496
Agrawal, K., Leiserson, C. E., He, Y., & Hsu, W. (2008). Adaptive work-stealing with parallelism feedback. ACM Transactions on Computing Systems. https://doi.org/10.1145/1394441.1394443
Alon, N., & Spencer, J. (1992). The probabilistic method. Wiley.
Arora, N. S., Blumofe, R. D., & Plaxton, C. G. (1998). Thread scheduling for multiprogrammed multiprocessors. In SPAA (pp. 119–129). https://doi.org/10.1145/277651.277678
Arora, N. S., Blumofe, R. D., & Plaxton, C. G. (2001). Thread scheduling for multiprogrammed multi processors. Theory of Computing Systems, 34(2), 115–144. https://doi.org/10.1007/s00224-001-0004-z
Attiya, H., Guerraoui, R., Hendler, D., Kuznetsov, P., Michael, M. M., & Vechev, M. T. (2011). Laws of order: Expensive synchronization in concurrent algorithms cannot be eliminated. In Proceedings of the 38th ACM SIGPLAN-SIGACT symposium on principles of programming languages, POPL 2011, January 26–28, 2011 (pp. 487–498). https://doi.org/10.1145/1926385.1926442
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., & Zhou, Y. (1996). Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1), 55–69. https://doi.org/10.1006/jpdc.1996.0107
Blumofe, R. D., & Leiserson, C. E. (1999). Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 720–748. https://doi.org/10.1145/324133.324234
Blumofe, R. D., & Papadopoulos, D. (1998). The performance of work stealing in multiprogrammed environments. In ACM sigmetrics performance evaluation review (Vol. 26, pp. 266–267).
Blumofe, R. D., Plaxton, C. G., & Ray, S. (1999). Verification of a concurrent deque implementation. University of Texas at Austin.
Chase, D., & Lev, Y. (2005). Dynamic circular work-stealing deque. In SPAA 2005: Proceedings of the 17th annual ACM symposium on parallelism in algorithms and architectures, July 18–20, 2005 (pp. 21–28). https://doi.org/10.1145/1073970.1073974
Dinan, J., Krishnamoorthy, S., Larkins, D. B., Nieplocha, J., & Sadayappan, P. (2008). Scioto: A framework for global-view task parallelism. In 2008 International conference on parallel processing, ICPP 2008, September 8–12, 2008 (pp. 586-593). Retrieved from https://doi.org/10.1109/ICPP.2008.44
Dinan, J., Larkins, D. B., Sadayappan, P., Krishnamoorthy, S., & Nieplocha, J. (2009). Scalable work stealing. In Proceedings of the ACM/IEEE conference on high performance computing, SC 2009, November 14–20, 2009. https://doi.org/10.1145/1654059.1654113
Endo, T., Taura, K., & Yonezawa, A. (1997). A scalable mark-sweep garbage collector on large-scale shared-memory machines. In Proceedings of the ACM/IEEE conference on supercomputing, SC 1997, November 15–21, 1997 (p. 48). Retrieved from https://doi.org/10.1145/509593.509641
Frigo, M., Leiserson, C. E., & Randall, K. H. (1998). The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN ’98 conference on programming language design and implementation (PLDI), June 17–19, 1998 (pp. 212–223). ACM. https://doi.org/10.1145/277650.277725
Hiraishi, T., Yasugi, M., Umatani, S., & Yuasa, T. (2009). Backtracking-based load balancing. In Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2009, February 14–18, 2009 (pp. 55–64). Retrieved from https://doi.org/10.1145/1504176.1504187
Lifflander, J., Krishnamoorthy, S., & Kalé, L. V. (2012). Work stealing and persistence-based load balancers for iterative overdecomposed applications. In The 21st international symposium on high-performance parallel and distributed computing, HPDC’12, June 18–22, 2012 (pp. 137–148). Retrieved from https://doi.org/10.1145/2287076.2287103
Michael, M. M., Vechev, M. T., & Saraswat, V. A. (2009). Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2009, February 14–18, 2009 (pp. 45–54). https://doi.org/10.1145/1504176.1504186
Morrison, A., & Afek, Y. (2014). Fence-free work stealing on bounded TSO processors. In Architectural support for programming languages and operating systems, ASPLOS ’14, March 1–5, 2014 (pp. 413–426). https://doi.org/10.1145/2541940.2541987
Muller, S. K., & Acar, U. A. (2016). Latency-hiding work stealing: Scheduling interacting parallel computations with work stealing. In Proceedings of the 28th ACM symposium on parallelism in algorithms and architectures, SPAA 2016, July 11–13, 2016 (pp. 71–82). https://doi.org/10.1145/2935764.2935793
Sewell, P., Sarkar, S., Owens, S., Nardelli, F. Z., & Myreen, M. O. (2010). x86-tso: A rigorous and usable programmer’s model for x86 multiprocessors. Communications of the ACM, 53(7), 89–97. https://doi.org/10.1145/1785414.1785443
Tchiboukdjian, M., Gast, N., Trystram, D., Roch, J., & Bernard, J. (2010). A tighter analysis of work stealing. In Algorithms and computation—21st international symposium, ISAAC 2010, December 15–17, 2010, proceedings, part II (pp. 291–302). https://doi.org/10.1007/978-3-642-17514-5_25
Tzannes, A., Barua, R., & Vishkin, U. (2011). Improving run-time scheduling for general-purpose parallel code. In L. Rauchwerger & V. Sarkar (Eds.), 2011 International conference on parallel architectures and compilation techniques, PACT 2011, October 10–14, 2011 (p. 216). IEEE Computer Society. https://doi.org/10.1109/PACT.2011.49
van Dijk, T., & van de Pol, J. C. (2014). Lace: Nonblocking split deque for work-stealing. In Europar 2014: Parallel processing workshops—Europar 2014 international workshops, August 25–26, 2014, revised selected papers, part II (pp. 206–217). https://doi.org/10.1007/978-3-319-14313-2_18
van Ede, T. (2015). Certainty in lockless concurrent algorithms: An informal proof of lace (Technical Report). University of Twente.

Download references

Acknowledgements

This work is supported by NOVA LINCSc(UIDB/04516/2020) with the financial support of FCT—Fundação para a Ciência e a Tecnologia, through national funds. We would also like to thank Tomas Hruz and the anonymous reviewers for the helpful feedback.

Author information

Authors and Affiliations

Department of Computer Science, ETH-Zurich, Zurich, Switzerland
Guilherme Rito
NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), NOVA School of Science and Technology, NOVA University Lisbon, Caparica, Portugal
Hervé Paulino

Authors

Guilherme Rito
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Paulino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hervé Paulino.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proofs

Lemma 3 is crucial for the performance analysis of Low-Cost Work Stealing. An analogous result has already been proved for concurrent deques (see Arora et al., 2001, Lemma 3). For the sake of completeness we present its proof, which is a simple transcription of original proof of Arora et al. (2001, Lemma 3), adapted for split deques.

Lemma

3 (Structural Lemma for split deques) Let $v_{1}, \ldots , v_{k}$ denote the nodes stored in some processor p’s split deque, ordered from the bottom of the split deque to the top, at some point in the linearized execution of Low-Cost Work Stealing. Moreover, let $v_{0}$ denote p’s assigned node (if any), and for $i = 0,\ldots ,k$ let $u_{i}$ denote the designated parent of $v_{i}$ in the enabling tree. Then, for $i = 1,\ldots ,k$, $u_{i}$ is an ancestor of $u_{i-1}$ in the enabling tree, and although $v_{0}$ and $v_{1}$ may have the same designated parent (i.e., $u_{0} = u_{1}$), for $i = 2,3,\ldots ,k$, $u_{i-1} \ne u_{i}$ (i.e., the ancestor relationship is proper).

Proof

Fix a particular split deque. The split deque state and assigned node only change when the assigned node is executed or a thief performs a successful steal. We prove the claim by induction on the number of assigned-node executions and steals since the split deque was last empty. In the base case, if the split deque is empty, then the claim holds vacuously. We now assume that the claim holds before a given assigned-node execution or successful steal, and we will show that it holds after. Specifically, before the assigned-node execution or successful steal, let $v_{0}$ denote the assigned node; let k denote the number of nodes in the split deque; let $v_{1},\ldots ,v_{k}$ denote the nodes in the split deque ordered from the bottom to top; and for $i=0,\ldots ,k$, let $u_{i}$ denote the designated parent of $v_{i}$. We assume that either $k = 0$, or for $i = 1,\ldots ,k$, node $u_{i}$ is an ancestor of $u_{i-1}$ in the enabling tree, with the ancestor relationship being proper, except possibly for the case $i = 1$. After the assigned-node execution or successful steal, let ${v_{0}}'$ denote the assigned node; let ${k}'$ denote the number of nodes in the split deque; let ${v_{1}}', \ldots , {v_{k}}'$ denote the nodes in the split deque ordered from bottom to top; and for $i = 1,\ldots ,{k}'$, let ${u_{i}}'$ denote the designated parent of ${v_{i}}'$. We now show that either ${k}' = 0$, or for $i = 1,\ldots ,{k}'$, node ${u_{i}}'$ is an ancestor of ${u_{i-1}}'$ in the enabling tree, with the ancestor relationship being proper, except possibly for the case $i = 1$.

Consider the execution of the assigned node $v_{0}$ by the owner.

If the execution of $v_{0}$ enables 0 children, then the owner pops the bottommost node off its split deque and makes that node its new assigned node. If $k = 0$, then the split deque is empty; the owner does not get a new assigned node; and ${k}' = 0$. If $k > 0$, then the bottommost node $v_{1}$ is popped and becomes the new assigned node, and ${k}' = k - 1$. If $k = 1$, then ${k}' = 0$. Otherwise, ${k}' = k - 1$. We now rename the nodes as follows. For $i = 0,\ldots ,{k}'$, we set ${v_{i}}' = v_{i + 1}$ and ${u_{i}}' = u_{i + 1}$. We now observe that for $i = 1,\ldots ,{k}'$, node ${u_{i}}'$ is a proper ancestor of ${u_{i - 1}}'$ in the enabling tree.

If the execution of $v_{0}$ enables 1 child x, then x becomes the new assigned node; the designated parent of x is $v_{0}$; and ${k}' = k$. If $k = 0$, then ${k}' = 0$. Otherwise, we can rename the nodes as follows. We set ${v_{0}}' = x$; we set ${u_{0}}' = v_{0}$; and for $i = 1,\ldots ,{k}'$, we set ${v_{i}}' = v_{i}$ and ${u_{i}}' = u_{i}$. We now observe that for $i = 1,\ldots ,{k}'$, node ${u_{i}}'$ is a proper ancestor of ${u_{i - 1}}'$ in the enabling tree. That ${u_{1}}'$ is a proper ancestor of ${u_{0}}'$ in the enabling tree follows from the fact that $\left( u_{0},v_{0}\right) $ is an enabling edge.

In the most interesting case, the execution of the assigned node $v_{0}$ enables 2 children x and y, with x being pushed onto the bottom of the split deque and y becoming the new assigned node. In this case, $\left( v_{0},x\right) $ and $\left( v_{0},y\right) $ are both enabling edges, and ${k}' = k + 1$. We now rename the nodes as follows. We set ${v_{0}}' = y$; we set ${u_{0}}' = v_{0}$; we set ${v_{1}}' = x$; we set ${u_{1}}' = v_{0}$; and for $i = 2,\ldots ,{k}'$, we set ${v_{i}}' = v_{i - 1}$ and ${u_{i}}' = u_{i - 1}$. We now observe that ${u_{1}}' = {u_{0}}'$, and for $i = 2,\ldots ,{k}'$, node ${u_{i}}'$ is a proper ancestor of ${u_{i - 1}}'$ in the enabling tree. That ${u_{2}}'$ is a proper ancestor of ${u_{1}}'$ in the enabling tree follows from the fact that $\left( u_{0},v_{0}\right) $ is an enabling edge.

Finally, we consider a successful steal by a thief. In this case, the thief pops the topmost node $v_{k}$ off the split deque, so ${k}' = k - 1$. If $k = 1$, then ${k}' = 0$. Otherwise, we can rename the nodes as follows. For $i = 0,\ldots ,{k}'$, we set ${v_{i}}' = v_{i}$ and ${u_{i}}' = u_{i}$. We now observe that for $i = 1,\ldots ,{k}'$, node ${u_{i}}'$ is an ancestor of ${u_{i-1}}'$ in the enabling tree, with the ancestor relationship being proper, except possibly for the case $i = 1$.

$\square $

Corollary

1 If $v_{0},\,v_{1},\,\ldots ,\,v_{k}$ are as defined in the statement of Lemma 3, then we have $w\left( v_{0}\right) \le w\left( v_{1}\right)< \cdots< w\left( v_{k-1}\right) < w\left( v_{k}\right) $.

We are now able to bound the execution time of a computation depending on the number of idle iterations that take place during that computation’s execution. The following result is a trivial variant of Arora et al. (2001, Lemma 5) but considering the Low-Cost Work Stealing algorithm, and is only added for the sake of completion.

Lemma

4 Consider any computation with work $T_{1}$ being executed by P processors, under Low-Cost Work Stealing. The execution time is $O\left( \frac{T_{1}}{P} + \frac{I}{P}\right) $, where I denotes the number of idle iterations executed by processors.

Proof

Consider two buckets to which we add tokens during the computation’s execution: the busy bucket and the idle bucket. At the end of each iteration, every processor places a token into one of these buckets. If a processor executed a node during the iteration, it places a token into the busy bucket, and otherwise, it places a token into the idle bucket. Since we have P processors, for each C consecutive steps, at least P tokens are placed into the buckets.

Because, by definition, the computation has $T_{1}$ nodes, there will be exactly $T_{1}$ tokens in the busy bucket when the computation’s execution ends. Moreover, as I denotes the number of idle iterations, it also corresponds to the number of tokens in the idle bucket when the computation’s execution ends. Thus, exactly $T_{1} + I$ tokens are collected during the computation’s execution. Taking into account that for each C consecutive steps at least P tokens are placed into the buckets, we conclude the number of steps required to collect all the tokens is at most $C \cdot \left( \frac{T_{1}}{P} + \frac{I}{P}\right) $. After collecting all the $T_{1}$ tokens, the computation’s execution terminates, implying the execution time is at most $O\left( \frac{T_{1}}{P} + \frac{I}{P}\right) $. $\square $

The following lemma is a formalization of the arguments already given in Arora et al., (2001), but considering the potential function we present.

Lemma

7 Consider some node u, ready at step i during the execution of a computation.

1.
If u gets assigned to a processor at that step, the potential drops by at least $\frac{3}{4}\phi _{i}\left( u\right) $.
2.
If u becomes stealable at that step, the potential drops by at least $\frac{3}{4}\phi _{i}\left( u\right) $.
3.
If u was already assigned to a processor and gets executed at that step i, the potential drops by at least $\frac{47}{64}\phi _{i}\left( u\right) $.

Proof

Regarding the first claim, if u was stealable the potential decreases from $4^{3w\left( u\right) - 1}$ to $4^{3w\left( u\right) -2}$. Otherwise, the potential decreases from $4^{3w\left( u\right) }$ to $4^{3w\left( u\right) -2}$, which is even more than in the previous case. Given that $4^{3w\left( u\right) - 1} - 4^{3w\left( u\right) -2} = \frac{3}{4}\phi _{i}\left( u\right) $, we conclude that if u gets assigned the potential decreases by at least $\frac{3}{4}\phi _{i}\left( u\right) $.

Regarding the second one, note that u was not stealable (because it became stealable at step i) and so the potential decreases from $4^{3w\left( u\right) }$ to $4^{3w\left( u\right) -1}$. So, if u becomes stealable, the potential decreases by $4^{3w\left( u\right) } - 4^{3w\left( u\right) -1} = \frac{3}{4}\phi _{i}\left( u\right) $.

We now prove the last claim. Recall that, by our conventions regarding computations’ structure, each node within a computation’s dag can have an out-degree of at most two. Consequently, each node can be the designated parent of at most two other ones in the enabling tree. Moreover, by definition, the weight of any node is strictly smaller than the weight of its designated parent, since it is deeper in the enabling tree than its designated parent. Consider the three possible scenarios:

0 nodes enabled:: The potential decreased by $\phi _{i}\left( u\right) $.
1 node enabled:: The enabled node becomes the assigned node of the processor (that executed u). Let x denote the enabled node. Since x is the child of u in the enabling tree, it follows $\phi _{i}\left( u\right) - \phi _{i+1}\left( x\right) = 4^{3w\left( u\right) - 2} - 4^{3w\left( x\right) - 2} = 4^{3w\left( u\right) - 2} - 4^{3\left( w\left( u\right) - 1 \right) - 2} = \frac{63}{64}\phi _{i}\left( u\right) $. Thus, for this situation, the potential decreases by $\frac{63}{64}\phi _{i}\left( u\right) $.
2 nodes enabled:: In this case, one of the enabled nodes immediately becomes the assigned node of the processor whist the other is pushed onto the bottom of the split deque’s private part. Let x denote the enabled node that becomes the processor’s new assigned node and y the other enabled node. Since both x and y have u as their designated parent in the enabling tree, we have $\phi _{i}\left( u\right) - \phi _{i+1}\left( x\right) - \phi _{i+1}\left( y\right) = 4^{3w\left( u\right) - 2} - 4^{3w\left( x\right) - 2} - 4^{3w\left( y\right) } = \frac{47}{64}\phi _{i}\left( u\right) $. As such, the potential decreases by $\frac{47}{64}\phi _{i}\left( u\right) $, concluding the proof of the lemma. $\square $

The following lemma is a direct consequence of Corollary 1 and of the potential function’s properties. The result is a variant of Arora et al., (2001), Top-Heavy Deques, considering split deques instead of the conventional fully concurrent deques, and our potential function, instead of the original.

Lemma

8 Consider any step i and any processor $p \in D_{i}$. The top-most node u in p’s split deque contributes at least $\frac{4}{5}$ of the potential associated with p. That is, we have $\phi _{i}\left( u\right) \ge \frac{4}{5}\Phi _{i}\left( p\right) $.

Proof

This lemma follows from Corollary 1. We prove it by induction on the number of nodes within p’s split deque.

Base case:: As the base case, consider that p’s split deque contains a single node u. The processor itself can either have or not an assigned node. For the second scenario, we have $\phi _{i}\left( u\right) = \Phi _{i}\left( p\right) $. Regarding the first case, let x denote p’s assigned node. Corollary 1 implies that $w\left( u\right) \ge w\left( x\right) $. It follows $\Phi _{i}\left( q\right) = \phi _{i}\left( u\right) + \phi _{i}\left( x\right) = 4^{3w\left( u\right) - 1} + 4^{3w\left( x\right) - 2} \le \frac{5}{4}\phi _{i}\left( u\right) $. Thus, if p’s split deque contains a single node we have $\Phi _{i}\left( q\right) \le \frac{5}{4}\phi _{i}\left( u\right) $.
Induction step:: Consider that p’s split deque now contains n nodes, where $n \ge 2$, and let $u,\,x$ denote the topmost and second topmost nodes, respectively, within the split deque. For the purpose of induction, let us assume the lemma holds for all the first $n - 1$ nodes (i.e., without accounting with u): $\Phi _{i}\left( p\right) - \phi _{i}\left( u\right) \le \frac{5}{4}\phi _{i}\left( x\right) $. Corollary 1 implies $w\left( u\right) > w\left( x\right) \equiv w\left( u\right) - 1 \ge w\left( x\right) $. It follows $\Phi _{i}\left( p\right) \le \frac{5}{4}\phi _{i}\left( x\right) + \phi _{i}\left( u\right) = \frac{5}{4}4^{3w\left( x\right) } + 4^{3w\left( u\right) - 1} \le \frac{5}{4}4^{3\left( w\left( u\right) - 1\right) } + 4^{3w\left( u\right) - 1} = \frac{69}{64}\phi _{i}\left( u\right) < \frac{5}{4}\phi _{i}\left( u\right) $ concluding the proof of the lemma. $\square $

The following result is a consequence of Lemma 8.

Lemma

9 Suppose a thief processor p chooses a processor $q \in D_{i}$ as its victim at some step j, such that $j \ge i$ (i.e., a steal attempt of p targeting q occurs at step j). Then, at step $j + 2C$, the potential decreased by at least $\frac{3}{5}\Phi _{i}\left( q\right) $ due to either the assignment of the topmost node in q’s split deque, or for making the topmost node of q’s split deque become stealable.

Proof

Let u denote the topmost node of q’s split deque at the beginning of step i. We first prove that u either gets assigned or becomes stealable.

Three possible scenarios may take place due to p’s steal attempt targeting q’s split deque.

The invocation returns a node If p stole u, then, u gets assigned to p. Otherwise, some other processor removed u before p did, implying u got assigned to that other processor.
The invocation aborts Since the split deque implementation meets the relaxed semantics on any good set of invocations, and because the Low-Cost Work Stealing algorithm only makes good sets of invocations, we conclude that some other processor successfully removed a topmost node from q’s split deque during the aborted steal attempt made by p. If the removed node was u, u gets assigned to a processor (that may either be q, or, some other thief that successfully stole u). Otherwise, u must have been previously stolen by a thief or popped by q, and thus became assigned to some processor.
The invocation returns empty This situation can only occur if either q’s split deque is completely empty, or if there is no node in the public part of q’s split deque.
- For the first case, since $q \in D_{i}$, some processor must have successfully removed u from q’s split deque. Consequently, u was assigned to a processor.
- If there was no node in the public part of q’s split deque, p sets q’s targeted flag to true in a later step ${j}'$. Recall that, for each C consecutive instructions executed by a processor, at least one corresponds to a milestone. It follows that ${j}' \le j + C$. Furthermore, by observing Algorithm 1, we conclude that q will make and complete an invocation to updateBottom of its split deque in one of the C steps succeeding step ${j}'$. Thus, if q’s split deque’s private part is not empty, a node will become stealable. From that invocation, only two possible situations can take place:
  - No node becomes stealable In this case, the private part of q’s split deque was empty, implying some processor (either q or some thief) assigned u.
  - A node becomes stealable If the node that became stealable as the result of the invocation was not u, then either u was assigned by a processor (that could have been q or some thief), or u had already been transferred to the public part of q’s split deque as a consequence of another thief’s steal attempt that also returned empty, implying that either u became assigned, or it became stealable. Otherwise, the node that became stealable as a result of the updateBottom’s invocation was u. Thus, in any case, u either gets assigned to a processor or becomes stealable.

With this, we conclude that u either became assigned or became stealable until step $j + 2C$.

From Lemma 8, we have $\phi _{i}\left( u\right) \ge \frac{4}{5}\Phi _{i}\left( q\right) $. Furthermore, Lemma 7 proves that if u gets assigned the potential decreases by at least $\frac{3}{4}\phi _{i}\left( u\right) $, and if u becomes stealable the potential also decreases by at least $\frac{3}{4}\phi _{i}\left( u\right) $. Because u is either assigned or becomes stealable in any case, we conclude the potential associated with q at step $j + 2C$ has decreased by at least $\frac{3}{5}\Phi _{i}\left( q\right) $. $\square $

The following lemma is trivial a generalization of the original result presented in Arora et al. (2001, Balls and Weighted Bins). The only difference between the two results is the assumption of having at least B balls, rather than exactly B balls. Its proof is only presented for the sake of completion and is (trivially) adapted from the proof of Arora et al. (2001, Balls and Weighted Bins).

Lemma

10 (Balls and Weighted Bins) Suppose we are given at least B balls, and exactly B bins. Each of the balls is tossed independently and uniformly at random into one of the B bins, where for $i = 1,\ldots ,\,B$, bin i has a weight $W_{i}$. The total weight is $W = \sum _{i = 1}^{B} W_{i}$. For each bin i, we define the random variable $X_{i}$ as

$$\begin{aligned} X_{i} = \left\{ \begin{array}{ll} W_{i} &{}\quad \text {if some ball lands in bin } i\\ 0 &{}\quad \text {otherwise} \end{array}\right. \end{aligned}$$

and define the random variable X as $X = \sum _{i = 1}^{B} X_{i}$.

Then, for any $\beta $ in the range $0< \beta < 1$, we have

$$\begin{aligned} P\left\{ X \ge \beta W\right\} \ge 1 - \frac{1}{\left( 1 - \beta \right) e}. \end{aligned}$$

Proof

Consider the random variable $W_{i} - X_{i}$ taking the value of $W_{i}$ when no ball lands in bin i and 0 otherwise, and let ${B}'$ denote the total number of balls that are tossed. It follows $E\left[ W_{i} - X_{i}\right] = W_{i}\left( 1 - \frac{1}{B}\right) ^{{B}'} \le \frac{W_{i}}{e}$. From the linearity of expectation, we have $E\left[ W-X\right] \le \frac{W}{e}$. Markov’s inequality then implies $P\left\{ W - X > \left( 1 - \beta \right) W\right\} = P\left\{ X < \beta W\right\} \le \frac{E\left[ W - X\right] }{\left( 1 - \beta \right) W} \le \frac{1}{\left( 1 - \beta \right) e}$. $\square $

The following result states that for each P idle iterations that take place, with constant probability the potential drops by a constant factor. An analogous lemma was originally presented in Arora et al. (2001, Lemma 8) for the non-blocking Work Stealing algorithm. The result is a consequence of Lemmas 9 and 10, and its proof follows the same traits as the one presented in that study.

Lemma

11 Consider any step i and any later step j such that at least P idle iterations occur from i (inclusive) to j (exclusive). Then, we have

$$\begin{aligned} P\left\{ \Phi _{i} - \Phi _{j + 2C} \ge \frac{3}{10}\Phi _{i}\left( D_{i}\right) \right\} > \frac{1}{4}. \end{aligned}$$

Proof

By Lemma 9 we know that for each processor $p \in D_{i}$ that is targeted by a steal attempt, the potential drops by at least $\frac{3}{5}\Phi _{i}\left( p\right) $, at most 2C steps after being targeted.

When executing an idle iteration, a processor plays the role of a thief attempting to steal work from some victim. Thus, since P idle iterations occur from step i (inclusive) to step j (exclusive), at least P steal attempts take place during that same interval. We can think of each such steal attempt as a ball toss of Lemma 10.

For each processor p in $D_{i}$, we assign it a weight $W_{p} = \frac{3}{5}\Phi _{i}\left( p\right) $, and for each other processor p in $A_{i}$, we assign it a weight $W_{p} = 0$. Clearly, the weights sum to $W = \frac{3}{5}\Phi _{i}\left( D_{i}\right) $. Using $\beta = \frac{1}{2}$ in Lemma 10, it follows that with probability at least $1 - \frac{1}{\left( 1 - \beta \right) e} > \frac{1}{4}$, the potential decreases by at least $\beta W = \frac{3}{10}\Phi _{i}\left( D_{i}\right) $, concluding the proof of this lemma. $\square $

Finally, we bound the expected number of idle iterations that take place during a computation’s execution using the Low-Cost Work Stealing algorithm. The result follows from Lemma 11 and is proved using similar arguments as the ones used in the proof of Arora et al. (2001, Theorem 9). The presented proof corresponds to an adaptation of the one originally presented for the just mentioned theorem.

Lemma

12 Consider any computation with work $T_{1}$ and critical-path length $T_{\infty }$ being executed by Low-Cost Work Stealing using P processors. The expected number of idle iterations is at most $O\left( P T_{\infty }\right) $. Moreover, with probability at least $1 - \varepsilon $ the number of idle iterations is at most $O\left( \left( T_{\infty } + \ln \left( \frac{1}{\varepsilon }\right) \right) P\right) $.

Proof

To analyze the number of idle iterations, we break the execution into phases, each composed by $\Theta \left( P\right) $ idle iterations. Then, we prove that with constant probability, a phase leads the potential to drop by a constant factor.

A computation’s execution begins when the root gets assigned to a processor. By definition, the weight of the root is $T_{\infty }$, implying the potential at the beginning of a computation’s execution starts at $\Phi _{0} = 4^{3T_{\infty } - 2}$. Furthermore, it is straightforward to deduce that the potential is 0 after (and only after) a computation’s execution terminates. We use these facts to bound the expected number of phases needed to decrease the potential down to 0. The first phase starts at step $t_{1} = 1$ and ends at the first step ${t_{1}}'$ such that, at least P idle iterations took place during the interval $\left[ t_{1},{t_{1}}' - 2C\right] $. The second phase starts at step $t_{2} = {t_{1}}' + 1$, and so on.

Consider two consecutive phases starting at steps i and j, respectively. We now prove that $P\left\{ \Phi _{j} \le \frac{7}{10}\Phi _{i}\right\} > \frac{1}{4}$. Recall that we can partition the potential as $\Phi _{i} = \Phi _{i}\left( A_{i}\right) + \Phi _{i}\left( D_{i}\right) $. Since, from the beginning of each phase and until its last 2C steps, at least P idle iterations take place, then, by Lemma 11 it follows $P\left\{ \Phi _{i} - \Phi _{j} \ge \frac{3}{10}\Phi _{i}\left( D_{i}\right) \right\} > \frac{1}{4}$. Now, we have to prove the potential also drops by a constant fraction of $\Phi _{i} \left( A_{i}\right) $. Consider some processor $p \in A_{i}$:

If p does not have an assigned node, then $\Phi _{i}\left( p\right) = 0$.
Otherwise, if p has an assigned node u at step i, then, $\Phi _{i}\left( p\right) = \phi _{i}\left( u\right) $. Noting that each phase has more than C steps, then, p executes u before the next phase begins (i.e., before step j). Thus, the potential drops by at least $\frac{47}{64}\phi _{i}\left( u\right) $ during that phase.

Cumulatively, for each $p \in A_{i}$, it follows $\Phi _{i} - \Phi _{j} \ge \frac{47}{64}\Phi _{i}\left( A_{i}\right) $. Thus, no matter how $\Phi _{i}$ is partitioned between $\Phi _{i}\left( A_{i}\right) $ and $\Phi _{i}\left( D_{i}\right) $, we have $P\left\{ \Phi _{i} - \Phi _{j} \ge \frac{3}{10}\Phi _{i}\right\} > \frac{1}{4}$.

We say a phase is successful if it leads the potential to decrease by at least a $\frac{3}{10}$ fraction. So, a phase succeeds with probability at least $\frac{1}{4}$. Since the potential is an integer, and, as aforementioned, starts at $\Phi _{0} = 4^{3T_{\infty } - 2}$ and ends at 0, then there can be at most $\left( 3T_{\infty } - 2\right) \log _{\frac{10}{7}}\left( 4\right) < 12T_{\infty }$ successful phases. If we think of each phase as a coin toss, where the probability that we get heads is at least $\frac{1}{4}$, then the expected number of coins we have to toss to get heads $12T_{\infty }$ times is at most $48T_{\infty }$. In the same way, the expected number of phases needed to obtain $12T_{\infty }$ successful ones is at most $48T_{\infty }$. Consequently, the expected number of phases is $O\left( T_{\infty }\right) $. Moreover, as each phase contains $O\left( P\right) $ idle iterations, the expected number of idle iterations is $O\left( PT_{\infty }\right) $.

Now, suppose the execution takes $n = 48T_{\infty } + m$ phases. Each phase succeeds with probability greater or equal to $p = \frac{1}{4}$, meaning the expected number of successes is at least $np = 12T_{\infty } + \frac{m}{4}$. We now compute the probability that the number of X successes is less than $12T_{\infty }$. We use the Chernoff bound (Alon & Spencer, 1992), $P\left\{ X< np - a\right\} < e^{-\frac{a^{2}}{2np}}$ with $a = \frac{m}{4}$. It follows, $np - a = 12T_{\infty }$. Choosing $m = 48T_{\infty } + 16\ln \left( \frac{1}{\varepsilon }\right) $, we have $ P\left\{ X< 12T_{\infty }\right\} < e^{-\frac{\left( \frac{m}{4}\right) ^{2}}{2\left( 12T_{\infty } + \frac{m}{4}\right) }} \le e^{-\frac{m}{16}} \le e^{-\frac{16\ln \left( \frac{1}{\varepsilon }\right) }{16}} = \varepsilon $. Thus, the probability that the execution takes $96T_{\infty } + 16\ln \left( \frac{1}{\varepsilon }\right) $ phases or more, is less than $\varepsilon $. With this we conclude that the number of idle iterations is at most $O\left( \left( T_{\infty } + \ln \left( \frac{1}{\varepsilon }\right) \right) P\right) $ with probability at least $1 - \varepsilon $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rito, G., Paulino, H. Scheduling computations with provably low synchronization overheads. J Sched 25, 107–124 (2022). https://doi.org/10.1007/s10951-021-00706-6

Download citation

Accepted: 10 August 2021
Published: 21 October 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10951-021-00706-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scheduling computations with provably low synchronization overheads

Abstract

Access this article

Similar content being viewed by others

A Quantum Computer Operating System

Qurzon: A Prototype for a Divide and Conquer-Based Quantum Compiler for Distributed Quantum Systems

Hybrid Register Allocation with Spill Cost and Pattern Guided Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proofs

Lemma

Proof

Corollary

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scheduling computations with provably low synchronization overheads

Abstract

Access this article

Similar content being viewed by others

A Quantum Computer Operating System

Qurzon: A Prototype for a Divide and Conquer-Based Quantum Compiler for Distributed Quantum Systems

Hybrid Register Allocation with Spill Cost and Pattern Guided Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proofs

A Proofs

Lemma

Proof

Corollary

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Lemma

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation