The TBB scheduler is designed to keep worker threads, and their underlying cores, as busy as possible. If and when a worker thread becomes idle, it steals work from another thread so that it has something to do. When it steals, a thread is not aware of what parallel algorithm, loop or function originally created the task that it steals. Usually, where a task comes from is irrelevant, and so the best thing for the TBB library to do is to treat all available tasks equally and process them as quickly as possible.
However, if our application uses nested parallelism, the TBB library can steal tasks in a way that leads to an execution order that might not be expected by a developer. This execution order is not inherently dangerous; in fact, in most cases, it is exactly what we would like to happen. But if we make incorrect assumptions about how tasks may execute, we can create patterns that lead to unexpected or even disastrous results.
A small example that demonstrates this issue is shown in Figure 12-1. In the code, there are two parallel_for loops
. In the body of the outer loop, a lock on mutex m is acquired. The thread that acquires this lock calls a second nested parallel_for loop while holding the lock. A problem arises if the thread that acquires the lock on m becomes idle before its inner loop is done; this can happen if worker threads steal away iterations but have not yet finished them when the master thread runs out of work. The master thread cannot simply exit the parallel_for, since it’s not done yet. To be efficient, this thread doesn’t just idly spin, waiting for the other threads to finish their work; who knows how long that could take? Instead, it keeps its current task on its stack and looks for additional work to keep itself busy until it can pick up where it left off. If this situation arises in Figure 12-1, there are two kinds of tasks in the system at the point that the thread is looking for work to steal – inner loop tasks and outer loop tasks. If the thread happens to steal and execute a task from the outer parallel_for, it will attempt to acquire a lock on m again. Since it already holds a lock on m, and a tbb::spin_mutex is not a recursive lock, there is a deadlock. The thread is trapped waiting for itself to release the lock!
After seeing this example, two questions commonly arise: (1) does anyone really write code like this? And, (2) can a thread really wind up stealing a task from the outer loop? The answer to both of these questions is, unfortunately, yes.
People in fact do write code like this – almost always unintentionally though. One common way this pattern might arise is if a lock is held while a library function is called. A developer may assume they know what a function does, but if they are not familiar with its implementation, they can be wrong. If the library call contains nested parallelism, the case shown in Figure 12-1 can be the result.
And yes, work stealing can cause this example to deadlock. Figure 12-2 shows how our example might fall into this terrible state.
In Figure 12-2(a), thread t0 starts the outer loop and acquires the lock on m. Thread t0 then starts the nested parallel_for and executes the left half of its iteration space. While thread t0 is busy, three other threads t1, t2, and t3 participate in the execution of tasks in the arena. Threads t1 and t2 steal outer loop iterations and are blocked waiting to acquire the lock on m, which t0 currently holds. Meanwhile, thread t3 randomly selects t0 to steal from and starts executing the right half of its inner loop. This is where things start to get interesting. Thread t0 completes the left half of the inner loop’s iterations and therefore will steal work to prevent itself from becoming idle. At this point it has two options: (1) if it randomly chooses thread t3 to steal from, it will execute more of its own inner loop or (2) if it randomly chooses thread t1 to steal from, it will execute one of the outer loop iterations. Remember that by default, the scheduler treats all tasks equally, so it doesn’t prefer one over the other. Figure 12-2(b) shows the unlucky choice where it steals from thread t1 and becomes deadlocked trying to acquire the lock it already holds since its outer task is still on its stack.
Another example that shows correctness issues is shown in Figure 12-3. Again, we see a set of nested parallel_for loops, but instead of a deadlock, we get unexpected results because of the use of thread local storage. In each task, a value is written to a thread local storage location, local_i, an inner parallel_for loop is executed, and then the thread local storage location is read. Because of the inner loop, a thread may steal work if it becomes idle, write another value to the thread local storage location, and then return to the outer task.
The TBB development team uses the term moonlightingFootnote 1 for situations in which a thread has unfinished child tasks in flight and steals unrelated tasks to keep itself busy. Moonlighting is usually a good thing! It means that our threads are not sitting around idle. It’s only in limited situations when things go awry. In both of our examples, there was a bad assumption. They both assumed – not surprisingly – that because TBB has a non-preemptive scheduler, the same thread could never be executing an inner task and then start executing an outer task before it completed the inner task. As we’ve seen, because a thread can steal work while it’s waiting in nested parallelism, this situation can in fact occur. This typically benign behavior is only dangerous if we incorrectly depend on the thread executing the tasks in a mutually exclusive way. In the first case, a lock was held while executing nested parallelism – allowing the thread to pause the inner task and pick up an outer task. In the second case, the thread accessed thread local storage before and after nested parallelism
and assumed the thread would not moonlight in between.
As we can see, these examples are different but share a common misconception. In the blog “The Work Isolation Functionality in Intel Threading Building Blocks” that is listed in the “For More Information” section at the end of this chapter, Alexei Katranov provides a three-step checklist for deciding when work isolation is needed to ensure correctness:
-
1.
Is nested parallelism used (even indirectly, through third party library calls)? If not, isolation is not needed; otherwise, go to the next step.
-
2.
Is it safe for a thread to reenter the outer level parallel tasks (as if there was recursion)? Storing to a thread local value, re-acquiring a mutex already acquired by this thread, or other resources that should not be used by the same thread again can all cause problems. If reentrance is safe, isolation is not needed; otherwise, go to the next step.
-
3.
Isolation is needed. Nested parallelism has to be called inside an isolated region.
Creating an Isolated Region with this_task_arena::isolate
When we need isolation for correctness, we can use one of the isolate functions in the this_task_arena namespace:
Figure 12-4 shows how to use this function to add an isolated region around the nested parallel_for from Figure 12-1. Within an isolated region, if a thread becomes idle because it must wait – for example at the end of a nested parallel_for – it will only be allowed to steal tasks spawned from within its own isolated region. This fixes our deadlock problem, because if a thread steals while waiting at the inner parallel_for in Figure 12-4, it will not be allowed to steal an outer task.
When a thread becomes blocked within an isolated region, it will still randomly choose a thread from its task arena to steal from, but now must inspect tasks in that victim thread’s deque to be sure it steals only tasks that originated from within its isolated region.
The main properties of this_task_arena::isolate are nicely summarized, again in Alexei’s blog, as follows:
-
The isolation only constrains threads that enter or join an isolated region. Worker threads outside of an isolated region can take any task including a task spawned in an isolated region.
-
When a thread without isolation executes a task spawned in an isolated region, it joins the region of this task and becomes isolated until the task is complete.
-
Threads waiting inside an isolated region cannot process tasks spawned in other isolated regions (i.e., all regions are mutually isolated). Moreover, if a thread within an isolated region enters a nested isolated region, it cannot process tasks from the outer isolated region.
Oh No! Work Isolation Can Cause Its Own Correctness Issues!
Unfortunately, we can’t just indiscriminately apply work isolation. There are performance implications, which we will get to later, but more importantly, work isolation itself can cause deadlock if used incorrectly! Here we go again…
In particular, we have to be extra careful when we mix work isolation with TBB interfaces that separate spawning tasks from waiting for tasks – such as task_group and flow graphs. A task that calls a wait interface in one isolated region cannot participate in tasks spawned in a different isolated region while it waits. If enough threads get stuck in such a position, the application might run out of threads and forward progress will stop.
Let’s consider the example function shown in Figure 12-5. In the function splitRunAndWait, M tasks are spawned in task_group tg. But each spawn happens within a different isolated region.
If we call function fig_12_5 directly
, as is done in Figure 12-5, there is no problem. The call to tg.wait in splitRunAndWait is not inside of an isolated region itself, so the master thread and the worker threads can help with the different isolated regions and then move to other ones when they are finished.
But what if we change our main function to the one in Figure 12-6?
Now, the calls to splitRunAndWait are each made inside of different isolated regions, and subsequently the calls to tg.wait are made in those isolated regions. Each thread that calls tg.wait has to wait until its tg is finished but cannot steal any of the tasks that belong to its tg or any other task_group, because those tasks were spawned from different isolated regions! If M is large enough, we will likely wind up with all of our threads waiting in calls to tg.wait, with no threads left to execute any of the related tasks. So our application deadlocks.
If we use an interface that separates spawns from waits, we can avoid this issue by making sure that we always wait in the same isolated region from which we spawn the tasks. We could, for example, rewrite the code from Figure 12-6 to move the call to run out into the outer region as shown in Figure 12-7.
Now, even if our main function uses a parallel loop and isolation, we no longer have a problem, since each thread that calls tg.wait will be able to execute the tasks from its tg:
Even When It Is Safe, Work Isolation Is Not Free
In addition to potential deadlock issues, work isolation does not come for free from a performance perspective either, so even when it is safe to use, we need to use it judiciously. A thread that is not in an isolated region can choose any task when it steals, which means it can quickly pop the oldest task from a victim thread’s deque. If the victim has no tasks at all, it can also immediately pick another victim. However, tasks spawned in an isolated region, and their children tasks, are tagged to identify the isolated region they belong to. A thread that is executing in an isolated region must scan a chosen victim’s deque to find the oldest task that belongs to its isolated region – not just any old task will do. And the thread only knows if a victim thread has no tasks from its isolated region after scanning all of the available tasks and finding none from its region. Only then will it pick another victim to try to steal from. Threads stealing from within an isolated region have more overhead because they need to be pickier!