Anyone who has been around children (or who acts like a child) knows that sometimes the only way to stop children from bothering each other is to separate them. The same thing can be said of TBB tasks and algorithms. When tasks or algorithms just can’t get along, we can separate them using work isolation.

For example, when using nested parallelism, we need – in some limited situations – to create work isolation in order to ensure correctness. In this chapter, we walk through scenarios where this need arises and then provide a set of rules for determining when we need isolation for correctness. We also describe how the isolate function is used to create work isolation.

In other cases, we may want to create work isolation so that we can constrain where tasks execute for performance reasons by using explicit task arenas. Creating isolation in these cases is a double-edged sword. On one hand, we will be able to control things like the number of threads that will participate in different task arenas as a way to favor some tasks over others, or to use hooks in the TBB library to pin the threads to specific cores to optimize for locality. On the other hand, explicit task arenas make it more difficult for threads to participate in work outside of the arena they are currently assigned to. We discuss how to use class task_arena when we want to create isolation for performance reasons. We will also caution that while class task_arena can be used to create isolation to address correctness problems too, its higher overhead makes it less desirable for that purpose.

Work isolation is a valuable feature when required and used properly, but, as we will see throughout this chapter it needs to be used cautiously.

Work Isolation for Correctness

The TBB scheduler is designed to keep worker threads, and their underlying cores, as busy as possible. If and when a worker thread becomes idle, it steals work from another thread so that it has something to do. When it steals, a thread is not aware of what parallel algorithm, loop or function originally created the task that it steals. Usually, where a task comes from is irrelevant, and so the best thing for the TBB library to do is to treat all available tasks equally and process them as quickly as possible.

However, if our application uses nested parallelism, the TBB library can steal tasks in a way that leads to an execution order that might not be expected by a developer. This execution order is not inherently dangerous; in fact, in most cases, it is exactly what we would like to happen. But if we make incorrect assumptions about how tasks may execute, we can create patterns that lead to unexpected or even disastrous results.

A small example that demonstrates this issue is shown in Figure 12-1. In the code, there are two parallel_for loops . In the body of the outer loop, a lock on mutex m is acquired. The thread that acquires this lock calls a second nested parallel_for loop while holding the lock. A problem arises if the thread that acquires the lock on m becomes idle before its inner loop is done; this can happen if worker threads steal away iterations but have not yet finished them when the master thread runs out of work. The master thread cannot simply exit the parallel_for, since it’s not done yet. To be efficient, this thread doesn’t just idly spin, waiting for the other threads to finish their work; who knows how long that could take? Instead, it keeps its current task on its stack and looks for additional work to keep itself busy until it can pick up where it left off. If this situation arises in Figure 12-1, there are two kinds of tasks in the system at the point that the thread is looking for work to steal – inner loop tasks and outer loop tasks. If the thread happens to steal and execute a task from the outer parallel_for, it will attempt to acquire a lock on m again. Since it already holds a lock on m, and a tbb::spin_mutex is not a recursive lock, there is a deadlock. The thread is trapped waiting for itself to release the lock!

Figure 12-1
figure 1

Holding a lock while executing a nested parallel_for

After seeing this example, two questions commonly arise: (1) does anyone really write code like this? And, (2) can a thread really wind up stealing a task from the outer loop? The answer to both of these questions is, unfortunately, yes.

People in fact do write code like this – almost always unintentionally though. One common way this pattern might arise is if a lock is held while a library function is called. A developer may assume they know what a function does, but if they are not familiar with its implementation, they can be wrong. If the library call contains nested parallelism, the case shown in Figure 12-1 can be the result.

And yes, work stealing can cause this example to deadlock. Figure 12-2 shows how our example might fall into this terrible state.

Figure 12-2
figure 2

One potential execution of the task tree generated by the code in Figure 12-1

In Figure 12-2(a), thread t0 starts the outer loop and acquires the lock on m. Thread t0 then starts the nested parallel_for and executes the left half of its iteration space. While thread t0 is busy, three other threads t1, t2, and t3 participate in the execution of tasks in the arena. Threads t1 and t2 steal outer loop iterations and are blocked waiting to acquire the lock on m, which t0 currently holds. Meanwhile, thread t3 randomly selects t0 to steal from and starts executing the right half of its inner loop. This is where things start to get interesting. Thread t0 completes the left half of the inner loop’s iterations and therefore will steal work to prevent itself from becoming idle. At this point it has two options: (1) if it randomly chooses thread t3 to steal from, it will execute more of its own inner loop or (2) if it randomly chooses thread t1 to steal from, it will execute one of the outer loop iterations. Remember that by default, the scheduler treats all tasks equally, so it doesn’t prefer one over the other. Figure 12-2(b) shows the unlucky choice where it steals from thread t1 and becomes deadlocked trying to acquire the lock it already holds since its outer task is still on its stack.

Another example that shows correctness issues is shown in Figure 12-3. Again, we see a set of nested parallel_for loops, but instead of a deadlock, we get unexpected results because of the use of thread local storage. In each task, a value is written to a thread local storage location, local_i, an inner parallel_for loop is executed, and then the thread local storage location is read. Because of the inner loop, a thread may steal work if it becomes idle, write another value to the thread local storage location, and then return to the outer task.

Figure 12-3
figure 3

Nested parallelism that can cause unexpected results due to the use of thread local storage

The TBB development team uses the term moonlightingFootnote 1 for situations in which a thread has unfinished child tasks in flight and steals unrelated tasks to keep itself busy. Moonlighting is usually a good thing! It means that our threads are not sitting around idle. It’s only in limited situations when things go awry. In both of our examples, there was a bad assumption. They both assumed – not surprisingly – that because TBB has a non-preemptive scheduler, the same thread could never be executing an inner task and then start executing an outer task before it completed the inner task. As we’ve seen, because a thread can steal work while it’s waiting in nested parallelism, this situation can in fact occur. This typically benign behavior is only dangerous if we incorrectly depend on the thread executing the tasks in a mutually exclusive way. In the first case, a lock was held while executing nested parallelism – allowing the thread to pause the inner task and pick up an outer task. In the second case, the thread accessed thread local storage before and after nested parallelism and assumed the thread would not moonlight in between.

As we can see, these examples are different but share a common misconception. In the blog “The Work Isolation Functionality in Intel Threading Building Blocks” that is listed in the “For More Information” section at the end of this chapter, Alexei Katranov provides a three-step checklist for deciding when work isolation is needed to ensure correctness:

  1. 1.

    Is nested parallelism used (even indirectly, through third party library calls)? If not, isolation is not needed; otherwise, go to the next step.

  2. 2.

    Is it safe for a thread to reenter the outer level parallel tasks (as if there was recursion)? Storing to a thread local value, re-acquiring a mutex already acquired by this thread, or other resources that should not be used by the same thread again can all cause problems. If reentrance is safe, isolation is not needed; otherwise, go to the next step.

  3. 3.

    Isolation is needed. Nested parallelism has to be called inside an isolated region.

Creating an Isolated Region with this_task_arena::isolate

When we need isolation for correctness, we can use one of the isolate functions in the this_task_arena namespace:

figure a

Figure 12-4 shows how to use this function to add an isolated region around the nested parallel_for from Figure 12-1. Within an isolated region, if a thread becomes idle because it must wait – for example at the end of a nested parallel_for – it will only be allowed to steal tasks spawned from within its own isolated region. This fixes our deadlock problem, because if a thread steals while waiting at the inner parallel_for in Figure 12-4, it will not be allowed to steal an outer task.

Figure 12-4
figure 4

Using the isolate function to prevent moonlighting in the case of nested parallelism

When a thread becomes blocked within an isolated region, it will still randomly choose a thread from its task arena to steal from, but now must inspect tasks in that victim thread’s deque to be sure it steals only tasks that originated from within its isolated region.

The main properties of this_task_arena::isolate are nicely summarized, again in Alexei’s blog, as follows:

  • The isolation only constrains threads that enter or join an isolated region. Worker threads outside of an isolated region can take any task including a task spawned in an isolated region.

  • When a thread without isolation executes a task spawned in an isolated region, it joins the region of this task and becomes isolated until the task is complete.

  • Threads waiting inside an isolated region cannot process tasks spawned in other isolated regions (i.e., all regions are mutually isolated). Moreover, if a thread within an isolated region enters a nested isolated region, it cannot process tasks from the outer isolated region.

Oh No! Work Isolation Can Cause Its Own Correctness Issues!

Unfortunately, we can’t just indiscriminately apply work isolation. There are performance implications, which we will get to later, but more importantly, work isolation itself can cause deadlock if used incorrectly! Here we go again…

In particular, we have to be extra careful when we mix work isolation with TBB interfaces that separate spawning tasks from waiting for tasks – such as task_group and flow graphs. A task that calls a wait interface in one isolated region cannot participate in tasks spawned in a different isolated region while it waits. If enough threads get stuck in such a position, the application might run out of threads and forward progress will stop.

Let’s consider the example function shown in Figure 12-5. In the function splitRunAndWait, M tasks are spawned in task_group tg. But each spawn happens within a different isolated region.

Figure 12-5
figure 5

A function that calls run and wait on task_group tg. The call to run is made from within an isolated region.

If we call function fig_12_5 directly , as is done in Figure 12-5, there is no problem. The call to tg.wait in splitRunAndWait is not inside of an isolated region itself, so the master thread and the worker threads can help with the different isolated regions and then move to other ones when they are finished.

But what if we change our main function to the one in Figure 12-6?

Figure 12-6
figure 6

A function that calls run and wait on task_group tg. The call to run is made from within an isolated region.

Now, the calls to splitRunAndWait are each made inside of different isolated regions, and subsequently the calls to tg.wait are made in those isolated regions. Each thread that calls tg.wait has to wait until its tg is finished but cannot steal any of the tasks that belong to its tg or any other task_group, because those tasks were spawned from different isolated regions! If M is large enough, we will likely wind up with all of our threads waiting in calls to tg.wait, with no threads left to execute any of the related tasks. So our application deadlocks.

If we use an interface that separates spawns from waits, we can avoid this issue by making sure that we always wait in the same isolated region from which we spawn the tasks. We could, for example, rewrite the code from Figure 12-6 to move the call to run out into the outer region as shown in Figure 12-7.

Figure 12-7
figure 7

A function that calls run and wait on task_group tg. The calls to run and wait are now both made outside of the isolated region.

Now, even if our main function uses a parallel loop and isolation, we no longer have a problem, since each thread that calls tg.wait will be able to execute the tasks from its tg:

Even When It Is Safe, Work Isolation Is Not Free

In addition to potential deadlock issues, work isolation does not come for free from a performance perspective either, so even when it is safe to use, we need to use it judiciously. A thread that is not in an isolated region can choose any task when it steals, which means it can quickly pop the oldest task from a victim thread’s deque. If the victim has no tasks at all, it can also immediately pick another victim. However, tasks spawned in an isolated region, and their children tasks, are tagged to identify the isolated region they belong to. A thread that is executing in an isolated region must scan a chosen victim’s deque to find the oldest task that belongs to its isolated region – not just any old task will do. And the thread only knows if a victim thread has no tasks from its isolated region after scanning all of the available tasks and finding none from its region. Only then will it pick another victim to try to steal from. Threads stealing from within an isolated region have more overhead because they need to be pickier!

Using Task Arenas for Isolation: A Double-Edged Sword

Work isolation restricts a thread’s options when it looks for work to do. We can isolate work using the isolate function as described in the previous section, or we can use class task_arena. The subset of the class task_arena interface relevant to this chapter is shown in Figure 12-8.

Figure 12-8
figure 8

A subset of the class task_arena public interface

It almost never makes sense to use class task_arena instead of the isolate function to create isolation solely to ensure correctness. That said, there are still important uses for class task_arena. Let’s look at the basics of class task_arena and, while doing so, uncover its strengths and weaknesses.

With the task_arena constructor, we can set the total number of slots for threads in the arena using the max_concurrency argument and the number of those slots that are reserved exclusively for master threads using the reserved_for_masters argument. More details on how task_arena can be used to control the number of threads used by computations are provided in Chapter 11.

Figure 12-9 shows a small example where a single task_arena ta2 is created, with max_concurrency=2, and a task that executes a parallel_for is executed in that arena.

Figure 12-9
figure 9

A task_arena that has a maximum concurrency of 2

When a thread calls a task_arena’s execute method, it tries to join the arena as a master thread. If there are no available slots, it enqueues the task into the task arena. Otherwise, it joins the arena and executes the task in that arena. In Figure 12-9, the thread will join task_arena ta2, start the parallel_for, and then participate in executing tasks from the parallel_for. Since the arena has a max_concurrency of 2, at most, one additional worker thread can join in and participate in executing tasks in that task arena. If we execute the instrumented example from Figure 12-9 available at Github, we see

There are 4 logical cores. 2 threads participated in ta2

Already we can start to see differences between isolate and class task_arena. It is true that only threads in ta2 will be able to execute tasks in ta2, so there is work isolation, but we were also able to set the maximum number of threads that can participate in executing the nested parallel_for.

Figure 12-10 takes this a step further by creating two task arenas, one with a max_concurrency of 2 and the other with a max_concurrency of 6. A parallel_invoke is then used to create two tasks, one that executes a parallel_for in ta2 and another that executes a parallel_for in ta6. Both parallel_for loops have the same number of iterations and spin for the same amount of time per iteration.

Figure 12-10
figure 10

Using two task_arena objects to use six threads for one loop and two for another

We have effectively divided up our eight threads into two groups, letting two of the threads work on the parallel_for in ta2 and six of the threads work on the parallel_for in ta6. Why would we do this? Perhaps we think the work in ta6 is more critical.

If we execute the code in Figure 12-10 on a platform with eight hardware threads, we will see output similar to

ta2_time == 0.500409 ta6_time == 0.169082 There are 8 logical cores. 2 threads participated in ta2 6 threads participated in ta6

This is the key difference between using isolate and task_arena to create isolation. When using task_arena, we are almost always more concerned with controlling the threads that participate in executing the tasks, rather than in the isolation itself. The isolation is not created for correctness but instead for performance. An explicit task_arena is a double-edged sword – it lets us control the threads that participate in the work but also builds a very high wall between them. When a thread leaves an isolated region created by isolate, it is free to participate in executing any of the other tasks in its arena. When a thread runs out of work to do in an explicit task_arena, it must travel back to the global thread pool and then find another arena that has work to do and has open slots.

Note

We just offered a KEY rule of thumb: Use isolate primarily to aid in correctness; use task_arenas primarily for performance.

Let’s consider our example in Figure 12-10 again. We created more slots in task_arena ta6. As a result, the parallel_for in ta6 completed much faster than the parallel_for in ta2. But after the work is done in ta6, the threads assigned to that arena return to the global thread pool. They are now idle but unable to help with the work in ta2 – the arena has only two slots for threads and they are already full!

The class task_arena abstraction is very powerful, but the high wall it creates between threads limits its practical applications. Chapter 11 discusses in more detail how class task_arena can be used alongside class task_scheduler_init and class global_control to control the number of threads that are available to specific parallel algorithms in a TBB application. Chapter 20 shows how we can use task_arena objects to partition work and schedule the work on specific cores in a Non-Uniform Memory Access (NUMA) platform to tune for data locality. In both chapters, we will see that task_arena is very useful but has drawbacks.

Don’t Be Tempted to Use task_arenas to Create Work Isolation for Correctness

In the specific use cases described in Chapters 11 and 20, the number of threads and even their placement on to particular cores are tightly controlled – and therefore we want to have different threads in the different arenas. In the general case though, the need for task_arena objects to manage and migrate threads just creates overhead.

As an example, let’s again look at a nested set of parallel_for loops , but now without a correctness problem. We can see the code and a possible task tree in Figure 12-11. If we execute this set of loops, all of the tasks will be spawned into the same task arena. When we used isolate in the previous section, all of the tasks were still kept in the same arena, but threads isolated themselves by inspecting tasks before they stole them to make sure they were allowed to take them according to isolation constraints.

Figure 12-11
figure 11

An example of two nested parallel_for loops : (a) the source code and (b) the task tree

Now, let’s modify this simple nested loop example to create isolation using explicit task arena objects. If we want each thread that executes an iteration in the outer loop to only execute tasks from its own inner loop, which we easily achieved by using isolate in Figure 12-4, we can create local nested explicit task_arena instances within each outer body as shown in Figure 12-12(a) and Figure 12-12(b).

Figure 12-12
figure 12

Creating an explicit task_arena for each outer loop body execution. Now, while executing in the inner arena, threads will be isolated from the outer work and unrelated inner loops.

If M == 4, there will be a total of five arenas, and when each thread calls nested.execute, it will be isolated from outer loop tasks as well as unrelated inner loop tasks. We have created a very elegant solution, right?

Of course not! Not only are we creating, initializing, and destroying several task_arena objects, these arenas need to be populated with worker threads. As described in Chapter 11, worker threads fill task arenas in proportion to the number of slots they have. If we have a system with four hardware threads, each arena will only get one thread! What’s the point in that? If we have more threads, they will be evenly divided among the task arenas. As each inner loop finishes, its threads will return to the global thread pool and then migrate to another task arena that has not yet finished. This is not a cheap operation!

Having many task arenas and migrating threads between them is simply not an efficient way to do load balancing. Our toy example in Figure 12-12(b) is shown with only four outer iterations; if there were many iterations, we would create and destroy task_arenas in each outer task. Our four worker threads would scramble around from task arena to task arena looking for work! Stick with the isolate function for these cases!

Summary

We have now learned how to separate TBB tasks and algorithms when they just can’t get along. We saw that nested parallelism combined with the way that stealing occurs in TBB can lead to dangerous situations if we are not careful. We then saw that the this_task_arena::isolate function can be used to address these situations, but it too must be used carefully or else we can create new problems.

We then discussed how we can use class task_arena when we want to create isolation for performance reasons. While class task_arena can be used to create isolation to address correctness, its higher overheads make it less desirable for that purpose. However, as we see in Chapters 11 and 20, class task_arena is an essential part of our toolbox when we want to control the number of threads used by an algorithm or to control the placement of threads on to cores.

For More Information

Alexei Katranov, “The Work Isolation Functionality in Intel Threading Building Blocks (Intel TBB),” https://software.intel.com/en-us/blogs/2018/08/16/the-work-isolation-functionality-in-intel-threading-building-blocks-intel-tbb .