Reference Work Entry

Encyclopedia of Parallel Computing

pp 2062-2071

Trace Scheduling

  • Stefan M. FreudenbergerAffiliated withFreudenberger Consulting, Freudenberger Consulting

Definition

Trace scheduling is a global acyclic instruction scheduling technique in which the scheduling region consists of a linear acyclic sequence of basic blocks embedded in the control flow graph. Trace scheduling differs from other global acyclic scheduling techniques by allowing the scheduling region to be entered after the first instruction.

Trace scheduling was the first global instruction scheduling technique that was proposed and successfully implemented in both research and commercial compilers. By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware, trace scheduling provided the basis for making large amounts of instruction-level parallelism practical. Its first commercial implementation demonstrated that commercial codes could be statically compiled for multi-issue architectures, and thus greatly influenced and contributed to the performance of superscalar architectures. Today, the ideas of trace scheduling and its descendants are implemented in most compilers.

Discussion

Introduction

Global scheduling techniques are needed for processors that expose instruction-level parallelism (ILP), that is, processors that allow multiple operations to execute simultaneously. This situation may independently arise for two reasons: either because a processor issues more than a single operation during each clock cycle, or because a processor allows issuing independent operations while deeply pipelined operations are still executing. The number of independent operations that need to be found for an ILP processor is a function of both the number of operations issued per clock cycle, and the latency of operations, whether computational or memory. The latency of computational operations depends upon the design of the functional units. The latency of memory operations depends upon the design and latencies of caches and main memory, as well as on the availability of prefetch and cache-bypassing operations. Global scheduling techniques are needed for these processors because the number of independent operations available in a typical basic block is too small to fully utilize their available hardware resources. By expanding the scheduling region, more operations become available for scheduling. Global scheduling techniques differ from other global code motion techniques (such as loop-invariant code motion or partial redundancy elimination) because they take into account the available hardware resources (such as available functional units and operation issue slots).

Instruction scheduling techniques can be broadly classified based on the region that they schedule, and whether this region is cyclic or acyclic. Algorithms that schedule only single basic blocks are known as local scheduling algorithms; algorithms that schedule multiple basic blocks at once are known as global scheduling algorithms. Global scheduling algorithms that operate on entire loops of a program are known as cyclic scheduling algorithms, while methods that impose a scheduling barrier at the end of a loop body are known as acyclic scheduling algorithms. Global scheduling regions include regions consisting of a single basic block as a “degenerate” form of region, and acyclic schedulers may consider entire loops but, unlike cyclic schedulers, stop at the loops’ back edges (a back edge points to an ancestor in a depth-first traversal of the control flow graph; it captures the flow from one iteration of the loop to the start of the next iteration).

All scheduling algorithms can benefit from hardware support. When control-dependent operations that can cause side effects move above their controlling branch, they need to be either executed conditionally so that their effects only arise if the operation is executed in the original program order, or any side effects must be delayed until the point at which the operation would have been executed originally.

Hardware techniques to support this include predication of operations, implicit or explicit register renaming, and mechanisms to suppress or delay exceptions in order to prevent an incorrect exception to be signaled. Predication of operations controls whether the side effects of the predicated operations become visible to the program state through an additional predicate operand. The predicate operand can be implicit (such as the conditional execution of operations in branch delay slots depending on the outcome of the branch condition) or explicit (through an additional machine register operand); in the latter case, the predicate operand could simply be the same predicate that controls the conditional branch on which the operation was control-dependent in the original flow graph (in which case the predicated operation could move just above a single conditional branch). Register renaming refers to the technique where additional machine registers are used to hold the results of an operation until the point where the operation would have occurred in the original program order.

Global scheduling algorithms principally consist of two phases: region formation and schedule construction. Algorithms differ in the shape of the region and the global code motions permitted during scheduling. Depending on the region and the allowed code motions, compensation code needs to be inserted at appropriate places in the control flow graph to maintain the original program semantics; depending on the code motions allowed during scheduling, compensation code needs to be inserted during the scheduling phase of the compiler.

Trace scheduling allows traces to be entered after the first operation and before the last operation. This complicates the determination of compensation code because the location of rejoin points cannot be done before a trace has been scheduled. This leads to the following overall trace scheduling loop:

while (unscheduled operations remain)          

{

              select trace T

              construct schedule for T

              bookkeeping -

                    determine rejoin points to T

                    generate compensation code  

  }

The remainder of this entry first discusses region formation and schedule construction in general and as it applies to trace scheduling, and then compares trace scheduling to other acyclic global scheduling techniques. Cyclic scheduling algorithms are discussed elsewhere.

Region Formation – Trace Picking

Traces were the first global scheduling region proposed, and represent contiguous linear paths through the code (Fig. 1). More formally, a trace consists of the operations of a sequence of basic blocks B 0, B 1,…, B n with the properties that:
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F20/MediaObjects/978-0-387-09766-4_20_Part_Fig1-251_HTML.gif
Trace Scheduling. Fig. 1

Trace selection. The left diagram shows the selected trace. The right diagram illustrates the mutual-most-likely trace picking heuristic: assume that A is the last operation of the current trace, and that B is one of A’s successors. Here B is the most likely successor of A, and A is the most likely predecessor of B

  • Each basic block is a predecessor of the next in the sequence (i.e., for each k=0, …, n−1, B k is a predecessor of B k+1, and B k+1 is a successor of B k in the control flow graph).

  • For any j, k there is no path B j B k B j except for those that include B 0 (i.e., the code is cycle free, except that the entire region can be part of some encompassing loop).

Note that this definition does not exclude forward branches within the region, nor control flow that leaves the region and reenters it at a later point. This generality has been controversial in the research community because many felt that the added complexity of its implementation was not justified by its added benefit and has led to several alternative approaches that are discussed below.

Of the many ways in which one can form traces, the most popular algorithm employs the following simple trace formation algorithm:
  • Pick the as-yet unscheduled operation with the largest expected execution frequency as the seed operation of the trace.

  • Grow the trace both forward in the direction of the flow graph as well as backward, picking the mutually most-likely successor (predecessor) operation to the currently last (first) operation on the trace.

  • Stop growing a trace when either no mutually most-likely successor (predecessor) exists, or when some heuristic trace length limit has been reached.

The mutually most-likely successor S of an operation P is the operation with the properties that:
  • S is the most likely successor of P;

  • P is the most likely predecessor of S.

For this definition, it is immaterial whether the likelihood that S follows P (P precedes S) is based on available profile data collected during earlier runs of the program, has been determined by a synthetic profile, or is based on source annotations in the program. Of course, the more benefit is derived from having picked the correct trace, the greater is the penalty when picking the wrong trace.

Trace picking is the region formation technique used for trace scheduling. Other acyclic region formation techniques and their relationship to trace scheduling are discussed below.

Region Enlargement

Trace selection alone typically does not expose enough ILP for the instruction scheduler of a typical ILP processor. Once the limit on the length of a “natural” trace has been reached (e.g., the entire loop body), region-enlargement techniques can be employed to further increase the size of the region, albeit at the cost of a larger code size for the program. Many enlargement techniques exploit the fact that programs iterate and grow the size of a region by making extra copies of highly iterated code, leading to a larger region that contains more ILP.

These code-replicating techniques have been criticized by advocates of other approaches, such as cyclic scheduling and loop-level parallel processing, because comparable benefits to larger schedule regions may be found using other techniques. However, no study appears to exist that quantifies such claims.

The simplest and oldest region-enlargement technique is loop unrolling (Fig. 2): to unroll a loop, duplicate its body several times, change the targets of the back edges of each copy but the last to point to the header of the next copy (so that the back edges of the last copy point back to the loop header of the first copy). Variants of loop unrolling include pre-/post-conditioning of a loop by k for counted for loops with unknown loop bounds (leading to two loops: a “fixup loop” that executes up to k iterations; and a “main loop” that is unrolled by k and has its internal exits removed; the fixup loop can precede or follow the main loop), and loop peeling by the expected small iteration count. When the iteration count of the fixup loop of a p-conditioned loop is small (which it typically is), the fixup loop is completely unrolled.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F20/MediaObjects/978-0-387-09766-4_20_Part_Fig2-251_HTML.gif
Trace Scheduling. Fig. 2

Simplified illustration of variants of loop unrolling. “if” and “goto” represent the loop control operations; “body” represents the part of the loop without loop-related control flow. In the general case (e.g., a while loop) the loop exit tests remain inside the loop. This is shown in the second column (“unrolled by 4”). For counted loops (i.e., for loops), the compiler can condition the unrolled loop so that the loop conditions can be removed from the main body of the loop. Two variants of this are shown in the two rightmost columns. Modern compilers will typically precede the loop with a zero trip count test and place the loop condition at the bottom of the loop. This removes the unconditional branch from the loop

Typically, loop unrolling is done before region formation so that the enlarged region becomes available to the region selector. This is done to keep the region selector simpler but may lead to phase-ordering issues, as loop unrolling has to guess the “optimal” unroll amount. At the same time, when loops are unrolled before region formation then the resulting code can be scalar optimized in the normal fashion; in particular height-reducing transformations that remove dependences between the individual copies of the unrolled loop body can expose a larger amount of parallelism between the individual iterations (Fig. 3). Needless to say, if no parallelism between the iterations exists or can be found, loop unrolling is ineffective.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F20/MediaObjects/978-0-387-09766-4_20_Part_Fig3-251_HTML.gif
Trace Scheduling. Fig. 3

Typical induction variable manipulations for loops. Downward arrows represent flow dependences; upward arrows represent anti dependences. Only the critical dependences are shown

Loop unrolling in many industrial compilers is often rather effective because a heuristically determined small amount of unrolling is sufficient to fill the resources of the target machine.

Region Compaction – Instruction Scheduler

Once the scheduling region has been selected, the instruction scheduler assigns functional units of the target machine and time slots in the instruction schedule to each operation of the region. In doing so, the scheduler attempts to minimize an objective cost function while maintaining program semantics and obeying the resource limitations of the target architecture. Often, the objective cost function is the expected execution time, but other objective functions are possible (for example, code size and energy efficiency could be part of an objective function).

The semantics of a program defines certain sequential constraints or dependences that must be maintained by a valid execution. These dependences preclude some reordering of operations within a program. The data flow of a program imposes data dependences, and the control flow of a program imposes control dependences. (Note the difference between control flow and control dependence: block B is control dependent on block A if A precedes B along some path, but B does not post-dominate A. In other words, the result of the control decision made in A directly affects whether or not B is executed.)

There are three types of data dependences: read-after-write dependences (also called RAW, flow, or true dependences), write-after-read dependences (also called WAR or anti dependences), and write-after-write dependences (also called WAW or output dependences). The latter two types are also called false dependences because they can be removed by renaming.

There are two types of control dependences: split dependences may prevent operations from moving below the exit of a basic block, and join dependences may prevent operations from moving above the entrance to a basic block. Control dependence does not constrain the relative order of operations within a basic block but rather expresses constraints on moving operations between basic blocks.

Both data and control dependences represent ordering constraints on the program execution, and hence induce a partial ordering on the operations. Any partial ordering can be represented as a directed acyclic graph (DAG), and DAGs are indeed often used by scheduling algorithms. Variants to the simple DAG are the data dependence graph (DDG), and the program dependence graph (PDG). All these graphs represent operations as nodes and dependences as edges (some graphs only express data dependences, while others include both data and control dependences).

Code Motion Between Adjacent Blocks

Two fundamental techniques, predication and speculation, are employed by schedulers (or earlier phases) to transform or remove control dependence. While it is sometimes possible to employ either technique, they represent independent techniques, and usually one is more natural to employ in a given situation. Speculation is used to move operations above a branch that is highly weighted in one direction; predication is used to collapse short sequences of alternative operations following a branch that is nearly equally likely in each direction. Predication can also play an important role in software pipelining.

Speculative code motion (or code hoisting and sometimes code sinking) moves operations above control-dominating branches (or below joins for sinking). In principle, this transformation does not always maintain the original program semantics, and in particular it may change the exception behavior of the program. If an operation may generate an exception and the exception recovery model does not allow speculative exceptions to be dismissed (ignored), then the compiler must generate recovery code that raises the exception at the original program point of the speculated operation. Unlike predication, speculation actually removes control dependences, and thus potentially reduces the length of the critical path of execution. Depending on the shape and size of recovery code, and if multiple operations are speculated, the addition of recovery code can lead to a substantial amount of code.

Predication is a technique where with hardware support operations have an additional input operand, the predicate operand, which determines whether any effects of executing the operations are seen by the program execution. Thus, from an execution point of view, the operation is conditionally executed under the control of the predicate input. Hence changing a control-dependent operation to its predicated equivalent that depends on a predicate that is equivalent to the condition of the control dependence turns control dependence into data dependence.

Trace Compaction

There are many different scheduling techniques, which can broadly be classified by features into cycle versus operation scheduling, linear versus graph-based, cyclic versus acyclic, and greedy versus backtracking. However, for trace scheduling itself the scheduling technique employed is not of major concern; rather, trace scheduling distinguishes itself from other global acyclic scheduling techniques by the way the scheduling region is formed, and by the kind of code motions permitted during scheduling. Hence these techniques will not be described here, and in the following, a greedy graph-based technique, namely list scheduling, will be used.

Compensation Code

During scheduling, typically only a very small number of operations can be moved freely between basic blocks without changing program semantics. Other operations may be moved only when additional compensation code is inserted at an appropriate place in order to maintain original program semantics. Trace scheduling is quite general in this regard. Recall that a trace may be entered after the first instruction, and exited before the last instruction. In addition, trace scheduling allows operations in the region (trace) to move freely during scheduling relative to entries (join points) to and exits (split points) from the current trace. A separate bookkeeping step restores the original program semantics after trace compaction through the introduction of compensation code. It is this freedom of code motion during scheduling, and the introduction of compensation code between the scheduling of individual regions, that represents a major difference between trace scheduling and other acyclic scheduling techniques.

Since trace scheduling allows operations to move above join points as well as below split points (conditional branches) in the original program order, the bookkeeping process includes the following kinds of compensation. Note that a complete discussion of all the intricacies of compensation code is well beyond the scope of this entry; however, the following is a list of the simple concepts that form the basis of many of the compensation techniques used in compilers.

No compensation (Fig. 4a). If the global motion of an operation on the trace does not change the relative order of operations with respect to split and join points, no compensation code is needed. This covers the situation when an operation moves above a split, in which case the operation becomes speculative, and requires compensation depending on the recovery model of exceptions: in the case of dismissible speculation, no compensation code is needed; in the case of recovery speculation, the compiler has to emit a recovery block to guarantee the timely delivery of exceptions for correctly speculated operations.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F20/MediaObjects/978-0-387-09766-4_20_Part_Fig4-251_HTML.gif
Trace Scheduling. Fig. 4

Basic scenarios for compensation code. In each diagram, the left part shows the selected trace, the right part shows the compacted code where operation B has moved above operation A

Split compensation (Fig. 4b). When an operation A moves below a split operation B (i.e., a conditional branch), a copy of A (called A′) must be inserted on the off-trace split edge. When multiple operations move below a split operation, they are all copied on the off-trace edge in source order. These copies are unscheduled, and hence will be picked and scheduled later during the trace scheduling of the program.

Join compensation (Fig. 4c). When an operation B moves above a join point A, a copy of B (called B′) must be copied on the off-trace join edge. When multiple operations move above a join point, they are all copied on the off-trace edge in source order.

Join–Split compensation (Fig. 4d). When splits are allowed to move above join points, the situation becomes more complicated: when the split is copied on the rejoin edge, it must account for any split compensation and therefore introduce additional control paths with additional split copies.

These rules define the compensation code required to correctly maintain the semantics of the original program. The following observations can be used to heuristically control the amount of compensation code that is generated.

To limit split compensation, the Multiflow Trace Scheduling compiler [12], the first commercial compiler to implement trace scheduling, required that all operations that precede a split on the trace precede the split on the schedule. While this limits the amount of available parallelism, the intuitive explanation is that a trace represents the most likely execution path; the on-trace performance penalty of this restriction is small; and off-trace the same operations would have to be executed in the first place. Multiflow’s implementation excluded memory-store operations from this heuristic because in Multiflow’s Trace architecture stores were unconditional and hence could not move above splits; they were allowed to move below splits to avoid serialization between stores and loop exits in unrolled loops. The Multiflow compiler also restricted splits to remain in source order. Not only did this reduce the amount of compensation code, it also ensured that all paths created by compensation code are subsets of paths (possibly rearranged) in the flow graph before trace scheduling.

Another observation concerns the possible suppression of compensation copies [8] (Fig. 5): sometimes an operation C that moves above a join point following an operation B actually moves to a position on the trace that dominates the join point. When this happens, and the result of C is still available at the join point, no copy of C is needed. This situation often arises when loops with internal branches are unrolled. Without copy suppression, such loops can generate large amounts of redundant compensation code.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F20/MediaObjects/978-0-387-09766-4_20_Part_Fig5-251_HTML.gif
Trace Scheduling. Fig. 5

Compensation copy suppression. The left diagram shows the selected trace. The middle diagram shows the compacted code where operation C has moved above operation A together with the normal join compensation. The right diagram shows the result of compensation copy suppression assuming that C is available at Y

Bibliographic Notes and Further Reading

The simplest form of a scheduling region is a region where all operations come from a single-entry single-exit straight-line piece of code (i.e., a basic block). Since these regions do not contain any internal control flow, they can be scheduled using simple algorithms that maintain the partial order given by data dependences. (For simplicity, it is best to require that operations that could incur an exception must end their basic block, allowing the exception to be caught by an exception handler.)

Traces and trace scheduling were the first region-scheduling techniques proposed. They were introduced by Fisher [67] and described more carefully in Ellis’ thesis [4]. By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware trace scheduling provided the basis for making VLIW machines practical. Trace scheduling was implemented in the Multiflow compiler [12]; by demonstrating that commercial codes could be statically compiled for multi-issue architectures, this work also greatly influenced and contributed to the performance of superscalar architectures. Today, ideas of trace scheduling and its descendants are implemented in most compilers (e.g., GCC, LLVM, Open64, Pro64, as well as commercial compilers).

Trace scheduling inspired several other global acyclic scheduling techniques. The most important linear acyclic region-scheduling techniques are presented next.

Superblocks

Hwu and his colleagues on the IMPACT project have developed a variant of trace scheduling called superblock scheduling. Superblocks are traces with the added restriction that the superblock must be entered at the top [23]. Hence superblocks can be joined only before the first or after the last operation in the superblock. As such, superblocks are single-entry, multiple-exit traces.

Since superblocks do not contain join points, scheduling a superblock cannot generate any join or join–split compensation. By also prohibiting motion below splits, superblock scheduling avoids the need of generating compensation code outside the schedule region, and hence does not require a separate bookkeeping step. With these restrictions, superblock formation can be completed before scheduling starts, simplifying its implementation.

Superblock formation often includes a technique called tail duplication to increase the size of the superblock: tail duplication copies any operations that follow a rejoin in the original control flow graph and that are part of the superblock into the rejoin edge, thus effectively lowering the rejoin point to the end of the superblock. This is done at superblock formation time, before any compaction takes place [11].

A variant of superblock scheduling that allows speculative code motion is sentinel scheduling [14].

Hyperblocks

A different approach to global acyclic scheduling also originated with the IMPACT project. Hyperblocks are superblocks that have eliminated internal control flow using predication [13]. As such, hyperblocks are single-entry, multiple-exit traces (superblocks) that use predication to eliminate internal control flow.

Treegions

Treegions [910] consist of the operations from a list of basic blocks B 0, B 1, …, B n with the properties that:

  • For each j>0, B j has exactly one predecessor.

  • For each j>0, the predecessor B i of B j is also on the list, where i<j.

Hence, treegions represent trees of basic blocks in the control flow graph. Since treegions do not contain any side entrances, each path through a treegion yields a superblock. Like superblock compilers, treegion compilers employ tail duplication and other region-enlarging techniques. More recent work by Zhou and Conte [1617] shows that treegions can be made quite effective without significant code growth.

Nonlinear Regions

Nonlinear region approaches include percolation scheduling [1] and DAG-based scheduling [15]. Trace scheduling-2 [5] extends treegions by removing the restriction on side entrances. However, its implementation proved so difficult that its proposer eventually gave up on it, and no formal description or implementation of it is known to exist.

Related Entries

Modulo Scheduling and Loop Pipelining

Copyright information

© Springer Science+Business Media, LLC 2011
Show all