Trace scheduling is a global acyclic instruction scheduling technique in which the scheduling region consists of a linear acyclic sequence of basic blocks embedded in the control flow graph. Trace scheduling differs from other global acyclic scheduling techniques by allowing the scheduling region to be entered after the first instruction.
Trace scheduling was the first global instruction scheduling technique that was proposed and successfully implemented in both research and commercial compilers. By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware, trace scheduling provided the basis for making large amounts of instruction-level parallelism practical. Its first commercial implementation demonstrated that commercial codes could be statically compiled for multi-issue architectures, and thus greatly influenced and contributed to the performance of superscalar architectures. Today, the ideas of trace scheduling and its descendants are implemented in most compilers.
Global scheduling techniques are needed for processors that expose instruction-level parallelism (ILP), that is, processors that allow multiple operations to execute simultaneously. This situation may independently arise for two reasons: either because a processor issues more than a single operation during each clock cycle, or because a processor allows issuing independent operations while deeply pipelined operations are still executing. The number of independent operations that need to be found for an ILP processor is a function of both the number of operations issued per clock cycle, and the latency of operations, whether computational or memory. The latency of computational operations depends upon the design of the functional units. The latency of memory operations depends upon the design and latencies of caches and main memory, as well as on the availability of prefetch and cache-bypassing operations. Global scheduling techniques are needed for these processors because the number of independent operations available in a typical basic block is too small to fully utilize their available hardware resources. By expanding the scheduling region, more operations become available for scheduling. Global scheduling techniques differ from other global code motion techniques (such as loop-invariant code motion or partial redundancy elimination) because they take into account the available hardware resources (such as available functional units and operation issue slots).
Instruction scheduling techniques can be broadly classified based on the region that they schedule, and whether this region is cyclic or acyclic. Algorithms that schedule only single basic blocks are known as local scheduling algorithms; algorithms that schedule multiple basic blocks at once are known as global scheduling algorithms. Global scheduling algorithms that operate on entire loops of a program are known as cyclic scheduling algorithms, while methods that impose a scheduling barrier at the end of a loop body are known as acyclic scheduling algorithms. Global scheduling regions include regions consisting of a single basic block as a “degenerate” form of region, and acyclic schedulers may consider entire loops but, unlike cyclic schedulers, stop at the loops’ back edges (a back edge points to an ancestor in a depth-first traversal of the control flow graph; it captures the flow from one iteration of the loop to the start of the next iteration).
All scheduling algorithms can benefit from hardware support. When control-dependent operations that can cause side effects move above their controlling branch, they need to be either executed conditionally so that their effects only arise if the operation is executed in the original program order, or any side effects must be delayed until the point at which the operation would have been executed originally.
Hardware techniques to support this include predication of operations, implicit or explicit register renaming, and mechanisms to suppress or delay exceptions in order to prevent an incorrect exception to be signaled. Predication of operations controls whether the side effects of the predicated operations become visible to the program state through an additional predicate operand. The predicate operand can be implicit (such as the conditional execution of operations in branch delay slots depending on the outcome of the branch condition) or explicit (through an additional machine register operand); in the latter case, the predicate operand could simply be the same predicate that controls the conditional branch on which the operation was control-dependent in the original flow graph (in which case the predicated operation could move just above a single conditional branch). Register renaming refers to the technique where additional machine registers are used to hold the results of an operation until the point where the operation would have occurred in the original program order.
Global scheduling algorithms principally consist of two phases: region formation and schedule construction. Algorithms differ in the shape of the region and the global code motions permitted during scheduling. Depending on the region and the allowed code motions, compensation code needs to be inserted at appropriate places in the control flow graph to maintain the original program semantics; depending on the code motions allowed during scheduling, compensation code needs to be inserted during the scheduling phase of the compiler.
Trace scheduling allows traces to be entered after the first operation and before the last operation. This complicates the determination of compensation code because the location of rejoin points cannot be done before a trace has been scheduled. This leads to the following overall trace scheduling loop:
while (unscheduled operations remain)
select trace T
construct schedule for T
determine rejoin points to T
generate compensation code
The remainder of this entry first discusses region formation and schedule construction in general and as it applies to trace scheduling, and then compares trace scheduling to other acyclic global scheduling techniques. Cyclic scheduling algorithms are discussed elsewhere.
Region Formation – Trace Picking
Each basic block is a predecessor of the next in the sequence (i.e., for each k=0, …, n−1, B k is a predecessor of B k+1, and B k+1 is a successor of B k in the control flow graph).
For any j, k there is no path B j →B k →B j except for those that include B 0 (i.e., the code is cycle free, except that the entire region can be part of some encompassing loop).
Note that this definition does not exclude forward branches within the region, nor control flow that leaves the region and reenters it at a later point. This generality has been controversial in the research community because many felt that the added complexity of its implementation was not justified by its added benefit and has led to several alternative approaches that are discussed below.
Pick the as-yet unscheduled operation with the largest expected execution frequency as the seed operation of the trace.
Grow the trace both forward in the direction of the flow graph as well as backward, picking the mutually most-likely successor (predecessor) operation to the currently last (first) operation on the trace.
Stop growing a trace when either no mutually most-likely successor (predecessor) exists, or when some heuristic trace length limit has been reached.
S is the most likely successor of P;
P is the most likely predecessor of S.
For this definition, it is immaterial whether the likelihood that S follows P (P precedes S) is based on available profile data collected during earlier runs of the program, has been determined by a synthetic profile, or is based on source annotations in the program. Of course, the more benefit is derived from having picked the correct trace, the greater is the penalty when picking the wrong trace.
Trace picking is the region formation technique used for trace scheduling. Other acyclic region formation techniques and their relationship to trace scheduling are discussed below.
Trace selection alone typically does not expose enough ILP for the instruction scheduler of a typical ILP processor. Once the limit on the length of a “natural” trace has been reached (e.g., the entire loop body), region-enlargement techniques can be employed to further increase the size of the region, albeit at the cost of a larger code size for the program. Many enlargement techniques exploit the fact that programs iterate and grow the size of a region by making extra copies of highly iterated code, leading to a larger region that contains more ILP.
These code-replicating techniques have been criticized by advocates of other approaches, such as cyclic scheduling and loop-level parallel processing, because comparable benefits to larger schedule regions may be found using other techniques. However, no study appears to exist that quantifies such claims.
Loop unrolling in many industrial compilers is often rather effective because a heuristically determined small amount of unrolling is sufficient to fill the resources of the target machine.
Region Compaction – Instruction Scheduler
Once the scheduling region has been selected, the instruction scheduler assigns functional units of the target machine and time slots in the instruction schedule to each operation of the region. In doing so, the scheduler attempts to minimize an objective cost function while maintaining program semantics and obeying the resource limitations of the target architecture. Often, the objective cost function is the expected execution time, but other objective functions are possible (for example, code size and energy efficiency could be part of an objective function).
The semantics of a program defines certain sequential constraints or dependences that must be maintained by a valid execution. These dependences preclude some reordering of operations within a program. The data flow of a program imposes data dependences, and the control flow of a program imposes control dependences. (Note the difference between control flow and control dependence: block B is control dependent on block A if A precedes B along some path, but B does not post-dominate A. In other words, the result of the control decision made in A directly affects whether or not B is executed.)
There are three types of data dependences: read-after-write dependences (also called RAW, flow, or true dependences), write-after-read dependences (also called WAR or anti dependences), and write-after-write dependences (also called WAW or output dependences). The latter two types are also called false dependences because they can be removed by renaming.
There are two types of control dependences: split dependences may prevent operations from moving below the exit of a basic block, and join dependences may prevent operations from moving above the entrance to a basic block. Control dependence does not constrain the relative order of operations within a basic block but rather expresses constraints on moving operations between basic blocks.
Both data and control dependences represent ordering constraints on the program execution, and hence induce a partial ordering on the operations. Any partial ordering can be represented as a directed acyclic graph (DAG), and DAGs are indeed often used by scheduling algorithms. Variants to the simple DAG are the data dependence graph (DDG), and the program dependence graph (PDG). All these graphs represent operations as nodes and dependences as edges (some graphs only express data dependences, while others include both data and control dependences).
Code Motion Between Adjacent Blocks
Two fundamental techniques, predication and speculation, are employed by schedulers (or earlier phases) to transform or remove control dependence. While it is sometimes possible to employ either technique, they represent independent techniques, and usually one is more natural to employ in a given situation. Speculation is used to move operations above a branch that is highly weighted in one direction; predication is used to collapse short sequences of alternative operations following a branch that is nearly equally likely in each direction. Predication can also play an important role in software pipelining.
Speculative code motion (or code hoisting and sometimes code sinking) moves operations above control-dominating branches (or below joins for sinking). In principle, this transformation does not always maintain the original program semantics, and in particular it may change the exception behavior of the program. If an operation may generate an exception and the exception recovery model does not allow speculative exceptions to be dismissed (ignored), then the compiler must generate recovery code that raises the exception at the original program point of the speculated operation. Unlike predication, speculation actually removes control dependences, and thus potentially reduces the length of the critical path of execution. Depending on the shape and size of recovery code, and if multiple operations are speculated, the addition of recovery code can lead to a substantial amount of code.
Predication is a technique where with hardware support operations have an additional input operand, the predicate operand, which determines whether any effects of executing the operations are seen by the program execution. Thus, from an execution point of view, the operation is conditionally executed under the control of the predicate input. Hence changing a control-dependent operation to its predicated equivalent that depends on a predicate that is equivalent to the condition of the control dependence turns control dependence into data dependence.
There are many different scheduling techniques, which can broadly be classified by features into cycle versus operation scheduling, linear versus graph-based, cyclic versus acyclic, and greedy versus backtracking. However, for trace scheduling itself the scheduling technique employed is not of major concern; rather, trace scheduling distinguishes itself from other global acyclic scheduling techniques by the way the scheduling region is formed, and by the kind of code motions permitted during scheduling. Hence these techniques will not be described here, and in the following, a greedy graph-based technique, namely list scheduling, will be used.
During scheduling, typically only a very small number of operations can be moved freely between basic blocks without changing program semantics. Other operations may be moved only when additional compensation code is inserted at an appropriate place in order to maintain original program semantics. Trace scheduling is quite general in this regard. Recall that a trace may be entered after the first instruction, and exited before the last instruction. In addition, trace scheduling allows operations in the region (trace) to move freely during scheduling relative to entries (join points) to and exits (split points) from the current trace. A separate bookkeeping step restores the original program semantics after trace compaction through the introduction of compensation code. It is this freedom of code motion during scheduling, and the introduction of compensation code between the scheduling of individual regions, that represents a major difference between trace scheduling and other acyclic scheduling techniques.
Since trace scheduling allows operations to move above join points as well as below split points (conditional branches) in the original program order, the bookkeeping process includes the following kinds of compensation. Note that a complete discussion of all the intricacies of compensation code is well beyond the scope of this entry; however, the following is a list of the simple concepts that form the basis of many of the compensation techniques used in compilers.
Split compensation (Fig. 4b). When an operation A moves below a split operation B (i.e., a conditional branch), a copy of A (called A′) must be inserted on the off-trace split edge. When multiple operations move below a split operation, they are all copied on the off-trace edge in source order. These copies are unscheduled, and hence will be picked and scheduled later during the trace scheduling of the program.
Join compensation (Fig. 4c). When an operation B moves above a join point A, a copy of B (called B′) must be copied on the off-trace join edge. When multiple operations move above a join point, they are all copied on the off-trace edge in source order.
Join–Split compensation (Fig. 4d). When splits are allowed to move above join points, the situation becomes more complicated: when the split is copied on the rejoin edge, it must account for any split compensation and therefore introduce additional control paths with additional split copies.
These rules define the compensation code required to correctly maintain the semantics of the original program. The following observations can be used to heuristically control the amount of compensation code that is generated.
To limit split compensation, the Multiflow Trace Scheduling compiler , the first commercial compiler to implement trace scheduling, required that all operations that precede a split on the trace precede the split on the schedule. While this limits the amount of available parallelism, the intuitive explanation is that a trace represents the most likely execution path; the on-trace performance penalty of this restriction is small; and off-trace the same operations would have to be executed in the first place. Multiflow’s implementation excluded memory-store operations from this heuristic because in Multiflow’s Trace architecture stores were unconditional and hence could not move above splits; they were allowed to move below splits to avoid serialization between stores and loop exits in unrolled loops. The Multiflow compiler also restricted splits to remain in source order. Not only did this reduce the amount of compensation code, it also ensured that all paths created by compensation code are subsets of paths (possibly rearranged) in the flow graph before trace scheduling.
Bibliographic Notes and Further Reading
The simplest form of a scheduling region is a region where all operations come from a single-entry single-exit straight-line piece of code (i.e., a basic block). Since these regions do not contain any internal control flow, they can be scheduled using simple algorithms that maintain the partial order given by data dependences. (For simplicity, it is best to require that operations that could incur an exception must end their basic block, allowing the exception to be caught by an exception handler.)
Traces and trace scheduling were the first region-scheduling techniques proposed. They were introduced by Fisher [6, 7] and described more carefully in Ellis’ thesis . By demonstrating that simple microcode operations could be statically compacted and scheduled on multi-issue hardware trace scheduling provided the basis for making VLIW machines practical. Trace scheduling was implemented in the Multiflow compiler ; by demonstrating that commercial codes could be statically compiled for multi-issue architectures, this work also greatly influenced and contributed to the performance of superscalar architectures. Today, ideas of trace scheduling and its descendants are implemented in most compilers (e.g., GCC, LLVM, Open64, Pro64, as well as commercial compilers).
Trace scheduling inspired several other global acyclic scheduling techniques. The most important linear acyclic region-scheduling techniques are presented next.
Hwu and his colleagues on the IMPACT project have developed a variant of trace scheduling called superblock scheduling. Superblocks are traces with the added restriction that the superblock must be entered at the top [2, 3]. Hence superblocks can be joined only before the first or after the last operation in the superblock. As such, superblocks are single-entry, multiple-exit traces.
Since superblocks do not contain join points, scheduling a superblock cannot generate any join or join–split compensation. By also prohibiting motion below splits, superblock scheduling avoids the need of generating compensation code outside the schedule region, and hence does not require a separate bookkeeping step. With these restrictions, superblock formation can be completed before scheduling starts, simplifying its implementation.
Superblock formation often includes a technique called tail duplication to increase the size of the superblock: tail duplication copies any operations that follow a rejoin in the original control flow graph and that are part of the superblock into the rejoin edge, thus effectively lowering the rejoin point to the end of the superblock. This is done at superblock formation time, before any compaction takes place .
A variant of superblock scheduling that allows speculative code motion is sentinel scheduling .
A different approach to global acyclic scheduling also originated with the IMPACT project. Hyperblocks are superblocks that have eliminated internal control flow using predication . As such, hyperblocks are single-entry, multiple-exit traces (superblocks) that use predication to eliminate internal control flow.
For each j>0, B j has exactly one predecessor.
For each j>0, the predecessor B i of B j is also on the list, where i<j.
Hence, treegions represent trees of basic blocks in the control flow graph. Since treegions do not contain any side entrances, each path through a treegion yields a superblock. Like superblock compilers, treegion compilers employ tail duplication and other region-enlarging techniques. More recent work by Zhou and Conte [16, 17] shows that treegions can be made quite effective without significant code growth.
Nonlinear region approaches include percolation scheduling  and DAG-based scheduling . Trace scheduling-2  extends treegions by removing the restriction on side entrances. However, its implementation proved so difficult that its proposer eventually gave up on it, and no formal description or implementation of it is known to exist.