1.1 Introduction
Traditional processor designs issue and execute at most one operation per cycle. So to overcome this instruction pipelining has become a standard feature for improving the performance of commercial processors. An extension to instruction pipelining is to provide parallel data-paths in order to fetch, decode and execute several operations per cycle. Such processors have been referred to as multiple-instruction-issue processors as per [4]. A critical issue regarding the design of multiple instruction-issue processors is code scheduling. Code scheduling methods ensure that control dependencies, data dependencies, and resource limitations are properly handled during concurrent execution. The goal is to produce a code schedule that minimizes the execution time in addition to enforcing the correctness of execution. Code scheduling can be done at compile time (static scheduling) [2] and/or at run time (dynamic scheduling) [2]. In dynamic scheduling hardware rearranges the instruction execution to reduce stalls, allowing the program to run at full speed. Hence, it requires a considerable investment in hardware. Whereas, in static scheduling (pipelining), the compiler schedules the instruction so the hardware functional units are not utilised properly. This essay will provide the critical comparative analysis of dynamic and static scheduling.
1.2 Dynamic and Static Scheduling
Code scheduling can be done at compile time (static scheduling) and/or at run time (dynamic scheduling). Static scheduling requires intelligent compilation support whereas dynamic scheduling requires sophisticated hardware support. In practice, dynamic scheduling is assisted by static scheduling to improve performance and to reduce hardware cost. On the other hand, static scheduling is often assisted by hardware interlocking to enforce the correctness of execution. Compiler can schedule (move) instructions to reduce stalls. Basic pipeline scheduling eliminates back-to-back load-use pairs. The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to issue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using this technique have been called Very Long Instruction Word (VLIW) processors. The primary dynamic scheduling technique uses special hardware to analyze the instruction stream at execution time and to determine which operations in the instruction stream are independent of all preceding operations and have their source operands available. These operations can then be issued and executed concurrently. Processors using dynamic scheduling techniques have been called superscalar processors or out of order processors.
1.3 Program characteristics that support dynamic scheduling and static scheduling
In a static issue processor, you can think of the set of instructions that issue in a given clock cycle, which is called an issue packet, as one large instruction with multiple operations. Since a static multiple-issue processor usually restricts what mix of instructions can be initiated in a given clock cycle, so it is a single instruction allowing several operations in certain predefined fields. This view led to the original name for this approach: Very Long Instruction Word (VLIW). Most static issue processors also rely on the compiler to take on some responsibility for handling data and control hazards. Loop Unrolling is used to get more performance from the loop. It is a technique where multiple copies of the loop body are made. After unrolling, there is more ILP available by overlapping instructions from different iterations.During the unrolling process, the compiler introduces additional registers. The goal of this process, called register renaming, is to eliminate dependences that are not true data dependences, but could either lead to potential hazards or prevent the compiler from flexibly scheduling the code. Anti-dependence or name dependence can also arise which is caused is an ordering forced by the reuse of a name, typically a register, rather than by a true dependence that carries a value between two. Renaming the registers during the unrolling process allows the compiler to subsequently move these independent instructions so as to better schedule the code. The renaming process eliminates the name dependences, while preserving the true dependences. In static scheduling Static branch predictors are sometimes used in processors where the expectation is that branch behaviour is highly predictable at compile-time; static prediction can also be used to assist dynamic predictors. There are several different methods to statically predict branch behaviour. The simplest scheme is to predict a branch as taken. A better alternative is to predict on the basis of branch direction, choosing backward-going branches to be taken and forward-going, branches to be not taken. A still more accurate technique is to predict branches on the basis of profile information collected from earlier runs. Static branch behaviour is useful for scheduling instructions when the branch delays are exposed by the architecture (either delayed or cancelling branches), for assisting dynamic predictors and for determining which code paths are more frequent, which is a key step in code scheduling. Superscalar processors decide on the fly how many instructions to issue. A statically scheduled superscalar must check for any dependences between instructions in the issue packet as well as between any issue candidate and any instruction already in the pipeline. An alternative to the superscalar approach is to rely on compiler technology not only to minimize the potential data hazard stalls, but to actually format the instructions in a potential issue packet so that the hardware need not check explicitly for dependences. When the behaviour of branches is not well known, compiler techniques alone may not be able to uncover much ILP. In in-order scheduling Instructions are issued in program order and if an instruction is stalled in the pipeline, no later instructions can proceed. Thus, if there is dependence between two closely spaced instructions in the pipeline, this will lead to a hazard and a stall will result. If there are multiple functional units, these units could lie idle. Thus, out-of-order execution with out-of-order completion is needed and this is where the dynamic scheduling helps. But Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline and its logical extension to an in-order floating-point pipeline. Out-of-order completion also creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behaviour in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise. Dynamically scheduled processors preserve exception behaviour by ensuring that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed. To allow out-of-order execution, the ID pipe stage of five-stage pipeline is divided into two stages:1. Issue—Decode instructions, check for structural hazards. 2. Read operands—Wait until no data hazards, then read operands. In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Score boarding; is a technique for allowing instructions to execute out-of-order. Tomasulo algorithm is more sophisticated technique that has several major enhancements over score boarding.
1.4 Tradeoffs to be taken into account while deciding between dynamic and static scheduling
The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies within the set. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to issue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using this technique have been called Very Long Instruction Word (VLIW) processors. The primary dynamic scheduling technique uses special hardware to analyze the instruction stream at execution time and to determine which operations in the instruction stream are independent of all preceding operations and have their source operands available. These operations can then be issued and executed concurrently. Processors using dynamic scheduling techniques have been called superscalar processors. Both of these techniques have advantages and disadvantages and both have demonstrated significant problems that limit their achievable performance despite claims of strong peak performance. Statically scheduled processors require that the latencies of all operations be fixed and known in advance. Because statically scheduled processors do not support dynamic dependency detection, this restriction limits the changes that can be made in architecture to those which do not affect operation latencies. However, the absence of dynamic analysis hardware allows fast and simple implementations. Dynamically scheduled processors require a significant amount of hardware to perform the instruction stream analysis which scales as O [n * n] where n is the number of operations being analysed. This analysis can rapidly become the critical path in the processor and can lead to an increase in the cycle time or the pipeline depth. However, the presence of dynamic analysis hardware does not require that the compiler know anything about operation latencies allowing significant flexibility in implementation changes and code generation techniques. Although dynamic analysis is useful, in the typical superscalar processor there is a significant amount of redundant analysis performed by the compiler and the hardware. Ignoring issues of code compatibility, this redundancy is inefficient and undesirable. Dynamic scheduling is out of order execution which executes instructions in non sequential order. It reduces RAW stalls and increases pipeline and functional unit utilization. Dynamic scheduling as Loop unrolling increases scheduling scope for more flexibility reduces impact of RAW hazards and it also removes WAR /WAW violations that result from scheduling. Static scheduling has a performance portability advantage of not recompiling for new machines. In dynamic scheduling more registers are available but the compilers may not have enough to schedule well. Basing on the speculative memory operation re-ordering compiler must be conservative while hardware can speculate. Compiler has a large scope which does as much as it cannot much and the rest is done by the hardware. In Dynamic scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behaviour. It handles cases when dependences unknown at compile time. It allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve. It allows code that compiled for one pipeline to run efficiently on a different pipeline. It simplifies the compiler.
1.5 Critical analysis of the use of dynamic and static scheduling for speculation
To speculate extensively, we must be able to identify and distinguish memory references. This capability is difficult to do at compile time for integer programs that contain pointers. In a hardware-based scheme, dynamic runtime disambiguation of memory addresses is done using the techniques for Tomasulo algorithm. This disambiguation allows us to move loads past stores at runtime. Support for speculative memory references can help overcome the conservatism of the compiler, but unless such approaches are used carefully, the overhead of the recovery mechanisms may swamp the advantages. Hardware-based speculation works better when control flow is unpredictable, and when hardware-based branch prediction is superior to software-based branch prediction done at compile time. These properties hold for many integer programs. Because speculated instructions may slow down the computation when the prediction is incorrect, this difference is significant. One result of this difference is that even statically scheduled processors normally include dynamic branch predictors. Hardware-based speculation maintains a completely precise exception model even for speculated instructions. Hardware-based speculation does not require compensation or bookkeeping code, which is needed by ambitious software speculation mechanisms. Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling than a purely hardware-driven approach. Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of architecture. Against these advantages stands a major disadvantage: supporting speculation in hardware is complex and requires additional hardware resources.
1.6 Conclusion
The relative merits of static scheduling and dynamic scheduling approaches to exploiting ILP continues to be debated. Initially, the static scheduling and dynamic scheduling approaches were quite different, and the ability to manage the complexity of the hardware-intensive approaches was in doubt. The development of several high performance dynamic speculation processors, which have high clock rates, has made this easy. As both approaches have proven to have advantages, each has tended to incorporate techniques from the other. It remains unclear whether the two approaches will continue to move toward the middle, or whether a new architectural approach that truly combines the best of each will be developed.
References
____________________________________________________________________
[1] Hennessy,J. and Patterson,D. (2005) Computer Organization and Design:T H E H A R D W A R E / S O F T W A R E I N T E R F A C E,3rd edition,Morgan Kaufmann
[2] Hennessy,J. and Patterson,D. (2007) Computer Architecture: A Quantitative Approach, 4th edition, Morgan Kaufmann.
[3] Hesham,E.R. and Mostafa,A.(2005) ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING,John Wiley & Sons
[4] Smith,M., Johnson and Horowitz,M.A.(April 1989) “Limits on Multiple Instruction Issue“, Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems.
[5] Princeton wordnet web [Online] Available from http://wordnetweb.princeton.edu/perl/webwn?s=Computer+Architecture&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h= Accesses:31 December 2012