



# Optimization: What's the Point? (A Quick Review) Machine-Independent Optimizations: - e.g., constant propagation & folding, redundancy elimination, dead-code elimination, etc. - Goal: eliminate work Machine-Dependent Optimizations: - register allocation • Goal: reduce cost of accessing data - instruction scheduling • Goal: ??? - ...

















### Constraint #1: Hardware Resources • Processors have finite resources, and there are often constraints on how these resources can be used. Examples: - Finite issue width Limited functional units (FUs) per given instruction type Limited pipelining within a given functional unit (FU) Carnegie Mellon 15745: Instruction Scheduling

Carnegie Mellon











### Given Ambiguous Data Dependences, What To Do? x = a[i]; \*p = 1; y = \*q; \*r = z; • Conservative approach: don't reorder instructions - ensures correct execution - but may suffer poor performance • Aggressive approach? - is there a way to safely reorder instructions? Carnegie Mellon

### Why Data Dependences are Challenging x = a[i]; \*p = 1; y = \*q; \*r = z; \* which of these instructions can be reordered? • ambiguous data dependences are very common in practice - difficult to resolve, despite fancy pointer analysis [next lecture] Carnegie Mellon 15745: Instruction Scheduling 18 Carnegie Mellon



















### Representing Data Dependences: The Data Precedence Graph (DPG) · Two different kinds of edges: DPG Code true "edges": E (read-after-write) e = (10,11)'anti-edges": E' e' = (I1,I2) (write-after-read) e = (12,13) Why distinguish them? – do they affect scheduling differently? RAW: read waits for value to be computed WAR: write only needs to wait for read to start · What about output dependences? WAW: earlier write is removed by Dead Code Elimination 15745: Instruction Scheduling



### 

```
List Scheduling Algorithm
ready-list = root nodes in DPG; inflight-list = {};
while ((|ready-list|+|inflight-list| > 0) && an issue slot is available) {
   for op = (all nodes in ready-list in descending priority order) {
       if (an FU exists for op to start at cycle) {
          remove op from ready-list and add to inflight-list;
          add op to schedule at time cycle;
          if (op has an outgoing anti-edge)
              add all targets of op's anti-edges that are ready to ready-list;
   cycle = cycle + 1;
   for op = (all nodes in inflight-list)
       if (op finishes at time cycle) {
          remove op from inflight-list;
          check nodes waiting for op & add to ready-list if all operands available;
                                                                  Carnegie Mellon
15745: Instruction Scheduling
```

















### Contrasting Forward vs. Backward **List Scheduling** Forward Backward INT MEM Cycle MEM Cycle INT INT INT LDIb LDIc 1 LSL 1 LDId ADDa 2 LDIc 3 LDId ADDI LDIa CMP 5 ADDa 9 9 10 CMP 10 11 BR 11 · backward scheduling clusters work near the end · backward is better in this case, but this is not always true Carnegie Mellon 15745: Instruction Scheduling

## List Scheduling Wrap-Up The priority function can be arbitrarily sophisticated - e.g., filling branch delay slots in early RISC processors List scheduling is widely used, and it works fairly well It is limited, however, by basic block boundaries

### Evaluation of List Scheduling Cooper et al. propose "RBF" scheduling: - schedule each block M times forward & backward - break any priority ties randomly For real programs: - regular list scheduling works very well For synthetic blocks: - RBF wins when "available parallelism" (AP) is ~2.5 - for smaller AP, scheduling is too constrained - for larger AP, any decision tends to work well

### Efficient Instruction Scheduling for a Pipelined Architecture

Phillip B. Gibbons\* & Steven S. Muchnick\*\*

[My first publication. "PLDI" 1986]

Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304-1181

### Abstract

As part of an effort to develop an optimizing compiler for a pepileind architecture, a code reorganization algorithm has been developed that significantly reduces the number of runtime pipeline interlocks. In a pass after code generation, the algorithm uses a dag representation to heuristically schedule the instructions in each basic block.

Previous algorithms for reducing pipeline interlocks have had worst-case runtimes of at least  $O(n^4)$ . By using a dag representation which prevents scheduling deadlocks and a selection method that requires no lookahead, the resulting algorithm reorganizes instructions almost as effectively in practice, while having an  $O(n^3)$  worst-case runtime.

### 1. Introduction

The architecture we have studied has many features which

Fortunately, not all pairs of consecutive instructions cause pipeline hazards. In the architecture under consideration, the only hazards are register- and memory-based: 1) loading a register from memory followed by using that register as a source, 2) storing to any memory location followed by loading from my location, and 3) loading from memory followed by using ary register as the target of an arithmetic-flogical instruction or a load/store with address modification. Each of these pipeline hazards causes some potential implementation of the architecture to stall or inter-lock for one pipe cycle.

There are three approaches to reducing the number of pipeline interlocks incurred in executing a program, distinguished by the agent and the time when the code is inspected: either special hardware can do it during execution, or a person or software can do it before execution. The hardware approach has been used in the Control Data 6600 [Tho64] and the IBM 360/91 [Tom67], two of the fastest machines of their day. While reasonably effective, this approach is very expensive and can only span relatively short code sequences. Rymarczyk

### **Looking Ahead**

Monday: Pointer Analysis

- [ALSU 12.4, 12.6-12.7]
- Wednesday: Dynamic Code Optimization
- Friday: No class
- Following Monday & Wednesday: "Recent Research on Optimization"
  - Student-led discussions, in groups of 2, with 20 minutes/group
  - Read 3 papers on a topic, and lead a discussion in class

15-745: Instruction Scheduling

45

Carno