

# **High Performance CPU Cores**

**18-613**: Foundations of Computer Systems 5<sup>th</sup> Lecture, **Feb 26, 2019** 

**Instructor:** 

Franz Franchetti

Parts based on "15-740 Computer Architecture" by Nathan Beckmann



### **Midterm Exam: Mechanics**

### Pittsburgh (Sections A, B, and C)

- Part of 1x-x13 exam setup

  WEH Cluster, desktop computer, together with all others
- Time slot: March 5, 12pm EST we sign you up
- If you have a conflict you can start a bit earlier/later e-mail me ASAP

#### Silicon Valley (Section SA)

- Separate exam on your own laptop
  SV B23 118, bring charger for your laptop
- Time Slot (tentative)

  March 5, 4:30pm EST 7:30pm EST (max)

Sunday, March 3: Midterm Review (PIT and SV)



### Midterm Exam: 613 Extra Question

#### One extra question

Equally weighted with other 213 questions 12.5% of exam score

#### Three sub-questions

- State of the art in computing Covers main ideas from Lecture 2 (Multiple choice)
- C Language Tests your knowledge of C (Understand provided C code)
- Intel SIMD Extensions
  Fill holes in Intel SSE/AVX program (Instruction semantics provided)

Practice/example exam will be provided as PDF in Canvas



# From 213 Lecture 10: Superscalar Processor

- Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
- Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have
- Most modern CPUs are superscalar.
- Intel: since Pentium (1993)

# **Modern CPU Design**





### **Intel Haswell CPU**

- 8 Total Functional Units
- Multiple instructions can execute in parallel
  - 2 load, with address computation
  - 1 store, with address computation
  - 4 integer
  - 2 FP multiply
  - 1 FP add
  - 1 FP divide

#### Some instructions take > 1 cycle, but can be pipelined

| Instruction               | Latency | Cycles/Issue |
|---------------------------|---------|--------------|
| Load / Store              | 4       | 1            |
| Integer Multiply          | 3       | 1            |
| Integer/Long Divide       | 3-30    | 3-30         |
| Single/Double FP Multiply | 5       | 1            |
| Single/Double FP Add      | 3       | 1            |
| Single/Double FP Divide   | 3-15    | 3-15         |



## **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool



#### Intel® 64 and IA-32 Architectures Software Developer's Manual

Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4

NOTE: This document contains all four volumes of the Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, Order Number 253665; Instruction Set Reference A-Z, Order Number 325383; System Programming Guide. Order Number 325384; Model-Specific Registers, Order Number 335592. Refer to all four volumes when evaluating your design needs.

> Order Number: 325462-069US January 2019



# 213 Lecture 5: Disassembling Object Code

#### Disassembled

```
0000000000400595 <sumstore>:
  400595:
           53
                             push
                                    %rbx
  400596: 48 89 d3
                                    %rdx,%rbx
                             mov
  400599:
           e8 f2 ff ff ff
                             callq 400590 <plus>
  40059e: 48 89 03
                                    %rax, (%rbx)
                             mov
  4005a1:
           5b
                                    %rbx
                             pop
  4005a2:
           c3
                             reta
```

#### Disassembler

```
objdump -d sum
```

- Useful tool for examining object code
- Analyzes bit pattern of series of instructions
- Produces approximate rendition of assembly code
- Can be run on either a .out (complete executable) or .o file

### **x86 ISA Evolution In Practice**

#### 40 years of x86 binary compatible, but 500x parallelism













1978/1979

8086/8088, 4.77/8 MHz 8087 non-IEEE FPU x86-16

#### 1985

80386, 12 MHz 80387 IEEE FPU x86-32

#### 1996

Pentium MMX 166 MHz x86-32 + MMX

#### 2004

Pentium 4F, 3.6 GHz Execution disable x86-64 + SSE3

#### 2011

Sandy Bridge, 3.6 GHz 2-8 cores, HD3000 GPU x86-64 + AVX

#### 2019

Xeon Platinum 8176F 28 cores, 3.8 GHz x86-64 + AVX-512

x86 is the abstraction, backwards-compatible ISA extensions necessary

### **x86 Instruction Format**

- Translation of byte-string to assembly instruction
- Instruction decoder needs to interpret the bytes
- A single instruction can be up to 15 bytes (longer -> core dump)

3. Some rare instructions can take an 8B immediate or 8B displacement.



Figure 2-1. Intel 64 and IA-32 Architectures Instruction Format

### **Instruction Format: Prefix**

- Group 1: LOCK, REPNE, BND
- Group 2: segment override, branch hints
- Group 3: operand size override
- Group 4: address size override
- 64-bit and SIMD: REX, VEX, EVEX





# **Instruction Format: Opcode**

- One, two or three bytes
- Opcode may spill 3 bit into ModR/M
- Fields in opcode can define behavior displacement size, register encoding, condition codes, sign extension
- SIMD instructions require various escape codes
   this carves out encoding space while keeping instruction length down
- GDB or disassembler decodes for us





# Instruction Format: ModR/M and SIB

- Mod: Addressing-form specifier
   32 values: 8 registers and 24 addressing modes
- Reg/Opcode: used to specify either register or opcode depends on primary Opcode field (previous slide)
- R/M: register or addressing mode
   may also be used for opcode information
- SIB: scale/index/base for certain addressing modes





# **Displacement and Immediate**

- Some addressing modes need displacements 1, 2 or 4 byte values
- Immediate
  Constant values go here: 1, 2, or 4 byte values





# What is missing so far?

- More than 8 registers
  REX prefix
- 64-bit immediate
  Semantics change of mov instruction: immediate -> load
- 3-operand AVX, AVX-512VEX and EVEX prefix



Figure 2-6. Memory Addressing With a SIB Byte



# An Example: ADD

#### ADD-Add

| Opcode           | Instruction                             | Op/<br>En | 64-bit<br>Mode | Compat/<br>Leg Mode | Description                                  |
|------------------|-----------------------------------------|-----------|----------------|---------------------|----------------------------------------------|
| 04 <i>ib</i>     | ADD AL, imm8                            | I         | Valid          | Valid               | Add imm8 to AL.                              |
| 05 iw            | ADD AX, imm16                           | I         | Valid          | Valid               | Add imm16 to AX.                             |
| 05 id            | ADD EAX, imm32                          | I         | Valid          | Valid               | Add imm32 to EAX.                            |
| REX.W + 05 id    | ADD RAX, imm32                          | I         | Valid          | N.E.                | Add imm32 sign-extended to 64-bits to RAX.   |
| 80 /0 ib         | ADD r/m8, imm8                          | MI        | Valid          | Valid               | Add imm8 to r/m8.                            |
| REX + 80 /0 ib   | ADD r/m8 <sup>*</sup> , imm8            | MI        | Valid          | N.E.                | Add sign-extended imm8 to r/m8.              |
| 81 /0 iw         | ADD r/m16, imm16                        | MI        | Valid          | Valid               | Add imm16 to r/m16.                          |
| 81 /0 id         | ADD r/m32, imm32                        | MI        | Valid          | Valid               | Add imm32 to r/m32.                          |
| REX.W + 81 /0 id | ADD r/m64, imm32                        | MI        | Valid          | N.E.                | Add imm32 sign-extended to 64-bits to r/m64. |
| 83 /0 ib         | ADD r/m16, imm8                         | MI        | Valid          | Valid               | Add sign-extended imm8 to r/m16.             |
| 83 /0 ib         | ADD r/m32, imm8                         | MI        | Valid          | Valid               | Add sign-extended imm8 to r/m32.             |
| REX.W + 83 /0 ib | ADD r/m64, imm8                         | MI        | Valid          | N.E.                | Add sign-extended imm8 to r/m64.             |
| 00 /r            | ADD r/m8, r8                            | MR        | Valid          | Valid               | Add r8 to r/m8.                              |
| REX + 00 /r      | ADD r/m8 <sup>*</sup> , r8 <sup>*</sup> | MR        | Valid          | N.E.                | Add r8 to r/m8.                              |
| 01 /r            | ADD r/m16, r16                          | MR        | Valid          | Valid               | Add r16 to r/m16.                            |
| 01 /r            | ADD r/m32, r32                          | MR        | Valid          | Valid               | Add r32 to r/m32.                            |
| REX.W + 01 /r    | ADD r/m64, r64                          | MR        | Valid          | N.E.                | Add r64 to r/m64.                            |
| 02 /r            | ADD r8, r/m8                            | RM        | Valid          | Valid               | Add r/m8 to r8.                              |
| REX + 02 /r      | ADD r8*, r/m8*                          | RM        | Valid          | N.E.                | Add r/m8 to r8.                              |
| 03 /r            | ADD r16, r/m16                          | RM        | Valid          | Valid               | Add r/m16 to r16.                            |
| 03 /r            | ADD r32, r/m32                          | RM        | Valid          | Valid               | Add r/m32 to r32.                            |
| REX.W + 03 /r    | ADD r64, r/m64                          | RM        | Valid          | N.E.                | Add r/m64 to r64.                            |

#### NOTES:

#### Complete specification: 2 full pages in the instruction manual

<sup>\*</sup>In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.



## **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool

### **Background:**

Nathan Beckmann: "15-740: Computer Architecture" <a href="https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/">https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/</a>



### RISC vs. CISC

### Complex Instruction Set Computing (CISC)

- variable length instructions: 1-321 bytes
- GP registers+special purpose registers+PC+SP+conditions
- Data: bytes to strings
- memory-memory instructions
- special instructions: e.g., crc, polyf, ...

### Reduced Instruction Set Computing (RISC)

- fixed length instructions: 4 bytes
- GP registers + PC
- load/store with few addressing modes



## The RISC Design Tenets

- Single-cycle execution
  CISC: many multicycle operations
- Hardwired (simple) control
  CISC: microcode for multi-cycle operations
- Load/store architecture

  CISC: register-memory and memory-memory
- Few memory addressing modes CISC: many modes
- Fixed-length instruction format CISC: many formats and lengths
- Reliance on compiler optimizations
  CISC: hand assemble to get good performance
- Many registers (compilers can use them effectively)
  CISC: few registers



# **Schools of ISA Design and Performance**

#### Complex instruction set computer (CISC)

- Complex instructions -> lots of work per instruction -> fewer instructions per program
- But... more cycles per instruction & longer clock period
- Modern  $\mu$ arch gets around most of this!

#### Reduced instruction set computer (RISC)

- Fine-grain instructions -> less work per instruction -> more instructions per program
- But... lower cycles per instruction & shorter clock period
- Heavy reliance on compiler to "do the right thing"

$$CPU\ Time = \frac{Instructions}{Program} \times \frac{Cycles}{Instruction} \times \frac{Seconds}{Cycle}$$



### Complexity: x86 vs. RISC-V



#### Intel® 64 and IA-32 Architectures Software Developer's Manual

Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4

NOTE: This document contains all four volumes of the Intel 64 and IA-32 Architectures Software Developer's Manual: Basic Architecture, Order Number 253665; Instruction Ser Reference A-Z, Order Number 325383; System Programming Guide, Order Number 325384; Model-Specific Registers, Order Number 335592. Refer to all four volumes when evaluating your design needs.

CISC: About 5,000 pages

Order Number: 325462-069US



RISC: 2 pages

### Intel's x86 Trick: RISC Inside

#### 1993: Intel wanted "out-of-order execution" in Pentium Pro

Hard to do with a coarse grain ISA like x86

### Solution? Translate x86 to RISC micro-ops internally (µops)

```
push $eax → store $eax, -4($esp)
addi $esp,$esp,-4
```

- Processor maintains x86 ISA externally for compatibility
- But executes RISC μISA internally for implementability
- Given translator, x86 almost as easy to implement as RISC
  - Intel implemented "out-of-order" before any RISC company!
  - "OoO" also helps x86 more (because ISA limits compiler)
- Different µops for different designs
  - Not part of the ISA specification 

    Implementation flexibility



### **Potential Micro-op Scheme**

#### Most instructions are a single micro-op

- Add, xor, compare, branch, etc.
- Loads example: mov -4(%rax), %ebx
- Stores example: mov %ebx, -4(%rax)

#### Each memory access adds a micro-op

- addl -4(%rax), %ebx is two micro-ops (load, add)
- addl %ebx, -4(%rax) is three micro-ops (load, add, store)

#### Function call (CALL) – 4 uops

 Get program counter, store program counter to stack, adjust stack pointer, unconditional jump to function start

#### Return from function (RET) – 3 uops

Adjust stack pointer, load return address from stack, jump register

#### Again, just a basic idea, micro-ops are specific to each chip



## **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool

### **Background:**

Nathan Beckmann: "15-740: Computer Architecture" <a href="https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/">https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/</a>

### Electrical & Computer ENGINEERING

# From 213 Lecture 10: Pipelining

```
long mult_eg(long a, long b, long c) {
   long p1 = a*b;
   long p2 = a*c;
   long p3 = p1 * p2;
   return p3;
}
```



|         | Time |     |     |     |       |       |       |
|---------|------|-----|-----|-----|-------|-------|-------|
|         | 1    | 2   | 3   | 4   | 5     | 6     | 7     |
| Stage 1 | a*b  | a*c |     |     | p1*p2 |       |       |
| Stage 2 |      | a*b | a*c |     |       | p1*p2 |       |
| Stage 3 |      |     | a*b | a*c |       |       | p1*p2 |

- Divide computation into stages
- Pass partial computations from stage to stage
- Stage i can start on new computation once values passed to i+1
- E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles



# **Pipelining: A Closer Look**

#### **Unpipelined System**



Delay = 1.3 ns Throughput = 0.77 GHz

#### **3-Stage Pipeline**



#### **Deep Pipeline**



Clock



# **Limitation: Sequential Dependences**





- Op4 gets result from Op3!



# **Intel Pipelines**

| Year | Micro-architecture                          | Pipeline stages                         | max. Clock | Tech process |
|------|---------------------------------------------|-----------------------------------------|------------|--------------|
| 1989 | <u>486</u> (80486)                          | 3                                       | 100 MHz    | 1000 nm      |
| 1993 | P5 (Pentium)                                | 5                                       | 300 MHz    | 600 nm       |
| 1995 | P6 (Pentium Pro; later Pentium II)          | 14 (17 with load & store/retire)        | 450 MHz    | 350 nm       |
| 1999 | P6 (Pentium III) (Copper Mine)              | 12 (15 with load & store/retire)        | 1400 MHz   | 250 nm       |
| 2000 | NetBurst (Pentium 4) (Willamette)           | 20                                      | 2000 MHz   | 180 nm       |
| 2002 | NetBurst (Pentium 4), (Northwood, Gallatin) | 20 unified with branch prediction       | 3466 MHz   | 130 nm       |
| 2003 | Pentium M                                   | 10 (12 with fetch/retire)               | 2133 MHz   | 130 nm       |
| 2004 | NetBurst (Pentium 4) ( <u>Prescott</u> )    | 31 unified with branch prediction       | 3800 MHz   | 90 nm        |
| 2006 | <u>Intel Core</u>                           | 12 (14 with fetch/retire)               | 3000 MHz   | 65 nm        |
| 2007 | <u>Penryn</u>                               |                                         | 3333 MHz   |              |
| 2000 | <u>Nehalem</u>                              | 20 unified (14 without miss prediction) | 3600 MHz   | 45 nm        |
| 2008 | <u>Bonnell</u>                              | 16 (20 with prediction miss)            | 2100 MHz   |              |
| 2010 | Westmere                                    | 20 unified (14 without miss prediction) | 3730 MHz   |              |
| 2011 | <u>Saltwell</u>                             | 16 (20 with prediction miss)            | 2130 MHz   | 32 nm        |
| 2011 | Sandy Bridge                                | 14 (16; th f-t-t /ti)                   | 4000 MHz   |              |
| 2012 | Ivy Bridge                                  | 14 (16 with fetch/retire)               | 4100 MHz   |              |
| 2012 | <u>Silvermont</u>                           | 14-17 (16-19 with fetch/retire)         | 2670 MHz   | 22 nm        |
| 2013 | <u>Haswell</u>                              | 14/15: 1 5-1-1 /1:>                     | 4400 MHz   |              |
| 2014 | <u>Broadwell</u>                            | 14 (16 with fetch/retire)               | 3700 MHz   |              |
| 2015 | <u>Airmont</u>                              | 14-17 (16-19 with fetch/retire)         | 2640 MHz   |              |
| 2015 | <u>Skylake</u>                              | 14 (16 with fetch/retire)               | 4200 MHz   |              |
| 2016 | Goldmont                                    | 20 unified with branch prediction       | 2600 MHz   | 14 nm        |
| 2016 | Kaby Lake                                   | 14/46 with fatal (nation)               | 4500 MHz   |              |
| 2017 | Coffee Lake                                 | 14 (16 with fetch/retire)               | 5000 MHz   |              |
| 2017 | Goldmont Plus                               | ? 20 unified with branch prediction?    | 2800 MHz   |              |
| 2010 | <u>Cannon Lake</u>                          | 44/45 31 5 1 1 / 13 )                   | 3200 MHz   | 10 nm        |
| 2018 | Whiskey Lake                                | 14 (16 with fetch/retire)               | 4600 MHz   | 14 nm        |



## **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool

### **Background:**

Nathan Beckmann: "15-740: Computer Architecture" <a href="https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/">https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/</a>



### **Out-of-Order Execution**

### **In-order pipeline**



### **Out-of-order pipeline**



In-order commit



## **Problem: Dependencies**

#### Data Dependencies

- RAW: Read after write. (true dependence)
- WAR: write after read (false dependence)
- WAW: write after write (false dependence)

```
loop: ld
                 (r2+r9*8)
      ld
                 r4+r9*8)
     mult
     mult
      add
           r5
                 (r6+r9*8)
      st
      add
           r9 r9
                                    Alpha ASM Syntax:
      cmp
                                     inst dst, src
     bnz
            loop
```

Out-of-order execution: execute as soon as RAW allows



# Register Renaming

- "Architected" vs "Physical" registers level of indirection
  - Names: r1, r2, r3
  - Locations: p1, p2, p3, p4, p5, p6, p7
  - Original mapping:  $r1 \rightarrow p1$ ,  $r2 \rightarrow p2$ ,  $r3 \rightarrow p3$ , p4-p7 are "available"

| N/126 1261/ |   |
|-------------|---|
|             |   |
| MapTable    |   |
| i iapiabi   | _ |

| r1 | r2         | r3         |
|----|------------|------------|
| p1 | p2         | р3         |
| p4 | <b>p</b> 2 | р3         |
| p4 | p2         | <b>p</b> 5 |
| p4 | p2         | <b>p</b> 6 |

FreeList

| p4,p5,p6,p7 |
|-------------|
| p5,p6,p7    |
| p6,p7       |
| p7          |

Original insns

add 
$$r2,r3 \rightarrow r1$$
 add  $p2,p3 \rightarrow p4$   
sub  $r2,r1 \rightarrow r3$  sub  $p2,p4 \rightarrow p5$   
mul  $r2,r3 \rightarrow r3$  mul  $p2,p5 \rightarrow p6$   
div  $r1,4 \rightarrow r1$  div  $p4,4 \rightarrow p7$ 

Renamed insns

- Renaming conceptually write each register once
  - + Removes false dependences
  - + Leaves true dependences intact!
- When to reuse a physical register? After overwriting is done

Register renaming: resolves false dependencies

# **Out Of Order: Dynamic Scheduling**

#### Ready Table

| P2  | Р3                | P4                            | P5                      | P6                          | P7      |
|-----|-------------------|-------------------------------|-------------------------|-----------------------------|---------|
| Yes | Yes               |                               |                         |                             |         |
| Yes | Yes               | Yes                           |                         |                             |         |
| Yes | Yes               | Yes                           | Yes                     |                             | Yes     |
| Yes | Yes               | Yes                           | Yes                     | Yes                         | Yes     |
|     | Yes<br>Yes<br>Yes | Yes Yes<br>Yes Yes<br>Yes Yes | Yes Yes Yes Yes Yes Yes | Yes Yes Yes Yes Yes Yes Yes | Yes Yes |

add p2,p3
$$\rightarrow$$
p4  
sub p2,p4 $\rightarrow$ p5 and div p4,4 $\rightarrow$ p7  
mul p2,p5 $\rightarrow$ p6

- Instructions fetch/decoded/renamed into re-order buffer (ROB)
- Check which instructions are ready and execute earliest ready instruction

Dynamic Scheduling: instruction level parallelism and throughput



### **Branch Prediction**

- When encounter conditional branch, cannot determine where to continue fetching
  - Branch Taken: Transfer control to branch target
  - Branch Not-Taken: Continue with next instruction in sequence
- Cannot resolve until outcome determined by branch/integer unit





# **Branch Misprediction Invalidation**

```
Assume
401029:
         vmulsd (%rdx),%xmm0,%xmm0
40102d:
         add
                 $0x8,%rdx
                                           vector length = 100
401031:
                 %rax,%rdx
         cmp
                              i = 98
401034:
                 401029
         ine
                                           Predict Taken (OK)
401029:
         vmulsd (%rdx),%xmm0,%xmm0
40102d:
         add
                 $0x8,%rdx
401031:
                 %rax,%rdx
         cmp
                              i = 99
401034:
                 401029
         ine
                                           Predict Taken
                                           (Oops)
         vmulsd (%rdx),%xmm0,%xmm0
401029:
40102d:
         add
                 $0x8,%rdx
401031:
                 %rax,%rdx
         cmp
                              i = 100
401034:
                 401029
          ine
                                               Invalidate
401029:
         vmulsd (%rdx).%xmm0.%xmm0
401024.
         add
                 $0x8 %rdx
401031 •
                 %rax %rdx
          CMD
401034 •
                 101029
         ine
```

### **Branch Predictor**





#### **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool

## **Intel Skylake**





## **Intel Skylake Server: AVX-512**





#### **Intel Pipeline Example**





#### **Intel Microarchitectural Features**

- Superscalar out of order engine 8 functional units (port 0—7)
- Hyperthreading
   2 threads per core, 2 register files but only one set of functional units
- Zero idioms, move elimination recognize and optimize common simple operations
- Decoded Icache
  Store decoded micro-ops
- Store-to-load forwarding
   Data reused before stored to cache
- Loop stream detector detect loops post-decode (around 28 uops, 8 branches)
- Instruction ROM microcode (uop sequences) for less frequent instructions
- Micro-fusion, macro-fusion and uop unlamination combine or split micro-ops for efficiency and throughput



## Skylake Desktop CPU Die Shot



https://en.wikichip.org/wiki/intel/microarchitectures/skylake\_(client)



#### **Skylake By The Numbers**

- Decoder/Frontend5 uop/cycle decoder1.5k uop cache (6uop/cycle)
- OOO execution engine
   224 entry reorder buffer,
   about 350 uop
   97 entry scheduler
   48 entry branch order buffer
- Physical register file180 integer register168 vector registers
- Buffers
   store buffer: 56 entries
   load buffer: 72 entries
   Line fill buffer: 10 entries
- Power gating 10,000 cycles to wake up AVX2



https://en.wikichip.org/wiki/intel/microarchitectures/skylake\_(client)



## **Advanced Topics**

- Instruction encoding
- RISC vs. CISC
- Pipelining
- Out of order execution
- Intel Microarchitecture
- Latencies/Throughput, IACA tool



Intel® 64 and IA-32 Architectures
Optimization Reference Manual

Order Number: 248966-040 April 2018



# **Instruction Latency/Throughput Table**

| Instruction                        | Latency <sup>1</sup> |                 |                    | Throughput   |                 |                 |                    |              |
|------------------------------------|----------------------|-----------------|--------------------|--------------|-----------------|-----------------|--------------------|--------------|
| DisplayFamily_DisplayModel         | 06_4E,<br>06_5E      | 06_3D/<br>47/56 | 06_3C/4<br>5/46/3F | 06_3A<br>/3E | 06_4E,<br>06_5E | 06_3D/<br>47/56 | 06_3C/4<br>5/46/3F | 06_3A<br>/3E |
| VADDPD/PS ymm1, ymm2, ymm3         | 4                    | 3               | 3                  | 3            | 0.5             | 1               | 1                  | 1            |
| VADDSUBPD/PS ymm1, ymm2, ymm3      | 4                    | 3               | 3                  | 3            | 0.5             | 1               | 1                  | 1            |
| VANDNPD/PS ymm1, ymm2, ymm3        | 1                    | 1               | 1                  | 1            | 0.33            | 1               | 1                  | 1            |
| VANDPD/PS ymm1, ymm2, ymm3         | 1                    | 1               | 1                  | 1            | 0.33            | 1               | 1                  | 1            |
| VBLENDPD/PS ymm1, ymm2, ymm3, imm  | 1                    | 1               | 1                  | 1            | 0.33            | 0.33            | 0.33               | 0.5          |
| VBLENDVPD/PS ymm1, ymm2, ymm3, ymm | 1                    | 2               | 2                  | 1            | 1               | 2               | 2                  | 1            |
| VCMPPD/PS ymm1, ymm2, ymm3         | 4                    | 3               | 3                  | 3            | 0.5             | 1               | 1                  | 1            |

| DisplayFamily_DisplayModel     | Recent Intel Microarchitectures       |  |  |
|--------------------------------|---------------------------------------|--|--|
| 06_4EH, 06_5EH                 | Skylake microarchitecture             |  |  |
| 06_3DH, 06_47H, 06_56H         | Broadwell microarchitecture           |  |  |
| 06_3CH, 06_45H, 06_46H, 06_3FH | Haswell microarchitecture             |  |  |
| 06_3AH, 06_3EH                 | Ivy Bridge microarchitecture          |  |  |
| 06_2AH, 06_2DH                 | Sandy Bridge microarchitecture        |  |  |
| 06_25H, 06_2CH, 06_2FH         | Intel microarchitecture Westmere      |  |  |
| 06_1AH, 06_1EH, 06_1FH, 06_2EH | Intel microarchitecture Nehalem       |  |  |
| 06_17H, 06_1DH                 | Enhanced Intel Core microarchitecture |  |  |
| 06_0FH                         | Intel Core microarchitecture          |  |  |



# **Cache Sizes/Latencies/Throughput**

| Cache level                 | Category                           | Broadwell<br>Microarchitecture | Skylake Server<br>Microarchitecture |  |
|-----------------------------|------------------------------------|--------------------------------|-------------------------------------|--|
| L1 Data Cache<br>Unit (DCU) | Size [KB]                          | 32                             | 32                                  |  |
|                             | Latency [cycles]                   | 4-6                            | 4-6                                 |  |
|                             | Max bandwidth [bytes/cycles]       | 96                             | 192                                 |  |
|                             | Sustained bandwidth [bytes/cycles] | 93                             | 133                                 |  |
|                             | Associativity [ways]               | 8                              | 8                                   |  |
| L2 Mid-level Cache          | Size [KB]                          | 256                            | 1024 (1MB)                          |  |
| (MLC)                       | Latency [cycles]                   | 12                             | 14                                  |  |
|                             | Max bandwidth [bytes/cycles]       | 32                             | 64                                  |  |
|                             | Sustained bandwidth [bytes/cycles] | 25                             | 52                                  |  |
|                             | Associativity [ways]               | 8                              | 16                                  |  |
| L3 Last-level               | Size [MB]                          | Up to 2.5 per core             | up to 1.375 <sup>1</sup> per core   |  |
| Cache (LLC)                 | Latency [cycles]                   | 50-60                          | 50-70                               |  |
|                             | Max bandwidth [bytes/cycles]       | 16                             | 16                                  |  |
|                             | Sustained bandwidth [bytes/cycles] | 14                             | 15                                  |  |



## Intel Architecture Code Analyzer (IACA)

```
Intel(Ř) Architecture Code Analyzer Version – 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:<u>25:20 +0200)</u>
Analyzed File -
Binary Format - 64Dit
Architecture - BDW
Analysis Type - Inroughput
Throughput Analysis Report
Block Throughput: 5.00 Cycles
                                   Throughput Bottleneck: Port0. Port1. Port5
Port Binding In Cycles Per Iteration:
 Cycles | 5.0
                     15.0 10.0
                                       0.0 : 0.0
                                                     0.0 | 0.0 | 5.0 | 0.0 | 0.0
                  0.0
 - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
 - Data fetch pipe (on ports 2 and 3), CP - on a critical path
 - Macro Fusion with the previous instruction occurred
 - instruction micro-ops not bound to a port
 - Micro Fusion happened
 - ESP Tracking sync uop was issued
 - SSE instruction followed an AUX256/AUX512 instruction, dozens of cycles penalty is expected
 - instruction not supported, was not accounted in Analysis
 Num Of 1
                              Ports pressure in cycles
        | 0 - DU | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7
  Uops
         1 1.0
                                                                             CP | vfmadd231ps ymm0, ymm0, ymm0
                                                                                  vfmadd231ps ymm1, ymm1, ymm1
                      1.0
                                                                             CP
          1.0
                                                                             \mathbf{CP}
                                                                                  vfmadd231ps ymm2, ymm2, ymm2
                                                                                  vfmadd231ps ymm3, ymm3, ymm3
                      1.0
                                                                                  vfmadd231ps ymm4, ymm4, ymm4
         1 1.0
                      1.0
                                                                             CP
                                                                                  vfmadd231ps ymm5, ymm5, ymm5
          1.0
                                                                                  vfmadd231ps ymm6, ymm6, ymm6
                                                                                  vfmadd231ps ymm7, ymm7, ymm7
                                                                             CP
   1111
                      1.0
                                                                                  vfmadd231ps ymm8, ymm8, ymm8
          1.0
                                                                                  vfmadd231ps ymm9, ymm9, ymm9
                      1.0
                                                                                  vpaddd ymm10, ymm10, ymm10
                                                           1.0
                                                                             CP
                                                                                  vpaddd ymm11, ymm11, ymm11
                                                           1.0
                                                                             CP
                                                                                  vpaddd ymm12, ymm12, ymm12
                                                                             CP
                                                           1.0
                                                                                  vpaddd ymm13, ymm13, ymm13
                                                                                  vpaddd ymm14, ymm14, ymm14
     Num Of Uops: 15
```



#### Summary

#### What you think is happening

| 000000000 | 400595     | <sumstore< th=""><th>&gt;:</th><th></th></sumstore<> | >:    |                      |
|-----------|------------|------------------------------------------------------|-------|----------------------|
| 400595:   | 53         |                                                      | push  | %rbx                 |
| 400596:   | 48 89      | d3                                                   | mov   | %rdx,%rbx            |
| 400599:   | e8 f2      | ff ff ff                                             | callq | 400590 <plus></plus> |
| 40059e:   | 48 89      | 03                                                   | mov   | %rax,(%rbx)          |
| 4005a1:   | 5b         |                                                      | pop   | %rbx                 |
| 4005a2:   | <b>c</b> 3 |                                                      | retq  |                      |



Instructions are executed one by one

#### What really happens



Only way to know: timing of code