AMD’s Next Generation Microprocessor Architecture

Fred Weber

October 2001
"Hammer" Goals

• Build a next-generation system architecture which serves as the foundation for future processor platforms

• Enable a full line of server and workstation products
  – Leading edge x86 (32-bit) performance and compatibility
  – Native 64-bit support
  – Establish x86-64 Instruction Set Architecture
  – Extensive Multiprocessor support
  – RAS features

• Provide top-to-bottom desktop and mobile processors
Agenda

- x86-64™ Technology
- "Hammer" Architecture
- "Hammer" System Architecture
x86-64™ Technology
Why 64-Bit Computing?

- **Required for large memory programs**
  - Large databases
  - Scientific and Engineering Problems
    - Designing CPUs 😊
  
- **But,**
  - Limited Demand for Applications which require 64 bits
    - Most applications can remain 32-bit x86 instructions, if the processor continues to deliver leading edge x86 performance
  
- **And,**
  - Software is a huge investment (tool chains, applications, certifications)
  - Instruction set is first and foremost a vehicle for compatibility
    - Binary compatibility
    - Interpreter/JIT support is increasingly important
x86-64 Instruction Set Architecture

• **x86-64 mode built on x86**
  - Similar to the previous extension from 16-bit to 32-bit
  - Vast majority of opcodes and features unchanged
  - Integer/Address register files and datapaths are native 64-bit
  - 48-Bit Virtual Address Space, 40-Bit Physical Address Space

• **Enhancements**
  - Add 8 new integer registers
  - Add PC relative addressing
  - Add full support for SSE/SSEII based Floating Point Application Binary Interface (ABI)
    - including 16 registers
  - Additional Registers and Data Size added through reclaim of one byte increment/decrement opcodes (0x40-0x4F) for use as a single optional prefix

• **Public specification**
  - www.x86-64.org
x86-64 Programmer’s Model

In x86
Added by x86-64

RAX
EAX
AH
AL

GPR
EAX
EDI
R8

Program Counter
EIP
• Compiler and Tool Chain is a straightforward port
• Instruction set is designed to offer all the advantages of CISC and RISC
  – Code density of CISC
  – Register usage and ABI models of RISC
  – Enables easy application of standard compiler optimizations
• SpecInt2000 Code Generation (compared to 32 bit x86)
  – Code size grows <10%
    • Due mostly to instruction prefixes
  – Static Instruction Count **shrinks** by 10%
  – Dynamic Instruction Count **shrinks** by at least 5%
  – Dynamic Load/Store Count **shrinks** by 20%
  – All without any specific code optimizations
**x86-64™ Summary**

- **Processor is fully x86 capable**
  - Full native performance with 32-bit applications and OS
  - Full compatibility (BIOS, OS, Drivers)

- **Flexible deployment**
  - Best-in-class 32-bit, x86 performance
  - Excellent 64-bit, x86-64 instruction execution when needed

- **Server, Workstation, Desktop, and Mobile share same architecture**
  - OS, Drivers and Applications can be the same
  - CPU vendors focus not split, ISV focus not split
  - Support, optimization, etc. all designed to be the same
The "Hammer" Architecture
The “Hammer” Architecture

- L1 Instruction Cache
- L1 Data Cache
- L2 Cache
- "Hammer" Processor Core
- DDR Memory Controller

HyperTransport™
Processor Core Overview

Level 2 Cache
- L2 ECC
- L2 Tags
- L2 Tag ECC

Instr’n TLB
Level 1 Instr’n Cache

Data TLB
Level 1 Data Cache
- ECC
"Hammer" Pipeline

1. Fetch
2. Exec
3. L2
4. DRAM
Fetch/Decode Pipeline

Fetch

Fetch 1
Fetch 2
Pick
Decode 1
Decode 2
Pack
Pack/Decode
Execute Pipeline

- Dispatch
- Schedule
- AGU/ALU
- Data Cache 1
- Data Cache 2

Timeline:
- Exec: 1 ns
- Sequential Fetch
- Predicted Fetch
- Branch Target Address Calculator Fetch
- Mispredicted Fetch
Large Workload TLBs

- CR3, PDP, PDE
  - Probe Modify
  - CR3, PDP, PDE
  - CR3, PDP, PDE

- Flush Filter CAM 32 Entry
  - ASN
  - Current ASN

- 24-entry Page Descriptor Cache PDP, PDE
  - L2 Data Cache
    - L2 Data TLB 512-entry 4-way associative
  - L2 Data TLB 512-entry 4-way associative

- TLB Reload
  - PDC Reload

- L1 Instruction TLB 40 Entry Fully Associative 4M/2M & 4k pages
- L1 Instruction TLB 40 Entry Fully Associative 4M/2M & 4k pages

- Port 0, L1 Data TLB 40 Entry Fully Associative 4M/2M & 4k pages
- Port 1, L1 Data TLB 40 Entry Fully Associative 4M/2M & 4k pages

- 40 Entry Fully Associative 4M/2M & 4k pages
- 512-entry 4-way associative

- ASN VA PA
- ASN VA PA
- ASN VA PA
• **Integrated Memory Controller Details**
  - Memory controller details
    • 8 or 16-byte interface
    • 16-Byte interface supports
      - Direct connection to 8 registered DIMMs
      - Chipkill ECC
    • Unbuffered or Registered DIMMs
    • PC1600, PC2100, and PC2700 DDR memory

• **Integrated Memory Controller Benefits**
  - Significantly reduces DRAM latency
  - Memory latency improves
    • as CPU and HyperTransport™ link speed improves
  - Bandwidth and capacity grows with number of CPUs
  - Snoop probe throughput scales with CPU frequency
Reliability and Availability

- L1 Data Cache ECC Protected
- L2 Cache AND Cache Tags ECC Protected
- DRAM ECC Protected
  - With Chipkill ECC support
- On Chip and off Chip ECC Protected Arrays include background hardware scrubbers
- Remaining arrays parity protected
  - L1 Instruction Cache, TLBs, Tags
  - Generally read only data which can be recovered
- Machine Check Architecture
  - Report failures and predictive failure results
  - Mechanism for hardware/software error containment and recovery
• Next-generation computing performance goes beyond the microprocessor

• Screaming I/O for chip-to-chip communication
  - High bandwidth
  - Reduced pin count
  - Point-to-point links
  - Split transaction and full duplex

• Open standard
  - Industry enabler for building high bandwidth I/O subsystems
  - I/O subsystems: PCI-X, G-bit Ethernet, Infiniband, etc.

• Strong Industry Acceptance
  - 100+ companies evaluating specification & several licensing technologies through AMD (2000)
  - First HyperTransport technology-based south bridge announced by nVIDIA (June 2001)

• Enables scalable 2-8 processor SMP systems
  - Glueless MP
HT* = HyperTransport™ technology
HB = Host Bridge
Northbridge Overview

- System Request Queue (SRQ)
- Advanced Priority Interrupt Controller (APIC)
- Crossbar (XBAR)
- Memory Controller (MCT)
- DRAM Controller (DCT)

- HyperTransport Link 0
- HyperTransport Link 1
- HyperTransport Link 2

- 64-bit Data
- 64-bit Command/Address
- 16-bit Data/Command/Address

- CPU 0 Data
- CPU 1 Data
- CPU 0 Probes
- CPU 1 Probes
- CPU 0 Requests
- CPU 1 Requests

- CPU 0 Int
- CPU 1 Int

- DRAM Data
- RAS/CAS/Cntl
Northbridge Command Flow

CPU 0
- Victim Buffer (8-entry)
- Write Buffer (4-entry)
- Instruction MAB (2-entry)
- Data MAB (8-entry)

CPU 1
- System Request Queue 24-entry
- Address MAP & GART
- HyperTransport™ Link 0 Input
- HyperTransport Link 1 Input
- HyperTransport Link 2 Input
- Memory Command Queue 20-entry

XBAR
- HyperTransport Link 0 Output
- HyperTransport Link 1 Output
- HyperTransport Link 2 Output

All buffers are 64-bit command/address
Northbridge Data Flow

All buffers are 64-byte cache lines

HyperTransport™ Link 0 input
HyperTransport Link 1 input
HyperTransport Link 2 input

8-entry Buffer
8-entry Buffer
8-entry Buffer

HyperTransport Link 0 output
HyperTransport Link 1 output
HyperTransport Link 2 output

System Request Data Queue 12-entry

Memory Data Queue 8-entry

to CPU
to Host Bridge
to DCT
Coherent HyperTransport™
Read Request

Step 1

Read Cache Line
Step 2

Coherent HyperTransport™
Read Request

CPU 0

CPU 1

CPU 2

CPU 3

Memory 1

I/O

I/O

I/O

Memory 1

Read Cache Line

I: RdBlk

I: RdBlk

Memory 1

I/O

I/O

I/O
Step 3

Coherent HyperTransport™
Read Request

CPU 3 -> Memory 1
- Read Cache Line

CPU 0
- Probe Request 0

CPU 2
- Probe Request 2

CPU 1
- 2: RdBlk

Probe Request 3

Memory 1
- I/O
Step 4

Coherent HyperTransport™
Read Request

CPU 3
Memory 1

CPU 2
Memory 1

CPU 0
Memory 1

CPU 1
Memory 1

I/O

I/O

I/O

I/O

1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: PRQ3
3: RdBlk

Probe Request 1
Probe Response 3

1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk

Probe Request 3

Coherent HyperTransport™
Read Request

Step 5

Memory 1

CPU 3

3: PRQ3

3: RdBlk

I/O

CPU 2

2: RdBlk

3: PRQ2

4: TRSP3

I/O

CPU 1

4: PRQ1

I/O

CPU 0

3: PRQ0

1: RdBlk

Memory 1

Probe Response 0

Probe Response 3

Read Response
Coherent HyperTransport™
Read Request

Step 6

Memory 1
CPU 3
  3: PRQ0
  5: RDRSP
CPU 0
  3: PRQ3
  2: RdBlk
  3: PRQ2
CPU 1
  4: PRQ1
CPU 2
  4: TRSP3
  1: RdBlk
  5: TRSP3
I/O
5: RDRSP

Read Response

Probe Response 2
Step 7

Coherent HyperTransport™
Read Request

CPU 3
3: PRQ3
5: RDRSP
3: RdBlk

CPU 0
3: PRQ0

CPU 2
4: TRSP3
2: RdBlk
3: PRQ2

CPU 1
4: PRQ1
5: TRSP0

Memory 1
3: RdBlk

I/O

Memory 1

I/O

Memory 1

I/O

Memory 1

I/O

Memory 1

I/O

Memory 1

Read Response
Coherent HyperTransport™
Read Request

Step 9
"Hammer" Architecture Summary

- **8th Generation microprocessor core**
  - Improved IPC and operating frequency
  - Support for large workloads

- **Cache subsystem**
  - Enhanced TLB structures
  - Improved branch prediction

- **Integrated DDR memory controller**
  - Reduced DRAM latency

- **HyperTransport™ technology**
  - Screaming I/O for chip-to-chip communication
  - Enables glueless MP
"Hammer" System Architecture
“Hammer” System Architecture
1-way
"Hammer" System Architecture
Glueless Multiprocessing: 2-way

"Hammer"

8x AGP

HyperTransport™ AGP

Southbridge

HyperTransport™ PCI-X
"Hammer" System Architecture
Glueless Multiprocessing: 4-way

-Hammer-

Southbridge

HyperTransport™
AGP

8x AGP

HyperTransport™
PCI-X

HyperTransport PCI-X

Southbridge
“Hammer” System Architecture
Glueless Multiprocessing: 8-way
MP System Architecture

- **Software view of memory is SMP**
  - Physical address space is flat and fully coherent
  - Latency difference between local and remote memory in an 8P system is comparable to the difference between a DRAM page hit and DRAM page conflict
  - DRAM location can be contiguous or interleaved

- **Multiprocessor support designed in from the beginning**
  - Lower overall chip count
  - All MP system functions use CPU technology and frequency

- **8P System parameters**
  - 64 DIMMs (up to 128GB) directly connected
  - 4 HyperTransport links available for IO (25GB/s)
The Rewards of Good Plumbing

- **Bandwidth**
  - 4P system designed to achieve **8GB/s** aggregate memory copy bandwidth
    - With data spread throughout system
  - **Leading edge bus based systems limited to about 2.1GB/s aggregate bandwidth** (3.2GB/s theoretical peak)

- **Latency**
  - Average unloaded latency in 4P system (page miss) is designed to be **140ns**
  - Average unloaded latency in 8P system (page miss) is designed to be **160ns**
  - Latency under load planned to increase much more slowly than bus based systems due to available bandwidth
  - Latency shrinks quickly with increasing CPU clock speed and HyperTransport link speed
"Hammer" Summary

- **8th generation CPU core**
  - Delivering high-performance through an optimum balance of IPC and operating frequency

- **x86-64™ technology**
  - Compelling 64-bit migration strategy without any significant sacrifice of existing code base
  - Full speed support for x86 code base
  - Unified architecture from notebook through server

- **DDR memory controller**
  - Significantly reduces DRAM latency

- **HyperTransport™ technology**
  - High-bandwidth I/O
  - Glueless MP

- **Foundation for future portfolio of processors**
  - Top-to-bottom desktop and mobile processors
  - High-performance 1-, 2-, 4-, and 8-way servers and workstations