# CPU SIMD Quo vadis?

Technological Overview vasko.anton@gmail.com

# Content

- SIMD
- Existing SIMD technologies
- Future SIMD technologies
- Conclusion

# SIMD

- One of 4 computer architectures (Flynn)
- Single Instruction Single Data
- Single Instruction Multiple Data



## MMX

- MultiMedia eXtension
- Intel's technology
- Introduced in P5 microarchitecture (Pentium) in January 1997
- Later supported also by AMD (since K6-2)
- Only integer calculations or fixed-point floating point (FP)

# MMX - registers

- 8 new registers MM0 MM7, 64 bits wide
- Physically mirrored in FP stack



# MMX – data types

4 new data types (64 bits wide):

- 1. 8 packed bytes
- 2. 4 packed words
- 3. 2 packed doublewords
- 4. 1 quadword



# **MMX** - instructions

#### 57 new instructions:

- 1. arithmetic (padd, psub, pmul ...)
- 2. comparison (pcmpeq, pcmpgt)
- 3. conversion (pack, punpck)
- 4. logical (pand, pandn, por, pxor)
- 5. shift (psll, psrl, psra)
- 6. data transfer (movq, movd)
- 7. state management (emms)

#### 3DNow!

- AMD's technology (reply to MMX)
- MMX extension
- Introduced in K6-2 microarchitecture in May 1998
- Not supported by Intel
- Integer and single-precision FP calculations
- New data type two single-precision floats packed in mm register

# 3DNow! - instructions

#### 21 new instructions:

- 1. arithmetic (pfadd, pfsub, pfmul, pfmin, ...)
- 2. <u>horizontal</u> accumulation (pfacc)
- 3. approximations (pfrcp, pfrsqrt+iterations)
- 4. comparison (pfcmpeq, ...)
- conversion float-integer(pi2fd, pf2id)
- 6. state management (femms)
- 7. prefetching into L1

#### 3DNow!+

- Extension of 3DNow!
- Introduced in June 1999 in K7 (Athlon)
- 5 new FP instructions (pf2iw, pi2fw, pswapd, pfnacc, pfpnacc)
- 19 new MMX instructions:
  - mask moves (also non-temporal)
  - extracting/inserting from/to regs, shuffling
  - prefetching into different cache levels
  - min, max, avg, absolute differences

# SSE

- Streaming SIMD Extension
- Intel's reply to 3DNow!
- Introduced in P6 microarchitecture (Pentium III) in February 1999
- Later supported also by AMD (since October 2001 in Athlon XP)
- Single-precision floating point calculations

# SSE – data types and registers

- 8 physically new 128-bit XMM registers (16 in 64-bit OS)
- OS support
- new data types:
  - 16 bytes
  - 8 words
  - 4 dwords/single FP



# **SSE - instructions**

- 70 new instructions:
  - scalar and packed single-precision FP:
    - 1. arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)
    - 2. comparison (cmpss, ...)
    - 3. conversion float-int, truncation (cvtpi2ps, cvttps2pi, ...)
    - 4. logical (andps, andnps, orps, xorps)
  - integer instructions (pmin, pavg, psad, ...)
  - data movement (insert, extract, various mov\*s)
  - data shuffle and unpacking (shufps, ...)
  - state and cache memory management

## SSE2

- 2nd iteration of the SSE instruction set
- Intel's technology
- Introduced in NetBurst microarchitecture (Pentium 4) in December 2000
- Supported also by AMD since 2003 (in Athlon64)
- 128-bit integer (MMX) and doubleprecision floating point calculations

# SSE2 - instructions

- New data type two double-precision FP
- 144 new instructions:
  - scalar and packed double-precision FP:
    - 1. arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)
    - 2. comparison (cmpsd, ...)
    - 3. conversion double-int, double-single FP, truncation
    - 4. logical (andpd, andnpd, orpd, xorpd)
  - 64-bit MMX instructions extended to 128-bit
  - data movement, unaligned loads
  - data shuffle and unpacking
  - cache memory management

## SSE3

- 3rd iteration of the SSE instruction set
- Intel's technology
- Introduced in Pentium 4 (Prescott) in February 2004
- Supported also by AMD since 2004 (in Athlon64)

# SSE3 - instructions

- 13 new instructions:
  - Arithmetic (addsubps, addsubpd)
  - AOS (haddps, haddpd, hsubps, hsubpd)
  - Loading (Iddqu, mov{d|sh|sl}dup)
  - Optimizing Hyper-Threading

#### SSSE3

- Supplemental SSE3
- Intel's revision of SSE3
- Introduced in Core microarchitecture in July 2006
- Not supported by AMD
- 16x2 new discrete instructions (MMX and XMM)

# **SSSE3 - instructions**

- Arithmetic:
  - psign, pabs
  - pmaddsubsw (complex arithmetic)
- AOS:
  - phadd, phsub words, dwords, saturated signed words
- Other:
  - palignr, pshufb

## SSE4

- 4<sup>th</sup> iteration of the SSE instruction set
- Subsets:
  - SSE4.1
  - SSE4.2
  - SSE4a
- New instructions based on feedback from developers

## **SSE4.1**

- Available in Penryn (and higher)
- Intel only
- 47 new instructions:
  - Arithmetic instruction for various integer data types (pmul, pmin)
  - Dot product for floating-point (dpps, dppd)
  - Rounding FPs via immediate argument
  - Blending, insert/extract instructions
  - Other instructions (movntdqa, ...)

# **SSE4.2**

- Available in Core i7 (and higher)
- Intel only
- Not multimedia instructions
- 7 new instructions:
  - String comparison (pcmpestr, pcmpistr)
  - **crc32**
  - popcnt (population count)

## SSE4a

- Available in K10 (and higher)
- AMD only
- Not multimedia instructions
- 4+2 new instructions:
  - Izcnt (leading zero count)
  - popcnt (population count)
  - extract/insert
  - movntss/movntsd (scalar streaming store)

# **SSE4 Summary**

- SSE4a = 4 instructions from SSE4.1 + 2 new instructions
- SSE4 supporting madness:
  - SSE4.1 Intel Penryn (and higher)
  - SSE4.2 Intel Core i7 (and higher)
  - SSE4a AMD K10 (and higher)
- Programmers (and compilers) nightmare!

## SSE5

- 5<sup>th</sup> iteration of the SSE instruction set
- AMD's reply to SSE4.1
- Should be introduced in Bulldozer in 2011
- Will NOT be supported by Intel
- 170 new instructions
- Not all SSE4 instructions included
- Not superset but competitor to SSE4

# SSE5 – Key Features

- Three and four operand syntax
- Floating-point fused multiply-accumulate
- Integer multiply-accumulate with/without saturation
- Permutations, rotations, shifts, conditional moves
- Precision control, rounding, conversions

# **SSE Summary**

8 SSE revisions have 471 instructions



# AVX

- Intel Advanced Vector Extensions
- Intel's competitor to SSE5
- Should be introduced in Sandy Bridge in 2010
- Will be supported by AMD?
- Flexible design for further extensions (FMA)

# **AVX – Key Features**

- New 256-bit registers (YMM)
- Three and four operand instruction syntax
- Upgrade of some SSEn instructions(>200)
- Promoting some 128-bit SSEn instructions to 256-bit instructions (<100)</li>
- New arithmetic and data processing instructions (encryption, broadcast, permute, fused-multiply-add)

#### Conclusion

- MMX, 3DNow! are old
- Intel & AMD: SSE, SSE2, SSE3
- SSEn will be (soon) replaced by AVX
- SSE5 is incompatible with AVX
- AVX has better design than SSE5
- AVX = Intel's vengeance for AMD64

# Thank you for your attention!

vasko.anton@gmail.com