By wikipedia auto curator•March 23, 2026•

wiki

List of x86 SIMD instructions (Wikipedia Lab Guide)

Advanced x86 SIMD Instruction Set Study Guide

1) Introduction and Scope

This study guide provides a deep technical dive into the evolution and mechanics of x86 Single Instruction, Multiple Data (SIMD) instruction set extensions. SIMD architectures are fundamental to modern high-performance computing, enabling parallel processing of data elements within wide registers. This guide focuses on the underlying principles, architectural details, practical implications, and defensive engineering considerations for MMX, SSE, AVX, AVX2, AVX-512, FMA, and AMX. The scope is strictly technical, aiming to equip learners with a robust understanding of these powerful instruction sets for system analysis, optimization, and security.

2) Deep Technical Foundations

SIMD instructions operate on vectors, which are fixed-size arrays of data elements (lanes). A single SIMD instruction performs the same operation on each corresponding lane of multiple operand vectors simultaneously. This contrasts with scalar operations, which process a single data element at a time.

Key Concepts:

Vector Registers: Wide registers designed to hold multiple data elements. Their width has progressively increased across SIMD extensions:
- MMX: 64-bit registers (mm0-mm7), aliased with x87 FPU registers. This aliasing requires careful state management.
- SSE/SSE2: 128-bit registers (xmm0-xmm15 in 64-bit mode, xmm0-xmm7 in 32-bit mode). These are dedicated registers, separate from x87.
- AVX/AVX2: Extends xmm registers to 256-bit registers (ymm0-ymm15). The lower 128 bits of ymmN map directly to xmmN.
- AVX-512: Extends ymm registers to 512-bit registers (zmm0-zmm31). The lower 256 bits of zmmN map to ymmN, and the lower 128 bits map to xmmN.
Lane Width: The size of individual data elements within a vector. Common lane widths include 8-bit (byte), 16-bit (word), 32-bit (doubleword), 64-bit (quadword), and 128-bit (double quadword). The choice of lane width impacts the number of elements that fit into a vector.
Data Types: SIMD instructions can operate on various data types, including integers (signed/unsigned) and floating-point numbers (single-precision FP32, double-precision FP64, half-precision FP16, bfloat16). The specific instruction mnemonic often indicates the data type (e.g., PS for Packed Single-precision, PD for Packed Double-precision, PI for Packed Integer).
Instruction Prefixes: The encoding of SIMD instructions has evolved significantly to support wider registers, more operands, and advanced features.
- Legacy Prefixes (66h, 0Fh, 0F38h, 0F3Ah): Used for MMX, SSE, and early SSE2 instructions. These prefixes modify the primary opcode. For example, 0Fh indicates an extended opcode set, and 66h often signifies a 128-bit operation for SSE.
- VEX Prefix: Introduced with AVX, allowing for three-operand instructions, extended register access (up to ymm15), and improved encoding efficiency. It's a 2- or 3-byte prefix.
  - VEX.128: C4 Rxx 00 (for 128-bit operations, maps to xmm registers)
  - VEX.256: C4 Rxx 01 (for 256-bit operations, maps to ymm registers)
  - R: Extends the ModR/M byte's register field.
  - X: Extends the SIB byte's index field.
  - B: Extends the ModR/M byte's base field or SIB byte's base field.
  - W: Controls operand width (0 for 128-bit, 1 for 256-bit).
  - L: Controls vector length (0 for 128-bit, 1 for 256-bit).
  - P0, P1, P2: Select the opcode map.
- EVEX Prefix: Introduced with AVX-512, enabling 512-bit operations, opmask registers, broadcast functionality, rounding control, and more flexible encoding. It's a 4-byte prefix.
  - EVEX.512: 62 R' X' B' R'' W L P0 P1 P2
  - R', X', B': Register specifiers for the ModR/M, SIB, or imm8 fields.
  - W: Width specifier (0 for 32/64-bit, 1 for 8/16-bit).
  - L: Vector length (0 for 128/256-bit, 1 for 512-bit).
  - P0, P1, P2: Select the opcode map.
  - b: Broadcast bit.
  - z: Zeroing bit (destination register is zeroed before the operation if set).
  - r: Rounding control bits.
  - k: Opmask register specifier.
Opcode Encoding: The specific operation, operand types, and vector length are encoded within the instruction's opcode and prefixes. The ModR/M byte, SIB byte (if present), and immediate operands further define the instruction's behavior.
Register Aliasing (MMX): MMX registers (mm0-mm7) share physical storage with the x87 FPU registers (st0-st7). This means that using an MMX instruction can overwrite x87 state, and vice-versa. This requires careful management to avoid data corruption. For example, EMMS (Empty MMX State) instruction is often needed to clear the MMX state and allow normal x87 operation.

3) Internal Mechanics / Architecture Details

3.1) MMX (MultiMedia eXtensions)

Registers: 64-bit mm0 to mm7. These are aliases for the x87 FPU data registers.
Data Types: Packed bytes (8 x 8-bit), packed words (4 x 16-bit), packed doublewords (2 x 32-bit). Operations are typically integer-based.
Operation: Primarily single-operand instructions or two-operand instructions where the destination is also a source (e.g., PADDW mm1, mm2 performs mm1 = mm1 + mm2).
Encoding: Uses legacy prefixes (e.g., 0Fh followed by specific opcodes).
Example: PADDW mm1, mm2 adds packed words from mm2 to mm1.

; Example: Adding two packed words (16-bit integers)
; mm1 = [w1, w2, w3, w4]
; mm2 = [v1, v2, v3, v4]
; Result in mm1: [w1+v1, w2+v2, w3+v3, w4+v4] (each w_i + v_i is a 16-bit addition)
; If overflow occurs in a lane, it wraps around (standard 16-bit integer arithmetic).
PADDW mm1, mm2

3.2) SSE (Streaming SIMD Extensions) & SSE2

Registers: 128-bit xmm0 to xmm15 (in 64-bit mode). These are dedicated registers.
Data Types:
- SSE: Packed single-precision floating-point (4 x 32-bit). Instructions often end in PS.
- SSE2: Packed double-precision floating-point (2 x 64-bit) and packed integers (16 x 8-bit, 8 x 16-bit, 4 x 32-bit, 2 x 64-bit). Instructions often end in PD (double-precision float) or PI variants for integers.
Operation: Supports two-operand (OP xmmd, xmm_s) and three-operand (VOP xmmd, xmm_s1, xmm_s2 with VEX prefix) forms.
Encoding: Legacy prefixes for SSE and SSE2.
Data Movement: Instructions like MOVDQA (Move Double Quadword Aligned) and MOVDQU (Move Double Quadword Unaligned) are crucial for efficient data loading/storing. MOVDQA requires the memory operand to be 16-byte aligned, while MOVDQU does not.
Example: PADDD xmm1, xmm2 adds packed doublewords (32-bit integers) from xmm2 to xmm1.

; Example: Adding two packed doublewords (32-bit integers)
; xmm1 = [d1, d2, d3, d4]
; xmm2 = [v1, v2, v3, v4]
; Result in xmm1: [d1+v1, d2+v2, d3+v3, d4+v4] (each d_i + v_i is a 32-bit addition)
; Overflow wraps around.
PADDD xmm1, xmm2

3.3) AVX (Advanced Vector Extensions)

Registers: Extends xmm registers to 256-bit ymm0 to ymm15. The lower 128 bits of ymmN are accessible as xmmN.
Encoding: Primarily uses the 2- or 3-byte VEX prefix, enabling three-operand syntax and wider vectors.
Data Types: Supports packed single-precision (8 x 32-bit) and double-precision (4 x 64-bit) floating-point operations. Integer operations on 256-bit vectors were introduced with AVX2.
VEX Prefix Structure:
- VEX.128: C4 Rxx 00
- VEX.256: C4 Rxx 01
- R: Extends the ModR/M register operand (bit 3).
- X: Extends the SIB index field (bit 2).
- B: Extends the ModR/M base field or SIB base field (bit 1).
- W: Operand width (0 for 128-bit, 1 for 256-bit).
- L: Vector length (0 for 128-bit, 1 for 256-bit).
- P0, P1, P2: Opcode map selection.
Example: VADDPS ymm1, ymm2, ymm3 performs element-wise addition of packed single-precision floats from ymm2 and ymm3, storing the result in ymm1. This is a three-operand instruction.

; Example: Adding two packed single-precision floats (32-bit floats)
; ymm1 = [f1, f2, f3, f4, f5, f6, f7, f8]
; ymm2 = [v1, v2, v3, v4, v5, v6, v7, v8]
; ymm3 = [w1, w2, w3, w4, w5, w6, w7, w8]
; Result in ymm1: [v1+w1, v2+w2, ..., v8+w8]
VADDPS ymm1, ymm2, ymm3

3.4) AVX2

Registers: Continues to use 256-bit ymm registers.
Key Additions:
- Integer SIMD: Extends most AVX floating-point instructions to operate on packed integers (8, 16, 32, 64-bit).
- Gather Instructions: Enables non-contiguous memory loads into vector registers. For example, VGATHERDP loads doublewords from memory addresses specified by indices in a vector register.
- Fused Multiply-Add (FMA): While FMA instructions (FMA3, FMA4) are often associated with AVX, their availability depends on specific CPU features. FMA instructions perform (a * b) + c in a single step, often with higher precision and reduced latency.
- Bitwise Operations: Enhanced bitwise operations for integers, including bit shifts and rotates.
Example: VPADDD ymm1, ymm2, ymm3 adds packed doublewords from ymm2 and ymm3, storing the result in ymm1.

; Example: Adding two packed 32-bit integers
; ymm1 = [d1, d2, d3, d4, d5, d6, d7, d8]
; ymm2 = [v1, v2, v3, v4, v5, v6, v7, v8]
; ymm3 = [w1, w2, w3, w4, w5, w6, w7, w8]
; Result in ymm1: [v1+w1, v2+w2, ..., v8+w8]
VPADDD ymm1, ymm2, ymm3

3.5) AVX-512

Registers: Introduces 512-bit zmm0 to zmm31. zmmN registers are an extension of ymmN and xmmN. There are also 8 opmask registers (k0-k7), each 64 bits wide.
Encoding: Uses the 4-byte EVEX prefix, which is more flexible than VEX.
Key Features:
- Opmask Registers (k0-k7): 64-bit registers that allow conditional execution of lanes within a vector operation. A '1' in a bit position enables the operation for that lane; a '0' disables it. This enables fine-grained control and efficient handling of irregular data.
- Broadcast: Allows a single scalar value from memory or a register to be replicated across all lanes of a vector for an operation. This is controlled by the b bit in the EVEX prefix.
- Rounding Control: EVEX prefix supports explicit control of floating-point rounding modes per instruction using the r bits.
- Zeroing vs. Merging: The z bit in the EVEX prefix determines whether lanes masked out are zeroed or retain their previous values.
- Subsets: AVX-512 is modular, with subsets like AVX-512F (Foundation), AVX-512CD (Conflict Detection), AVX-512BW (Byte/Word), AVX-512DQ (Doubleword/Quadword), AVX-512VL (Vector Length Extensions, allowing 128/256-bit operations using EVEX encoding), AVX-512VNNI (Vector Neural Network Instructions), etc.
EVEX Prefix Structure:
- EVEX.512: 62 R' X' B' R'' W L P0 P1 P2
- R', X', B': Register specifiers for ModR/M, SIB, or imm8 fields.
- W: Width specifier (0 for 32/64-bit, 1 for 8/16-bit).
- L: Vector length (0 for 128/256-bit, 1 for 512-bit).
- P0, P1, P2: Opcode map selection.
- b: Broadcast bit.
- z: Zeroing bit.
- r: Rounding control bits.
- k: Opmask register specifier.
Example: VPADDD zmm1 {k2}, zmm2, zmm3 adds packed doublewords from zmm2 and zmm3, storing the result in zmm1. The operation is conditionally executed based on the bits set in opmask register k2. If a bit in k2 is 0, the corresponding lane in zmm1 is not updated (merging semantics, assuming z bit is 0).

; Example: Masked addition of packed 32-bit integers
; zmm1 = [d1, ..., d16] (16x 32-bit lanes)
; zmm2 = [v1, ..., v16]
; zmm3 = [w1, ..., w16]
; k2 = [b1, ..., b16] (where b_i is 1 or 0, mapped to the 64-bit k register)
; Assuming EVEX prefix with z=0 (merge) and k=k2:
; Result in zmm1: [d1 if b1, v1+w1 else d1, ..., d16 if b16, v16+w16 else d16]
VPADDD zmm1 {k2}, zmm2, zmm3

3.6) FMA (Fused Multiply-Add)

Concept: Combines a multiplication and an addition into a single instruction, often with higher precision and reduced latency. This is crucial for numerical stability and performance in linear algebra and signal processing.
FMA3: Three-operand instructions (common). The mnemonic indicates the order of operands for the multiply and add.
- vfmadd132sd xmm1, xmm2, xmm3: xmm1 = (xmm1 * xmm3) + xmm2 (single-precision scalar)
- vfmadd213sd xmm1, xmm2, xmm3: xmm1 = (xmm2 * xmm1) + xmm3
- vfmadd231sd xmm1, xmm2, xmm3: xmm1 = (xmm2 * xmm3) + xmm1
  Similar mnemonics exist for packed data (vfmadd132ps, vfmadd132pd, etc.) and for fused multiply-subtract (vfmsub).
FMA4: Four-operand instructions (less common, primarily AMD).
Encoding: Uses VEX or EVEX prefixes.
Data Types: FP32, FP64, FP16 (AVX512-FP16), BF16 (AVX10.2).
Example: VFMADD231PD ymm1, ymm2, ymm3 (packed double-precision) performs ymm1 = (ymm2 * ymm3) + ymm1 for each pair of double-precision elements.

; Example: Fused Multiply-Add for double-precision floats
; ymm1 = [a1, a2, a3, a4]
; ymm2 = [b1, b2, b3, b4]
; ymm3 = [c1, c2, c3, c4]
; Result in ymm1: [ (b1*c1)+a1, (b2*c2)+a2, (b3*c3)+a3, (b4*c4)+a4 ]
; The intermediate product (b_i * c_i) is kept at higher precision before adding a_i.
VFMADD231PD ymm1, ymm2, ymm3

3.7) AMX (Advanced Matrix Extensions)

Registers: Introduces 8 tmm (tile) registers (tmm0-tmm7), each capable of holding a matrix of data. These are not general-purpose vector registers.
Concept: Designed for efficient matrix multiplication and other tensor operations, crucial for AI/ML workloads. AMX operates on blocks of data called "tiles" rather than fixed-width vectors.
TILECFG Register: A special register used to configure the dimensions (rows, columns) and data types of the matrices within the tmm registers. This configuration must be set before using AMX instructions.
Instructions: TDPBSSD (Tile Dot Product Byte Signed Signed Doubleword), TDPBSSDW (Tile Dot Product Byte Signed Signed Word), etc. These instructions perform block matrix operations. For example, TDPBSSD computes Tile_Dest = Tile_Dest + (Tile_A * Tile_B) where Tile_A and Tile_B are byte matrices, and the result is accumulated into Tile_Dest as doublewords.
Operation: Operates on blocks (tiles) of data, not individual lanes in the same way as SSE/AVX. This requires a different programming model focused on tile configuration and block operations.

4) Practical Technical Examples

4.1) Vector Addition in C with Intrinsics (AVX2)

This example demonstrates vector addition of two arrays using AVX2 intrinsics in C.

#include <immintrin.h> // For AVX/AVX2 intrinsics
#include <stdio.h>
#include <stdlib.h> // For aligned_alloc

// Define vector size for AVX2 (256 bits / 32 bits/float = 8 floats)
#define AVX2_FLOAT_VECTOR_SIZE 8

void vector_add_avx2(float *a, float *b, float *c, int n) {
    int i = 0;
    // Process elements in chunks of AVX2_FLOAT_VECTOR_SIZE floats
    for (; i + AVX2_FLOAT_VECTOR_SIZE - 1 < n; i += AVX2_FLOAT_VECTOR_SIZE) {
        // Load 8 floats from array a into a ymm register (unaligned load)
        __m256 va = _mm256_loadu_ps(a + i);
        // Load 8 floats from array b into a ymm register (unaligned load)
        __m256 vb = _mm256_loadu_ps(b + i);
        // Add the two ymm registers element-wise
        __m256 vc = _mm256_add_ps(va, vb);
        // Store the result back to array c (unaligned store)
        _mm256_storeu_ps(c + i, vc);
    }

    // Handle remaining elements if n is not a multiple of AVX2_FLOAT_VECTOR_SIZE
    for (; i < n; ++i) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 20; // Example size, not a multiple of 8
    // Allocate aligned memory for better performance with aligned loads/stores
    // aligned_alloc requires block size to be a multiple of alignment and max_align_t.
    // We use 32 bytes for AVX2 ymm registers.
    float *arr_a = (float *)aligned_alloc(32, N * sizeof(float));
    float *arr_b = (float *)aligned_alloc(32, N * sizeof(float));
    float *arr_c = (float *)aligned_alloc(32, N * sizeof(float));

    if (!arr_a || !arr_b || !arr_c) {
        perror("Failed to allocate aligned memory");
        return 1;
    }

    // Initialize arrays
    for (int i = 0; i < N; ++i) {
        arr_a[i] = (float)(i + 1);
        arr_b[i] = (float)(i * 0.5f);
    }

    vector_add_avx2(arr_a, arr_b, arr_c, N);

    printf("Result of vector addition (AVX2):\n");
    for (int i = 0; i < N; ++i) {
        printf("%.2f ", arr_c[i]);
    }
    printf("\n");

    free(arr_a);
    free(arr_b);
    free(arr_c);

    return 0;
}

Explanation:

_mm256_loadu_ps(ptr): Loads 256 bits (8 floats) from an unaligned memory address ptr into a __m256 register. _mm256_load_ps requires 32-byte alignment.
_mm256_add_ps(__m256 a, __m256 b): Performs element-wise addition of two __m256 registers containing packed single-precision floats.
_mm256_storeu_ps(ptr, __m256 val): Stores the contents of val to an unaligned memory address ptr. _mm256_store_ps requires 32-byte alignment.
Alignment: Using aligned_alloc (or equivalent) ensures that the data is aligned to 32 bytes, allowing the use of potentially faster aligned load/store instructions if desired (_mm256_load_ps, _mm256_store_ps). However, _loadu_ and _storeu_ are used here for simplicity and robustness against non-aligned data.

4.2) Masked Bitwise XOR with AVX-512

This example demonstrates a masked bitwise XOR operation using AVX-512. It uses unsigned int (32-bit integers), so we'll use AVX-512BW (Byte/Word) and AVX-512VL (Vector Length Extensions) for 32-bit operations.

#include <immintrin.h> // For AVX-512 intrinsics
#include <stdio.h>
#include <stdlib.h> // For aligned_alloc

// Define vector size for AVX-512 with 32-bit integers (512 bits / 32 bits/uint = 16 uints)
#define AVX512_UINT_VECTOR_SIZE 16

// Function to convert a pattern of 0s and 1s into an AVX-512 opmask register
// This is a simplified helper; real implementations might use more direct intrinsics
// or bit manipulation.
__mmask16 create_mask16(const unsigned int *mask_pattern, int n) {
    __mmask16 k_mask = 0;
    for (int i = 0; i < n && i < 16; ++i) {
        if (mask_pattern[i]) {
            k_mask |= (1 << i);
        }
    }
    return k_mask;
}

void masked_xor_avx512(unsigned int *a, unsigned int *b, unsigned int *c, const unsigned int *mask_pattern, int n) {
    int i = 0;
    // Process elements in chunks of AVX512_UINT_VECTOR_SIZE uints
    for (; i + AVX512_UINT_VECTOR_SIZE - 1 < n; i += AVX512_UINT_VECTOR_SIZE) {
        // Load 16 unsigned ints from array a
        __m512i va = _mm512_loadu_si512(a + i);
        // Load 16 unsigned ints from array b
        __m512i vb = _mm512_loadu_si512(b + i);

        // Create the opmask register from the pattern for this chunk
        __mmask16 k_mask = create_mask16(mask_pattern + i, AVX512_UINT_VECTOR_SIZE);

        // Perform masked XOR.
        // _mm512_mask_xor_epi32:
        // - First arg: The destination register to merge into (or zero if z=1).
        //              Using _mm512_setzero_si512() for zeroing semantics.
        // - Second arg: The opmask register (k_mask).
        // - Third arg: First source vector (va).
        // - Fourth arg: Second source vector (vb).
        // Lanes where k_mask bit is 1: va ^ vb is computed and written to dest.
        // Lanes where k_mask bit is 0: The original value of the dest lane is kept (merge).
        // If we wanted to zero out masked lanes, we'd use:
        // __m512i vc = _mm512_mask_xor_epi32(_mm512_setzero_si512(), k_mask, va, vb);
        __m512i vc = _mm512_mask_xor_epi32(va, k_mask, va, vb); // Merging XOR

        // Store the result back to array c
        _mm512_storeu_si512(c + i, vc);
    }

    // Handle remaining elements
    for (; i < n; ++i) {
        if (mask_pattern[i]) { // Conceptual masking for scalar part
            c[i] = a[i] ^ b[i];
        } else {
            c[i] = a[i]; // Keep original value if mask is 0
        }
    }
}

int main() {
    const int N = 32; // Example size, multiple of 16 for full 512-bit vector
    // Allocate aligned memory (64 bytes for AVX-512 zmm registers)
    unsigned int *arr_a = (unsigned int *)aligned_alloc(64, N * sizeof(unsigned int));
    unsigned int *arr_b = (unsigned int *)aligned_alloc(64, N * sizeof(unsigned int));
    unsigned int *arr_c = (unsigned int *)aligned_alloc(64, N * sizeof(unsigned int));
    unsigned int *arr_mask_pattern = (unsigned int *)aligned_alloc(64, N * sizeof(unsigned int));

    if (!arr_a || !arr_b || !arr_c || !arr_mask_pattern) {
        perror("Failed to allocate aligned memory");
        return 1;
    }

    // Initialize arrays
    for (int i = 0; i < N; ++i) {
        arr_a[i] = 0x11111111 * (i + 1);
        arr_b[i] = 0x22222222 * (i + 1);
        arr_mask_pattern[i] = (i % 4 == 0) ? 1 : 0; // XOR every 4th element
    }

    masked_xor_avx512(arr_a, arr_b, arr_c, arr_mask_pattern, N);

    printf("Result of masked XOR operation (AVX-512):\n");
    for (int i = 0; i < N; ++i) {
        printf("0x%08X ", arr_c[i]);
        if ((i + 1) % 4 == 0) printf("\n"); // Newline every 4 elements for readability
    }
    printf("\n");

    free(arr_a);
    free(arr_b);
    free(arr_c);
    free(arr_mask_pattern);

    return 0;
}

Explanation of Masking:

Opmask Registers (k0-k7): These 64-bit registers control individual lanes. For AVX-512BW/DQ with 32-bit elements, a 64-bit opmask register can control up to 16 lanes (64 bits / 32 bits/lane = 2 lanes). A 1 in a bit position enables the operation for that lane, while a 0 disables it.
Masked Instructions: Instructions like _mm512_mask_xor_epi32 take an opmask register (__mmask16 for 16 lanes of 32-bit data) as an argument.
Zeroing vs. Merging:
- Zeroing: If the z bit in the EVEX prefix is set, lanes where the mask bit is 0 will be set to zero. The intrinsic _mm512_mask_xor_epi32(_mm512_setzero_si512(), k_mask, va, vb) achieves this.
- Merging: If the z bit is not set (default for _mm512_mask_xor_epi32 when the first argument is a source vector), lanes where the mask bit is 0 will retain their original destination register value. The example uses merging semantics (_mm512_mask_xor_epi32(va, k_mask, va, vb) where va is the destination and also a source).

4.3) Packet Structure Analysis (Conceptual)

Understanding SIMD instructions is crucial for analyzing high-performance network packet processing. Intrusion detection systems (IDS), firewalls, and deep packet inspection (DPI) engines often leverage SIMD for rapid pattern matching and data extraction.

Consider a scenario where we need to check for a specific 16-byte signature within a network packet payload.

Scenario: Extracting and comparing 16-byte chunks of a packet payload using SSE2 for signature matching.

#include <emmintrin.h> // For SSE2 intrinsics
#include <stdio.h>
#include <string.h>
#include <stdlib.h> // For malloc, free

// Example signature to search for (16 bytes)
const unsigned char signature[16] = "SECRET_PATTERN_123";

// Function to find a 16-byte signature in a payload using SSE2
// Returns the offset of the signature if found, otherwise -1.
int find_signature_sse2(const unsigned char *payload, int payload_len) {
    int i = 0;
    // Load the target signature into an SSE register.
    // _mm_loadu_si128 loads 128 bits (16 bytes) from

---

## Source

- Wikipedia page: https://en.wikipedia.org/wiki/List_of_x86_SIMD_instructions
- Wikipedia API endpoint: https://en.wikipedia.org/w/api.php
- AI enriched at: 2026-03-30T20:27:15.255Z