AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

Hotschmoe · 2025-01-14T18:03:24Z

Zig Version

0.13.0

Steps to Reproduce and Observed Behavior

The Bug

While the Zig compiler_rt implementation in compiler_rt/memcpy.zig is a simple byte-by-byte copy, LLVM's auto-vectorization transforms this into SIMD instructions (ldp/stp with Q registers) for copies larger than 32 bytes (0x20). These instructions require 16-byte alignment, but no alignment verification is performed before the transformation.

Technical Details

The source implementation is a simple byte-by-byte copy in compiler_rt
LLVM auto-vectorization transforms this into SIMD instructions
The resulting assembly uses ldp q0, q1, [x9, #-16] which requires 16-byte alignment (attempting to load 32 bytes using two 128-bit SIMD registers)
No alignment checks are performed before using SIMD instructions

Impact

Affects AArch64 bare metal code using memcpy with 8-byte aligned buffers
Particularly problematic for kernel/OS development where hardware structures are often 8-byte aligned
Results in alignment faults (EC=0x07) when copying > 32 bytes

Reproduction

https://github.com/Hotschmoe/zigv13_memcpy_failure

The included code demonstrates the issue by:

Setting up buffers with 8-byte alignment
Enabling alignment checking in SCTLR_EL1
Attempting to memcpy 40 bytes (triggering SIMD path)
Capturing the resulting alignment fault

Running the Test

Build with Zig 0.13.0
Run under QEMU (aarch64-virt)
Observe alignment fault with:
- EC (Exception Class): 0x07
- ISS: 0x1E00000

(in repo you can run zig build run to see the fault)

Additional Notes

see repo https://github.com/Hotschmoe/zigv13_memcpy_failure for additional notes about the failure in my own project. Feel free to reach out to access the files if needed

Environment

Zig 0.13.0
AArch64 bare metal (EL1)
Tested on QEMU virt machine

Expected Behavior

The implementation should either:

Check buffer alignment before allowing SIMD optimization
Have LLVM's auto-vectorization respect the actual buffer alignment
Document that memcpy requires 16-byte alignment for optimal performance
Use unaligned SIMD loads/stores when alignment isn't guaranteed

Actual Behavior

LLVM's auto-vectorization transforms the byte-by-byte implementation into SIMD instructions without alignment checks, causing alignment faults when buffers aren't 16-byte aligned.

The text was updated successfully, but these errors were encountered:

Hotschmoe · 2025-01-15T16:13:36Z

i fixed this in my own project by overwriting memcpy and importing early in boot.s
though this leaves performance on the table, I will try a a "vectorized: copy using 128-bit NEON loads/stores

both files work, but I need to actually track performance metrics (compiler vs memcpy_overwrite vs memcpy_overwrite_neon)

memcpy_overwrite.s

.section ".text"
.global memcpy    // Make it globally visible
.type memcpy, %function  // Declare it as a function
.align 4

memcpy:
    // x0 = destination (already in place)
    // x1 = source
    // x2 = length
    // Save the original destination for return value
    mov x3, x0

1:  // Local label for loop
    cbz x2, 2f          // If length is 0, exit
    ldrb w4, [x1], #1   // Load byte from source, increment
    strb w4, [x0], #1   // Store byte to dest, increment
    sub x2, x2, #1      // Decrement length
    b 1b                // Loop back

2:  // Exit label
    mov x0, x3          // Restore original destination for return
    ret

.size memcpy, . - memcpy

memcpy_overwrite_neon.s

.section ".text"
.global memcpy
.type memcpy, %function
.align 4

memcpy:
    // Arguments:
    // x0 = destination
    // x1 = source
    // x2 = length (number of bytes)
    //
    // We preserve x0 in x3 so we can return the original destination pointer.

    mov    x3, x0

    // Check if both src and dst are 16-byte aligned.
    // We'll do an alignment test on the low 4 bits (mod 16).
    and    x4, x0, #0xF
    and    x5, x1, #0xF

    // If they don't match or length < 16, jump to naive copy.
    // (You *could* also do partial alignment fixup here; for simplicity, we jump.)
    cmp    x4, x5
    b.ne   .Naive

    cmp    x2, #16
    b.lt   .Naive

    // Both aligned, and we have at least 16 bytes. 
    // We'll do the bulk copy using NEON 128-bit loads/stores.
    // First, figure out how many 16-byte blocks to copy.
    asr    x6, x2, #4    // x6 = number_of_16B_blocks = x2 / 16
    // We'll save the leftover (tail) in x7:
    and    x7, x2, #0xF  // leftover = x2 & 15

    // Loop copying 16 bytes at a time
.VectorLoop:
    cbz    x6, .Tail   // if x6 == 0, we're done with the vector loop
    // load a 16-byte chunk from [x1], store to [x0]
    ldr    q0, [x1], #16
    str    q0, [x0], #16
    subs   x6, x6, #1
    b.ne   .VectorLoop

.Tail:
    cbz    x7, .Done   // if leftover = 0, we’re done

    // Copy leftover bytes naively
.TailLoop:
    ldrb   w8, [x1], #1
    strb   w8, [x0], #1
    subs   x7, x7, #1
    b.ne   .TailLoop

    b      .Done

// ----------------------------------------------------
// Naive fallback for the entire length (misaligned case or too short)
.Naive:
    cbz    x2, .Done

.NaiveLoop:
    ldrb   w4, [x1], #1
    strb   w4, [x0], #1
    subs   x2, x2, #1
    b.ne   .NaiveLoop

// ----------------------------------------------------
.Done:
    mov    x0, x3   // Return the original destination
    ret

.size memcpy, . - memcpy

dweiller · 2025-01-17T04:08:36Z

If you'd like, you could test/use the memcpy implementation in #18912. I would think that LLVM shouldn't mess alignment in the implementation up as it does check pointer alignments. If you do try it and encounter any issues let me know in that PR.

andrewrk · 2025-01-25T21:56:19Z

@dweiller I think the next step here is to fix the issues @jacobly0 pointed out in review comments on that PR. Then we could ask @Hotschmoe to confirm the issue is resolved.

dweiller · 2025-01-31T02:53:28Z

@Hotschmoe Are you able to confirm if this issue has been fixed with a compiler version 0.14.0-dev.2960-df1cd6272 or later?

Hotschmoe · 2025-01-31T20:09:29Z

see
https://github.com/Hotschmoe/zigv13_memcpy_failure/tree/zig-dev-3008-test

it still fails using zig-windows-x86_64-0.14.0-dev.3008+7cef585f5

though I may be testing this incorrectly, I do not consider myself well versed in this low-level development

and worth noting my current fix for my own project memcpy_overwrite.s did not work here when I tested it. I am a bit busy this weekend so I won't be able to fiddle for a few more days

dweiller · 2025-02-01T01:55:16Z

Hmm, okay - I looked at the assembly for target aarch64-linux and it didn't look like there should be unaligned ldp/stp instructions. Perhaps there is a problem with unaligned acces via of ldr/str instructions? I don't really know aarch64, but I read that they may or may not have to be aligned depending on some flags or cpus or something. Is it possible that the qemu VM is insisting on stricter alignment requirements than whatever the default for the target your compiling for has, but there would be some cpu features you can pass via -mcpu that would fix it?

alexrp · 2025-02-01T02:04:07Z

You can try to experiment with -mcpu cortex_a72+strict_align.

I don't have time to open the Arm64 manual right now to verify, but since you're apparently doing freestanding development, it's entirely thinkable that the processor starts out with strict alignment checks enabled, while the compiler generally defaults to assuming non-strict alignment for code generation purposes. In such cases, it's on the user to enable the strict_align feature.

Hotschmoe · 2025-02-01T02:41:26Z

Okay it looks like it was my fault.

I was "Enabling alignment checking in SCTLR_EL1"

But it turns out other mobile armv8 operating systems like Android do not enable alignment checking. With that CPU feature disabled I have no issues and no alignment faults in my original project.

Hotschmoe added the bug Observed behavior contradicts documented or intended behavior label Jan 14, 2025

andrewrk added arch-aarch64 64-bit ARM backend-llvm The LLVM backend outputs an LLVM IR Module. labels Jan 25, 2025

andrewrk added this to the 0.14.0 milestone Jan 25, 2025

andrewrk added the regression It worked in a previous version of Zig, but stopped working. label Feb 1, 2025

alexrp closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2025

alexrp removed the regression It worked in a previous version of Zig, but stopped working. label Feb 14, 2025

jacobly0 removed this from the 0.14.0 milestone Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

Hotschmoe commented Jan 14, 2025

Hotschmoe commented Jan 15, 2025

dweiller commented Jan 17, 2025

andrewrk commented Jan 25, 2025

dweiller commented Jan 31, 2025

Hotschmoe commented Jan 31, 2025 •

edited

Loading

dweiller commented Feb 1, 2025 •

edited

Loading

alexrp commented Feb 1, 2025

Hotschmoe commented Feb 1, 2025 •

edited

Loading

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

Comments

Hotschmoe commented Jan 14, 2025

Zig Version

Steps to Reproduce and Observed Behavior

The Bug

Technical Details

Impact

Reproduction

Running the Test

Additional Notes

Environment

Expected Behavior

Expected Behavior

Actual Behavior

Hotschmoe commented Jan 15, 2025

dweiller commented Jan 17, 2025

andrewrk commented Jan 25, 2025

dweiller commented Jan 31, 2025

Hotschmoe commented Jan 31, 2025 • edited Loading

dweiller commented Feb 1, 2025 • edited Loading

alexrp commented Feb 1, 2025

Hotschmoe commented Feb 1, 2025 • edited Loading

Hotschmoe commented Jan 31, 2025 •

edited

Loading

dweiller commented Feb 1, 2025 •

edited

Loading

Hotschmoe commented Feb 1, 2025 •

edited

Loading