Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64: LLVM auto-vectorization of memcpy causes alignment faults with 8-byte aligned addresses #22491

Closed
Hotschmoe opened this issue Jan 14, 2025 · 8 comments
Labels
arch-aarch64 64-bit ARM backend-llvm The LLVM backend outputs an LLVM IR Module. bug Observed behavior contradicts documented or intended behavior

Comments

@Hotschmoe
Copy link

Zig Version

0.13.0

Steps to Reproduce and Observed Behavior

The Bug

While the Zig compiler_rt implementation in compiler_rt/memcpy.zig is a simple byte-by-byte copy, LLVM's auto-vectorization transforms this into SIMD instructions (ldp/stp with Q registers) for copies larger than 32 bytes (0x20). These instructions require 16-byte alignment, but no alignment verification is performed before the transformation.

Technical Details

  • The source implementation is a simple byte-by-byte copy in compiler_rt
  • LLVM auto-vectorization transforms this into SIMD instructions
  • The resulting assembly uses ldp q0, q1, [x9, #-16] which requires 16-byte alignment (attempting to load 32 bytes using two 128-bit SIMD registers)
  • No alignment checks are performed before using SIMD instructions

Impact

  • Affects AArch64 bare metal code using memcpy with 8-byte aligned buffers
  • Particularly problematic for kernel/OS development where hardware structures are often 8-byte aligned
  • Results in alignment faults (EC=0x07) when copying > 32 bytes

Reproduction

https://github.com/Hotschmoe/zigv13_memcpy_failure

The included code demonstrates the issue by:

  1. Setting up buffers with 8-byte alignment
  2. Enabling alignment checking in SCTLR_EL1
  3. Attempting to memcpy 40 bytes (triggering SIMD path)
  4. Capturing the resulting alignment fault

Running the Test

  1. Build with Zig 0.13.0
  2. Run under QEMU (aarch64-virt)
  3. Observe alignment fault with:
    • EC (Exception Class): 0x07
    • ISS: 0x1E00000

(in repo you can run zig build run to see the fault)

Additional Notes

see repo https://github.com/Hotschmoe/zigv13_memcpy_failure for additional notes about the failure in my own project. Feel free to reach out to access the files if needed

Environment

  • Zig 0.13.0
  • AArch64 bare metal (EL1)
  • Tested on QEMU virt machine

Expected Behavior

Expected Behavior

The implementation should either:

  1. Check buffer alignment before allowing SIMD optimization
  2. Have LLVM's auto-vectorization respect the actual buffer alignment
  3. Document that memcpy requires 16-byte alignment for optimal performance
  4. Use unaligned SIMD loads/stores when alignment isn't guaranteed

Actual Behavior

LLVM's auto-vectorization transforms the byte-by-byte implementation into SIMD instructions without alignment checks, causing alignment faults when buffers aren't 16-byte aligned.

@Hotschmoe Hotschmoe added the bug Observed behavior contradicts documented or intended behavior label Jan 14, 2025
@Hotschmoe
Copy link
Author

i fixed this in my own project by overwriting memcpy and importing early in boot.s
though this leaves performance on the table, I will try a a "vectorized: copy using 128-bit NEON loads/stores

both files work, but I need to actually track performance metrics (compiler vs memcpy_overwrite vs memcpy_overwrite_neon)

memcpy_overwrite.s

.section ".text"
.global memcpy    // Make it globally visible
.type memcpy, %function  // Declare it as a function
.align 4

memcpy:
    // x0 = destination (already in place)
    // x1 = source
    // x2 = length
    // Save the original destination for return value
    mov x3, x0

1:  // Local label for loop
    cbz x2, 2f          // If length is 0, exit
    ldrb w4, [x1], #1   // Load byte from source, increment
    strb w4, [x0], #1   // Store byte to dest, increment
    sub x2, x2, #1      // Decrement length
    b 1b                // Loop back

2:  // Exit label
    mov x0, x3          // Restore original destination for return
    ret

.size memcpy, . - memcpy

memcpy_overwrite_neon.s

.section ".text"
.global memcpy
.type memcpy, %function
.align 4

memcpy:
    // Arguments:
    // x0 = destination
    // x1 = source
    // x2 = length (number of bytes)
    //
    // We preserve x0 in x3 so we can return the original destination pointer.

    mov    x3, x0

    // Check if both src and dst are 16-byte aligned.
    // We'll do an alignment test on the low 4 bits (mod 16).
    and    x4, x0, #0xF
    and    x5, x1, #0xF

    // If they don't match or length < 16, jump to naive copy.
    // (You *could* also do partial alignment fixup here; for simplicity, we jump.)
    cmp    x4, x5
    b.ne   .Naive

    cmp    x2, #16
    b.lt   .Naive

    // Both aligned, and we have at least 16 bytes. 
    // We'll do the bulk copy using NEON 128-bit loads/stores.
    // First, figure out how many 16-byte blocks to copy.
    asr    x6, x2, #4    // x6 = number_of_16B_blocks = x2 / 16
    // We'll save the leftover (tail) in x7:
    and    x7, x2, #0xF  // leftover = x2 & 15

    // Loop copying 16 bytes at a time
.VectorLoop:
    cbz    x6, .Tail   // if x6 == 0, we're done with the vector loop
    // load a 16-byte chunk from [x1], store to [x0]
    ldr    q0, [x1], #16
    str    q0, [x0], #16
    subs   x6, x6, #1
    b.ne   .VectorLoop

.Tail:
    cbz    x7, .Done   // if leftover = 0, we’re done

    // Copy leftover bytes naively
.TailLoop:
    ldrb   w8, [x1], #1
    strb   w8, [x0], #1
    subs   x7, x7, #1
    b.ne   .TailLoop

    b      .Done

// ----------------------------------------------------
// Naive fallback for the entire length (misaligned case or too short)
.Naive:
    cbz    x2, .Done

.NaiveLoop:
    ldrb   w4, [x1], #1
    strb   w4, [x0], #1
    subs   x2, x2, #1
    b.ne   .NaiveLoop

// ----------------------------------------------------
.Done:
    mov    x0, x3   // Return the original destination
    ret

.size memcpy, . - memcpy

@dweiller
Copy link
Contributor

If you'd like, you could test/use the memcpy implementation in #18912. I would think that LLVM shouldn't mess alignment in the implementation up as it does check pointer alignments. If you do try it and encounter any issues let me know in that PR.

@andrewrk andrewrk added arch-aarch64 64-bit ARM backend-llvm The LLVM backend outputs an LLVM IR Module. labels Jan 25, 2025
@andrewrk andrewrk added this to the 0.14.0 milestone Jan 25, 2025
@andrewrk
Copy link
Member

@dweiller I think the next step here is to fix the issues @jacobly0 pointed out in review comments on that PR. Then we could ask @Hotschmoe to confirm the issue is resolved.

@dweiller
Copy link
Contributor

@Hotschmoe Are you able to confirm if this issue has been fixed with a compiler version 0.14.0-dev.2960-df1cd6272 or later?

@Hotschmoe
Copy link
Author

Hotschmoe commented Jan 31, 2025

see
https://github.com/Hotschmoe/zigv13_memcpy_failure/tree/zig-dev-3008-test

it still fails using zig-windows-x86_64-0.14.0-dev.3008+7cef585f5

though I may be testing this incorrectly, I do not consider myself well versed in this low-level development

  • and worth noting my current fix for my own project memcpy_overwrite.s did not work here when I tested it. I am a bit busy this weekend so I won't be able to fiddle for a few more days

@dweiller
Copy link
Contributor

dweiller commented Feb 1, 2025

Hmm, okay - I looked at the assembly for target aarch64-linux and it didn't look like there should be unaligned ldp/stp instructions. Perhaps there is a problem with unaligned acces via of ldr/str instructions? I don't really know aarch64, but I read that they may or may not have to be aligned depending on some flags or cpus or something. Is it possible that the qemu VM is insisting on stricter alignment requirements than whatever the default for the target your compiling for has, but there would be some cpu features you can pass via -mcpu that would fix it?

@alexrp
Copy link
Member

alexrp commented Feb 1, 2025

You can try to experiment with -mcpu cortex_a72+strict_align.

I don't have time to open the Arm64 manual right now to verify, but since you're apparently doing freestanding development, it's entirely thinkable that the processor starts out with strict alignment checks enabled, while the compiler generally defaults to assuming non-strict alignment for code generation purposes. In such cases, it's on the user to enable the strict_align feature.

@andrewrk andrewrk added the regression It worked in a previous version of Zig, but stopped working. label Feb 1, 2025
@Hotschmoe
Copy link
Author

Hotschmoe commented Feb 1, 2025

Okay it looks like it was my fault.

I was "Enabling alignment checking in SCTLR_EL1"

But it turns out other mobile armv8 operating systems like Android do not enable alignment checking. With that CPU feature disabled I have no issues and no alignment faults in my original project.

@alexrp alexrp closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2025
@alexrp alexrp removed the regression It worked in a previous version of Zig, but stopped working. label Feb 14, 2025
@jacobly0 jacobly0 removed this from the 0.14.0 milestone Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-aarch64 64-bit ARM backend-llvm The LLVM backend outputs an LLVM IR Module. bug Observed behavior contradicts documented or intended behavior
Projects
None yet
Development

No branches or pull requests

5 participants