Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memcpy-style copies of small fixed length arrays become memcpy, unless done via a loop with ops::Index #92993

Open
saethlin opened this issue Jan 17, 2022 · 4 comments
Labels
A-array Area: `[T; N]` A-codegen Area: Code generation C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@saethlin
Copy link
Member

godbolt demo

I have a [u8; 4] and I want to copy an unknown number of bytes from it into a &mut [u8]. Based on benchmarking, on x86_64 at least it is much slower to call memcpy than it is to do a byte-by-byte copy. The fastest implementation of this pattern in Rust is this:

if len > 4 {
    core::hint::unreachable_unchecked();
}
for i in 0..len {
    *dst.get_unchecked_mut(i) = src[i];
}

That just doesn't seem right to me.

@saethlin
Copy link
Member Author

This smells like a duplicate of #44099

@workingjubilee workingjubilee added the A-array Area: `[T; N]` label Mar 7, 2023
@Jules-Bertholet
Copy link
Contributor

@rustbot label +A-codegen +I-slow

@rustbot rustbot added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 3, 2023
@AlexTMjugador
Copy link

AlexTMjugador commented Aug 11, 2023

I just got bitten by this suboptimal codegen on a production project, and made a detailed write-up with performance analysis for my case here.

From my experience, calling memcpy for small buffers is especially bad for Linux musl targets, which have a significantly less optimized memcpy implementation than glibc. If any sort of buffered I/O operation or small slice copying is involved in the hot path of some piece of code (which potentially includes writing to any sort of in-memory buffer or BufWriter), runtime regressions of ~10% or more are to be expected just because memcpy implementation differences.

This is a fairly old problem that has been partially addressed with small-copy optimizations on the Read and Write implementation for slices, but even such optimizations fall back to the slow path for copies larger than one byte, and it'd be great to see some progress being made on this.

@iximeow
Copy link
Contributor

iximeow commented Mar 31, 2024

comparing a fixed-size copy and unknown-but-small-size copy, including llvm opt remarks: https://godbolt.org/z/MacG1Yrx3

even in the case of a fixed-size copy there's a memcpy for at least a moment:

note: <source>:16:9 loop-idiom (success): Formed a call to llvm.memcpy.p0.p0.i64() intrinsic
    from load and store instruction in _ZN7example23small_copy_fixed_amount17ha5560e32ccb9b385E function

so in the typical case we're pretty dependent on LLVM deciding to turn calls to the intrinsic memcpy back into loads and stores!

that, in turn, seems to happen in one of three places, two of which require constant-size len for memcpy, and the third is fairly target-specific. that seems to be why this remains memcpy through to a libc call. either:

fsrm in particular is an interesting detail, because for processors supporting fast rep mov it's likely faster than a loop. this LKML patch referencing it includes a handy dandy graph comparing a loop vs rep mov for copies up to 31 bytes long (assuming src and dest are not in cache. i'm curious how this holds up when they are.)

that seems like an indication that the "best" thing to do here is improve LLVM to handle the memcpy better, rather than avoiding the memcpy in LLVM entirely.

(i recognize most of the above is more suitable for an LLVM issue, but prevalence of the memcpy intrinsic in code that optimizes well was a surprise!)

bonus: related poor codegen from clang and workaround-ish in rust

a contributing factor to this being more noticeable in Rust is noalias. loop-idom doesn't fire to transform load-store loops into memcpy if source and dest can alias.

once you know that it's very easy to replicate this with clang (clang godbolt), or "fix" this by using rust types that might alias (rustc godbolt).

iximeow added a commit to iximeow/yaxpeax-x86 that referenced this issue Jun 24, 2024
this empty commit reproduces a github comment that describes the work on
commits from this point back to, roughly, 1.2.2. since many commits
between these two points are interesting in the context of performance
optimization (especially uarch-relevant tweaks), many WIP commits are
preserved. as a result there is no clear squash merge, and this commit
will be the next best thing.

on Rust 1.68.0 and a Xeon E3-1230 V2, relative changes are measured
roughly as:
  starting at ed4f238:
    - non-fmt ns/decode: 15ns
    - non-fmt instructions/decode: 94.6
    - non-fmt IPC: 1.71
    - fmt ns/decode+display: 91ns
    - fmt instructions/decode+display: 683.8
    - fmt IPC: 2.035

  ending at 6a5ea10
    - non-fmt ns/decode: 15ns
    - non-fmt instructions/decode: 94.6
    - non-fmt IPC: 1.71
    - fmt ns/decode+display: 47ns
    - fmt instructions/decode+display: 329.6
    - fmt IPC: 1.898

for an overall ~50% reduction in runtimes to display instructions.
writing into InstructionTextBuffer reduces overhead another ~10%.

-- original message follows --

this is where much of iximeow/yaxpeax-arch#7
originated.

`std::fmt` as a primary writing mechanism has.. some limitations:
* rust-lang/rust#92993 (comment)
* llvm/llvm-project#87440
* rust-lang/rust#122770

and some more interesting more fundamental limitations - writing to a
`T: fmt::Write` means implementations don't know if it's possible to
write bytes in reverse order (useful for printing digits) or if it's OK
to write too many bytes and then only advance `len` by the correct
amount (useful for copying variable-length-but-short strings like
register names). these are both perfectly fine to a `String` or `Vec`,
less fine to do to a file descriptor like stdout.

at the same time, `Colorize` and traits depending on it are very broken,
for reasons described in yaxpeax-arch.

so, this adapts `yaxpeax-x86` to use the new `DisplaySink` type for
writing, with optimizations where appropriate and output spans for
certain kinds of tokens - registers, integers, opcodes, etc. it's not
a perfect replacement for Colorize-to-ANSI-supporting-outputs but it's
more flexible and i think can be made right.

along the way this completes the move of `safer_unchecked` out to
yaxpeax-arch (ty @5225225 it's still so useful), cleans up some docs,
and comes with a few new test cases.

because of the major version bump of yaxpeax-arch, and because this
removes most functionality of the Colorize impl - it prints the
correct words, just without coloring - this is itself a major version
bump to 2.0.0. yay! this in turn is a good point to change the
`Opcode` enums from being tuple-like to struct-like, and i've done so
in
1b8019d.

full notes in CHANGELOG ofc. this is notes for myself when i'm trying
to remember any of this in two years :)
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-array Area: `[T; N]` A-codegen Area: Code generation C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

6 participants