Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fill_via_chunks without using zerocopy or unsafe. #1575

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

briansmith
Copy link

  • Added a CHANGELOG.md entry

Summary

Reduce use of zerocopy.

Motivation

Make progress towards reducing the set of dependencies.

Details

@briansmith briansmith force-pushed the b/less-zerocopy branch 5 times, most recently from f559d51 to 8fe7fcc Compare February 4, 2025 00:31
@briansmith
Copy link
Author

briansmith commented Feb 4, 2025

I looked at the generated x86_64 assembly of my original niave implementation:

  // Always using little endian for portability of results.
    dest.chunks_mut(size).zip(src).for_each(|(dest, src)| {
        dest.iter_mut()
            .zip(src.to_le_bytes().as_ref())
            .for_each(|(dest, src)| *dest = *src)
    });

It didn't get compiled into a single memcpy. So I've updated the PR with another implementation:

    dest.iter_mut()
        .zip(src.iter().flat_map(|t| t.to_le_bytes()))
        .for_each(|(dest, src)| *dest = src);

This alternative implementation doesn't result in a call to memcpy at all. I am on a laptop that isn't reliable enough performance-wise to run benchmarks on.

Notably, in both of these implementations, there are no calls to panic functions in the generated assembly, unlike the present (before the PR) code. This is a benefit to certain applications.

Regardless, I think it's just a matter of code golf to find a reasonable implementation, if this one isn't it.

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Feb 4, 2025

here's a method which uses a single memcpy for most of the data but a manual method for the remainder

for (chunk, src) in dest.chunks_exact_mut(SIZE).zip(&mut *src) {
    chunk.copy_from_slice(&src.to_le_bytes());
}
// using src.iter().by_ref() doesn't produce a memcpy
let n = (dest.len() / SIZE) * SIZE;
for (dest, src) in dest
    .chunks_exact_mut(SIZE)
    .into_remainder()
    .iter_mut()
    .zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
{
    *dest = src;
}

@Bluefinger
Copy link

Bluefinger commented Feb 4, 2025

The above won't compile as chunk.copy_from_slice(&src.to_le_bytes()); will expect a &[u8] but we actually get an associated type return. The following should result in a compilable version that yields a memcpy:

pub fn fill_via_chunks<T: Observable>(src: &mut [T], dest: &mut [u8]) -> (usize, usize) {
    let size = core::mem::size_of::<T>();
    let byte_len = min(core::mem::size_of_val(src), dest.len());
    let num_chunks = (byte_len + size - 1) / size;
    let n = (dest.len() / size) * size;

    dest.chunks_exact_mut(size)
        .zip(&mut *src)
        .for_each(|(chunk, src)| {
            chunk
                .iter_mut()
                .zip(src.to_le_bytes())
                .for_each(|(dest, src)| {
                    *dest = src;
                })
        });

    dest.chunks_exact_mut(size)
        .into_remainder()
        .iter_mut()
        .zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
        .for_each(|(dest, src)| {
            *dest = src;
        });

    (num_chunks, byte_len)
}

Here's the godbolt result of the above code: https://rust.godbolt.org/z/GfWxb9nzG in case folks want to see for themselves or poke around further.

Basically, chunks_exact is needed to get the compiler to infer correctly when to optimise to the memcpy case.

@TheIronBorn
Copy link
Contributor

Right thank you I didn't abstract enough.

Also I found this tail method produces better codegen:

let chunks = dest.chunks_exact_mut(size);
if let Some(src) = src.get(chunks.len()) {
    chunks
        .into_remainder()
        .iter_mut()
        .zip(src.to_le_bytes())
        .for_each(|(dest, src)| {
            *dest = src;
        });
}

https://rust.godbolt.org/z/s95dWrafe

But I don't know much about zerocopy so I couldn't speak to any performance change from dropping it

@josephlr
Copy link
Member

josephlr commented Feb 4, 2025

I think @Bluefinger is on the right track with chunks_exact_mut, although we can make things a bit more readable by using for loops and using the iterator's len() method to do the num_chunks calculation:

fn fill_via_chunks<T: Observable>(src: &[T], dest: &mut [u8]) -> (usize, usize) {
    let size = core::mem::size_of::<T>();

    let mut dest_it = dest.chunks_exact_mut(size);
    let it = dest_it.by_ref().zip(src);
    let num_chunks = it.len();
    let byte_len = num_chunks * size;

    for (dest_chunk, src_chunk) in it {
        dest_chunk.copy_from_slice(src_chunk.to_le_bytes().as_ref());
    }

    let dest_tail = dest_it.into_remainder();
    let n = dest_tail.len();
    if n > 0 && src.len() > num_chunks {
        dest_tail.copy_from_slice(&src[num_chunks].to_le_bytes().as_ref()[..n]);
        (num_chunks + 1, byte_len + n)
    } else {
        (num_chunks, byte_len)
    }
}

I honestly think this code is more readable than the existing code using zerocopy, mostly because it uses the iterator methods to do most of the num_chunks/byte_len math.

This gets compiled fairly well (Godbolt Link) as one memcpy and and additional optional memcopy if we need to do a partial copy. Seems good enough.

Copy link
Member

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only use of zerocopy in rand_core, so we can remove it from rand_core/Cargo.toml.

@briansmith briansmith force-pushed the b/less-zerocopy branch 2 times, most recently from f867f03 to 4e5b6b2 Compare February 4, 2025 15:49
@briansmith
Copy link
Author

Thanks everybody! I updated the PR with a version that uses your ideas.

I am still frustrated that it compiles to two memcpys instead of one.

@briansmith
Copy link
Author

Also, in this new version, the compiler is still able to statically verify that there are no panics (no calls into panic infrastructure in the generated assembly).

Reduce the use of `zerocopy` as a step towards eliminating the
dependency.
Copy link
Member

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few style nits, but they are equivalent to your code, so feel free to ignore them if you prefer how its currently written.


let zipped = dest.by_ref().zip(src.by_ref());
let num_chunks = zipped.len();
zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It seems like this crate generally prefers for loops to for_each, is there a reason to use for_each here? Is it because of variable shadowing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler doesn't optimise the same for for loops vs for_each, and for ensuring certain optimisations take place, you need to use for_each. In like 90% of cases, for loops are just fine, but when targeting particular optimisation cases, for_each is usually what you need to reach for, particularly when you are trying to get autovectorisation to kick in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I have nothing against usage of for_each in this crate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a reference: https://rust.godbolt.org/z/3qfned8vn

This shows the compiled output diff between the for loop and for_each. The for_each is able to optimise to the two memcpy ops, whereas the for optimises to a single memcpy case with a separate path that does not have memcpy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Do we want the tail-case to use memcpy? In the worst case it's only 7 bytes

zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));

let byte_len = num_chunks * size;
if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find the if let pattern here a little confusing, could we just do:

    if let Some(src) = src.next() {
        // We have consumed all full chunks of dest, but not src.
        let dest = dest.into_remainder();
        let n = dest.len();
        if n > 0 {
            dest.copy_from_slice(&src.to_le_bytes().as_ref()[..n]);
            return (num_chunks + 1, byte_len + n)
        }
    }
    (num_chunks, byte_len)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.

This alternative compiles to identical assembly so I'll take it by preference.

Comment on lines 133 to 136
pub fn fill_via_u32_chunks(src: &mut [u32], dest: &mut [u8]) -> (usize, usize) {
// TODO(SemVer): src: `&[u32]` as we don't mutate it.
fill_via_chunks(src, dest)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know why this method is public at all? It is used in the impl of RngCore for BlockRng<R: BlockRngCore>, which is crate-local. I doubt RNGs would want to use it directly. It looks like this pub fn pre-dates BlockRng.

I suggest that we deprecate these two methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering that myself

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do:

  • Alternative code as suggested above
  • Deprecate methods as noted above
  • Add a note to the changelog, under [Unreleased] heading
  • (Optional): do not use memcpy for tail case

After that I'm happy with this: the code is clear enough, at most two usages of memcpy seems fine, benchmark results are fine.

zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));

let byte_len = num_chunks * size;
if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.

This alternative compiles to identical assembly so I'll take it by preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants