Implement `fill_via_chunks` without using zerocopy or `unsafe`. #1575

briansmith · 2025-02-03T23:49:08Z

Added a CHANGELOG.md entry

Summary

Reduce use of zerocopy.

Motivation

Make progress towards reducing the set of dependencies.

Details

briansmith · 2025-02-04T00:32:30Z

I looked at the generated x86_64 assembly of my original niave implementation:

  // Always using little endian for portability of results.
    dest.chunks_mut(size).zip(src).for_each(|(dest, src)| {
        dest.iter_mut()
            .zip(src.to_le_bytes().as_ref())
            .for_each(|(dest, src)| *dest = *src)
    });

It didn't get compiled into a single memcpy. So I've updated the PR with another implementation:

    dest.iter_mut()
        .zip(src.iter().flat_map(|t| t.to_le_bytes()))
        .for_each(|(dest, src)| *dest = src);

This alternative implementation doesn't result in a call to memcpy at all. I am on a laptop that isn't reliable enough performance-wise to run benchmarks on.

Notably, in both of these implementations, there are no calls to panic functions in the generated assembly, unlike the present (before the PR) code. This is a benefit to certain applications.

Regardless, I think it's just a matter of code golf to find a reasonable implementation, if this one isn't it.

TheIronBorn · 2025-02-04T07:34:18Z

here's a method which uses a single memcpy for most of the data but a manual method for the remainder

for (chunk, src) in dest.chunks_exact_mut(SIZE).zip(&mut *src) {
    chunk.copy_from_slice(&src.to_le_bytes());
}
// using src.iter().by_ref() doesn't produce a memcpy
let n = (dest.len() / SIZE) * SIZE;
for (dest, src) in dest
    .chunks_exact_mut(SIZE)
    .into_remainder()
    .iter_mut()
    .zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
{
    *dest = src;
}

Bluefinger · 2025-02-04T08:09:29Z

The above won't compile as chunk.copy_from_slice(&src.to_le_bytes()); will expect a &[u8] but we actually get an associated type return. The following should result in a compilable version that yields a memcpy:

pub fn fill_via_chunks<T: Observable>(src: &mut [T], dest: &mut [u8]) -> (usize, usize) {
    let size = core::mem::size_of::<T>();
    let byte_len = min(core::mem::size_of_val(src), dest.len());
    let num_chunks = (byte_len + size - 1) / size;
    let n = (dest.len() / size) * size;

    dest.chunks_exact_mut(size)
        .zip(&mut *src)
        .for_each(|(chunk, src)| {
            chunk
                .iter_mut()
                .zip(src.to_le_bytes())
                .for_each(|(dest, src)| {
                    *dest = src;
                })
        });

    dest.chunks_exact_mut(size)
        .into_remainder()
        .iter_mut()
        .zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
        .for_each(|(dest, src)| {
            *dest = src;
        });

    (num_chunks, byte_len)
}

Here's the godbolt result of the above code: https://rust.godbolt.org/z/GfWxb9nzG in case folks want to see for themselves or poke around further.

Basically, chunks_exact is needed to get the compiler to infer correctly when to optimise to the memcpy case.

TheIronBorn · 2025-02-04T11:07:20Z

Right thank you I didn't abstract enough.

Also I found this tail method produces better codegen:

let chunks = dest.chunks_exact_mut(size);
if let Some(src) = src.get(chunks.len()) {
    chunks
        .into_remainder()
        .iter_mut()
        .zip(src.to_le_bytes())
        .for_each(|(dest, src)| {
            *dest = src;
        });
}

https://rust.godbolt.org/z/s95dWrafe

But I don't know much about zerocopy so I couldn't speak to any performance change from dropping it

josephlr · 2025-02-04T12:09:42Z

I think @Bluefinger is on the right track with chunks_exact_mut, although we can make things a bit more readable by using for loops and using the iterator's len() method to do the num_chunks calculation:

fn fill_via_chunks<T: Observable>(src: &[T], dest: &mut [u8]) -> (usize, usize) {
    let size = core::mem::size_of::<T>();

    let mut dest_it = dest.chunks_exact_mut(size);
    let it = dest_it.by_ref().zip(src);
    let num_chunks = it.len();
    let byte_len = num_chunks * size;

    for (dest_chunk, src_chunk) in it {
        dest_chunk.copy_from_slice(src_chunk.to_le_bytes().as_ref());
    }

    let dest_tail = dest_it.into_remainder();
    let n = dest_tail.len();
    if n > 0 && src.len() > num_chunks {
        dest_tail.copy_from_slice(&src[num_chunks].to_le_bytes().as_ref()[..n]);
        (num_chunks + 1, byte_len + n)
    } else {
        (num_chunks, byte_len)
    }
}

I honestly think this code is more readable than the existing code using zerocopy, mostly because it uses the iterator methods to do most of the num_chunks/byte_len math.

This gets compiled fairly well (Godbolt Link) as one memcpy and and additional optional memcopy if we need to do a partial copy. Seems good enough.

josephlr

This is the only use of zerocopy in rand_core, so we can remove it from rand_core/Cargo.toml.

briansmith · 2025-02-04T15:50:58Z

Thanks everybody! I updated the PR with a version that uses your ideas.

I am still frustrated that it compiles to two memcpys instead of one.

briansmith · 2025-02-04T15:51:43Z

Also, in this new version, the compiler is still able to statically verify that there are no panics (no calls into panic infrastructure in the generated assembly).

Reduce the use of `zerocopy` as a step towards eliminating the dependency.

josephlr

Looks good, I have a few style nits, but they are equivalent to your code, so feel free to ignore them if you prefer how its currently written.

josephlr · 2025-02-04T23:14:50Z

rand_core/src/impls.rs

+
+    let zipped = dest.by_ref().zip(src.by_ref());
+    let num_chunks = zipped.len();
+    zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));


Nit: It seems like this crate generally prefers for loops to for_each, is there a reason to use for_each here? Is it because of variable shadowing?

The compiler doesn't optimise the same for for loops vs for_each, and for ensuring certain optimisations take place, you need to use for_each. In like 90% of cases, for loops are just fine, but when targeting particular optimisation cases, for_each is usually what you need to reach for, particularly when you are trying to get autovectorisation to kick in.

For the record, I have nothing against usage of for_each in this crate.

Just as a reference: https://rust.godbolt.org/z/3qfned8vn

This shows the compiled output diff between the for loop and for_each. The for_each is able to optimise to the two memcpy ops, whereas the for optimises to a single memcpy case with a separate path that does not have memcpy.

Huh. Do we want the tail-case to use memcpy? In the worst case it's only 7 bytes

josephlr · 2025-02-04T23:36:27Z

rand_core/src/impls.rs

+    zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));
+
+    let byte_len = num_chunks * size;
+    if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) {


Nit: I find the if let pattern here a little confusing, could we just do:

if let Some(src) = src.next() { // We have consumed all full chunks of dest, but not src. let dest = dest.into_remainder(); let n = dest.len(); if n > 0 { dest.copy_from_slice(&src.to_le_bytes().as_ref()[..n]); return (num_chunks + 1, byte_len + n) } } (num_chunks, byte_len)

The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.

This alternative compiles to identical assembly so I'll take it by preference.

dhardy · 2025-02-05T08:53:01Z

rand_core/src/impls.rs

 pub fn fill_via_u32_chunks(src: &mut [u32], dest: &mut [u8]) -> (usize, usize) {
+    // TODO(SemVer): src: `&[u32]` as we don't mutate it.
    fill_via_chunks(src, dest)
 }


Does anyone know why this method is public at all? It is used in the impl of RngCore for BlockRng<R: BlockRngCore>, which is crate-local. I doubt RNGs would want to use it directly. It looks like this pub fn pre-dates BlockRng.

I suggest that we deprecate these two methods.

I was wondering that myself

dhardy

To do:

Alternative code as suggested above
Deprecate methods as noted above
Add a note to the changelog, under [Unreleased] heading
(Optional): do not use memcpy for tail case

After that I'm happy with this: the code is clear enough, at most two usages of memcpy seems fine, benchmark results are fine.

dhardy · 2025-02-05T09:55:12Z

rand_core/src/impls.rs

+    zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref()));
+
+    let byte_len = num_chunks * size;
+    if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) {


The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.

This alternative compiles to identical assembly so I'll take it by preference.

briansmith mentioned this pull request Feb 3, 2025

Make the dependency on zerocopy optional #1574

Open

briansmith force-pushed the b/less-zerocopy branch 5 times, most recently from f559d51 to 8fe7fcc Compare February 4, 2025 00:31

briansmith force-pushed the b/less-zerocopy branch from 8fe7fcc to 4743a47 Compare February 4, 2025 00:33

josephlr requested changes Feb 4, 2025

View reviewed changes

briansmith force-pushed the b/less-zerocopy branch 2 times, most recently from f867f03 to 4e5b6b2 Compare February 4, 2025 15:49

Implement fill_via_chunks without using zerocopy or unsafe.

7a14dfe

Reduce the use of `zerocopy` as a step towards eliminating the dependency.

briansmith force-pushed the b/less-zerocopy branch from 4e5b6b2 to 7a14dfe Compare February 4, 2025 15:52

josephlr approved these changes Feb 4, 2025

View reviewed changes

dhardy reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `fill_via_chunks` without using zerocopy or `unsafe`. #1575

Implement `fill_via_chunks` without using zerocopy or `unsafe`. #1575

briansmith commented Feb 3, 2025

briansmith commented Feb 4, 2025 •

edited

Loading

TheIronBorn commented Feb 4, 2025 •

edited

Loading

Bluefinger commented Feb 4, 2025 •

edited

Loading

TheIronBorn commented Feb 4, 2025

josephlr commented Feb 4, 2025 •

edited

Loading

josephlr left a comment

briansmith commented Feb 4, 2025

briansmith commented Feb 4, 2025

josephlr left a comment

josephlr Feb 4, 2025

Bluefinger Feb 5, 2025

dhardy Feb 5, 2025

Bluefinger Feb 5, 2025

TheIronBorn Feb 5, 2025

josephlr Feb 4, 2025

dhardy Feb 5, 2025

dhardy Feb 5, 2025

TheIronBorn Feb 5, 2025

dhardy left a comment •

edited

Loading

dhardy Feb 5, 2025

Implement fill_via_chunks without using zerocopy or unsafe. #1575

Are you sure you want to change the base?

Implement fill_via_chunks without using zerocopy or unsafe. #1575

Conversation

briansmith commented Feb 3, 2025

Summary

Motivation

Details

briansmith commented Feb 4, 2025 • edited Loading

TheIronBorn commented Feb 4, 2025 • edited Loading

Bluefinger commented Feb 4, 2025 • edited Loading

TheIronBorn commented Feb 4, 2025

josephlr commented Feb 4, 2025 • edited Loading

josephlr left a comment

Choose a reason for hiding this comment

briansmith commented Feb 4, 2025

briansmith commented Feb 4, 2025

josephlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhardy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Implement `fill_via_chunks` without using zerocopy or `unsafe`. #1575

Implement `fill_via_chunks` without using zerocopy or `unsafe`. #1575

briansmith commented Feb 4, 2025 •

edited

Loading

TheIronBorn commented Feb 4, 2025 •

edited

Loading

Bluefinger commented Feb 4, 2025 •

edited

Loading

josephlr commented Feb 4, 2025 •

edited

Loading

dhardy left a comment •

edited

Loading